Datasets
Last updated
Was this helpful?
Last updated
Was this helpful?
Datasets are the foundation for evaluating and improving your LLM application. They serve as structured collections of examples—each containing inputs and, optionally, reference outputs—that enable consistent and repeatable assessments. These examples allow you to assess performance, track regressions, and run experiments with consistency.
Store test cases: Replace scattered JSONL or CSV files with integrated, versioned datasets
Track model quality: Capture and re-evaluate LLM outputs using human or model-graded evaluations
Find new test cases: Store user-reviewed generations and production logs to identify regressions
Arize treats datasets as first-class citizens.
Integrated: Datasets connect seamlessly with experiments, evaluations, and production monitoring. You can tag spans from production or test different prompt strategies using dataset metadata.
Versioned: All changes—additions, deletions, updates—are tracked. You can pin an experiment to a specific dataset version to ensure reproducibility.
You can build datasets from a variety of sources:
1. Manually Curated Examples The best place to start. Based on your understanding of the application, you can define a handful of examples—20 well-crafted ones often go a long way. These examples help cover expected behavior and common edge cases.
2. Historical Logs Once your application is live, you’ll begin collecting valuable usage data. Logs can reveal examples where the app struggled (e.g., user dissatisfaction, high latency). Add these examples to datasets to continually test against real-world issues.
3. Synthetic Data With a few solid examples in hand, you can use LLMs to generate many similar examples. Synthetic data is useful for scaling evaluations quickly, but it should be guided by well-designed source examples.
Arize supports flexible formats depending on your LLM application's needs:
1. Key-Value Pairs Great for multi-input/multi-output tasks like function calls, agents, or classification tasks.
"Paul Graham is an investor, entrepreneur, and computer scientist known for..."
2. Prompt-Completion (String Pairs) Ideal for testing single-turn completions.
3. Messages or Chat Format Best suited for conversational agents.
Golden Datasets A golden dataset contains trusted inputs and ideal outputs. These are typically hand-labeled and serve as a benchmark for model quality.
Paris is the capital of France
True
Canada borders the United States
True
The native language of Japan is English
False
Golden datasets are useful for regression testing and validating performance before releases.
Regression Datasets A regression dataset captures examples where your application previously failed or performed poorly. These datasets are crucial for ensuring that fixes or improvements persist over time and don’t reintroduce bugs or regressions. Examples are often pulled from user feedback or logs with problematic behavior.
What's the boiling point of water on Mars?
I don't know
Translate 'cat' to Spanish
Translation not available
Summarize: 'The U.S. economy grew 3% last quarter
No summary found
Check out the Quickstart guide for experiments.
Learn more about evaluation concepts by reading Evaluation Concepts and our definitive guide on LLM app evaluation.
Look at end to end examples of .