Datasets
Version controlled examples to run your experiments
Last updated
Was this helpful?
Version controlled examples to run your experiments
Last updated
Was this helpful?
Datasets are the foundation for evaluating and improving your LLM application. They enable consistent and repeatable assessments by using structured collections of example data.
You can build datasets from a variety of sources:
1. Manually Curated Examples The best place to start. Based on your understanding of the application, you can define a handful of examples—20 well-crafted ones often go a long way. These examples help cover expected behavior and common edge cases.
2. Historical Logs Once your application is live, you’ll begin collecting valuable usage data. Logs can reveal examples where the app struggled (e.g., user dissatisfaction, high latency). Add these examples to datasets to continually test against real-world issues.
3. Synthetic Data With a few solid examples in hand, you can use LLMs to generate many similar examples. Synthetic data is useful for scaling evaluations quickly, but it should be guided by well-designed source examples.
Arize supports flexible formats depending on your LLM application's needs:
1. Key-Value Pairs Great for multi-input/multi-output tasks like function calls, agents, or classification tasks.
"Paul Graham is an investor, entrepreneur, and computer scientist known for..."
2. Prompt-Completion (String Pairs) Ideal for testing single-turn completions.
3. Messages or Chat Format Best suited for conversational agents.
Golden Datasets A golden dataset contains trusted inputs and ideal outputs. These are typically hand-labeled and serve as a benchmark for model quality.
Paris is the capital of France
True
Canada borders the United States
True
The native language of Japan is English
False
Golden datasets are useful for regression testing and validating performance before releases.
Regression Datasets A regression dataset captures examples where your application previously failed or performed poorly. These datasets are crucial for ensuring that fixes or improvements persist over time and don’t reintroduce bugs or regressions. Examples are often pulled from user feedback or logs with problematic behavior.
What's the boiling point of water on Mars?
I don't know
Translate 'cat' to Spanish
Translation not available
Summarize: 'The U.S. economy grew 3% last quarter
No summary found
to store test cases and track version history
and track application performance over time
to modify or add to existing dataset versions
to manipulate in code or download
Quickstart
Start experimenting with datasets in UI or code
Learn more about evals
Read our evaluation concepts page
Look at end to end examples of .