Datasets

Version controlled examples to run your experiments

Datasets are the foundation for evaluating and improving your LLM application. They enable consistent and repeatable assessments by using structured collections of example data.

Key features

How to Create Datasets

You can build datasets from a variety of sources:

1. Manually Curated Examples The best place to start. Based on your understanding of the application, you can define a handful of examples—20 well-crafted ones often go a long way. These examples help cover expected behavior and common edge cases.

2. Historical Logs Once your application is live, you’ll begin collecting valuable usage data. Logs can reveal examples where the app struggled (e.g., user dissatisfaction, high latency). Add these examples to datasets to continually test against real-world issues.

3. Synthetic Data With a few solid examples in hand, you can use LLMs to generate many similar examples. Synthetic data is useful for scaling evaluations quickly, but it should be guided by well-designed source examples.

Dataset Structures

Arize supports flexible formats depending on your LLM application's needs:

1. Key-Value Pairs Great for multi-input/multi-output tasks like function calls, agents, or classification tasks.

Input

Context

Output

"Paul Graham is an investor, entrepreneur, and computer scientist known for..."

2. Prompt-Completion (String Pairs) Ideal for testing single-turn completions.

Input

Output

3. Messages or Chat Format Best suited for conversational agents.

Input:
{"messages": [{"role": "system", "content": "You are an expert SQL assistant"}]}
Output:
{"messages": [{"role": "assistant", "content": "SELECT * FROM users;"}]}

Types of Datasets

Golden Datasets A golden dataset contains trusted inputs and ideal outputs. These are typically hand-labeled and serve as a benchmark for model quality.

Input

Output

Paris is the capital of France

True

Canada borders the United States

True

The native language of Japan is English

False

Golden datasets are useful for regression testing and validating performance before releases.

Regression Datasets A regression dataset captures examples where your application previously failed or performed poorly. These datasets are crucial for ensuring that fixes or improvements persist over time and don’t reintroduce bugs or regressions. Examples are often pulled from user feedback or logs with problematic behavior.

Input

Output

What's the boiling point of water on Mars?

I don't know

Translate 'cat' to Spanish

Translation not available

Summarize: 'The U.S. economy grew 3% last quarter

No summary found

Learn More

Last updated 3 days ago

Was this helpful?