Concepts: Datasets
Datasets are integral to evaluation and experimentation.
They are collections of examples that provide the inputs, outputs,
and any other attributes
needed for assessing your application. Each example within a dataset represents a single data point, consisting of an inputs
dictionary, an optional output
dictionary, and an optional metadata
dictionary. The optional
output dictionary often contains the the expected LLM application output for the given input.
Datasets allow you to collect data from production, staging, evaluations, and even manually. The examples collected are then used to run experiments and evaluations to track improvements.
Use datasets to:
Store evaluation test cases for your eval script instead of managing large JSONL or CSV files
Capture generations to assess quality manually or using LLM-graded evals
Store user reviewed generations to find new test cases
With Arize, datasets are:
Integrated. Datasets are integrated with the platform, so you can add production spans to datasets, use datasets to run experiments, and use metadata to track different segments and use-cases.
Versioned. Every insert, update, and delete is versioned, so you can pin experiments and evaluations to a specific version of a dataset and track changes over time.
Creating Datasets
There are various ways to get started with datasets:
Manually Curated Examples
This is how we recommend you start. From building your application, you probably have an idea of what types of inputs you expect your application to be able to handle, and what "good" responses look like. You probably want to cover a few different common edge cases or situations you can imagine. Even 20 high quality, manually curated examples can go a long way.
Historical Logs
Once you ship an application, you start gleaning valuable information: how users are actually using it. This information can be valuable to capture and store in datasets. This allows you to test against specific use cases as you iterate on your application.
If your application is going well, you will likely get a lot of usage. How can you determine which datapoints are valuable to add? There are a few heuristics you can follow. If possible, try to collect end user feedback. You can then see which datapoints got negative feedback. That is super valuable! These are spots where your application did not perform well. You should add these to your dataset to test against in the future. You can also use other heuristics to identify interesting datapoints - for example, runs that took a long time to complete could be interesting to analyze and add to a dataset.
Synthetic Data
Once you have a few examples, you can try to artificially generate examples to get a lot of datapoints quickly. It's generally advised to have a few good handcrafted examples before this step, as the synthetic data will often resemble the source examples in some way.
Last updated