Create a dataset

Last updated 22 days ago

Was this helpful?

Create a dataset

We have four ways of loading data into a dataset

Create a dataset from CSV

You can upload CSVs as a dataset in Arize. Your columns in the file will be attributes that you can access in experiments or in prompt playground.

The primary requirement for the CSV is that it must have an id column. Here's an example CSV snippet, which has the question and response columns as attributes.

id,question,response
1,"here is a question","a satisfactory answer"

Create a dataset from your spans

If you have added tracing to your application, you can create datasets by adding spans from your application with Arize. Go to the traces page and filter for the examples you care about, such as spans with a hallucination label.

Create a dataset with code

If you'd like to create your datasets programmatically, you can using our python SDK to create, update, and delete datasets.

To start let's install the packages we need:

!pip install "arize[Datasets]" pandas

Let's get your developer key by clicking "code" on the datasets page.

from arize.experimental.datasets import ArizeDatasetsClient
import pandas as pd

client = ArizeDatasetsClient(developer_key=developer_key)

You can create many different kinds of datasets. The examples below are sorted by complexity.

If you are looking to upload a standard set of examples with string inputs, you can create the dataframe as such.

import pandas as pd
import json
from arize.experimental.datasets.utils.constants import GENERATIVE

data = [{
    "persona": "An aspiring musician who is writing their own songs",
    "problem": "I often get stuck overthinking my lyrics and melodies.",
}]

df = pd.DataFrame(data)

dataset_id = client.create_dataset(
    space_id="YOUR_SPACE_ID", 
    dataset_name="Your Dataset",
    dataset_type=GENERATIVE,
    data=df
)

In this dataset, we'll attach the prompt to the data points so when you import it in prompt playground, the prompt will automatically appear.

import pandas as pd
import json
from arize.experimental.datasets.utils.constants import GENERATIVE

PROMPT_TEMPLATE = """
You are an expert product manager recommending features for a target user. 

Persona: {persona}
Problem: {problem}
"""

data = [{
    "attributes.llm.prompt_template.template": PROMPT_TEMPLATE,
    "persona": "An aspiring musician who is writing their own songs",
    "problem": "I often get stuck overthinking my lyrics and melodies.",
}]

df = pd.DataFrame(data)

dataset_id = client.create_dataset(
    space_id="YOUR_SPACE_ID", 
    dataset_name="Your Dataset",
    dataset_type=GENERATIVE,
    data=df
)

import pandas as pd
import json
from arize.experimental.datasets.utils.constants import GENERATIVE

PROMPT_TEMPLATE = """
You are an expert product manager recommending features for a target user. 

Persona: {persona}
Problem: {problem}
"""

data = [
    {
        "attributes.llm.prompt_template.template": PROMPT_TEMPLATE,
        "attributes.llm.prompt_template.variables": json.dumps({
            "persona": "An aspiring musician who is writing their own songs",
            "problem": "I often get stuck overthinking my lyrics and melodies.",
        })
    },
    {
        "attributes.llm.prompt_template.template": PROMPT_TEMPLATE,
        "attributes.llm.prompt_template.variables": json.dumps({
            "persona": "A Christian who goes to church every week",
            "problem": "I'm often too tired for deep Bible study at the end of the day.",
        })
    },
]

df = pd.DataFrame(data)

dataset_id = client.create_dataset(
    space_id="YOUR_SPACE_ID", 
    dataset_name="Your Dataset",
    dataset_type=GENERATIVE,
    data=df
)

Here's how it looks importing the dataset into prompt playground, making it very easy to iterate on your prompt and test new outputs across many data points.

Create a synthetic dataset

When you are first developing with LLMs, you typically start with a prompt and little else. The early iteration gets you to a point where the video demo looks amazing, but there's a lack of confidence in its reliability and robustness.

This is where you can use LLMs to generate examples for you based on your prompt. Here's an example, where we can use ChatGPT or your LLM of choice to create a set of examples you can upload to Arize.

You are a data analyst. You are using LLMs to summarize a document. Create a CSV of 20 test cases with the following columns:

1. Input: The full document text, usually five paragraphs of articles about beauty products.
2. Prompt Variables: A JSON string of metadata attached to the article, such as the article title, date, and website URL
3. Output: The one line summary

This will generate a CSV file for you that looks like:

Coming soon, you'll be able to do this directly in the Arize platform based on your traces and prompts, but in the interim, you can upload this data with code or CSV.