Log Evaluation Results

This guide shows how LLM evaluation results in dataframes can be sent to Phoenix.

An evaluation must have a name (e.g. "Q&A Correctness") and its DataFrame must contain identifiers for the subject of evaluation, e.g. a span or a document (more on that below), and values under either the score, label, or explanation columns. See Evaluations for more information.

Connect to Phoenix

Before accessing px.Client(), be sure you've set the following environment variables:

import os

os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key=..."
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"

If you're self-hosting Phoenix, ignore the client headers and change the collector endpoint to your endpoint.

Span Evaluations

A dataframe of span evaluations would look similar like the table below. It must contain span_id as an index or as a column. Once ingested, Phoenix uses the span_id to associate the evaluation with its target span.

span_id
label
score
explanation

5B8EF798A381

correct

1

"this is correct ..."

E19B7EC3GG02

incorrect

0

"this is incorrect ..."

The evaluations dataframe can be sent to Phoenix as follows. Note that the name of the evaluation must be supplied through the eval_name= parameter. In this case we name it "Q&A Correctness".

from phoenix.trace import SpanEvaluations
import os

px.Client().log_evaluations(
    SpanEvaluations(
        dataframe=qa_correctness_eval_df,
        eval_name="Q&A Correctness",
    ),
)

Document Evaluations

A dataframe of document evaluations would look something like the table below. It must contain span_id and document_position as either indices or columns. document_position is the document's (zero-based) index in the span's list of retrieved documents. Once ingested, Phoenix uses the span_id and document_position to associate the evaluation with its target span and document.

span_id
document_position
label
score
explanation

5B8EF798A381

0

relevant

1

"this is ..."

5B8EF798A381

1

irrelevant

0

"this is ..."

E19B7EC3GG02

0

relevant

1

"this is ..."

The evaluations dataframe can be sent to Phoenix as follows. Note that the name of the evaluation must be supplied through the eval_name= parameter. In this case we name it "Relevance".

from phoenix.trace import DocumentEvaluations

px.Client().log_evaluations(
    DocumentEvaluations(
        dataframe=document_relevance_eval_df,
        eval_name="Relevance",
    ),
)

Logging Multiple Evaluation DataFrames

Multiple sets of Evaluations can be logged by the same px.Client().log_evaluations() function call.

px.Client().log_evaluations(
    SpanEvaluations(
        dataframe=qa_correctness_eval_df,
        eval_name="Q&A Correctness",
    ),
    DocumentEvaluations(
        dataframe=document_relevance_eval_df,
        eval_name="Relevance",
    ),
    SpanEvaluations(
        dataframe=hallucination_eval_df,
        eval_name="Hallucination",
    ),
    # ... as many as you like
)

Specifying A Project for the Evaluations

By default the client will push traces to the project specified in the PHOENIX_PROJECT_NAME environment variable or to the default project. If you want to specify the destination project explicitly, you can pass the project name as a parameter.

from phoenix.trace import SpanEvaluations

px.Client().log_evaluations(
    SpanEvaluations(
        dataframe=qa_correctness_eval_df,
        eval_name="Q&A Correctness",
    ),
    project_name="<my-project>"
)

Last updated