Ask or search…
K
Links

LLM Evaluations

Evaluate LLM outputs with Arize

LLM Evals Overview

Supported Evaluation Approaches

Arize supports various evaluation approaches depending on your available data and use case.
Approach
Description
Example
End-user feedback
Thumbs up/thumbs down on the LLM response
🏗️ Task-Based Metrics
Different metrics for different tasks
ROUGE for summarization, BLEU for translation
A separate LLM call to evaluate the LLM response

LLM-Assisted Evaluation

Use LLM assisted evaluations when end-user feedback is scarce or unstructured to assess the response of your LLM application or context relevance in retrieval augmented generation (RAG).
Arize offers support for both pre-tested and custom LLM-assisted evaluations.

Pre-Tested Evals

The Arize Phoenix LLM Evals library is designed for simple, fast, and accurate LLM-based evaluations. All evals templates are tested against golden datasets that are available as part of the LLM eval library's benchmarked datasets and target precision at 70-90% and F1 at 70-85%.
Pinpoint potential issues or inaccuracies in user query responses for further analysis when user feedback is not available
Identify issues from insufficient or irrelevant context in the knowledge base
Evaluate responses for toxic, inappropriate, or negative attributes
Evaluate the overall performance of a summarization task
Identify model hallucinations based on contextual data by assessing thewmodel's output against contextual information
Assess the correctness and readability of code from a code generation process

How to Upload Evals to Arize

After you run an LLM assisted eval with Phoenix, you may want to upload those evals to Arize for in-depth troubleshooting and root cause analysis.
🗺️ Follow a step-by-step tutorial here.

Log Evals

Log evals to Arize using tags within the model schema. Learn more about logging LLM model data here.
schema = Schema(
prediction_id_column_name="task_id",
prediction_label_column_name="readable",
tag_column_names=["user_feedback", "readability"], #reability and user feedback are the evals
prompt_column_names="input",
response_column_names="output",
)

Troubleshooting Evals

After you've logged your data to Arize, select the evaluation metric you want to measure on the Performance Tracing tab.
To do this, make sure the Actual Class (e.g. evaluation metric used) or User Feedback columns are selected in the Primary Columns selector. Then sort those columns from lowest score to highest.

Custom Evals

The Arize LLM Evals library is designed to support building any custom Eval templates needed for your use case.
Learn how to build your own eval with Arize Phoenix here.