Ask or search…

Phoenix LLM Evals

Evaluating LLM outputs is best tackled by using a separate evaluation LLM. The Phoenix LLM Evals library is designed for simple, fast, and accurate LLM-based evaluations.
LLM Evals

The Problem with LLM Evaluations

  1. 1.
    Most evaluation libraries do NOT follow trustworthy benchmarking rigor necessary for production environments. Production LLM Evals need to benchmark both a model and "a prompt template". (i.e. the Open AI “model” Evals only focuses on evaluating the model, a different use case).
  2. 2.
    There is typically difficulty integrating benchmarking, development, production, or the LangChain/LlamaIndex callback system. Evals should process batches of data with optimal speed.
  3. 3.
    Obligation to use chain abstractions (i.e. LangChain shouldn't be a prerequisite for obtaining evaluations for pipelines that don't utilize it).

Our Solution: Phoenix LLM Evals

1. Support for Pre-Tested Eval Templates & custom eval templates

Phoenix provides pretested eval templates and convenience functions for a set of common Eval “tasks”. Learn more about pretested templates here. This library is split into high-level functions to easily run rigorously pre-tested functions and building blocks to modify and create your own Evals.

2. Data Science Rigor when Benchmarking Evals for Reproducible Results

The Phoenix team is dedicated to testing model and template combinations and is continually improving templates for optimized performance. Find the most up-to-date template on GitHub.

3. Designed for Throughput

Phoenix evals are designed to run as fast as possible on batches of Eval data and maximize the throughput and usage of your API key. The current Phoenix library is 10x faster in throughput than current call-by-call-based approaches integrated into the LLM App Framework Evals.

4. Run the Same Evals in Different Environments (Notebooks, python pipelines, Langchain/LlamaIndex callbacks)

Phoenix Evals are designed to run on dataframes, in Python pipelines, or in LangChain & LlamaIndex callbacks. Evals are also supported in Python pipelines for normal LLM deployments not using LlamaIndex or LangChain. There is also one-click support for Langchain and LlamaIndx support.
Same Eval Harness Different Environment

5. Run Evals on Span and Chain Level

Evals are supported on a span level for LangChain and LlamaIndex.
Running on Spans/Callbacks
How evals work