Overview: Evals

The standard for evaluating text is human labeling. However, high-quality LLM outputs are becoming cheaper and faster to produce, and human evaluation cannot scale. In this context, evaluating the performance of LLM applications is best tackled by using a separate evaluation LLM. The Phoenix LLM Evals library is designed for simple, fast, and accurate LLM-based evaluations.

  • Simple callback system integration for applying to spans on LangChain and LlamaIndex

  • Support for one-click explanations

  • Fast on Batches- Async concurrent fast parallel rate limit management for API calls

  • Custom dataset support and custom Eval creation support

  • Pre-tested Evaluations with model benchmarks

  • Extensive support for RAG Evals: Benchmarking scripts, retrieval Evals and citation Evals

The Problem with LLM Evaluations

  1. Most evaluation libraries do NOT follow trustworthy benchmarking rigor necessary for production environments. Production LLM Evals need to benchmark both a model and "a prompt template". (i.e. the Open AI “model” Evals only focuses on evaluating the model, a different use case).

  2. There is typically difficulty integrating benchmarking, development, production, or the LangChain/LlamaIndex callback system. Evals should process batches of data with optimal speed.

  3. Obligation to use chain abstractions (i.e. LangChain shouldn't be a prerequisite for obtaining evaluations for pipelines that don't utilize it).

Our Solution: Phoenix LLM Evals

1. Support for Pre-Tested Eval Templates & custom eval templates

Phoenix provides pretested eval templates and convenience functions for a set of common Eval “tasks”. Learn more about pretested templates here. This library is split into high-level functions to easily run rigorously pre-tested functions and building blocks to modify and create your own Evals.

2. Data Science Rigor when Benchmarking Evals for Reproducible Results

The Phoenix team is dedicated to testing model and template combinations and is continually improving templates for optimized performance. Find the most up-to-date template on GitHub.

3. Designed for Throughput

Phoenix evals are designed to run as fast as possible on batches of Eval data and maximize the throughput and usage of your API key. The current Phoenix library is 10x faster in throughput than current call-by-call-based approaches integrated into the LLM App Framework Evals.

4. Run the Same Evals in Different Environments (Notebooks, python pipelines, Langchain/LlamaIndex callbacks)

Phoenix Evals are designed to run on dataframes, in Python pipelines, or in LangChain & LlamaIndex callbacks. Evals are also supported in Python pipelines for normal LLM deployments not using LlamaIndex or LangChain. There is also one-click support for Langchain and LlamaIndx support.

5. Run Evals on Span and Chain Level

Evals are supported on a span level for LangChain and LlamaIndex.

Last updated