Phoenix LLM Evals
- 1.Most evaluation libraries do NOT follow trustworthy benchmarking rigor necessary for production environments. Production LLM Evals need to benchmark both a model and "a prompt template". (i.e. the Open AI “model” Evals only focuses on evaluating the model, a different use case).
- 2.There is typically difficulty integrating benchmarking, development, production, or the LangChain/LlamaIndex callback system. Evals should process batches of data with optimal speed.
- 3.Obligation to use chain abstractions (i.e. LangChain shouldn't be a prerequisite for obtaining evaluations for pipelines that don't utilize it).
Phoenix evals are designed to run as fast as possible on batches of Eval data and maximize the throughput and usage of your API key. The current Phoenix library is 10x faster in throughput than current call-by-call-based approaches integrated into the LLM App Framework Evals.
Phoenix Evals are designed to run on dataframes, in Python pipelines, or in LangChain & LlamaIndex callbacks. Evals are also supported in Python pipelines for normal LLM deployments not using LlamaIndex or LangChain. There is also one-click support for Langchain and LlamaIndx support.
Same Eval Harness Different Environment
Evals are supported on a span level for LangChain and LlamaIndex.
Running on Spans/Callbacks
How evals work