Phoenix: AI Observability & Evaluation
Evaluate, troubleshoot, and fine-tune your LLM, CV, and NLP models in a notebook.
Last updated
Evaluate, troubleshoot, and fine-tune your LLM, CV, and NLP models in a notebook.
Last updated
Phoenix is an open-source observability library designed for experimentation, evaluation, and troubleshooting.
The toolset is designed to ingest inference data for LLMs, CV, NLP, and tabular datasets as well as LLM traces. It allows AI Engineers and Data Scientists to quickly visualize their data, evaluate performance, track down issues & insights, and easily export to improve.
Running Phoenix for the first time? Select a quickstart below.
Don't know which one to choose? Phoenix has two main data ingestion methods:
LLM Traces: Phoenix is used on top of trace data generated by LlamaIndex and LangChain. The general use case is to troubleshoot LLM applications with agentic workflows.
Inferences: Phoenix is used to troubleshoot models whose datasets can be expressed as DataFrames in Python such as LLM applications built in Python workflows, CV, NLP, and tabular models.
Evaluate Performance of LLM Tasks with Evals Library: Use the Phoenix Evals library to easily evaluate tasks such as hallucination, summarization, and retrieval relevance, or create your own custom template.
Troubleshoot Agentic Workflows: Get visibility into where your complex or agentic workflow broke, or find performance bottlenecks, across different span types with LLM Tracing.
Optimize Retrieval Systems: Identify missing context in your knowledge base, and when irrelevant context is retrieved by visualizing query embeddings alongside knowledge base embeddings with RAG Analysis.
Compare Model Versions: Compare and evaluate performance across model versions prior to deploying to production.
Exploratory Data Analysis: Connect teams and workflows, with continued analysis of production data from Arize in a notebook environment for fine tuning workflows.
Find Clusters of Issues to Export for Model Improvement: Find clusters of problems using performance metrics or drift. Export clusters for retraining workflows.
Surface Model Drift and Multivariate Drift: Use the Embeddings Analyzer to surface data drift for computer vision, NLP, and tabular models.
Check out a comprehensive list of example notebooks for LLM Traces, Evals, RAG Analysis, and more.
Learn about best practices, and how to get started with use case examples such as Q&A with Retrieval, Summarization, and Chatbots.
Join the Phoenix Slack community to ask questions, share findings, provide feedback, and connect with other developers.