Learn how to trace your LLM application and run evaluations in Arize
Last updated
Was this helpful?
To trace your LLM app and start troubleshooting your LLM calls, you'll need to do the following:
Install our tracing packages
Run the following commands below to install our open source tracing packages, which works on top of . This example below uses openai, and we support many LLM providers ().
Go to your space settings in the left navigation, and you will see your API keys on the right hand side. You'll need the space key and API keys for the next part.
Add our tracing code
The following code snippet showcases how to automatically instrument your OpenAI application.
# Import open-telemetry dependencies
from arize_otel import register_otel, Endpoints
# Setup OTEL via our convenience function
register_otel(
endpoints = Endpoints.ARIZE,
space_key = "your-space-key", # in app space settings page
api_key = "your-api-key", # in app space settings page
model_id = "your-model-id", # name this to whatever you would like
)
# Import the automatic instrumentor from OpenInference
from openinference.instrumentation.openai import OpenAIInstrumentor
# Finish automatic instrumentation
OpenAIInstrumentor().instrument()
The following code snippet showcases how to automatically instrument your LLM application.
import os
import arize-otel
# Import open-telemetry dependencies
from arize_otel import register_otel, Endpoints
# Setup OTEL via our convenience function
register_otel(
endpoints = Endpoints.ARIZE,
space_key = "your-space-key", # in app Space settings page
api_key = "your-api-key", # in app Space settings page
model_id = "your-model-id",
)
# Import the automatic instrumentor from OpenInference
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
# Finish automatic instrumentation
LlamaIndexInstrumentor().instrument()
To test, you can create a simple RAG application using LlamaIndex.
from gcsfs import GCSFileSystem
from llama_index.core import (
Settings,
StorageContext,
load_index_from_storage,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
file_system = GCSFileSystem(project="public-assets-275721")
index_path = "arize-phoenix-assets/datasets/unstructured/llm/llama-index/arize-docs/index/"
storage_context = StorageContext.from_defaults(
fs=file_system,
persist_dir=index_path,
)
Settings.llm = OpenAI(model="gpt-4-turbo-preview")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
index = load_index_from_storage(
storage_context,
)
query_engine = index.as_query_engine()
response = query_engine.query("What is Arize and how can it help me as an AI Engineer?")
Now start asking questions to your LLM app and watch the traces being collected by Arize.
The following code snippet showcases how to automatically instrument your LLM application.
import os
import arize-otel
# Import open-telemetry dependencies
from arize_otel import register_otel, Endpoints
# Setup OTEL via our convenience function
register_otel(
endpoints = Endpoints.ARIZE,
space_key = "your-space-key", # in app Space settings page
api_key = "your-api-key", # in app Space settings page
model_id = "your-model-id",
)
# Import the automatic instrumentor from OpenInference
from openinference.instrumentation.langchain import LangChainInstrumentor
# Finish automatic instrumentation
LangChainInstrumentor().instrument()
To test, you can create a simple RAG application using Langchain.
from langchain.chains import RetrievalQA
from langchain.retrievers import KNNRetriever
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
knn_retriever = KNNRetriever(
index=np.stack(df["text_vector"]),
texts=df["text"].tolist(),
embeddings=OpenAIEmbeddings(),
)
chain_type = "stuff" # stuff, refine, map_reduce, and map_rerank
chat_model_name = "gpt-3.5-turbo"
llm = ChatOpenAI(model_name=chat_model_name)
chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type=chain_type,
retriever=knn_retriever,
metadata={"application_type": "question_answering"},
)
response = chain.invoke("What is Arize and how can it help me as an AI Engineer?")
Now start asking questions to your LLM app and watch the traces being collected by Arize.
Run your LLM application
Once you've executed a sufficient number of queries (or chats) to your application, you can view the details on the LLM Tracing page.
Evaluation
Install the Arize SDK
pip install -q 'arize[LLM_Evaluation]'
conda install -c conda-forge arize
Import your spans in code
Once you have traces in Arize, you can visit the LLM Tracing tab to see your traces and export them in code. By clicking the export button, you can get the boilerplate code to copy paste to your evaluator.
# this will be prefilled by the export command.
# Note: This uses a different API Key than the one above.
ARIZE_API_KEY = ''
# import statements required for getting your spans
import os
os.environ['ARIZE_API_KEY'] = ARIZE_API_KEY
from datetime import datetime
from arize.exporter import ArizeExportClient
from arize.utils.types import Environments
# Exporting your dataset into a dataframe
client = ArizeExportClient()
primary_df = client.export_model_to_df(
space_id='', # this will be prefilled by export
model_id='', # this will be prefilled by export
environment=Environments.TRACING,
start_time=datetime.fromisoformat(''), # this will be prefilled by export
end_time=datetime.fromisoformat(''), # this will be prefilled by export
)
Run a custom evaluator using Phoenix
Import the functions from our Phoenix library to run a custom evaluation using OpenAI.
import os
from phoenix.evals import OpenAIModel, llm_classify
Ensure you have your OpenAI API keys setup correctly for your OpenAI model.
Create a prompt template for the LLM to judge the quality of your responses. Below is an example which judges the positivity or negativity of the LLM output.
MY_CUSTOM_TEMPLATE = '''
You are evaluating the positivity or negativity of the responses to questions.
[BEGIN DATA]
************
[Question]: {input}
************
[Response]: {output}
[END DATA]
Please focus on the tone of the response.
Your answer must be single word, either "positive" or "negative"
'''
You can use the code below to check which attributes are in the traces in your dataframe.
primary_df.columns
Use the code below to set the input and output variables needed for the prompt above.
Currently, our evaluations are logged within Arize every 24 hours, and we're working on making them as close to instant as possible! Reach out to support@arize.com if you're having trouble here.
import os
from arize.pandas.logger import Client
API_KEY = os.environ.get("ARIZE_API_KEY")
SPACE_KEY = os.environ.get("ARIZE_SPACE_KEY")
# Initialize Arize client using the model_id and version you used previously
arize_client = Client(space_key=SPACE_KEY, api_key=API_KEY)
model_id = "quickstart-llm-tutorial"
# Set the evals_df to have the correct span ID to log it to Arize
evals_df = evals_df.set_index(primary_df["context.span_id"])
# Use Arize client to log evaluations
response = arize_client.log_evaluations(
dataframe=evals_df,
model_id=model_id,
)
# If successful, the server will return a status_code of 200
if response.status_code != 200:
print(f"❌ logging failed with response code {response.status_code}, {response.text}")
else:
print(f"✅ You have successfully logged evaluations to Arize")
Next steps
Dive deeper into the following topics to keep improving your LLM application!
Arize is an OpenTelemetry collector, which means you can configure your tracer and span processor. For more OTEL configurability,
Are you coding with Javascript instead of Python? See our detailed guide on or with Javascript examples.
Now start asking questions to your LLM app and watch the traces being collected by Arize. For more examples of instrumenting OpenAI applications, check our .
A detailed view of a trace of a RAG application using LlamaIndex
Notice the variables in brackets for {input} and {output} above. You will need to set those variables appropriately for the dataframe so you can run your custom template. We use as a set of conventions (complementary to ) to trace AI applications. This means depending on the provider you are using, the attributes of the trace will be different.
Use the function to run the evaluation using your custom template. You will be using the dataframe from the traces you generated above.
If you'd like more information, see our detailed guide on . You can also use our for evaluating hallucination, toxicity, retrieval, etc.
Export the evals you generated above to Arize using the log_evaluations function as part of our Python SDK. See more information on how to do this in our article on .