OpenInference is an open standard that encompasses model inference and LLM application tracing.
OpenInference is a specification that encompass two data models:
The OpenInference data format is designed to provide an open interoperable data format for model inference files. Our goal is for modern ML systems, such as model servers and ML Observability platforms, to interface with each other using a common data format.\
The goal of this is to define a specification for production inference logs that can be used on top of many file formats including Parquet, Avro, CSV and JSON. It will also support future formats such as Lance.
Inference Table in Inference Store
An inference store is a common approach to store model inferences, normally stored in a data lake or data warehouse.
- Text Generative - Prompt and Response
- Text Classification
- NER Span Categorization
- Classification + Score
- Time Series Forecasting
- Bounding Box
In an inference store the prediction ID is a unique identifier for a model prediction event. The prediction ID defines the inputs to the model, model outputs, latently linked ground truth (actuals), meta data (tags) and model internals (embeddings and/or SHAP).
In this section we will review a flat (non nested structure) prediction event, the following sections will cover how to handle nested structures.
Prediction Inference Event Data
LLM Inference Data
A prediction event can represent a prompt response pair for LLMs where the conversation ID maintains the thread of conversation.
Core Model Inference Data
The core components of an inference event are the:
- Model input (features/prompt)
- Model output (prediction/response)
- Ground truth (actuals or latent actuals)
- Model ID
- Model Version
- Conversation ID
Additional data that may be contained include:
- SHAP values
- Raw links to data
- Bounding boxes
The fundamental storage unit in an inference store is an inference event. These events are stored in groups that are logically separated by model ID, model version and environment.
Model Data and Version
Environment describes where the model is running for example we use environments of training, validation/test and production to describe different places you run a model.
The production environment is commonly a streaming-like environment. It is streaming in the sense that a production dataset has no beginning or end. The data can be added to it continuously. In most production use cases data is added in small mini batches or real time event-by-event.
The training and validation environments are commonly used to send data in batches. These batches define a group of data for analysis purposes. It’s common in validation/test and training to have the timestamp be optional.
Note: historical backtesting data comparisons on time series data can require non-runtime settings for timestamp use for training and validation
The model ID is a unique human readable identifier for a model within a workspace - it completely separates the model data between logical instances.
The model version is a logical separator for metrics and analysis used to look at different builds of a model. A model version can capture common changes such as weight updates and feature additions.
Unlike Infra observability, the inference store needs some mutability. There needs to be some way in which ground truth is added or updated for a prediction event.
Ground truth is required in the data in order to analyze performance metrics such as precision, recall, AUC, LogLoss, and Accuracy.
Latent ground truth data may need to be “joined” to a prediction ID to enable performance visualization. In Phoenix, the library requires ground truth to be pre-joined to prediction data. In an ML Observability system such as Arize the joining of ground truth is typically done by the system itself.
Latent Ground Truth
The above image shows a common use case in ML Observability in which latent ground truth is received by a system and linked back to the original prediction based on a prediction ID.
Latent MetaData (Tags)
In addition to ground truth, latent metadata is also required to be linked to a prediction ID. Latent metadata can be critical to analyze model results using additional data tags linked to the original prediction ID.
Examples of Metadata (Tags):
- Loan default amount
- Loan status
- Revenue from conversion or click
- Server region
Images bounding box, NLP NER, and Image segmentation
The above picture shows how a nested set of detections can occur for a single image in the prediction body with bounding boxes within the image itself.
A model may have multiple inputs with different embeddings and images for each generating a prediction class. An example might be an insurance claim event with multiple images and a single prediction estimate for the claim.
The above prediction shows hierarchical data. The current version of Phoenix is designed to ingest a flat structure so teams will need to flatten the above hierarchy. An example of flattening is below.
Hierarchical Data Flattened
The example above shows an exploded representation of the hierarchical data. <todo fix, once team reviews approach internally>
OpenInference Tracing provides a detailed and holistic view of the operations happening within an LLM application. It offers a way to understand the "path" or journey a request takes from start to finish, helping in debugging, performance optimization, and ensuring the smooth flow of operations. Tracing takes advantage two key components to instrument your code.
- 1.Tracer: Responsible for creating spans that contain information about various operations.
- 2.Trace Exporters: These are responsible for sending the generated traces to consumers which can be a standard output for debugging, or an OpenInference Collector such as Phoenix.
OpenInference spans are built on-top of a unit of work called a
spankeeps track of how long the execution of a given LLM application step takes and also can store important information about the step in the form of
attributes. At a high level, a span has:
- 1.Span Context: Contains the trace ID (representing the trace the span belongs to) and the span's ID.
- 2.Attributes: Key-value pairs containing metadata to annotate a span. They provide insights about the operation being tracked. Semantic attributes offer standard naming conventions for common metadata.
- 3.Span Events: Structured log messages on a span, denoting a significant point in time during the span's duration.
- 4.Span Status: Attached to a span to denote its outcome as Unset, Ok, or Error.
- 5.Span Kind: Provides a hint on how to assemble the trace. Types include:
- Chain: Represents the starting point or link between different LLM application steps.
- Retriever: Represents a data retrieval step.
- LLM: Represents a call to an LLM.
- Embedding: Represents a call to an LLM for embedding.
- Tool: Represents a call to an external tool.
- Agent: Encompasses calls to LLMs and Tools, describing a reasoning block.
OpenInference Tracing offers a comprehensive view of the inner workings of an LLM application. By breaking down the process into spans and categorizing each span, it offers a clear picture of the operations and their interrelations, making troubleshooting and optimization easier and more effective. For the full details of OpenInference tracing, please consult the specification