Quickstart: Prompt Engineering
Last updated
Copyright © 2023 Arize AI, Inc
Last updated
This Quickstart guide will walk you through the following content:
Find problematic production examples and pull them into the playground for debugging.
Improve your model and prompt template to address identified issues.
Test the updated prompt on a dataset of examples.
Save the updated prompt as an experiment and evaluate aggregate metrics to share with your team.
Publish the updated prompt to the Prompt Hub to update your production application.
The Prompt Playground provides an intuitive interface for experimenting with prompt templates, input variables, LLM models, and parameters. This no-code platform empowers both coding and non-coding experts to refine their prompts for optimal use in production applications.
The most common way to enter the Prompt Playground is through a span on the LLM tracing page. For instance, users can filter spans where an Online Evaluator flagged the LLM output as a hallucination and then bring one of these examples into the Prompt Playground to refine the prompt, ensuring the LLM produces factual responses in the future.
With the span loaded in the Playground, the user can iterate on the prompt to reduce the likelihood of hallucinations. One effective approach is to adjust the model and its parameters. In this example, the user reduces the model's temperature and upgrade to GPT-4o for improved accuracy and reliability.
Note that the Prompt Playground supports all major model providers and allows integration with custom model endpoints.
Another approach to reducing hallucinations is modifying the template. Using Copilot, the user optimizes the prompt, instructing the LLM to respond with 'I don’t know' when the answer is not found in the provided context. After pressing 'Run' with the updated prompt template, the New Output confirms that the LLM now responds with 'I don’t know' instead of generating a fabricated answer.
While we have observed improved performance on a single example, how can we ensure consistent improvement across all hallucinated spans? To validate that the new prompt effectively reduces hallucinations more broadly, we can load a dataset of hallucinated examples into the Prompt Playground and test the updated prompt against the entire dataset.
Having observed improvements across the entire dataset, the next step is to save the output for team review and collaboration. The output is saved as an experiment, allowing comparison with other templates. Evaluation metrics are computed and aggregated for each experiment, offering a quantitative analysis to complement the qualitative review of the new outputs.
The template can also be saved to the Prompt Hub, making it especially valuable for production use cases and collaboration. Refer to the next section for details.