In a previous article, we learned that prompting is how we communicate with LLMs, such as OpenAI’s GPT-4 and Meta’s Llama 2. We also observed how prompt structure and technique impact the relevancy and consistency of LLM output. But how do we actually determine the quality of our LLM prompts and outputs?
In this article, we will investigate:
If we were just using LLMs for personal or leisure use, then we may not need rigorous evaluations of our LLM prompts and outputs. However, when building LLM-powered applications for business and production scenarios, the caliber of the LLM, prompts, and outputs matters and needs to be measured.
LLM evaluation (eval) is a generic term. Let’s cover the two main types of LLM evaluation.
LLM Model Eval | LLM Prompt Eval | |
Purpose | Evaluate models or versions of the same model based on overall performance | Evaluate prompt effectiveness based on LLM output quality |
By Whom | AI model developers | AI/LLM application developers |
Frequency | Infrequent | Frequent |
Benchmarks/Metrics |
|
|
A table comparing LLM model eval vs. LLM prompt eval. Source: Author
LLM model evals are used to assess the overall quality of the foundational models, such as OpenAI’s GPT-4 and Meta’s Llama 2, across a variety of tasks and are usually done by model developers. The same test datasets are fed into the particular models and their resulting metrics, or evaluation datasets, are compared.
The effectiveness of LLM evaluations is heavily influenced by the quality of the training data used to develop these models. High-quality, diverse data ensures that large language models can generalize well across a variety of tasks, leading to better performance during evaluations.
The following are some popular LLM model eval metrics available:
The purpose of LLM model evals is to differentiate between various models or versions of the same model based on overall performance and general capabilities. The results — along with other considerations for access methods, costs, and transparency — help inform which model(s) or model version(s) to use for your LLM-powered application. Choosing which LLM(s) to use is typically a one-time endeavor near the beginning of your application development.
LLM prompt evals are application-specific and assess prompt effectiveness based on the quality of LLM outputs. This type of evaluation measures how well your inputs (e.g. prompt and context) determine your outputs. Unlike the broader LLM model evaluation benchmarks, these evals are highly specific to your use case and tasks.
Before running the evals, you need to assemble a “golden dataset” of inputs and expected outputs, as well as any prompts and templates, that are representative of your specific use case. Run the prompts and templates on your golden dataset through the selected LLM to establish your baseline. You’ll typically re-run your evals and monitor these metrics against your baseline frequently for your LLM-powered application to optimize your system.
An emerging technique that can significantly influence prompt effectiveness is Retrieval Augmented Generation (RAG). This approach combines the strengths of LLMs with retrieval mechanisms, allowing models to pull in relevant external information when generating responses. Integrating RAG into the evaluation process enables us to better assess how well prompts leverage external knowledge, which can improve grounding and relevance in LLM outputs.
Currently, there is no definitive standard for evaluating prompt effectiveness and output quality. In general, we want to assess whether the prompt and output are good and safe. Here are some key dimensions to consider:
Specifically for binary classification of outputs, there are four common metrics: accuracy, precision, recall, and F1 score. First, let’s look at the four possible outcomes for binary classification, using relevance as an example. These four possible outcomes make up the confusion matrix.
Based on the confusion matrix, the four metrics are defined:
There are two major approaches to running LLM evals: Human Evaluation vs. LLM-Assisted Evaluation.
As the name suggests, human evaluators manually assess the LLM outputs. The outputs can be evaluated in several ways:
The downside to human evaluation is that humans are inherently subjective and also resource-intensive.
Instead of a human, an LLM is used to assess the LLM outputs. The LLM selected to perform the evaluation can be an LLM used for the main application or a separate one. One simple approach is to set the temperature to zero for the evaluation LLM. Note that the output evaluation methods performed by a human (reference, scoring, and A/B testing) can also be performed by an LLM.
The key to an LLM-assisted evaluation is creating a prompt that correctly instructs the LLM on how to assess the outputs. The prompt is structured as a prompt template so that it can be programmatically composed, executed, and reused.
The LLM-assisted evaluation approach is more resource-efficient and can be scaled. Although non-human, an LLM is still susceptible to subjectivity, as it may be trained on data containing biases. At the time of writing, it’s hard to tell whether LLM-assisted evaluations can outperform human evaluations.
The following is a sample prompt template for the reference methodology. The eval LLM compares the AI response with the human ground truth and then provides a correct-or-incorrect judgment.
The following is a sample prompt template for detecting toxicity. The eval LLM is instructed to perform a binary classification scoring (toxic or non-toxic) on the provided text.
The following prompt template illustrates an example of the A/B testing paradigm. Given two answers, the eval LLM is instructed to pick the better answer for the question.
There are available tools that help with prompt management and optimization as well as LLM evaluation.
A prompt registry is a centralized repository to store, manage, and version prompts. It helps manage a growing and evolving collection of prompts in an organized and accessible way. It may offer functionalities such as change tracking and versioning of prompts. It allows for better team collaboration with a central hub to share, edit, and refine prompts.
A prompt registry is typically offered as part of a suite of LLMOps tools. Some offerings include PromptLayer and Weights & Biases Prompts.
A prompt playground is an interactive environment to create, iterate, and refine prompts. It may offer features such as viewing prompts and corresponding responses, editing existing prompts, and analyzing prompt performance.
A prompt playground may be offered as a standalone tool or part of a suite. For example, OpenAI has a simple playground to experiment with its models. Chainlit, an open-source Python AI framework, provides a prompt playground module.
An evaluation framework offers tools for building and running LLM evals. It saves you time and effort compared to starting from scratch.
For instance, OpenAI’s Evals is an open-source framework for performing LLM model evals. It offers a registry of benchmarks and the optionality to create custom evals and use private data.
Another open-source framework, promptfoo, can be used for LLM model and prompt evals. It includes features for speeding up evaluations with caching and concurrency as well as setting up automatic output scoring.
Overall, manual and automated LLM model and prompt evals, along with the use of appropriate LLM evaluation metrics, can effectively monitor the quality of LLM, prompts, and outputs. The availability of prompting and LLM eval tools help with organization and efficiency. As your LLM-powered application enters production mode and grows in complexity, LLM evals and tools become more significant.
To evaluate the quality of LLM outputs, you need to assess how well the generated text aligns with the intended task. Key assessment factors include the relevance of the output to the input prompt, the accuracy of the information provided, and whether the outputs generated are factually true. It’s also important to look at how efficiently the model generates responses, how flexible it is to handle a range of topics, and whether it avoids common pitfalls like hallucinations or biased language.
LLM accuracy is typically measured using a few different metrics that assess how close the model's output is to a correct or expected response. Common metrics include precision, which shows how often the model's positive outputs are correct, and recall, which measures its ability to find all relevant correct answers. Another useful metric is the F1 score, which balances precision and recall to give an overall sense of the model's performance. Accuracy itself measures the proportion of all correct responses out of the total attempts.
Benchmarking in large language models refers to testing and comparing different models using standardized datasets and tasks to evaluate their performance. This process involves running models through a set of tasks, such as answering questions or completing sentences, and then using evaluation metrics to measure their accuracy, efficiency, and ability to handle different types of queries. Benchmarking helps to highlight the strengths and weaknesses of each model. Popular benchmarks include HellaSwag, TruthfulQA, and MMLU.