Back to Blog

An Introduction to LLM Evaluation: How to measure the quality of LLMs, prompts, and outputs

Introduction

In a previous article, we learned that prompting is how we communicate with LLMs, such as OpenAI’s GPT-4 and Meta’s Llama 2. We also observed how prompt structure and technique impact the relevancy and consistency of LLM output. But how do we actually determine the quality of our LLM prompts and outputs?

In this article, we will investigate: 

  • LLM model evaluation vs. LLM prompt evaluation
  • Human vs. LLM-assisted approaches to running LLM evaluations
  • Available tools for prompt maintenance and LLM evaluation

If we were just using LLMs for personal or leisure use, then we may not need rigorous evaluations of our LLM prompts and outputs. However, when building LLM-powered applications for business and production scenarios, the caliber of the LLM, prompts, and outputs matters and needs to be measured.

Types of LLM Evaluation

LLM evaluation (eval) is a generic term. Let’s cover the two main types of LLM evaluation.

  LLM Model Eval LLM Prompt Eval
Purpose Evaluate models or versions of the same model based on overall performance Evaluate prompt effectiveness based on LLM output quality
By Whom AI model developers AI/LLM application developers
Frequency Infrequent Frequent
Benchmarks/Metrics
  • HellaSwag
  • TruthfulQA
  • Measuring Massive Multitask Language Understanding (MMLU)
  • HumanEval
  • GSM8K
  • etc.

  • Grouding
  • Relevance
  • Efficiency
  • Versatility
  • Hallucinations
  • Toxicity
  • etc

A table comparing LLM model eval vs. LLM prompt eval. Source: Author

 

LLM Model Evaluation

LLM model evals are used to assess the overall quality of the foundational models, such as OpenAI’s GPT-4 and Meta’s Llama 2, across a variety of tasks and are usually done by model developers. The same test datasets are fed into the particular models and their resulting metrics, or evaluation datasets, are compared.

The effectiveness of LLM evaluations is heavily influenced by the quality of the training data used to develop these models. High-quality, diverse data ensures that large language models can generalize well across a variety of tasks, leading to better performance during evaluations.

Screenshot 2024-05-14 at 1.19.29 p.m.A diagram illustrating LLM model evals. Source: https://arize.com/blog-course/llm-evaluation-the-definitive-guide/#large-language-model-model-eval 

 

The following are some popular LLM model eval metrics available:

  • HellaSwag - A benchmark that measures how well an LLM can complete a sentence. For example, provided with "A woman sits at a piano" the LLM needs to pick "She sets her fingers on the keys" as the most probable phrase that follows.
  • TruthfulQA - A benchmark to measure truthfulness in an LLM’s generated responses. To score high on this benchmark, an LLM needs to avoid generating false answers based on popular misconceptions learned from human texts.
  • Measuring Massive Multitask Language Understanding (MMLU) - A broad benchmark to measure an LLM’s multi-task accuracy and natural language understanding (NLU). The test encompasses 57 tasks that cover a breadth of topics, including hard sciences like mathematics and computer science and social sciences like history and law. There are also varying topic depths, from basic to advanced levels.
  • HumanEval - A benchmark that measures an LLM’s coding abilities and includes 164 programming problems with a function signature, docstring, body, and several unit tests. The coding problems are written in Python and the comments and docstrings contain natural text in English.
  • GSM8K - A benchmark to measure an LLM’s capability to perform multi-step mathematical reasoning. The test dataset contains 8.5K math word problems that involve 2-8 steps and require only basic arithmetic operations (+ - / *).

Screenshot 2024-05-14 at 1.27.14 p.m.A table of Claude 3 benchmarks against other LLMs. Source: https://www.anthropic.com/news/claude-3-family 


The purpose of LLM model evals is to differentiate between various models or versions of the same model based on overall performance and general capabilities. The results — along with other considerations for access methods, costs, and transparency — help inform which model(s) or model version(s) to use for your LLM-powered application. Choosing which LLM(s) to use is typically a one-time endeavor near the beginning of your application development.

LLM Prompt Evaluation

LLM prompt evals are application-specific and assess prompt effectiveness based on the quality of LLM outputs. This type of evaluation measures how well your inputs (e.g. prompt and context) determine your outputs. Unlike the broader LLM model evaluation benchmarks, these evals are highly specific to your use case and tasks.

Before running the evals, you need to assemble a “golden dataset” of inputs and expected outputs, as well as any prompts and templates, that are representative of your specific use case. Run the prompts and templates on your golden dataset through the selected LLM to establish your baseline. You’ll typically re-run your evals and monitor these metrics against your baseline frequently for your LLM-powered application to optimize your system.

An emerging technique that can significantly influence prompt effectiveness is Retrieval Augmented Generation (RAG). This approach combines the strengths of LLMs with retrieval mechanisms, allowing models to pull in relevant external information when generating responses. Integrating RAG into the evaluation process enables us to better assess how well prompts leverage external knowledge, which can improve grounding and relevance in LLM outputs.

 

Screenshot 2024-05-14 at 1.42.17 p.m.A diagram illustrating LLM prompt evals. Source: https://arize.com/blog-course/llm-evaluation-the-definitive-guide/#llm-system-evaluation 

 

Currently, there is no definitive standard for evaluating prompt effectiveness and output quality. In general, we want to assess whether the prompt and output are good and safe. Here are some key dimensions to consider:

  • Grounding - The authoritative basis of the LLM output, determined by comparing it against some ground truths in a specific domain.
  • Relevance - The pertinence of the LLM output to the prompt query or topic alignment. This can be measured with a predefined scoring methodology, such as binary classification (relevant/irrelevant).
  • Efficiency - The speed and computing consumption of the LLM to produce the output. This can be calculated with the time it takes to receive the output and also the cost of inference (prompt execution) in tokens or dollars.
  • Versatility - The capability of the LLM to handle different types of queries. One indicator is perplexity, which measures how confused the model is in making the next word or token predictions. Lower perplexity means the model is less confused and therefore more confident in its predictions. In general, a model’s confidence has a positive correlation with its accuracy. Moreover, a lower perplexity on new, unseen data means the model can generalize well.
  • Hallucinations - Whether the LLM output contains hallucinations or factually untrue statements. This may be determined with a chosen scoring method, such as binary classification (factual/hallucinated), based on some reference data.
  • Toxicity - The presence of toxic content, such as inappropriate language, biases, and threats in the LLM output. Some metrics for toxicity include fairness scoring, disparity analysis, and bias detection

Specifically for binary classification of outputs, there are four common metrics: accuracy, precision, recall, and F1 score. First, let’s look at the four possible outcomes for binary classification, using relevance as an example. These four possible outcomes make up the confusion matrix. 

 

Screenshot 2024-05-14 at 1.47.06 p.m.Confusion matrix for binary classification of relevance. Source: Author

 

Based on the confusion matrix, the four metrics are defined:

  • Accuracy - Measures the overall proportion of correct predictions made by the model. It’s calculated as (True Positives + True Negatives) / Total Predictions. However, just looking at accuracy alone can be misleading if the dataset is imbalanced as the majority class dominates the accuracy score, possibly masking the poor performance of the minority class.
  • Precision - Also known as the positive predictive value, measures the proportion of true positives among the positive predictions made by the model. It’s calculated as True Positives / (True Positives + False Positives). Indicates the model's ability to make positive predictions.
  • Recall - Also known as the true positive rate, measures the proportion of true positives out of all actual positives. It’s calculated as True Positives / (True Positives + False Negatives). Indicates the model's ability to identify all actual positive cases.
  • F1 score - Combines precision and recall into a single metric. It’s calculated as the harmonic mean  2 * (Precision * Recall) / (Precision + Recall). The score ranges from 0 to 1, with 1 indicating perfect classification. Indicates a model’s ability to balance the tradeoff between precision and recall.

LLM Evaluation Approaches

There are two major approaches to running LLM evals: Human Evaluation vs. LLM-Assisted Evaluation.

Human Evaluation

As the name suggests, human evaluators manually assess the LLM outputs. The outputs can be evaluated in several ways:

  • Reference - The evaluator compares an output with the preset ground truth, or ideal response, and gives a yes-or-no judgment on whether the output is accurate. This method requires that the ground truths be constructed ahead of time. Also, the evaluation results are directly influenced by the quality of the ground truths.
  • Scoring - The evaluator rates an output by assigning a score (e.g. 0-10). The score can be based on a single criterion or a set of criteria that can be broad or narrow in scope. As there is no referenced ground truth, the judgment is completely up to the evaluator.
  • A/B Testing - The evaluator is given a pair of outputs and needs to pick the better one.

The downside to human evaluation is that humans are inherently subjective and also resource-intensive.

Screenshot 2024-05-14 at 1.57.43 p.m.A diagram of various ways of scoring an output. Source: https://arize.com/blog-course/llm-evaluation-the-definitive-guide/#avoid-numeric-evals 

 

LLM-Assisted Evaluation

Instead of a human, an LLM is used to assess the LLM outputs. The LLM selected to perform the evaluation can be an LLM used for the main application or a separate one. One simple approach is to set the temperature to zero for the evaluation LLM. Note that the output evaluation methods performed by a human (reference, scoring, and A/B testing) can also be performed by an LLM.

The key to an LLM-assisted evaluation is creating a prompt that correctly instructs the LLM on how to assess the outputs. The prompt is structured as a prompt template so that it can be programmatically composed, executed, and reused.

The LLM-assisted evaluation approach is more resource-efficient and can be scaled. Although non-human, an LLM is still susceptible to subjectivity, as it may be trained on data containing biases. At the time of writing, it’s hard to tell whether LLM-assisted evaluations can outperform human evaluations.

Reference

The following is a sample prompt template for the reference methodology. The eval LLM compares the AI response with the human ground truth and then provides a correct-or-incorrect judgment.

Screenshot 2024-05-14 at 2.00.19 p.m.A sample prompt template for comparing AI response with human ground truth. Source: https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals/ai-vs-human-groundtruth

Scoring

The following is a sample prompt template for detecting toxicity. The eval LLM is instructed to perform a binary classification scoring (toxic or non-toxic) on the provided text.

Screenshot 2024-05-14 at 2.02.45 p.m.A sample prompt template for detecting toxicity. Source: https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals/toxicity 

A/B Testing

The following prompt template illustrates an example of the A/B testing paradigm. Given two answers, the eval LLM is instructed to pick the better answer for the question.

Screenshot 2024-05-14 at 2.07.33 p.m.A sample prompt template for A/B testing. Source: https://txt.cohere.com/evaluating-llm-outputs/ 

Tools

There are available tools that help with prompt management and optimization as well as LLM evaluation.

Prompt Registry

A prompt registry is a centralized repository to store, manage, and version prompts. It helps manage a growing and evolving collection of prompts in an organized and accessible way. It may offer functionalities such as change tracking and versioning of prompts. It allows for better team collaboration with a central hub to share, edit, and refine prompts.

A prompt registry is typically offered as part of a suite of LLMOps tools. Some offerings include PromptLayer and Weights & Biases Prompts.

Prompt Playground

A prompt playground is an interactive environment to create, iterate, and refine prompts. It may offer features such as viewing prompts and corresponding responses, editing existing prompts, and analyzing prompt performance.

A prompt playground may be offered as a standalone tool or part of a suite. For example, OpenAI has a simple playground to experiment with its models. Chainlit, an open-source Python AI framework, provides a prompt playground module.

Evaluation Framework

An evaluation framework offers tools for building and running LLM evals. It saves you time and effort compared to starting from scratch.

For instance, OpenAI’s Evals is an open-source framework for performing LLM model evals. It offers a registry of benchmarks and the optionality to create custom evals and use private data.

Another open-source framework, promptfoo, can be used for LLM model and prompt evals. It includes features for speeding up evaluations with caching and concurrency as well as setting up automatic output scoring.

LLM Evaluation: Next Steps

Overall, manual and automated LLM model and prompt evals, along with the use of appropriate LLM evaluation metrics, can effectively monitor the quality of LLM, prompts, and outputs. The availability of prompting and LLM eval tools help with organization and efficiency. As your LLM-powered application enters production mode and grows in complexity, LLM evals and tools become more significant.

More on LLM Evaluation

How to evaluate LLM output quality?

To evaluate the quality of LLM outputs, you need to assess how well the generated text aligns with the intended task. Key assessment factors include the relevance of the output to the input prompt, the accuracy of the information provided, and whether the outputs generated are factually true. It’s also important to look at how efficiently the model generates responses, how flexible it is to handle a range of topics, and whether it avoids common pitfalls like hallucinations or biased language.

What are the metrics for LLM accuracy?

LLM accuracy is typically measured using a few different metrics that assess how close the model's output is to a correct or expected response. Common metrics include precision, which shows how often the model's positive outputs are correct, and recall, which measures its ability to find all relevant correct answers. Another useful metric is the F1 score, which balances precision and recall to give an overall sense of the model's performance. Accuracy itself measures the proportion of all correct responses out of the total attempts.

What is benchmarking in LLM?

Benchmarking in large language models refers to testing and comparing different models using standardized datasets and tasks to evaluate their performance. This process involves running models through a set of tasks, such as answering questions or completing sentences, and then using evaluation metrics to measure their accuracy, efficiency, and ability to handle different types of queries. Benchmarking helps to highlight the strengths and weaknesses of each model. Popular benchmarks include HellaSwag, TruthfulQA, and MMLU.