AI/ML

LLM Evaluation: Measuring the Quality of LLMs, Prompts, and Outputs

Learn how LLM evaluation can assess model quality, prompt effectiveness, and output relevance for optimized performance. Read more.

LLM evaluation is divided into two key areas: Model Evaluation – Measures overall performance using standardized benchmarks. Prompt Evaluation – Assesses the quality of responses based on various criteria. Uses established benchmarks like HellaSwag, TruthfulQA, and MMLU to test accuracy and reasoning.

Focuses on areas such as factual correctness, coherence, and contextual understanding. Examines accuracy, efficiency, relevance, versatility, hallucinations, and bias in LLM outputs. Helps improve prompt engineering by refining input structures and wording. Human-Led Assessments – Experts review and score LLM outputs.

LLM-Assisted Methods – AI models analyze and provide feedback on outputs. Various tools exist to assess and maintain prompt quality. Combining automated and manual evaluations leads to more reliable assessments.

In a previous article , we learned that prompting is how we communicate with LLMs, such as OpenAI’s GPT-4 and Meta’s Llama 2. We also observed how prompt structure and technique impact the relevancy and consistency of LLM output. But how do we actually determine the quality of our LLM prompts and outputs? In this article, we will investigate:

LLM model evaluation vs. LLM prompt evaluation Human vs. LLM-assisted approaches to running LLM evaluations Available tools for prompt maintenance and LLM evaluation

If we were just using LLMs for personal or leisure use, then we may not need rigorous evaluations of our LLM prompts and outputs. However, when building LLM-powered applications for business and production scenarios, the caliber of the LLM, prompts, and outputs matters and needs to be measured.

LLM evaluation (eval) is a generic term. Let’s cover the two main types of LLM evaluation.

LLM Model EvalLLM Prompt EvalPurposeEvaluate models or versions of the same model based on overall performanceEvaluate prompt effectiveness based on LLM output qualityBy WhomAI model developersAI/LLM application developersFrequencyInfrequentFrequentBenchmarks/Metrics

Measuring Massive Multitask Language Understanding (MMLU) A table comparing LLM model eval vs. LLM prompt eval. Source: Author

LLM model evals are used to assess the overall quality of the foundational models, such as OpenAI’s GPT-4 and Meta’s Llama 2, across a variety of tasks and are usually done by model developers. The same test datasets are fed into the particular models and their resulting metrics, or evaluation datasets, are compared.

The effectiveness of LLM evaluations is heavily influenced by the quality of the training data used to develop these models. High-quality, diverse data ensures that large language models can generalize well across a variety of tasks, leading to better performance during evaluations.

A diagram illustrating LLM model evals. Source: https://arize.com/blog-course/llm-evaluation-the-definitive-guide/#large-language-model-model-eval The following are some popular LLM model eval metrics available:

HellaSwag - A benchmark that measures how well an LLM can complete a sentence. For example, provided with "A woman sits at a piano" the LLM needs to pick "She sets her fingers on the keys" as the most probable phrase that follows.

TruthfulQA - A benchmark to measure truthfulness in an LLM’s generated responses. To score high on this benchmark, an LLM needs to avoid generating false answers based on popular misconceptions learned from human texts.

Measuring Massive Multitask Language Understanding (MMLU) - A broad benchmark to measure an LLM’s multi-task accuracy and natural language understanding (NLU). The test encompasses 57 tasks that cover a breadth of topics, including hard sciences like mathematics and computer science and social sciences like history and law. There are also varying topic depths, from basic to advanced levels.

HumanEval - A benchmark that measures an LLM’s coding abilities and includes 164 programming problems with a function signature, docstring, body, and several unit tests. The coding problems are written in Python and the comments and docstrings contain natural text in English.

GSM8K - A benchmark to measure an LLM’s capability to perform multi-step mathematical reasoning. The test dataset contains 8.5K math word problems that involve 2-8 steps and require only basic arithmetic operations (+ - / *). A table of Claude 3 benchmarks against other LLMs. Source: https://www.anthropic.com/news/claude-3-family

The purpose of LLM model evals is to differentiate between various models or versions of the same model based on overall performance and general capabilities. The results — along with other considerations for access methods, costs, and transparency — help inform which model(s) or model version(s) to use for your LLM-powered application. Choosing which LLM(s) to use is typically a one-time endeavor near the beginning of your application development.

An emerging technique that can significantly influence prompt effectiveness is Retrieval Augmented Generation (RAG). This approach combines the strengths of LLMs with retrieval mechanisms, allowing models to pull in relevant external information when generating responses. Integrating RAG into the evaluation process enables us to better assess how well prompts leverage external knowledge, which can improve grounding and relevance in LLM outputs.

A diagram illustrating LLM prompt evals. Source: https://arize.com/blog-course/llm-evaluation-the-definitive-guide/#llm-system-evaluation

Currently, there is no definitive standard for evaluating prompt effectiveness and output quality. In general, we want to assess whether the prompt and output are good and safe. Here are some key dimensions to consider:

Grounding - The authoritative basis of the LLM output, determined by comparing it against some ground truths in a specific domain.

Relevance - The pertinence of the LLM output to the prompt query or topic alignment. This can be measured with a predefined scoring methodology, such as binary classification (relevant/irrelevant).

Efficiency - The speed and computing consumption of the LLM to produce the output. This can be calculated with the time it takes to receive the output and also the cost of inference (prompt execution) in tokens or dollars.

Versatility - The capability of the LLM to handle different types of queries. One indicator is perplexity , which measures how confused the model is in making the next word or token predictions. Lower perplexity means the model is less confused and therefore more confident in its predictions. In general, a model’s confidence has a positive correlation with its accuracy. Moreover, a lower perplexity on new, unseen data means the model can generalize well.

Hallucinations - Whether the LLM output contains hallucinations or factually untrue statements. This may be determined with a chosen scoring method, such as binary classification (factual/hallucinated), based on some reference data.

Toxicity - The presence of toxic content, such as inappropriate language, biases, and threats in the LLM output. Some metrics for toxicity include fairness scoring , disparity analysis , and bias detection .

Specifically for binary classification of outputs, there are four common metrics: accuracy, precision, recall, and F1 score . First, let’s look at the four possible outcomes for binary classification, using relevance as an example. These four possible outcomes make up the confusion matrix.

Confusion matrix for binary classification of relevance. Source: Author Based on the confusion matrix, the four metrics are defined:

Accuracy - Measures the overall proportion of correct predictions made by the model. It’s calculated as (True Positives + True Negatives) / Total Predictions. However, just looking at accuracy alone can be misleading if the dataset is imbalanced as the majority class dominates the accuracy score, possibly masking the poor performance of the minority class.

Precision - Also known as the positive predictive value, measures the proportion of true positives among the positive predictions made by the model. It’s calculated as True Positives / (True Positives + False Positives). Indicates the model's ability to make positive predictions.

Recall - Also known as the true positive rate, measures the proportion of true positives out of all actual positives. It’s calculated as True Positives / (True Positives + False Negatives). Indicates the model's ability to identify all actual positive cases.

F1 score - Combines precision and recall into a single metric. It’s calculated as the harmonic mean 2 * (Precision * Recall) / (Precision + Recall). The score ranges from 0 to 1, with 1 indicating perfect classification. Indicates a model’s ability to balance the tradeoff between precision and recall.

There are two major approaches to running LLM evals: Human Evaluation vs. LLM-Assisted Evaluation. As the name suggests, human evaluators manually assess the LLM outputs. The outputs can be evaluated in several ways :

Reference - The evaluator compares an output with the preset ground truth, or ideal response, and gives a yes-or-no judgment on whether the output is accurate. This method requires that the ground truths be constructed ahead of time. Also, the evaluation results are directly influenced by the quality of the ground truths.

Scoring - The evaluator rates an output by assigning a score (e.g. 0-10). The score can be based on a single criterion or a set of criteria that can be broad or narrow in scope. As there is no referenced ground truth, the judgment is completely up to the evaluator.

A/B Testing - The evaluator is given a pair of outputs and needs to pick the better one. The downside to human evaluation is that humans are inherently subjective and also resource-intensive.

A diagram of various ways of scoring an output. Source: https://arize.com/blog-course/llm-evaluation-the-definitive-guide/#avoid-numeric-evals

The following is a sample prompt template for the reference methodology. The eval LLM compares the AI response with the human ground truth and then provides a correct-or-incorrect judgment.

A sample prompt template for comparing AI response with human ground truth. Source: https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals/ai-vs-human-groundtruth

The following is a sample prompt template for detecting toxicity. The eval LLM is instructed to perform a binary classification scoring (toxic or non-toxic) on the provided text.

A sample prompt template for detecting toxicity. Source: https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals/toxicity

The following prompt template illustrates an example of the A/B testing paradigm. Given two answers, the eval LLM is instructed to pick the better answer for the question. A sample prompt template for A/B testing. Source: https://txt.cohere.com/evaluating-llm-outputs/

There are available tools that help with prompt management and optimization as well as LLM evaluation.

A prompt playground is an interactive environment to create, iterate, and refine prompts. It may offer features such as viewing prompts and corresponding responses, editing existing prompts, and analyzing prompt performance. A prompt playground may be offered as a standalone tool or part of a suite. For example, OpenAI has a simple playground to experiment with its models. Chainlit , an open-source Python AI framework, provides a prompt playground module.

Overall, manual and automated LLM model and prompt evals, along with the use of appropriate LLM evaluation metrics, can effectively monitor the quality of LLM, prompts, and outputs. The availability of prompting and LLM eval tools help with organization and efficiency. As your LLM-powered application enters production mode and grows in complexity, LLM evals and tools become more significant.

Diana Cheung (ex-LinkedIn software engineer, USC MBA, and Codesmith alum) is a technical writer on technology and business. She is an avid learner and has a soft spot for tea and meows.