Is your LLM performing @k? A discussion of common metrics.

Tags:
Cite
@misc{frieder2025llmsok,
title={Is your LLM performing @k? A discussion of common metrics.},
author={Simon Frieder},
year={2025}
month={July}
url={https://www.friederrr.org/blog/is-your-llm-ok}
howpublished = {\url{https://www.friederrr.org/blog/is-your-llm-ok}}

You might think it’s performing OK on various …@k metrics, but you might not understand what the metrics you measure actually mean and measure: In recent times, several types of metrics to evaluate the outputs of black-box systems (typically LLMs) have been proposed, which have all the following in common: a) there is a test set of p-many problems, b) for each problem, k solution candidates are generated by that black-box system for that problem, c) some (human or automatic) assessment is made how many of these k solution candidates come close to solving the problem, d) and finally the metric aggregates all the p \times k assessments into a single number.

I collectively call these ”…@k metrics”. Unfortunately, they often are not fully defined, perhaps even ill-defined, confusing, and inconsistent. In this post I will clean up with several misconceptions around metrics. Knowing about these issues will hopefully make your evals more rigorous.

I will here focus on metrics that operate on a black-box system and a black-box assessment, which makes only a binary (correct/incorrect) assessment, because these are the most general, and therefore most widespread. There are other types of metrics, that can be disinguished upon whether they are execution-based and execution-free metrics (see, e.g. https://arxiv.org/pdf/2212.10481), but we will not go into these. Not that classic metric, such as BLEU, are subsumed in the black-box assessment; one can use pass@k with BLEU metrics, but just as well with accuracy.

Two closely related metrics are the following:

n@k = x%: This metric says: Sample k candidates to solve a given problem. Let your system pick n<=k of these for each problem. Check if at least one of the n candidates solves, and if does, mark that problem as solved. Report the percentage of the p problems that were solved this way. This is a metric defined first in the AlphaCode paper [https://arxiv.org/pdf/2203.07814, page 6 and page 15].

The problem is: This is an ill-defined metric, the metric changes depending on how the system chooses the n samples, but this information is not included in the metric name. If some function f that chooses the n samples is used, then the metric should actually be called something like f@n@k. For example, if you use logprobs (assuming your system relies on these) to order all the k samples and choose the top n ones, this metric should be called logprobrank@n@k. If some other scoring function is used, for example a reward model M that makes a guess how well a generated problem will work, then metric should be named M@n@k. The metrics f@n@k and M@n@k may not agree, but reporting them as “n@k” without providing the ranking procedure is incomplete: If I just tell you that n@k = 0.7 and you don’t have access to the system (which you don’t in case of commercial LLMs), you will not be able to fully interpret the evaluation, as you don’t know the function f that chose the n samples! The AlphaCode paper makes this suboptimal state of affairs explicit (page 6): “The filtering method is up to the system itself”.

One way to fix this situation, if we want to keep with the “n@k” name of the metric, would be to change its definition slightly. Here’s a very similar definition which makes the n@k metric rigorous, let’s call this “n@@k”, which I define in the following way: There exists a subset of n samples, out of a total of k samples, such that on that sample x% are achieved. Notice that his definition doesn’t rely on a mechanism to find n samples anymore. Then n@@k becomes an upper bound for the entire family of n@k-like metrics, since each choosing mechanism will pick a specific subset of n sample, but there might be a subset of n samples that wasn’t picked that might score higher.

I am not saying that n@@k is useful - I am saying it’s rigorous, and n@k is not as for the latter you typically have to read around to figure out the mechanism by which the n samples are chosen.

pass@k. This metric is simple to define. Not that if n=k, then the n@k metric (for any choosing mechanisms), coincides with n@@k. This is the pass@k matric as all of the k samples are being assessed for correctness, not just a subset of n of them. pass@k was first defined in the SPoC paper (https://proceedings.neurips.cc/paper/2019/file/7298332f04ac004a0ca44cc69ecf6f6b-Paper.pdf), but not using the pass@k name. Rather it is called “success rate at B” (page 6), where B is a “budget” of trials (i.e. what we denote by k) that a system can generate per problem. The original definition is slightly ambiguous and we will return to it below.

There are several issues with these metrics:

  1. The black-box assessment was binary.

The precise wording is slightly ambiguous, as the definition of this metric is “the fraction of test examples where the system generates an accepted program under the budget of B trials”. The “test examples” here is the black-box assessment, which
Without further context, this sentence could be both understood as

  • there is one problem to the system generates B-many candidates in total, and all of the
  • the system gener

In is important to note that for the n@k metric (and thus also pass@k) the n samples are made independently of eachother. This means that the system receives teh

ratiopass@k. pass@k is not sensitive to how many of the k samples were positive: Obviously, it makes a difference if for a given problem only a few solution candidates are correct vs. most solution candidates are correct. pass@k doesn’t pick up on this difference. One such example of a metric is the ratiopass@k metric introduced only recently, in 2024: https://onlinelibrary.wiley.com/doi/full/10.4218/etrij.2023-0357. This is strange, since it is a completely natural metric to consider: ratiopass@k metric was introduced in the context of assessing generated software programs, where there are test input-output pairs. In this context it is natural to move away from a binary evaluation if all tests suceed or not, and allow for fractional points based on the number of tests that passed. This is not the case for any kinds of assessments: For example, if you assess mathematical proofs, it is not straightforward to split the assessments in “additive categories” (like test input-output pairs for programs) so that the entire assessment passes if individual tests pass.