Systematic evaluation is essential at every stage: after pretraining (does the model understand language?), after fine-tuning (does it follow instructions?), after alignment (is it safe and helpful?), and after quantization (did we lose accuracy?).
Measures how well a model predicts the next token. Lower is better.
PPL = exp(-1/N Σ_{i=1}^{N} log P(x_i | x_{<i}))
Measures n-gram overlap between generated text and reference:
BLEU = BP · exp(Σ_{n=1}^{4} w_n · log p_n)
Where p_n = modified n-gram precision, BP = brevity penalty, w_n = uniform weights (typically 1/4).
Measures recall of n-grams from reference text:
For code generation: generate k samples, check how many pass unit tests.
Pass@k = 1 - C(n-c, k) / C(n, k)
Where n = total samples, c = correct samples. Unbiased estimator of the probability that at least one of k samples is correct.
| Benchmark | Measures | Format | Tasks/Subjects |
|---|---|---|---|
| MMLU | World knowledge | Multiple choice | 57 subjects (STEM, humanities, social sciences) |
| HellaSwag | Commonsense reasoning | Completion selection | 10K scenarios |
| HumanEval | Code generation | Executable Python | 164 programming problems |
| MBPP | Code generation | Executable Python | 974 problems |
| GSM8K | Math reasoning | Free-form numerical | 8.5K grade school math |
| MATH | Advanced math | Free-form | Competition-level problems |
| TruthfulQA | Factual accuracy | QA pairs | 817 questions designed to elicit falsehoods |
| ToxiGen | Safety and bias | Classification | 274K toxic/benign statements |
| ARC | Science reasoning | Multiple choice | 7.7K grade-school science questions |
| WinoGrande | Coreference resolution | Fill-in-the-blank | 44K pronoun resolution |
Use a strong LLM (e.g., GPT-4) to evaluate model outputs:
Advantages: captures nuance that automatic metrics miss. Limitations: expensive, potential bias toward certain response styles.
The standard open-source evaluation framework:
lm_eval --model hf --model_args pretrained=meta-llama/Llama-3-70B \
--tasks mmlu,hellaswag,gsm8k \
--num_fewshot 5 \
--batch_size 8
Stanford’s comprehensive evaluation framework:
| Stage | Key Metrics | What to Watch |
|---|---|---|
| Pretraining | Perplexity, MMLU, HellaSwag | Training loss convergence, benchmark trends |
| SFT | MT-Bench, instruction-following accuracy | Overfitting to training format |
| Alignment | TruthfulQA, ToxiGen, MT-Bench | Alignment tax (capability regression) |
| Quantization | Perplexity delta, MMLU delta | <1% degradation on critical metrics |
| Deployment | Latency, throughput, user satisfaction | A/B test against baseline |