nvidia-generative-ai-notes

Model Evaluation

Systematic evaluation is essential at every stage: after pretraining (does the model understand language?), after fine-tuning (does it follow instructions?), after alignment (is it safe and helpful?), and after quantization (did we lose accuracy?).

Automatic Metrics

Perplexity

Measures how well a model predicts the next token. Lower is better.

PPL = exp(-1/N Σ_{i=1}^{N} log P(x_i | x_{<i}))

BLEU (Bilingual Evaluation Understudy)

Measures n-gram overlap between generated text and reference:

BLEU = BP · exp(Σ_{n=1}^{4} w_n · log p_n)

Where p_n = modified n-gram precision, BP = brevity penalty, w_n = uniform weights (typically 1/4).

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

Measures recall of n-grams from reference text:

Exact Match and F1

Pass@k

For code generation: generate k samples, check how many pass unit tests.

Pass@k = 1 - C(n-c, k) / C(n, k)

Where n = total samples, c = correct samples. Unbiased estimator of the probability that at least one of k samples is correct.

Benchmark Suites

Benchmark Measures Format Tasks/Subjects
MMLU World knowledge Multiple choice 57 subjects (STEM, humanities, social sciences)
HellaSwag Commonsense reasoning Completion selection 10K scenarios
HumanEval Code generation Executable Python 164 programming problems
MBPP Code generation Executable Python 974 problems
GSM8K Math reasoning Free-form numerical 8.5K grade school math
MATH Advanced math Free-form Competition-level problems
TruthfulQA Factual accuracy QA pairs 817 questions designed to elicit falsehoods
ToxiGen Safety and bias Classification 274K toxic/benign statements
ARC Science reasoning Multiple choice 7.7K grade-school science questions
WinoGrande Coreference resolution Fill-in-the-blank 44K pronoun resolution

LLM-as-Judge Evaluation

Use a strong LLM (e.g., GPT-4) to evaluate model outputs:

Advantages: captures nuance that automatic metrics miss. Limitations: expensive, potential bias toward certain response styles.

Evaluation Frameworks

EleutherAI lm-evaluation-harness

The standard open-source evaluation framework:

lm_eval --model hf --model_args pretrained=meta-llama/Llama-3-70B \
    --tasks mmlu,hellaswag,gsm8k \
    --num_fewshot 5 \
    --batch_size 8

HELM (Holistic Evaluation of Language Models)

Stanford’s comprehensive evaluation framework:

OpenAI Evals

Evaluation at Each Stage

Stage Key Metrics What to Watch
Pretraining Perplexity, MMLU, HellaSwag Training loss convergence, benchmark trends
SFT MT-Bench, instruction-following accuracy Overfitting to training format
Alignment TruthfulQA, ToxiGen, MT-Bench Alignment tax (capability regression)
Quantization Perplexity delta, MMLU delta <1% degradation on critical metrics
Deployment Latency, throughput, user satisfaction A/B test against baseline

Best Practices