When you send a prompt to an LLM and it generates a response, inference happens in two distinct phases:

The prefill phase processes the input prompt in parallel.
Characteristics:
The model generates tokens one at a time, autoregressively.
Characteristics:
Generating token 100:
With KV Cache (Standard Approach) Generating token 100:

KV Cache Memory = 2 × num_layers × num_heads × seq_len × head_dim × bytes_per_element × batch_size
Note:
- 2 for Key and Value
- num_layer as each layer has its own K and V matrices
- each layer has multiple attention heads
- seq_len is number of tokens in context
- bytes_per_element is precision (FP32, FP16 etc.)
- batch_size is number of sequences processed simultaneously
Example: LLaMA 2 70B
- Layers: 80
- Heads: 64 (with GQA: 8 KV heads)
- Head dim: 128
- Sequence: 4096 tokens
- Dtype: float16 (2 bytes)
- Batch: 1
KV Cache = 2 × 80 × 8 × 4096 × 128 × 2 × 1
= 1.34 GB per sequence!
This is why KV cache memory is often the bottleneck for serving LLMs.
| Aspect | Prefill | Decode |
|---|---|---|
| Tokens processed | All prompt tokens (parallel) | One token (sequential) |
| Compute intensity | High (compute-bound) | Low (memory-bound) |
| GPU utilization | High (matrix ops) | Low (memory transfers) |
| Bottleneck | FLOPS | Memory bandwidth |
| Latency | Time to First Token (TTFT) | Time per Output Token (TPOT) |
| KV cache | Created | Extended |
During decode, for each token: