TensorRT-LLM is NVIDIA’s library for compiling and optimizing LLMs for production inference. It transforms model checkpoints into highly optimized inference engines with kernel fusion, quantization, and advanced batching — delivering 2-5x speedup over native PyTorch inference.
PyTorch inference is flexible but leaves significant performance on the table:
TensorRT-LLM addresses all of these by compiling the model into an optimized execution graph.
Model Checkpoint (HuggingFace / NeMo / Megatron)
↓
1. Convert to TensorRT-LLM checkpoint format
↓
2. Build TensorRT engine (.engine file)
↓
3. Deploy via Triton or direct API
# Convert HuggingFace checkpoint to TRT-LLM format
python convert_checkpoint.py \
--model_dir /models/llama-70b-hf \
--output_dir /checkpoints/llama-70b-trtllm \
--dtype float16 \
--tp_size 8
trtllm-build \
--checkpoint_dir /checkpoints/llama-70b-trtllm \
--output_dir /engines/llama-70b \
--gemm_plugin float16 \
--max_batch_size 64 \
--max_input_len 2048 \
--max_seq_len 4096 \
--paged_kv_cache enable \
--use_inflight_batching
import tensorrt_llm
from tensorrt_llm.runtime import ModelRunner
runner = ModelRunner.from_dir("/engines/llama-70b")
outputs = runner.generate(input_texts=["Explain quantum computing:"], max_new_tokens=256)
Combines multiple operations into single GPU kernels:
TensorRT-LLM supports multiple precision modes at build time:
| Mode | Description | Typical Speedup |
|---|---|---|
| FP16 | Standard half-precision | Baseline |
| FP8 | H100 Transformer Engine | ~2x |
| INT8 (SmoothQuant) | W8A8 with smoothing | ~1.5-2x |
| INT4 (AWQ) | 4-bit weight-only | ~2-3x |
| INT4 (GPTQ) | 4-bit weight-only | ~2-3x |
# Build with INT4 AWQ quantization
trtllm-build --checkpoint_dir /checkpoints/llama-70b-awq \
--output_dir /engines/llama-70b-int4 \
--quant_mode int4_awq
Also called continuous batching — new requests enter the batch as completed ones exit:
Manages KV cache memory like OS virtual memory:
TensorRT-LLM supports multi-GPU inference for models that don’t fit on a single GPU:
# Build for 8-GPU tensor parallelism
trtllm-build --checkpoint_dir /checkpoints/llama-70b \
--tp_size 8 --pp_size 1 \
--output_dir /engines/llama-70b-tp8
# Launch with MPI
mpirun -n 8 python run_inference.py --engine_dir /engines/llama-70b-tp8
TensorRT-LLM includes optimized implementations for:
Typical speedups over PyTorch (FP16, batch size 1):
| Model | PyTorch | TRT-LLM FP16 | TRT-LLM INT4 |
|---|---|---|---|
| LLaMA 7B | 35 tok/s | 85 tok/s | 150 tok/s |
| LLaMA 70B (8xH100) | 15 tok/s | 40 tok/s | 80 tok/s |
Actual numbers depend on hardware, batch size, sequence length, and quantization.
TensorRT-LLM engines deploy via Triton Inference Server:
NVIDIA NIM packages TensorRT-LLM engines with Triton into pre-built containers, providing one-command deployment with automatic optimization for the target GPU.