nvidia-generative-ai-notes

TensorRT-LLM

TensorRT-LLM is NVIDIA’s library for compiling and optimizing LLMs for production inference. It transforms model checkpoints into highly optimized inference engines with kernel fusion, quantization, and advanced batching — delivering 2-5x speedup over native PyTorch inference.

Why TensorRT-LLM?

PyTorch inference is flexible but leaves significant performance on the table:

TensorRT-LLM addresses all of these by compiling the model into an optimized execution graph.

Compilation Workflow

Model Checkpoint (HuggingFace / NeMo / Megatron)
    ↓
1. Convert to TensorRT-LLM checkpoint format
    ↓
2. Build TensorRT engine (.engine file)
    ↓
3. Deploy via Triton or direct API

Step 1: Checkpoint Conversion

# Convert HuggingFace checkpoint to TRT-LLM format
python convert_checkpoint.py \
    --model_dir /models/llama-70b-hf \
    --output_dir /checkpoints/llama-70b-trtllm \
    --dtype float16 \
    --tp_size 8

Step 2: Engine Build

trtllm-build \
    --checkpoint_dir /checkpoints/llama-70b-trtllm \
    --output_dir /engines/llama-70b \
    --gemm_plugin float16 \
    --max_batch_size 64 \
    --max_input_len 2048 \
    --max_seq_len 4096 \
    --paged_kv_cache enable \
    --use_inflight_batching

Step 3: Run Inference

import tensorrt_llm
from tensorrt_llm.runtime import ModelRunner

runner = ModelRunner.from_dir("/engines/llama-70b")
outputs = runner.generate(input_texts=["Explain quantum computing:"], max_new_tokens=256)

Key Optimizations

Kernel Fusion

Combines multiple operations into single GPU kernels:

Graph Optimization

Quantization Support

TensorRT-LLM supports multiple precision modes at build time:

Mode Description Typical Speedup
FP16 Standard half-precision Baseline
FP8 H100 Transformer Engine ~2x
INT8 (SmoothQuant) W8A8 with smoothing ~1.5-2x
INT4 (AWQ) 4-bit weight-only ~2-3x
INT4 (GPTQ) 4-bit weight-only ~2-3x
# Build with INT4 AWQ quantization
trtllm-build --checkpoint_dir /checkpoints/llama-70b-awq \
    --output_dir /engines/llama-70b-int4 \
    --quant_mode int4_awq

In-Flight Batching

Also called continuous batching — new requests enter the batch as completed ones exit:

Paged KV Cache

Manages KV cache memory like OS virtual memory:

Multi-GPU Inference

TensorRT-LLM supports multi-GPU inference for models that don’t fit on a single GPU:

# Build for 8-GPU tensor parallelism
trtllm-build --checkpoint_dir /checkpoints/llama-70b \
    --tp_size 8 --pp_size 1 \
    --output_dir /engines/llama-70b-tp8

# Launch with MPI
mpirun -n 8 python run_inference.py --engine_dir /engines/llama-70b-tp8

Supported Architectures

TensorRT-LLM includes optimized implementations for:

Performance Characteristics

Typical speedups over PyTorch (FP16, batch size 1):

Model PyTorch TRT-LLM FP16 TRT-LLM INT4
LLaMA 7B 35 tok/s 85 tok/s 150 tok/s
LLaMA 70B (8xH100) 15 tok/s 40 tok/s 80 tok/s

Actual numbers depend on hardware, batch size, sequence length, and quantization.

Integration with Triton

TensorRT-LLM engines deploy via Triton Inference Server:

Integration with NIM

NVIDIA NIM packages TensorRT-LLM engines with Triton into pre-built containers, providing one-command deployment with automatic optimization for the target GPU.