nvidia-generative-ai-notes

Triton Inference Server

Triton Inference Server is NVIDIA’s production-grade inference serving platform. It supports multiple frameworks (PyTorch, TensorFlow, ONNX, TensorRT, TensorRT-LLM) and provides dynamic batching, concurrent model execution, and comprehensive monitoring.

Why Triton?

Moving from notebook inference to production requires:

Triton solves all of these with a single, unified serving platform.

Deployment Workflow

Step 1: Export from NeMo

NeMo Checkpoint → TensorRT-LLM Engine → Triton Model Repository
                → ONNX → Triton Model Repository

Step 2: Organize Model Repository

model_repository/
├── llama_70b/
│   ├── config.pbtxt
│   ├── 1/
│   │   └── model.plan          # TensorRT-LLM engine
│   └── 2/
│       └── model.plan          # Version 2 (for A/B testing)
└── embedding_model/
    ├── config.pbtxt
    └── 1/
        └── model.onnx

Step 3: Configure and Launch

tritonserver --model-repository=/models --log-verbose=1

Key Features

Dynamic Batching

Automatically groups incoming requests into batches:

TensorRT-LLM Backend

Purpose-built for LLM inference:

Concurrent Model Execution

Model Analyzer

Profile and optimize model configuration:

Monitoring

Triton exposes Prometheus-compatible metrics:

Integration Points