NVIDIA NIM packages optimized LLM inference into pre-built, containerized microservices. It abstracts the complexity of TensorRT-LLM compilation, Triton configuration, and optimization tuning into a single container with an OpenAI-compatible API.
Deploying an optimized LLM in production requires:
NIM handles all of this automatically, reducing deployment from days of engineering to a single command.
Client (OpenAI-compatible API)
↓ HTTP/gRPC
┌─────────────────────────────┐
│ NIM Container │
│ ┌───────────────────────┐ │
│ │ API Gateway │ │
│ │ (OpenAI-compatible) │ │
│ └───────────┬───────────┘ │
│ ↓ │
│ ┌───────────────────────┐ │
│ │ TensorRT-LLM Engine │ │
│ │ (auto-optimized) │ │
│ └───────────┬───────────┘ │
│ ↓ │
│ ┌───────────────────────┐ │
│ │ GPU Execution │ │
│ │ (multi-GPU support) │ │
│ └───────────────────────┘ │
└─────────────────────────────┘
# Pull and run a NIM container
docker run -d --gpus all \
-e NGC_API_KEY=$NGC_API_KEY \
-p 8000:8000 \
nvcr.io/nvidia/nim/meta-llama3-70b-instruct:latest
# Use with OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-used")
response = client.chat.completions.create(
model="meta-llama3-70b-instruct",
messages=[{"role": "user", "content": "Explain transformers in 3 sentences."}],
max_tokens=256
)
NIM automatically selects the best optimization profile for the target GPU:
Drop-in replacement for OpenAI endpoints:
/v1/chat/completions — chat interface/v1/completions — text completion/v1/models — list available models/v1/embeddings — embedding generation (for embedding NIMs)NIM automatically distributes models across available GPUs:
| NIM Type | Purpose | Examples |
|---|---|---|
| LLM | Text generation | LLaMA, Mistral, Mixtral, Gemma |
| Embedding | Dense vector generation | NV-Embed, E5 |
| Reranking | Cross-encoder scoring | NV-RerankQA |
| Vision-Language | Multimodal generation | LLaVA, VILA |
| Speech | ASR and TTS | Riva, Parakeet |
# Single GPU deployment
docker run --gpus '"device=0"' -p 8000:8000 nvcr.io/nvidia/nim/model:latest
# Multi-GPU deployment
docker run --gpus all -p 8000:8000 nvcr.io/nvidia/nim/model:latest
Helm charts for production deployment:
NIM runs on any cloud with NVIDIA GPUs:
Deploy your own fine-tuned models via NIM:
| Aspect | NIM | Manual (TRT-LLM + Triton) |
|---|---|---|
| Setup time | Minutes | Days |
| Optimization | Automatic | Manual tuning |
| API compatibility | OpenAI-compatible | Custom |
| Flexibility | Standard configs | Full control |
| Updates | Container pull | Manual rebuild |
| Best for | Production teams | ML infrastructure teams |