Large language models have moved from research curiosity to production infrastructure. Training and deploying them at scale requires more than understanding the transformer architecture — it demands a systems-level grasp of distributed computation, memory management, and tooling. NVIDIA has built a vertically integrated stack for this purpose: from GPU communication primitives (NCCL) through training frameworks (Megatron-Core) to orchestration layers (NeMo 2.0). This article walks through that stack, layer by layer, connecting foundational concepts to the engineering decisions that make billion-parameter models practical.
Every LLM is built on the transformer, but not every transformer is the same. Three architectural variants dominate, each suited to different workloads:
| Architecture | Structure | Strengths | Examples |
|---|---|---|---|
| Encoder-only | Bidirectional self-attention, masked language modeling | Classification, embeddings, semantic search | BERT, RoBERTa |
| Decoder-only | Causal (masked) self-attention, autoregressive generation | Open-ended text generation, in-context learning | GPT-4, LLaMA |
| Encoder-decoder | Encoder processes input; decoder generates output via cross-attention | Translation, summarization, structured seq2seq tasks | T5, BART |
Decoder-only models dominate current LLM development due to their simpler training objective (next-token prediction) and strong scaling behavior. Encoder-decoder models remain relevant for tasks requiring tight input-output alignment, such as machine translation and retrieval-augmented generation (RAG).
The scaled dot-product attention formula sits at the heart of every transformer:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V
Multi-head attention runs h parallel attention heads, each learning different relationship patterns. Recent attention variants trade expressiveness for efficiency:
These variants matter enormously at inference time, where KV cache memory is often the binding constraint.
Before a transformer sees text, it must be converted to numerical representations:
LLM inference has two distinct phases with fundamentally different performance profiles:
Prefill phase: The entire input prompt is processed in parallel. This phase is compute-bound — GPU arithmetic units are the bottleneck. It produces the initial KV cache.
Decode phase: Tokens are generated one at a time, each requiring a read of the full KV cache. This phase is memory-bandwidth-bound. For a model like LLaMA 70B with a 4096-token sequence, the KV cache alone consumes ~1.34 GB per sequence.
The KV cache formula:
KV Cache = 2 x num_layers x num_heads x seq_len x head_dim x bytes_per_element x batch_size
Without caching, every new token would require recomputing attention over the entire sequence — O(n^3) complexity. The KV cache reduces this to O(n^2), but at the cost of memory that grows linearly with sequence length and batch size.
Six techniques form the modern inference optimization toolkit:
Continuous Batching — New requests enter the batch as completed ones exit, eliminating idle GPU cycles from waiting for the longest sequence to finish.
Paged Attention (vLLM) — Manages KV cache like virtual memory: non-contiguous physical blocks mapped through a block table. Eliminates memory fragmentation and enables memory sharing across beam search candidates.
Speculative Decoding — A small draft model proposes multiple tokens; the large model verifies them in a single forward pass. Accepted tokens are “free” — three tokens for the cost of roughly 1.5 forward passes.
KV Cache Quantization — Compressing cache entries from fp16 to int8 or int4 doubles or quadruples the number of concurrent sequences, independent of model weight quantization.
GQA/MQA — Architectural choices made at training time that pay dividends at inference by shrinking the KV cache by 8-64x.
Flash Attention — Restructures the attention computation to process data in blocks, reducing memory usage from O(n^2) to O(n) by avoiding materialization of the full attention matrix.
The key insight: most production inference bottlenecks are memory and scheduling problems, not raw compute problems.
How a model selects the next token from its probability distribution significantly affects output quality. Here is an overview of core sampling techniques:
| Strategy | Mechanism | Trade-off |
|---|---|---|
| Greedy | Always pick the highest-probability token | Deterministic but repetitive |
| Temperature | Scale logits before softmax (T<1 = sharper, T>1 = flatter) | Controls randomness |
| Top-k | Sample from the k most probable tokens | Fixed candidate set |
| Top-p (Nucleus) | Sample from the smallest set whose cumulative probability >= p | Adaptive candidate set |
| Beam Search | Maintain multiple candidate sequences | Better for structured outputs |
| Speculative Decoding | Draft-then-verify with two models | Faster wall-clock time |
Advanced techniques include repetition penalties, min-p sampling, contrastive decoding (comparing expert vs. amateur model outputs), and typical sampling (information-theoretic approach filtering tokens by expected information content).
Prompting strategies can dramatically improve LLM reasoning without changing model weights:
Chain of Thought (CoT): “Let’s think step by step” — prompts the model to show intermediate reasoning. Yields 20-50% accuracy gains on complex problems. Works in both zero-shot and few-shot settings.
ReAct (Reason + Act): Alternates between reasoning (Thought), tool use (Action), and feedback (Observation). Essential for tasks requiring external information retrieval or computation.
Tree of Thoughts (ToT): Explores multiple reasoning paths via BFS/DFS/beam search, evaluating each branch. Stronger than linear CoT for problems with dead ends or creative solutions.
Graph of Thoughts (GoT): Extends ToT by allowing thoughts to form arbitrary graphs — merging, referencing, and synthesizing across branches.
LLM Compiler: Plans tool calls as a directed acyclic graph (DAG) upfront and executes them in parallel, rather than sequentially as in ReAct.
Language Agent Tree Search (LATS): Combines Monte Carlo Tree Search with LLM agents, learning from both successes and failures through strategic exploration.
These frameworks represent a shift toward using more compute at inference time (test-time compute) to extract better answers from existing models.
Training data quality directly determines model quality. NeMo Curator is NVIDIA’s GPU-accelerated pipeline for preparing datasets at petabyte scale. Its processing stages include:
NeMo Curator also supports multimodal pipelines: image aesthetic filtering, video scene detection, and audio transcription quality assessment. Output is clean, shuffled JSONL or Parquet ready for tokenization.
Models with hundreds of billions of parameters cannot fit on a single GPU. Seven parallelism strategies address this, each distributing a different dimension of the computation:
Each GPU holds a full model copy and processes a different data batch. Gradients are synchronized via all-reduce. Simple and effective, but the model must fit in a single GPU’s memory.
Individual layers are split across GPUs. For a linear layer with weight matrix W, column-parallel splitting sends different output dimensions to different GPUs. Requires high-bandwidth interconnect (NVLink) since communication happens within every layer.
The model is partitioned into sequential stages assigned to different GPUs. Micro-batching (GPipe) and interleaved 1F1B scheduling reduce pipeline bubble overhead. Best for cross-node communication where bandwidth is limited.
The sequence dimension is split across GPUs. Ring attention passes KV pairs between devices. Enables context lengths beyond what a single GPU’s memory can support.
Splits the sequence dimension across GPUs for very long contexts (32K+ tokens). Each GPU computes local attention for its chunk, with ring-based communication for cross-chunk dependencies. Unlike SP, Context Parallelism is specifically optimized for long-sequence attention computation, often combined with Flash Attention and sequence pipelining. Enables training and inference on contexts beyond single-GPU memory limits.
For Mixture-of-Experts (MoE) models, different experts reside on different GPUs. All-to-all communication routes tokens to the appropriate expert.
Zero Redundancy Optimizer eliminates memory redundancy across data-parallel ranks:
Production training combines DP x TP x PP: tensor parallelism within nodes (over NVLink), pipeline parallelism across nodes (over InfiniBand), and data parallelism across replica groups. This is how models like LLaMA 70B are actually trained.
All of this depends on NCCL (NVIDIA Collective Communications Library), which provides optimized primitives — AllReduce, AllGather, ReduceScatter, Broadcast — over NVLink, PCIe, InfiniBand, and TCP. PyTorch’s distributed.init_process_group("nccl") initializes this layer, while torchrun handles rank assignment and rendezvous coordination. For a concrete NCCL all-reduce benchmark setup and mpirun command breakdown, see the performance benchmark with NCCL.
Three components form the core of NVIDIA’s training infrastructure:
The low-level, high-performance training library extracted from Megatron-LM. It provides:
Megatron-Core is what you use when training a 70B model on 64 GPUs with TP=8, PP=4, and sequence parallelism enabled.
NVIDIA’s end-to-end framework for developing and deploying models across NLP, speech, audio, and vision. It wraps Megatron-Core with:
NeMo also provides Megatron recipes — pre-configured training setups for common model sizes (1.3B, 7B, 13B, 70B, 175B). These recipes encode tested parallelism strategies, batch sizes, and hyperparameters, providing a proven starting point rather than trial-and-error configuration.
The standard NeMo workflow: prepare data (tokenize into Megatron binary format) -> train (NeMo configs + Megatron-Core backend) -> evaluate (BLEU, accuracy, perplexity) -> deploy (export or serve via Triton).
The compatibility layer that connects NVIDIA’s ecosystem to the broader ML world:
torchrun is PyTorch’s distributed launcher — it sets RANK, WORLD_SIZE, and LOCAL_RANK environment variables, initializes DDP, and coordinates multi-node training. NeMo-Run sits on top, providing config-driven orchestration that supports local execution, Slurm clusters, and Kubernetes environments. It replaces the older NeMo Framework Launcher.
Before PEFT or alignment, models typically undergo Supervised Fine-Tuning on task-specific demonstration data. SFT transforms a base pretrained model into an instruction-following model by training on high-quality instruction-response pairs. Training typically runs for 1-3 epochs to avoid overfitting. SFT is the behavioral foundation for subsequent alignment (RLHF/DPO), which refines the model’s behavior using human preference signals.
Pretrained Model → SFT (demonstrations) → Alignment (preferences) → Production
NeMo supports SFT with the same distributed training features as pretraining. See NeMo fine-tuning for configuration.
Full fine-tuning of a large model is expensive. PEFT methods update only a small fraction of parameters:
NeMo 2.0 exposes these via config classes (LoraPEFTConfig, QLoraPEFTConfig, etc.), making it straightforward to swap methods experimentally. See the NeMo-Run fine-tuning script for a working example.
Knowledge distillation trains a smaller student model to mimic a larger teacher. The standard loss function combines ground-truth supervision with soft-label matching:
Loss = alpha * CE(student_output, true_labels) + beta * KL(student_logits, teacher_logits)
The teacher runs in eval mode with frozen weights, generating soft probability distributions. The student learns from both the hard labels and the teacher’s “dark knowledge” — the relative probabilities across all tokens, not just the correct one.
Advanced variants include hidden-state matching (student reproduces intermediate layer representations) and attention distillation (student mimics the teacher’s attention patterns).
In the NVIDIA stack, this workflow runs on Megatron-Core with distributed training: the teacher generates logits, the student computes combined CE + KL loss, and backpropagation updates only the student’s weights. See the distillation implementation for reference code.
Moving models between training configurations and ecosystems requires checkpoint translation:
Megatron Bridge handles all three operations, making it possible to train with Megatron-Core’s performance optimizations and then deploy through HuggingFace’s ecosystem. See the bridge script for a conversion example.
After supervised fine-tuning, production models require alignment to ensure helpful, harmless, and honest behavior. NeMo Aligner provides GPU-accelerated implementations of the core alignment methods.
The classic three-stage pipeline:
The reward model learns from preference pairs using the Bradley-Terry model. The RL phase uses the reward signal with a KL divergence penalty to prevent drift:
L_PPO = -E[reward(x, y)] + β · KL(π_θ || π_SFT)
RLHF is powerful but expensive — four models must coexist in memory (policy, reference, reward, value).
DPO simplifies RLHF by eliminating the reward model entirely. It directly optimizes the policy on preference pairs:
L_DPO = -E[log σ(β · (log π_θ(y_w|x)/π_ref(y_w|x) - log π_θ(y_l|x)/π_ref(y_l|x)))]
DPO is simpler to implement, more stable, and requires only two models in memory (policy + frozen reference). It often achieves comparable alignment quality to RLHF with significantly less computational overhead.
While alignment shapes behavior during training, NeMo Guardrails enforces safety policies at inference time:
Guardrails uses Colang, a modeling language for conversational flows, enabling programmable safety without retraining. NeMo Aligner supports distributed training with the same TP/PP/DP strategies as pretraining, making it practical to align models at any scale.
Quantization reduces model precision to lower memory footprint and increase throughput. A 70B model in FP16 requires ~140 GB; INT4 brings that to ~35 GB.
| Method | When Applied | Retraining? | Accuracy | NVIDIA Support |
|---|---|---|---|---|
| PTQ | Post-training | No (calibration only) | Good | TensorRT-LLM, modelopt |
| QAT | During training | Yes | Best | Megatron-Core |
| SmoothQuant | Post-training | No | Very Good | TensorRT-LLM |
| AWQ | Post-training | No | Excellent | TensorRT-LLM |
| GPTQ | Post-training | No | Excellent | Community + TRT-LLM |
PTQ (Post-Training Quantization): Calibrate on a small dataset, convert weights and activations. Fast but may degrade accuracy at very low precision.
SmoothQuant: Migrates quantization difficulty from activations (which have outliers) to weights (which are smooth) via per-channel scaling. Enables accurate W8A8 inference.
AWQ: Protects the most important weight channels (identified by activation magnitudes) from aggressive quantization. Excellent INT4 accuracy without retraining.
H100 GPUs include native FP8 Tensor Cores. The Transformer Engine automatically manages FP8 training with dynamic loss scaling and amax tracking. Result: ~2x training throughput over BF16 with minimal accuracy loss. Megatron-Core and NeMo 2.0 integrate Transformer Engine, enabling FP8 with a single config flag.
TensorRT-LLM applies quantization during engine compilation. NVIDIA’s modelopt toolkit provides a unified API for PTQ, QAT, and sparsity, with direct export to TensorRT-LLM format.
Production models require systematic evaluation across multiple dimensions:
Perplexity measures language modeling quality:
PPL = exp(-1/N Σ log P(x_i | x_{<i}))
Lower perplexity = better prediction. Standard benchmarks: WikiText-2, The Pile validation.
Task-Specific: BLEU/ROUGE for summarization and translation, Exact Match and F1 for QA, Pass@k for code generation (HumanEval).
| Benchmark | Measures | Format |
|---|---|---|
| MMLU | World knowledge (57 subjects) | Multiple choice |
| HellaSwag | Commonsense reasoning | Completion selection |
| HumanEval | Code generation correctness | Executable tests |
| GSM8K | Math reasoning | Free-form numerical |
| TruthfulQA | Factual accuracy | QA pairs |
| ToxiGen | Safety and bias | Classification |
| MT-Bench | Multi-turn conversation quality | LLM-as-judge |
EleutherAI lm-evaluation-harness is the standard open-source framework with 200+ pre-implemented tasks and support for zero-shot and few-shot evaluation.
Track metrics at every stage: pretraining (perplexity, MMLU), SFT (instruction-following), alignment (TruthfulQA, ToxiGen — watch for alignment tax), and quantization (<1% degradation on critical metrics). See evaluation details for comprehensive guidance.
While the optimization techniques in Section 2 improve inference computation, production deployment requires a full serving stack.
TensorRT-LLM compiles model checkpoints into optimized inference engines:
NeMo/HuggingFace Checkpoint → TRT-LLM Builder → Optimized Engine (.engine)
Typical speedup: 2-5x over native PyTorch inference with multi-GPU support (TP, PP).
Triton provides production-grade model serving:
NVIDIA NIM packages the entire inference stack into pre-built, containerized microservices:
docker run --gpus all nvcr.io/nvidia/nim/model:latestNIM abstracts TensorRT-LLM builds, Triton configuration, and optimization tuning. It integrates with NeMo Guardrails for runtime safety and supports embedding and reranking NIMs for RAG pipelines.
RAG extends LLMs with external knowledge, addressing hallucination and knowledge cutoff:
User Query → Embed → Retrieve top-k from Vector DB → Augment Prompt → Generate Answer
Document Chunking: Split documents into semantically coherent chunks (typically 256-512 tokens with 10-20% overlap). Strategies range from fixed-size splitting to semantic boundary detection.
Embedding Models: Convert text to dense vectors for similarity search. See vector database embeddings for model choices including BGE, E5, and Cohere embed-v3.
Vector Databases: Store and index embeddings for fast approximate nearest neighbor (ANN) search. Options include Milvus (GPU-accelerated), Weaviate, Pinecone, FAISS, and pgvector.
Retrieval + Re-ranking: Initial top-k retrieval via ANN search, refined with cross-encoder re-ranking for more accurate relevance scoring. Cross-encoders (like bge-reranker-large, Cohere Rerank) jointly encode query-document pairs to produce precise relevance scores, achieving NDCG@10 of 0.80-0.90+. Hybrid search (dense + BM25 sparse) catches both semantic matches and exact keyword hits.
Combine embedding NIM + reranking NIM + LLM NIM for end-to-end RAG serving, with NeMo Guardrails for safety and Milvus for GPU-accelerated vector search.
The NVIDIA generative AI stack is optimized for specific GPU hardware. Understanding the hardware hierarchy is essential for choosing parallelism strategies.
Hopper (H100): 4th-gen Tensor Cores with native FP8 support via Transformer Engine. 80 GB HBM3, 3.35 TB/s memory bandwidth, 900 GB/s NVLink. The current workhorse for LLM training.
Blackwell (B200): 5th-gen Tensor Cores with FP4 support. 192 GB HBM3e, 8 TB/s bandwidth, 1,800 GB/s NVLink. ~2x throughput improvement over Hopper.
| Interconnect | Bandwidth | Latency | Use Case |
|---|---|---|---|
| NVLink | 900 GB/s (H100) | ~1 μs | Tensor Parallelism within node |
| NVSwitch | Full mesh at NVLink speed | ~1 μs | All-to-all GPU connectivity in DGX |
| InfiniBand NDR | 400 Gbps (~50 GB/s) | ~5 μs | Pipeline/Data Parallelism across nodes |
DGX H100: 8x H100 GPUs connected via NVSwitch for full-bandwidth mesh communication. The building block for SuperPOD clusters (32+ DGX nodes).
The bandwidth hierarchy dictates optimal parallelism mapping:
This is why 3D parallelism configurations use TP=8 within DGX nodes and PP across nodes — matching communication patterns to hardware topology. See distributed training for strategy details and NCCL for communication primitives.
Training and deploying LLMs at scale requires understanding where compute is being spent and where bottlenecks exist. NVIDIA Nsight is a suite of profiling and debugging tools designed for GPU workloads, providing visibility into everything from system-level communication patterns to individual CUDA kernel performance.
NVIDIA Nsight comprises four main tools, each targeting different layers of the stack:
| Tool | Purpose | Key Insights |
|---|---|---|
| Nsight Systems | System-level performance tracing | CPU-GPU interaction, communication overlap, idle time |
| Nsight Compute | CUDA kernel profiling | Tensor core utilization, memory bandwidth, occupancy |
| Nsight Graphics | Graphics debugging | Rendering pipelines, shader performance |
| Nsight Eclipse/VS | CUDA debugging in IDE | Breakpoints, memory inspection, runtime errors |
Nsight Systems traces the timeline of CPU and GPU activity, answering critical questions for distributed training:
Visualize CUDA kernel launches, memory copies, NCCL collectives, and CPU threads in a unified timeline:
nsys profile -o training_profile python train.py
The resulting timeline shows whether your parallelism strategy is CPU-bound (too much launch overhead), communication-bound (NCCL dominates), or compute-bound (GPUs saturated with work). In well-tuned distributed training, you should see NCCL all-reduce operations overlapping with backward pass computation.
Nsight Compute drills into individual CUDA kernels, providing hardware-level metrics:
ncu --set full -o kernel_profile python train.py
Example output for an attention kernel:
Kernel: flash_attention_fwd
SM Occupancy: 72%
Tensor Core Utilization: 90%
Memory Bandwidth: 63% of peak
DRAM Throughput: 2.1 TB/s
Low tensor core utilization might indicate dimension alignment issues (use multiples of 8 for FP16, 16 for FP8). Low memory bandwidth with high occupancy suggests compute-bound workloads — exactly what you want for training. High memory bandwidth with low compute suggests memory-bound operations that might benefit from kernel fusion.
In multi-GPU setups with Megatron-Core or DeepSpeed, Nsight reveals:
Pipeline Parallelism Stalls: Nsight Systems shows pipeline bubbles — idle time when GPUs wait for the previous stage. Interleaved schedules (1F1B) should minimize these bubbles.
NCCL Communication Overhead: All-reduce, all-gather, and reduce-scatter operations should overlap with computation. If NCCL dominates the timeline, consider increasing gradient accumulation steps or adjusting TP/PP ratios.
Tensor Core Utilization Across GPUs: Nsight Compute can profile specific ranks. Uneven utilization suggests load imbalance — possibly from uneven layer distribution in pipeline parallelism.
Kernel Fusion Efficiency: Check if attention kernels use fused implementations (Flash Attention). Unfused attention shows as separate QK^T, softmax, and softmax*V kernels — a sign you’re leaving performance on the table.
NeMo 2.0 and Megatron-Core are instrumented for profiling. PyTorch’s profiler can trigger Nsight traces, and NeMo-Run can launch profiling jobs automatically:
from nemo.utils import exp_manager
exp_manager.configure_profiling(enabled=True, ranks=[0],
start_step=10, end_step=20)
This captures a representative training window without the overhead of profiling the entire run. Analyze the profile to identify:
Performance optimization is iterative: profile, identify bottlenecks, adjust configuration (batch size, parallelism, sequence length), profile again. Nsight makes the invisible visible, transforming GPU utilization from guesswork into data-driven engineering.
Vision-Language Models (VLMs) extend LLMs to process images alongside text:
Image → Vision Encoder (ViT) → Projection Layer ─┐
↓
Text → Token Embedding ────────────────→ LLM Decoder → Output
Vision Encoder: Vision Transformer (ViT) processes image patches as tokens. CLIP encoders are commonly used for their pre-trained alignment between image and text representations.
Projection Layer: Maps vision embedding space to LLM token embedding space (linear layer or small MLP). Often the only component trained from scratch when combining pretrained vision and language models.
LLM Decoder: Standard autoregressive transformer processes interleaved vision and text tokens through its self-attention mechanism.
| Strategy | What’s Trained | Cost | Quality |
|---|---|---|---|
| Frozen encoders + train projection | Projection layer only | Low | Baseline |
| Frozen vision + LoRA on LLM | Projection + LLM adapters | Medium | Good |
| End-to-end fine-tuning | Everything | High | Best |
Resolution directly impacts compute: a 336×336 image produces 576 vision tokens vs. 256 for 224×224 — with attention cost scaling quadratically. Solutions include dynamic resolution, token pooling, and tiled processing.
NeMo 2.0 provides multimodal model definitions, distributed training for VLMs (TP/PP across vision and language components), and export to TensorRT for joint vision-text inference. Vision-Language NIMs enable production deployment of models like LLaVA and VILA.
The NVIDIA generative AI stack is best understood as a pipeline where each layer addresses a specific engineering challenge:
┌───────────────────────────────────────────────────────────┐
│ Hardware Layer │
│ H100 / Blackwell GPUs + NVLink / NVSwitch / InfiniBand │
└─────────────────────────┬─────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ Data Curation (NeMo Curator) │
│ Ingestion → Filtering → Deduplication │
└─────────────────────────┬───────────────┘
↓
┌─────────────────────────────────────────┐
│ Tokenization + Binary Dataset Creation │
└─────────────────────────┬───────────────┘
↓
┌─────────────────────────────────────────┐
│ Pretraining │
│ Megatron-Core + NeMo 2.0 │
│ TP / PP / CP / DP / ZeRO │
│ FP8 (Transformer Engine) + NCCL │
│ torchrun / NeMo-Run orchestration │
└─────────────────────────┬───────────────┘
↓
┌─────────────────────────────────────────┐
│ Fine-Tuning │
│ SFT → PEFT (LoRA/QLoRA) │
│ Knowledge Distillation │
└─────────────────────────┬───────────────┘
↓
┌─────────────────────────────────────────┐
│ Alignment (NeMo Aligner) │
│ RLHF / DPO / SteerLM │
└─────────────────────────┬───────────────┘
↓
┌─────────────────────────────────────────┐
│ Evaluation │
│ MMLU, HumanEval, TruthfulQA, MT-Bench │
└─────────────────────────┬───────────────┘
↓
┌─────────────────────────────────────────┐
│ Quantization (modelopt) │
│ PTQ / QAT / AWQ / SmoothQuant / FP8 │
└─────────────────────────┬───────────────┘
↓
┌─────────────────────────────────────────┐
│ Checkpoint Translation │
│ Megatron Bridge (Megatron ↔ HuggingFace)│
└─────────────────────────┬───────────────┘
↓
┌─────────────────────────────────────────┐
│ Inference Optimization │
│ TensorRT-LLM compilation │
│ Kernel fusion + KV cache optimization │
│ In-flight batching + Paged Attention │
└─────────────────────────┬───────────────┘
↓
┌─────────────────────────────────────────┐
│ Deployment │
│ Option A: Triton Inference Server │
│ Option B: NVIDIA NIM (containerized) │
│ + NeMo Guardrails (runtime safety) │
└─────────────────────────┬───────────────┘
↓
┌─────────────────────────────────────────┐
│ Applications │
│ Direct inference │ RAG pipelines │
│ Multimodal systems │ AI agents │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ Cross-Cutting: Performance Profiling │
│ Nsight Systems (system timeline) │
│ Nsight Compute (kernel analysis) │
│ → Identify bottlenecks at every stage │
└─────────────────────────────────────────┘
Each component is modular — you can use NeMo Curator without NeMo 2.0, or TensorRT-LLM without Triton — but the stack is designed to work together, with NVIDIA GPU acceleration at every stage.
For engineers entering this space, the most productive path is to start with the fundamentals (attention, tokenization, embeddings), understand the inference memory profile (prefill vs. decode, KV cache sizing), and then work through the parallelism strategies that make large-scale training possible. From there, the post-training pipeline (SFT → alignment → evaluation → quantization) and the serving stack (TensorRT-LLM → Triton/NIM) become the path to production. The NVIDIA tooling becomes intuitive once you understand the problems it solves.