nvidia-generative-ai-notes

Checkpoint Translation

Checkpoint translation (shard/unshard checkpoints) refers to converting model checkpoints between different distributed training formats:

Sharding: Converting a single, consolidated model checkpoint into multiple shards (pieces) that can be distributed across multiple GPUs/devices. Each shard contains a portion of the model weights.

Example: A 7B parameter model checkpoint split into 8 shards for 8-GPU training with tensor parallelism.

Unsharding: The reverse process—combining multiple sharded checkpoints back into a single consolidated checkpoint.

Useful when you want to save a distributed model as a standard HuggingFace model or deploy it on a single device.

Format Translation: Megatron Bridge specifically handles converting between:

Megatron format: Sharded checkpoints optimized for distributed training with tensor/pipeline parallelism HuggingFace format: Standard consolidated checkpoint format used by the HF ecosystem

Why this matters:

Training at scale: When using tensor or pipeline parallelism, model weights are split across GPUs. You need sharded checkpoints for this.
Interoperability: Enables seamless conversion between Megatron and HuggingFace formats, allowing you to: Train with Megatron-Core (distributed, optimized) Deploy with HuggingFace (standard ecosystem) Resume training from different formats
Checkpointing: Efficiently save and load large models during distributed training without consolidating everything into a single file (which would be memory-intensive).

In essence, Megatron Bridge acts as a bridge that handles the complexity of translating checkpoints between distributed and consolidated formats, abstracting away the details from the user.