nvidia-generative-ai-notes

Checkpoint Translation

Checkpoint translation (shard/unshard checkpoints) refers to converting model checkpoints between different distributed training formats:

  1. Sharding: Converting a single, consolidated model checkpoint into multiple shards (pieces) that can be distributed across multiple GPUs/devices. Each shard contains a portion of the model weights.

Example: A 7B parameter model checkpoint split into 8 shards for 8-GPU training with tensor parallelism.

  1. Unsharding: The reverse process—combining multiple sharded checkpoints back into a single consolidated checkpoint.

Useful when you want to save a distributed model as a standard HuggingFace model or deploy it on a single device.

  1. Format Translation: Megatron Bridge specifically handles converting between:

Megatron format: Sharded checkpoints optimized for distributed training with tensor/pipeline parallelism HuggingFace format: Standard consolidated checkpoint format used by the HF ecosystem

Why this matters:

In essence, Megatron Bridge acts as a bridge that handles the complexity of translating checkpoints between distributed and consolidated formats, abstracting away the details from the user.