nvidia-generative-ai-notes

LLM Architecture

Introduction to Encoder-Decoder Structures in Large Language Models

Encoder-decoder architectures form a foundational pillar in the evolution of large language models (LLMs), particularly for handling sequence-to-sequence (seq2seq) tasks where inputs and outputs are variable-length sequences. Originating from the Transformer model introduced in 2017, this structure consists of two main components: an encoder that processes the input sequence into a rich, contextual representation, and a decoder that generates the output sequence based on that representation. In the context of LLMs, which are typically built on Transformer variants, encoder-decoder models excel in scenarios requiring deep understanding of input context to produce structured outputs, unlike simpler architectures that might focus solely on generation or embedding. The encoder uses self-attention mechanisms to create bidirectional embeddings, capturing relationships across the entire input. The decoder, employing masked self-attention (to prevent peeking at future tokens) and cross-attention (to attend to the encoder’s outputs), generates tokens autoregressively—one at a time, conditioned on previous outputs and the encoded input. This design addresses limitations in earlier recurrent neural networks (RNNs) like LSTMs by enabling parallel processing and better handling of long-range dependencies.

Comparison with Other LLM Architectures

To fully appreciate encoder-decoder structures, it’s useful to contrast them with encoder-only and decoder-only architectures, which dominate modern LLMs. Encoder-only models (e.g., BERT) focus on input understanding for tasks like classification, while decoder-only models (e.g., GPT series) emphasize generative capabilities through next-token prediction. Encoder-decoder models combine both for input-output mapping.

Architecture Type Structure Pretraining Objectives Key Strengths Limitations Examples
Encoder-Only Stacks of self-attention layers for bidirectional context; no generation component Masked language modeling (e.g., predicting masked tokens) and sentence-pair tasks Excellent for embeddings and classification; strong bidirectional understanding Cannot generate sequences autoregressively; limited to understanding tasks BERT, RoBERTa
Decoder-Only Masked self-attention for autoregressive generation; no separate encoder Next-token prediction on vast unlabeled text Efficient for open-ended generation; supports in-context learning; simpler training May struggle with complex input–output mappings requiring deep contextual alignment GPT-3, GPT-4, LLaMA
Encoder–Decoder Encoder for input embeddings + decoder for generation with cross-attention Denoising (e.g., reconstructing corrupted text) or unified text-to-text tasks Superior for seq2seq tasks with variable lengths; strong contextual understanding More computationally intensive to train; often requires paired data T5, BART, Original Transformer

Applications in Large Language Models

Advantages and Limitations

Advantages of encoder-decoder in LLMs include robust handling of complex dependencies, making them ideal for conditional generation. They often outperform decoder-only in quality for translation and summarization, as evidenced by 2024 studies on AI translation. However, they require more resources for training (needing paired datasets) and can be less flexible than decoder-only for zero-shot tasks. Recent trends favor decoder-only for scalability, but encoder-decoder remains vital for precision in specialized applications. In summary, while decoder-only models like GPT dominate general-purpose LLMs, encoder-decoder structures provide critical capabilities for transformative tasks, with ongoing research enhancing their efficiency and integration with knowledge retrieval.