nvidia-generative-ai-notes

LLM Architecture

Introduction to Encoder-Decoder Structures in Large Language Models

Encoder-decoder architectures form a foundational pillar in the evolution of large language models (LLMs), particularly for handling sequence-to-sequence (seq2seq) tasks where inputs and outputs are variable-length sequences. Originating from the Transformer model introduced in 2017, this structure consists of two main components: an encoder that processes the input sequence into a rich, contextual representation, and a decoder that generates the output sequence based on that representation. In the context of LLMs, which are typically built on Transformer variants, encoder-decoder models excel in scenarios requiring deep understanding of input context to produce structured outputs, unlike simpler architectures that might focus solely on generation or embedding. The encoder uses self-attention mechanisms to create bidirectional embeddings, capturing relationships across the entire input. The decoder, employing masked self-attention (to prevent peeking at future tokens) and cross-attention (to attend to the encoder’s outputs), generates tokens autoregressively—one at a time, conditioned on previous outputs and the encoded input. This design addresses limitations in earlier recurrent neural networks (RNNs) like LSTMs by enabling parallel processing and better handling of long-range dependencies.

Comparison with Other LLM Architectures

To fully appreciate encoder-decoder structures, it’s useful to contrast them with encoder-only and decoder-only architectures, which dominate modern LLMs. Encoder-only models (e.g., BERT) focus on input understanding for tasks like classification, while decoder-only models (e.g., GPT series) emphasize generative capabilities through next-token prediction. Encoder-decoder models combine both for input-output mapping.

Architecture Type	Structure	Pretraining Objectives	Key Strengths	Limitations	Examples
Encoder-Only	Stacks of self-attention layers for bidirectional context; no generation component	Masked language modeling (e.g., predicting masked tokens) and sentence-pair tasks	Excellent for embeddings and classification; strong bidirectional understanding	Cannot generate sequences autoregressively; limited to understanding tasks	BERT, RoBERTa
Decoder-Only	Masked self-attention for autoregressive generation; no separate encoder	Next-token prediction on vast unlabeled text	Efficient for open-ended generation; supports in-context learning; simpler training	May struggle with complex input–output mappings requiring deep contextual alignment	GPT-3, GPT-4, LLaMA
Encoder–Decoder	Encoder for input embeddings + decoder for generation with cross-attention	Denoising (e.g., reconstructing corrupted text) or unified text-to-text tasks	Superior for seq2seq tasks with variable lengths; strong contextual understanding	More computationally intensive to train; often requires paired data	T5, BART, Original Transformer

Applications in Large Language Models

Machine Translation: The original use case for the Transformer, where the encoder processes source language text, and the decoder generates the target language equivalent. Models like T5 and mT5 have shown superior performance in multilingual translation, especially for low-resource languages like Telugu or Malayalam, outperforming decoder-only models in translation quality and contextual fidelity. Studies indicate encoder-decoder architectures achieve higher BLEU scores due to better handling of syntactic alignments.
Text Summarization: By encoding long documents and decoding concise summaries, models like BART (which uses denoising pretraining) excel in abstractive summarization. This is crucial in LLMs for tasks like condensing news articles or research papers, where preserving key information is essential.
Question Answering and Dialogue Systems: Encoder processes queries or contexts, while decoder generates responses. In clinical settings, for instance, encoder-decoder models aid in extracting insights from medical records or generating patient summaries, though encoder-only might suffice for pure extraction.
Knowledge-Intensive Tasks: Recent innovations, such as integrated encoder-decoder for Retrieval-Augmented Generation (RAG), enhance LLMs by feeding external knowledge. A 2025 proposal introduces In-Context Vectors (ICV) to compress demonstrations, improving question-answering on benchmarks like HotpotQA (72% Exact Match) with fewer parameters and reduced token limits compared to traditional RAG. This makes encoder-decoder relevant for dynamic, real-world LLM deployments in search engines or chatbots.
Other Seq2Seq Tasks: Includes paraphrasing, style transfer, and code generation, where input-output lengths vary significantly. In STEM multiple-choice questions, encoder-decoder models demonstrate balanced performance in self-evaluation, though decoder-only can be more efficient for fluency.

Advantages and Limitations

Advantages of encoder-decoder in LLMs include robust handling of complex dependencies, making them ideal for conditional generation. They often outperform decoder-only in quality for translation and summarization, as evidenced by 2024 studies on AI translation. However, they require more resources for training (needing paired datasets) and can be less flexible than decoder-only for zero-shot tasks. Recent trends favor decoder-only for scalability, but encoder-decoder remains vital for precision in specialized applications. In summary, while decoder-only models like GPT dominate general-purpose LLMs, encoder-decoder structures provide critical capabilities for transformative tasks, with ongoing research enhancing their efficiency and integration with knowledge retrieval.