Tokenization is the process of converting raw text into a sequence of tokens that can be processed by a language model. A token can be a word, a subword, or even a character, depending on the tokenization strategy used.
Word-level tokenization: Each word is treated as a separate token. This can lead to a large vocabulary and issues with out-of-vocabulary (OOV) words.
Subword tokenization: Words are broken down into smaller units (subwords). This allows the model to handle OOV words by combining subwords. Common algorithms include Byte Pair Encoding (BPE) and WordPiece.
Character-level tokenization: Each character is treated as a token. This can capture fine-grained information but results in longer sequences.
Byte Pair Encoding (BPE): Iteratively merges the most frequent pairs of characters or subwords until a predefined vocabulary size is reached.
WordPiece: Similar to BPE but uses a different merging strategy based on likelihood. It is used in models like BERT.
SentencePiece: A more flexible tokenizer that can handle both word and subword tokenization. It is used in models like T5.