Artificial Intelligence

Byte-Pair Encoding

A subword tokenization algorithm that starts with individual characters and iteratively merges the most frequent pairs to create a vocabulary of subword units. It balances vocabulary size with handling of rare words.

Why It Matters

BPE is the tokenization method used by most modern LLMs. It handles any word (including misspellings and new terms) while keeping the vocabulary manageable.

Example

Starting with characters 'l','o','w','e','r', BPE might merge 'l'+'o' → 'lo', then 'lo'+'w' → 'low', building up common subwords that work across many words.

Think of it like...

Like creating abbreviations for common letter combinations in shorthand writing — frequent patterns get their own symbol, making the system efficient.

Related Terms