Tokenization
The process of breaking text into smaller units (tokens) for processing by NLP models. Tokenization can split text into words, subwords, or characters depending on the method used.
Why It Matters
Tokenization is the first step in any NLP pipeline. How text is tokenized affects model vocabulary size, handling of rare words, and language support.
Example
The sentence 'I don't like mushrooms' tokenized as: ['I', 'don', "'", 't', 'like', 'mushrooms'] or ['I', 'do', 'n\'t', 'like', 'mush', 'rooms'] depending on the tokenizer.
Think of it like...
Like breaking a sentence into Scrabble tiles — the way you split it up determines what building blocks the model has to work with.
Related Terms
Token
The basic unit of text that language models process. A token can be a word, part of a word, or a punctuation mark. Text is broken into tokens before being fed into an LLM, and the model generates output one token at a time.
Tokenizer
A component that converts raw text into tokens (numerical representations) that a language model can process. Different tokenizers split text differently, affecting model performance and efficiency.
Byte-Pair Encoding
A subword tokenization algorithm that starts with individual characters and iteratively merges the most frequent pairs to create a vocabulary of subword units. It balances vocabulary size with handling of rare words.
Natural Language Processing
The branch of AI that deals with the interaction between computers and human language. NLP enables machines to read, understand, generate, and make sense of human language in a useful way.