Artificial Intelligence

Tokenization Strategy

The approach and rules for how text is split into tokens. Different strategies (word-level, subword, character-level) make different tradeoffs between vocabulary size and sequence length.

Why It Matters

Your tokenization strategy affects model efficiency, multilingual support, and how well the model handles rare or novel words.

Example

A subword tokenizer splitting 'unhappiness' into 'un'+'happiness' versus a word-level tokenizer treating it as one token — subword handles new word forms better.

Think of it like...

Like choosing how to cut a log — you can cut it into large planks, medium boards, or small strips, each useful for different building projects.

Related Terms