Artificial Intelligence & Machine Learning

Tokenization

Definition

Tokenization is the process of breaking down a piece of text into smaller units called "tokens". These tokens can be words, characters, or sub-words.

Why It Matters

Tokenization is the very first step in almost any Natural Language Processing (NLP) task. Computers cannot understand raw text; they need it to be broken down into a numerical representation, and tokenization is the start of that process.

Contextual Example

The sentence "The cat sat on the mat." could be tokenized into the following word tokens: ["The", "cat", "sat", "on", "the", "mat", "."]. Each token is then typically mapped to an integer ID.

Common Misunderstandings

  • Different tokenization strategies exist, such as word-based, character-based, or subword-based (like Byte-Pair Encoding used in many LLMs).
  • After tokenization, the tokens are converted into numerical vectors (embeddings) that the model can process.

Related Terms

Last Updated: December 17, 2025