Artificial Intelligence & Machine Learning
Tokenization
Definition
Tokenization is the process of breaking down a piece of text into smaller units called "tokens". These tokens can be words, characters, or sub-words.
Why It Matters
Tokenization is the very first step in almost any Natural Language Processing (NLP) task. Computers cannot understand raw text; they need it to be broken down into a numerical representation, and tokenization is the start of that process.
Contextual Example
The sentence "The cat sat on the mat." could be tokenized into the following word tokens: ["The", "cat", "sat", "on", "the", "mat", "."]. Each token is then typically mapped to an integer ID.
Common Misunderstandings
- Different tokenization strategies exist, such as word-based, character-based, or subword-based (like Byte-Pair Encoding used in many LLMs).
- After tokenization, the tokens are converted into numerical vectors (embeddings) that the model can process.