Tokenization – Technology Term Explained | Technology Definitions

Definition

Tokenization is the process of breaking down a piece of text into smaller units called "tokens". These tokens can be words, characters, or sub-words.

Why It Matters

Tokenization is the very first step in almost any Natural Language Processing (NLP) task. Computers cannot understand raw text; they need it to be broken down into a numerical representation, and tokenization is the start of that process.

Contextual Example

The sentence "The cat sat on the mat." could be tokenized into the following word tokens: ["The", "cat", "sat", "on", "the", "mat", "."]. Each token is then typically mapped to an integer ID.

Common Misunderstandings

Different tokenization strategies exist, such as word-based, character-based, or subword-based (like Byte-Pair Encoding used in many LLMs).
After tokenization, the tokens are converted into numerical vectors (embeddings) that the model can process.

Definition

Why It Matters

Contextual Example

Common Misunderstandings

Related Terms