Databases & Data Storage

MapReduce

Definition

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

Why It Matters

MapReduce, popularized by Google, was a breakthrough in simplifying large-scale data processing. It breaks down a computation into two main phases: a "Map" phase that processes data in parallel, and a "Reduce" phase that aggregates the results.

Contextual Example

To count the occurrences of each word in a massive collection of documents, the Map phase would read the documents in parallel and emit a `(word, 1)` pair for each word. The Reduce phase would then sum up the counts for each identical word.

Common Misunderstandings

  • Hadoop MapReduce is the most well-known implementation of this model.
  • While powerful, MapReduce can be complex to program directly and has been largely replaced by higher-level tools like Apache Spark.

Related Terms

Last Updated: December 17, 2025