Databases & Data Storage
MapReduce
Definition
MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.
Why It Matters
MapReduce, popularized by Google, was a breakthrough in simplifying large-scale data processing. It breaks down a computation into two main phases: a "Map" phase that processes data in parallel, and a "Reduce" phase that aggregates the results.
Contextual Example
To count the occurrences of each word in a massive collection of documents, the Map phase would read the documents in parallel and emit a `(word, 1)` pair for each word. The Reduce phase would then sum up the counts for each identical word.
Common Misunderstandings
- Hadoop MapReduce is the most well-known implementation of this model.
- While powerful, MapReduce can be complex to program directly and has been largely replaced by higher-level tools like Apache Spark.