Apache Hadoop
Definition
Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.
Why It Matters
Hadoop pioneered the field of Big Data. Its two core components, HDFS (Hadoop Distributed File System) for distributed storage and MapReduce for distributed processing, made it possible to store and analyze petabyte-scale datasets on clusters of commodity hardware.
Contextual Example
Early big data applications at companies like Yahoo and Facebook were built on Hadoop. It allowed them to process and analyze the massive amounts of user-generated data and web logs they were collecting.
Common Misunderstandings
- While still used, Hadoop's processing engine, MapReduce, has been largely superseded by the faster and more flexible Apache Spark.
- Hadoop is a complex ecosystem with many different projects, not a single piece of software.