Skip to main content

The Hadoop Ecosystem

Hadoop's core (HDFS, MapReduce, YARN) is just the foundation. A rich ecosystem of open-source tools has grown around it.

Storage Layer

Tool	Description
HDFS	Distributed filesystem, the default Hadoop storage
Apache HBase	Column-family NoSQL database on top of HDFS; supports random read/write
Apache Kudu	Columnar storage for fast analytics and streaming ingest

Data Ingestion

Tool	Description
Apache Sqoop	Batch import/export between RDBMS and HDFS
Apache Flume	Streaming log and event data ingestion into HDFS
Apache Kafka	Distributed message bus; often feeds data into the cluster in real time

Processing & Query

Tool	Description
Apache Hive	SQL-like query language (HiveQL) translated to MapReduce or Tez
Apache Pig	Dataflow scripting language for complex ETL pipelines
Apache Spark	In-memory distributed compute engine; 10-100x faster than MapReduce
Apache Tez	DAG execution engine that replaces MapReduce for Hive and Pig
Apache Impala	Low-latency SQL engine with direct HDFS reads
Apache Drill	Schema-free SQL query engine across HDFS, HBase, S3, and more

Workflow & Coordination

Tool	Description
Apache Oozie	Workflow scheduler for Hadoop jobs
Apache ZooKeeper	Distributed coordination service for leader election
Apache Airflow	Modern DAG-based workflow orchestration platform

Choosing the Right Tool

Interactive SQL queries → Impala or Hive + Tez
Batch ETL → Spark or Hive
Real-time streaming → Kafka + Spark Structured Streaming or Flink
Random row lookups → HBase
Workflow scheduling → Oozie or Airflow

Next Steps

Head to Advanced Topics to explore high availability, security, and performance tuning.

Storage Layer
Data Ingestion
Processing & Query
Workflow & Coordination
Choosing the Right Tool
Next Steps