Hadoop's core (HDFS, MapReduce, YARN) is just the foundation. A rich ecosystem of open-source tools has grown around it.
Storage Layer
| Tool | Description |
|---|
| HDFS | Distributed filesystem, the default Hadoop storage |
| Apache HBase | Column-family NoSQL database on top of HDFS; supports random read/write |
| Apache Kudu | Columnar storage for fast analytics and streaming ingest |
Data Ingestion
| Tool | Description |
|---|
| Apache Sqoop | Batch import/export between RDBMS and HDFS |
| Apache Flume | Streaming log and event data ingestion into HDFS |
| Apache Kafka | Distributed message bus; often feeds data into the cluster in real time |
Processing & Query
| Tool | Description |
|---|
| Apache Hive | SQL-like query language (HiveQL) translated to MapReduce or Tez |
| Apache Pig | Dataflow scripting language for complex ETL pipelines |
| Apache Spark | In-memory distributed compute engine; 10-100x faster than MapReduce |
| Apache Tez | DAG execution engine that replaces MapReduce for Hive and Pig |
| Apache Impala | Low-latency SQL engine with direct HDFS reads |
| Apache Drill | Schema-free SQL query engine across HDFS, HBase, S3, and more |
Workflow & Coordination
| Tool | Description |
|---|
| Apache Oozie | Workflow scheduler for Hadoop jobs |
| Apache ZooKeeper | Distributed coordination service for leader election |
| Apache Airflow | Modern DAG-based workflow orchestration platform |
- Interactive SQL queries → Impala or Hive + Tez
- Batch ETL → Spark or Hive
- Real-time streaming → Kafka + Spark Structured Streaming or Flink
- Random row lookups → HBase
- Workflow scheduling → Oozie or Airflow
Next Steps
Head to Advanced Topics to explore high availability, security, and performance tuning.