Skip to main content

The Hadoop Ecosystem

Hadoop's core (HDFS, MapReduce, YARN) is just the foundation. A rich ecosystem of open-source tools has grown around it.

Storage Layer

ToolDescription
HDFSDistributed filesystem, the default Hadoop storage
Apache HBaseColumn-family NoSQL database on top of HDFS; supports random read/write
Apache KuduColumnar storage for fast analytics and streaming ingest

Data Ingestion

ToolDescription
Apache SqoopBatch import/export between RDBMS and HDFS
Apache FlumeStreaming log and event data ingestion into HDFS
Apache KafkaDistributed message bus; often feeds data into the cluster in real time

Processing & Query

ToolDescription
Apache HiveSQL-like query language (HiveQL) translated to MapReduce or Tez
Apache PigDataflow scripting language for complex ETL pipelines
Apache SparkIn-memory distributed compute engine; 10-100x faster than MapReduce
Apache TezDAG execution engine that replaces MapReduce for Hive and Pig
Apache ImpalaLow-latency SQL engine with direct HDFS reads
Apache DrillSchema-free SQL query engine across HDFS, HBase, S3, and more

Workflow & Coordination

ToolDescription
Apache OozieWorkflow scheduler for Hadoop jobs
Apache ZooKeeperDistributed coordination service for leader election
Apache AirflowModern DAG-based workflow orchestration platform

Choosing the Right Tool

  • Interactive SQL queries → Impala or Hive + Tez
  • Batch ETL → Spark or Hive
  • Real-time streaming → Kafka + Spark Structured Streaming or Flink
  • Random row lookups → HBase
  • Workflow scheduling → Oozie or Airflow

Next Steps

Head to Advanced Topics to explore high availability, security, and performance tuning.