Capacity Planning
Sizing a Hadoop cluster correctly upfront saves costly re-architecture later. This guide covers how to estimate storage, compute, and network requirements for your workload.
Start With Your Data
Answer these questions first:
| Question | Why It Matters |
|---|---|
| How much raw data per day? | Drives storage growth rate |
| What is the retention period? | Total storage footprint |
| What replication factor? | Multiplies raw storage |
| What compression ratio? | Reduces storage (typically 2–5×) |
| Is data hot or cold? | Affects disk type (SSD vs HDD) |
| How many concurrent jobs? | Drives CPU and memory sizing |
Storage Sizing
Formula:
Total HDFS Storage = Raw Data × Retention (days) × Replication Factor ÷ Compression Ratio
Example:
- Raw data ingested: 500 GB/day
- Retention: 365 days
- Replication factor: 3
- Compression ratio (ORC + Snappy): 4×
500 GB × 365 days × 3 ÷ 4 = 136,875 GB ≈ 136.9 TB
Add 25% overhead for intermediate files, temp space, and growth headroom → ~171 TB usable.
Per-Node Storage
| Use Case | Recommended Disk |
|---|---|
| General purpose | 6–12 × 4 TB HDD (JBOD, no RAID) |
| Low-latency reads | 2–4 × 2 TB SSD (NVMe preferred) |
| Archive/cold data | 6–12 × 8–16 TB HDD |
HDFS provides its own replication. Using RAID wastes capacity and adds complexity. Configure DataNode disks as JBOD (Just a Bunch of Disks).
Nodes needed:
DataNodes = ceil( Total HDFS Storage ÷ Usable storage per node )
With 12 × 4 TB HDD per node (48 TB raw, ~42 TB usable after OS):
ceil( 171 TB ÷ 42 TB ) = 5 DataNodes minimum
Add 20% buffer → 6–7 DataNodes for this example.
Compute Sizing
CPU
MapReduce and Spark are CPU-intensive for transformation workloads.
- Mapper slots:
(node cores - 1 for OS) × 0.8 - Reducer slots: Typically 40–50% of total mapper slots
For a 16-core node:
- YARN vCores: 14 (leave 2 for OS + DataNode)
- Map containers (2 cores each): 7 per node
- Reduce containers: ~3–4 per node
Memory
| Service | RAM Allocation | Notes |
|---|---|---|
| OS + DataNode | 8–16 GB | Fixed overhead per node |
| YARN NodeManager | Rest of RAM | e.g., 56 GB on 64 GB node |
| Map container | 2–4 GB | Tune per job type |
| Reduce container | 4–8 GB | Needs more for large shuffles |
Example (64 GB node):
- OS + DataNode: 8 GB
- YARN available: 56 GB
- Containers (4 GB each): 14 concurrent containers per node
NameNode RAM
The NameNode keeps the entire namespace in heap. A rough estimate:
NameNode Heap ≈ (Total Files + Blocks) ÷ 1,000,000 × 1 GB
| File Count | Recommended Heap |
|---|---|
| < 10M files | 4 GB |
| 10–50M files | 8–16 GB |
| 50–200M files | 32–64 GB |
| > 200M files | Consider HDFS Federation |
Network Sizing
Hadoop is network-intensive during shuffle (MapReduce), replication, and bulk loads.
| Tier | Minimum | Recommended |
|---|---|---|
| Node NICs | 1 GbE | 10 GbE |
| Top-of-rack switch | 10 GbE | 25–100 GbE uplink |
| Cross-rack bandwidth | 10 GbE | 40 GbE+ |
Replication bandwidth: When a DataNode fails, HDFS re-replicates all its blocks. On a node with 24 TB of data and 1 GbE network, recovery takes:
24,000 GB ÷ 125 MB/s ≈ 53 hours
With 10 GbE: ~5 hours. Always use at least 10 GbE for production.
Cluster Topology Reference
Small cluster (< 10 nodes)
| Role | Count | Spec |
|---|---|---|
| NameNode + ResourceManager | 1 (or 2 for HA) | 32–64 GB RAM, 4–8 cores, 2× SSD for OS |
| DataNode + NodeManager | 3–8 | 64–128 GB RAM, 12–16 cores, 8× 4 TB HDD, 10 GbE |
| ZooKeeper (if HA) | 3 | 8 GB RAM, 4 cores, SSD |
Medium cluster (10–100 nodes)
| Role | Count | Spec |
|---|---|---|
| Active + Standby NameNode | 2 | 128–256 GB RAM, 16–24 cores |
| JournalNodes | 3 | 16 GB RAM, SSD for journal storage |
| Active + Standby ResourceManager | 2 | 64 GB RAM, 16 cores |
| DataNode + NodeManager | 10–80 | 128–256 GB RAM, 24–32 cores, 12× 4–8 TB HDD |
| ZooKeeper | 3–5 | 16 GB RAM, SSD |
| Edge/Gateway nodes | 2–4 | 32 GB RAM (user-facing, no data) |
Large cluster (100+ nodes)
At this scale, consider:
- HDFS Federation for namespace scaling
- Dedicated YARN queues per team (Capacity Scheduler)
- Separate networks for storage traffic vs management
- Rack-aware topology with 40+ GbE cross-rack
Growth Planning
Track these metrics monthly and project 12–24 months out:
# Current HDFS usage and remaining
hdfs dfsadmin -report | grep -E "DFS Used|DFS Remaining|DFS Used%"
# File and block count growth
curl -s "http://namenode:9870/jmx?qry=Hadoop:service=NameNode,name=FSNamesystemState" \
| python3 -c "import json,sys; d=json.load(sys.stdin)['beans'][0]; \
print('Files:', d['FilesTotal'], 'Blocks:', d['BlocksTotal'])"
Set alerts at 70% capacity and plan procurement before you reach 85%.
Quick Sizing Checklist
- Estimated raw data volume and daily growth rate
- Retention policy defined (30 / 90 / 365 days)
- Replication factor chosen (3 for production)
- Compression format selected (ORC/Parquet + Snappy)
- NameNode RAM sized for expected file count
- DataNode disks configured as JBOD (no RAID)
- Network at 10 GbE minimum per node
- HA configured (NameNode + ResourceManager)
- ZooKeeper ensemble (3 or 5 nodes) provisioned
- 20–25% headroom added to all estimates