Capacity Planning

Sizing a Hadoop cluster correctly upfront saves costly re-architecture later. This guide covers how to estimate storage, compute, and network requirements for your workload.

Start With Your Data

Answer these questions first:

Question	Why It Matters
How much raw data per day?	Drives storage growth rate
What is the retention period?	Total storage footprint
What replication factor?	Multiplies raw storage
What compression ratio?	Reduces storage (typically 2–5×)
Is data hot or cold?	Affects disk type (SSD vs HDD)
How many concurrent jobs?	Drives CPU and memory sizing

Storage Sizing

Formula:

Total HDFS Storage = Raw Data × Retention (days) × Replication Factor ÷ Compression Ratio

Example:

Raw data ingested: 500 GB/day
Retention: 365 days
Replication factor: 3
Compression ratio (ORC + Snappy): 4×

500 GB × 365 days × 3 ÷ 4 = 136,875 GB ≈ 136.9 TB

Add 25% overhead for intermediate files, temp space, and growth headroom → ~171 TB usable.

Per-Node Storage

Use Case	Recommended Disk
General purpose	6–12 × 4 TB HDD (JBOD, no RAID)
Low-latency reads	2–4 × 2 TB SSD (NVMe preferred)
Archive/cold data	6–12 × 8–16 TB HDD

No RAID

HDFS provides its own replication. Using RAID wastes capacity and adds complexity. Configure DataNode disks as JBOD (Just a Bunch of Disks).

Nodes needed:

DataNodes = ceil( Total HDFS Storage ÷ Usable storage per node )

With 12 × 4 TB HDD per node (48 TB raw, ~42 TB usable after OS):

ceil( 171 TB ÷ 42 TB ) = 5 DataNodes minimum

Add 20% buffer → 6–7 DataNodes for this example.

Compute Sizing

CPU

MapReduce and Spark are CPU-intensive for transformation workloads.

Mapper slots: (node cores - 1 for OS) × 0.8
Reducer slots: Typically 40–50% of total mapper slots

For a 16-core node:

YARN vCores: 14 (leave 2 for OS + DataNode)
Map containers (2 cores each): 7 per node
Reduce containers: ~3–4 per node

Memory

Service	RAM Allocation	Notes
OS + DataNode	8–16 GB	Fixed overhead per node
YARN NodeManager	Rest of RAM	e.g., 56 GB on 64 GB node
Map container	2–4 GB	Tune per job type
Reduce container	4–8 GB	Needs more for large shuffles

Example (64 GB node):

OS + DataNode: 8 GB
YARN available: 56 GB
Containers (4 GB each): 14 concurrent containers per node

NameNode RAM

The NameNode keeps the entire namespace in heap. A rough estimate:

NameNode Heap ≈ (Total Files + Blocks) ÷ 1,000,000 × 1 GB

File Count	Recommended Heap
< 10M files	4 GB
10–50M files	8–16 GB
50–200M files	32–64 GB
> 200M files	Consider HDFS Federation

Network Sizing

Hadoop is network-intensive during shuffle (MapReduce), replication, and bulk loads.

Tier	Minimum	Recommended
Node NICs	1 GbE	10 GbE
Top-of-rack switch	10 GbE	25–100 GbE uplink
Cross-rack bandwidth	10 GbE	40 GbE+

Replication bandwidth: When a DataNode fails, HDFS re-replicates all its blocks. On a node with 24 TB of data and 1 GbE network, recovery takes:

24,000 GB ÷ 125 MB/s ≈ 53 hours

With 10 GbE: ~5 hours. Always use at least 10 GbE for production.

Cluster Topology Reference

Small cluster (< 10 nodes)

Role	Count	Spec
NameNode + ResourceManager	1 (or 2 for HA)	32–64 GB RAM, 4–8 cores, 2× SSD for OS
DataNode + NodeManager	3–8	64–128 GB RAM, 12–16 cores, 8× 4 TB HDD, 10 GbE
ZooKeeper (if HA)	3	8 GB RAM, 4 cores, SSD

Medium cluster (10–100 nodes)

Role	Count	Spec
Active + Standby NameNode	2	128–256 GB RAM, 16–24 cores
JournalNodes	3	16 GB RAM, SSD for journal storage
Active + Standby ResourceManager	2	64 GB RAM, 16 cores
DataNode + NodeManager	10–80	128–256 GB RAM, 24–32 cores, 12× 4–8 TB HDD
ZooKeeper	3–5	16 GB RAM, SSD
Edge/Gateway nodes	2–4	32 GB RAM (user-facing, no data)

Large cluster (100+ nodes)

At this scale, consider:

HDFS Federation for namespace scaling
Dedicated YARN queues per team (Capacity Scheduler)
Separate networks for storage traffic vs management
Rack-aware topology with 40+ GbE cross-rack

Growth Planning

Track these metrics monthly and project 12–24 months out:

# Current HDFS usage and remaining
hdfs dfsadmin -report | grep -E "DFS Used|DFS Remaining|DFS Used%"

# File and block count growth
curl -s "http://namenode:9870/jmx?qry=Hadoop:service=NameNode,name=FSNamesystemState" \
  | python3 -c "import json,sys; d=json.load(sys.stdin)['beans'][0]; \
    print('Files:', d['FilesTotal'], 'Blocks:', d['BlocksTotal'])"

Set alerts at 70% capacity and plan procurement before you reach 85%.

Quick Sizing Checklist

Start With Your Data
Storage Sizing
- Per-Node Storage
Compute Sizing
Network Sizing
Cluster Topology Reference
Growth Planning
Quick Sizing Checklist

Start With Your Data​

Storage Sizing​

Per-Node Storage​

Compute Sizing​

CPU​

Memory​

NameNode RAM​

Network Sizing​

Cluster Topology Reference​

Small cluster (< 10 nodes)​

Medium cluster (10–100 nodes)​

Large cluster (100+ nodes)​

Growth Planning​

Quick Sizing Checklist​