Skip to main content

Capacity Planning

Sizing a Hadoop cluster correctly upfront saves costly re-architecture later. This guide covers how to estimate storage, compute, and network requirements for your workload.

Start With Your Data

Answer these questions first:

QuestionWhy It Matters
How much raw data per day?Drives storage growth rate
What is the retention period?Total storage footprint
What replication factor?Multiplies raw storage
What compression ratio?Reduces storage (typically 2–5×)
Is data hot or cold?Affects disk type (SSD vs HDD)
How many concurrent jobs?Drives CPU and memory sizing

Storage Sizing

Formula:

Total HDFS Storage = Raw Data × Retention (days) × Replication Factor ÷ Compression Ratio

Example:

  • Raw data ingested: 500 GB/day
  • Retention: 365 days
  • Replication factor: 3
  • Compression ratio (ORC + Snappy): 4×
500 GB × 365 days × 3 ÷ 4 = 136,875 GB ≈ 136.9 TB

Add 25% overhead for intermediate files, temp space, and growth headroom → ~171 TB usable.

Per-Node Storage

Use CaseRecommended Disk
General purpose6–12 × 4 TB HDD (JBOD, no RAID)
Low-latency reads2–4 × 2 TB SSD (NVMe preferred)
Archive/cold data6–12 × 8–16 TB HDD
No RAID

HDFS provides its own replication. Using RAID wastes capacity and adds complexity. Configure DataNode disks as JBOD (Just a Bunch of Disks).

Nodes needed:

DataNodes = ceil( Total HDFS Storage ÷ Usable storage per node )

With 12 × 4 TB HDD per node (48 TB raw, ~42 TB usable after OS):

ceil( 171 TB ÷ 42 TB ) = 5 DataNodes minimum

Add 20% buffer → 6–7 DataNodes for this example.

Compute Sizing

CPU

MapReduce and Spark are CPU-intensive for transformation workloads.

  • Mapper slots: (node cores - 1 for OS) × 0.8
  • Reducer slots: Typically 40–50% of total mapper slots

For a 16-core node:

  • YARN vCores: 14 (leave 2 for OS + DataNode)
  • Map containers (2 cores each): 7 per node
  • Reduce containers: ~3–4 per node

Memory

ServiceRAM AllocationNotes
OS + DataNode8–16 GBFixed overhead per node
YARN NodeManagerRest of RAMe.g., 56 GB on 64 GB node
Map container2–4 GBTune per job type
Reduce container4–8 GBNeeds more for large shuffles

Example (64 GB node):

  • OS + DataNode: 8 GB
  • YARN available: 56 GB
  • Containers (4 GB each): 14 concurrent containers per node

NameNode RAM

The NameNode keeps the entire namespace in heap. A rough estimate:

NameNode Heap ≈ (Total Files + Blocks) ÷ 1,000,000 × 1 GB
File CountRecommended Heap
< 10M files4 GB
10–50M files8–16 GB
50–200M files32–64 GB
> 200M filesConsider HDFS Federation

Network Sizing

Hadoop is network-intensive during shuffle (MapReduce), replication, and bulk loads.

TierMinimumRecommended
Node NICs1 GbE10 GbE
Top-of-rack switch10 GbE25–100 GbE uplink
Cross-rack bandwidth10 GbE40 GbE+

Replication bandwidth: When a DataNode fails, HDFS re-replicates all its blocks. On a node with 24 TB of data and 1 GbE network, recovery takes:

24,000 GB ÷ 125 MB/s ≈ 53 hours

With 10 GbE: ~5 hours. Always use at least 10 GbE for production.

Cluster Topology Reference

Small cluster (< 10 nodes)

RoleCountSpec
NameNode + ResourceManager1 (or 2 for HA)32–64 GB RAM, 4–8 cores, 2× SSD for OS
DataNode + NodeManager3–864–128 GB RAM, 12–16 cores, 8× 4 TB HDD, 10 GbE
ZooKeeper (if HA)38 GB RAM, 4 cores, SSD

Medium cluster (10–100 nodes)

RoleCountSpec
Active + Standby NameNode2128–256 GB RAM, 16–24 cores
JournalNodes316 GB RAM, SSD for journal storage
Active + Standby ResourceManager264 GB RAM, 16 cores
DataNode + NodeManager10–80128–256 GB RAM, 24–32 cores, 12× 4–8 TB HDD
ZooKeeper3–516 GB RAM, SSD
Edge/Gateway nodes2–432 GB RAM (user-facing, no data)

Large cluster (100+ nodes)

At this scale, consider:

  • HDFS Federation for namespace scaling
  • Dedicated YARN queues per team (Capacity Scheduler)
  • Separate networks for storage traffic vs management
  • Rack-aware topology with 40+ GbE cross-rack

Growth Planning

Track these metrics monthly and project 12–24 months out:

# Current HDFS usage and remaining
hdfs dfsadmin -report | grep -E "DFS Used|DFS Remaining|DFS Used%"

# File and block count growth
curl -s "http://namenode:9870/jmx?qry=Hadoop:service=NameNode,name=FSNamesystemState" \
| python3 -c "import json,sys; d=json.load(sys.stdin)['beans'][0]; \
print('Files:', d['FilesTotal'], 'Blocks:', d['BlocksTotal'])"

Set alerts at 70% capacity and plan procurement before you reach 85%.

Quick Sizing Checklist

  • Estimated raw data volume and daily growth rate
  • Retention policy defined (30 / 90 / 365 days)
  • Replication factor chosen (3 for production)
  • Compression format selected (ORC/Parquet + Snappy)
  • NameNode RAM sized for expected file count
  • DataNode disks configured as JBOD (no RAID)
  • Network at 10 GbE minimum per node
  • HA configured (NameNode + ResourceManager)
  • ZooKeeper ensemble (3 or 5 nodes) provisioned
  • 20–25% headroom added to all estimates