HDFS Deep Dive
HDFS (Hadoop Distributed File System) is the primary storage layer of Hadoop. It is designed to run on commodity hardware and reliably store very large files — from gigabytes to petabytes.
Architecture
HDFS uses a master/worker architecture:
- NameNode — Manages the filesystem namespace (file tree and metadata). There is one active NameNode (plus an optional Standby for HA).
- DataNode — Stores actual data blocks. There are many DataNodes spread across the cluster.
- Secondary NameNode — Periodically merges the NameNode's edit log with the filesystem image (not a hot standby).
Client
├─► NameNode (metadata: where is block X?)
└─► DataNode 1, DataNode 2, DataNode 3 (actual data blocks)
How Replication Works
By default, each block is replicated 3 times across different DataNodes (and different racks). If a DataNode fails, the NameNode detects missing replicas and instructs another DataNode to create a copy.
Basic HDFS Commands
# List root directory
hdfs dfs -ls /
# Create a directory
hdfs dfs -mkdir -p /user/hadoop/data
# Upload a local file
hdfs dfs -put localfile.txt /user/hadoop/data/
# Download a file from HDFS
hdfs dfs -get /user/hadoop/data/localfile.txt ./output.txt
# View file contents
hdfs dfs -cat /user/hadoop/data/localfile.txt
# Check disk usage
hdfs dfs -du -h /user/hadoop/
# Delete a file
hdfs dfs -rm /user/hadoop/data/localfile.txt
# Check filesystem health
hdfs fsck / -files -blocks
Block Size
The default HDFS block size is 128 MB (configurable). Large block sizes reduce NameNode memory usage and network overhead for sequential reads.
hdfs dfs -D dfs.blocksize=256m -put bigfile.csv /data/
Safe Mode
After startup, HDFS enters safe mode while DataNodes report their blocks to the NameNode. Writes are blocked until a minimum replication threshold is met.
hdfs dfsadmin -safemode get
hdfs dfsadmin -safemode leave
Next Steps
Move on to MapReduce Fundamentals to learn how to process the data you store.