Common HDFS Errors and Fixes
A reference guide for the most frequently encountered HDFS errors in production, with root causes and step-by-step fixes.
1. Safe Mode — Cluster Won't Accept Writes
Symptom:
org.apache.hadoop.hdfs.server.namenode.SafeModeException:
Cannot create file. Name node is in safe mode.
Cause: The NameNode enters safe mode on startup until enough DataNodes check in and block replication reaches the threshold (default: 99.9% of blocks replicated). Also triggered manually or by very low disk space.
Fix:
# Check current safe mode status
hdfs dfsadmin -safemode get
# Wait for automatic exit (normal startup)
hdfs dfsadmin -safemode wait
# Force exit safe mode (only after confirming cluster is healthy)
hdfs dfsadmin -safemode leave
# If caused by low disk — free space first, then:
hdfs dfsadmin -safemode leave
To see why safe mode was entered:
curl -s http://namenode:9870/jmx?qry=Hadoop:service=NameNode,name=FSNamesystemState \
| python3 -m json.tool | grep -i safe
2. Under-Replicated Blocks
Symptom:
hdfs fsck / | grep "Under replicated"
# FSCK started ... Under replicated blocks: 142
Cause: One or more DataNodes went offline, leaving some blocks with fewer copies than dfs.replication.
Fix:
# Check which DataNodes are dead
hdfs dfsadmin -report | grep -A5 "Dead datanodes"
# If the dead node is coming back, wait — HDFS auto-replicates
# after dfs.namenode.replication.min is met
# Force immediate re-replication check
hdfs dfsadmin -metasave /tmp/meta.txt && cat /tmp/meta.txt | head -50
# Check replication progress
hdfs fsck / | tail -20
If the DataNode is permanently lost:
# Decommission the dead node (graceful removal)
# Add to dfs.hosts.exclude, then:
hdfs dfsadmin -refreshNodes
# Wait for replication to complete, then remove node from cluster
3. Corrupt Blocks
Symptom:
hdfs fsck / | grep "corrupt"
# CORRUPT: /data/sales/part-00003.snappy.orc
Cause: A DataNode disk failure corrupted a block, and all replicas of that block are bad (or the block count fell to zero).
Fix:
# Find all corrupt files
hdfs fsck / -list-corruptfileblocks
# Option 1: Delete corrupt files (if data can be regenerated)
hdfs fsck / -delete
# Option 2: Move corrupt files to /lost+found for investigation
hdfs fsck / -move
# Option 3: Skip corrupt blocks during reads (application level)
# Set in core-site.xml:
# dfs.client.block.write.replace-datanode-on-failure.policy = ALWAYS
After fixing, verify:
hdfs fsck / | grep -E "corrupt|missing|healthy"
4. DataNode Disk Failure
Symptom:
WARN datanode.DataNode: Removed volume: /data/disk3
ERROR datanode.DataNode: Disk error on DataNode: Failed to place block
Cause: A physical disk in a DataNode failed. HDFS auto-detects it via the dfs.datanode.failed.volumes.tolerated setting.
Fix:
# Check DataNode logs on the affected host
tail -200 /var/log/hadoop/hadoop-hdfs-datanode-*.log | grep -i "error\|failed\|removed"
# Check current volume failures via JMX
curl "http://datanode-host:9864/jmx?qry=Hadoop:service=DataNode,name=DataNodeInfo" \
| python3 -m json.tool | grep -i volume
# After replacing the disk, re-add it to hdfs-site.xml:
# dfs.datanode.data.dir = /data/disk1,/data/disk2,/data/disk3,/data/disk4
# Restart the DataNode
hdfs --daemon stop datanode
hdfs --daemon start datanode
Allow tolerated failures (so DataNode stays alive with one bad disk):
<property>
<name>dfs.datanode.failed.volumes.tolerated</name>
<value>1</value>
</property>
5. NameNode Out of Memory (OOM)
Symptom:
java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.hdfs.server.namenode.FSDirectory
Cause: NameNode heap is too small for the number of files and blocks in the namespace (~150 bytes per file+block object).
Fix:
Increase NameNode heap in hadoop-env.sh:
export HADOOP_NAMENODE_OPTS="-Xms16g -Xmx16g -XX:+UseG1GC \
-XX:G1HeapRegionSize=32m -XX:InitiatingHeapOccupancyPercent=35"
Check current file and block count:
curl -s "http://namenode:9870/jmx?qry=Hadoop:service=NameNode,name=FSNamesystemState" \
| python3 -c "import json,sys; b=json.load(sys.stdin)['beans'][0]; \
print('Files:', b['FilesTotal'], 'Blocks:', b['BlocksTotal'])"
Rule of thumb: 1 GB heap per 1 million files+blocks.
Long-term: clean up small files (they bloat the namespace), and consider HDFS Federation if namespace exceeds 200M files.
6. Too Many Small Files
Symptom:
- NameNode memory grows unboundedly
- MapReduce jobs have thousands of tiny map tasks
hdfs fsck /reports millions of blocks for modest data sizes
Cause: Each file and each block consumes ~150 bytes of NameNode RAM regardless of size. A directory with 10M files of 1 KB each wastes as much NameNode memory as 10M files of 128 MB each.
Fix:
# Identify directories with excessive small files
hdfs dfs -count -q -h /data/logs | sort -k2 -rn | head -20
# Compact small files with Hadoop Archive (HAR)
hadoop archive -archiveName logs-2026-04.har \
-p /data/logs 2026-04 \
/data/archives/
# Verify archive
hdfs dfs -ls har:///data/archives/logs-2026-04.har/
# Remove original after verification
hdfs dfs -rm -r /data/logs/2026-04
Prevention — configure output merging in Hive:
SET hive.merge.mapfiles=true;
SET hive.merge.mapredfiles=true;
SET hive.merge.size.per.task=256000000; -- 256 MB target file size
SET hive.merge.smallfiles.avgsize=64000000;
7. Connection Refused / NameNode Unreachable
Symptom:
java.net.ConnectException: Connection refused to host: namenode:8020
Cause: NameNode process is down, wrong hostname/port in core-site.xml, or firewall blocking.
Fix:
# Check if NameNode process is running
ps aux | grep NameNode
# Check NameNode logs
tail -100 /var/log/hadoop/hadoop-hdfs-namenode-*.log
# Verify configuration
hdfs getconf -confKey fs.defaultFS
# Test connectivity
nc -zv namenode 8020
# Restart NameNode if it crashed
hdfs --daemon start namenode
# For HA clusters — check which NameNode is active
hdfs haadmin -getAllServiceState
8. HDFS Quota Exceeded
Symptom:
org.apache.hadoop.hdfs.protocol.DSQuotaExceededException:
The DiskSpace quota of /user/alice is exceeded
Cause: A namespace quota (file count) or space quota (bytes) was set on a directory and has been reached.
Fix:
# Check quota on a directory
hdfs dfs -count -q -h /user/alice
# Output: QUOTA REMAINING_QUOTA SPACE_QUOTA REMAINING_SPACE DIR_COUNT FILE_COUNT CONTENT_SIZE PATH
# Increase namespace quota (max files)
hdfs dfsadmin -setQuota 1000000 /user/alice
# Increase space quota (bytes)
hdfs dfsadmin -setSpaceQuota 5t /user/alice # 5 TB
# Remove quota entirely
hdfs dfsadmin -clrQuota /user/alice
hdfs dfsadmin -clrSpaceQuota /user/alice
9. Slow or Stuck MapReduce Jobs
Symptom: Jobs run much slower than expected or hang at 99% for a long time.
Cause and Fix by scenario:
| Scenario | Diagnosis | Fix |
|---|---|---|
| Reducer bottleneck | 1 reducer handling huge data skew | Add a DISTRIBUTE BY key or use a combiner |
| Speculative task | One slow task blocks the job | Enable mapreduce.map.speculative=true |
| GC pressure | Excessive GC in task logs | Increase container heap; switch to G1GC |
| Network shuffle | Slow cross-rack shuffle | Enable map output compression (Snappy) |
| YARN preemption | Jobs losing containers | Tune queue capacity; check yarn.scheduler.capacity |
# Find the slow task in a running job
yarn logs -applicationId application_XXXX_YYYY | grep -i "slow\|lagging"
# Check container resource usage
yarn application -status application_XXXX_YYYY
10. hdfs fsck Quick Reference
# Full filesystem check
hdfs fsck /
# Check specific path, show file details
hdfs fsck /data/warehouse -files -blocks -locations
# List only corrupt files
hdfs fsck / -list-corruptfileblocks
# Delete corrupt files automatically
hdfs fsck / -delete
# Move corrupt files to /lost+found
hdfs fsck / -move
# Ignore under-replicated (faster check)
hdfs fsck / -includeSnapshots
Healthy output ends with:
The filesystem under path '/' is HEALTHY
Unhealthy indicators to watch:
Under replicated blocks: X ← DataNode outage, will self-heal
Mis-replicated blocks: X ← Rack awareness misconfiguration
Corrupt blocks: X ← Data loss risk — act immediately
Missing blocks: X ← Data loss — act immediately