HDFS Snapshots
What Are Snapshots?
An HDFS snapshot is a read-only, point-in-time copy of a directory. Snapshots are created nearly instantly and consume no additional storage until files actually change — they work by tracking the differences (copy-on-write), not duplicating data.
Use snapshots to:
- Protect against accidental deletion or corruption
- Provide a stable baseline for backup jobs
- Allow users to recover individual files without admin intervention
- Create a consistent view for ETL pipelines while writes continue
Enable Snapshots on a Directory
Snapshots must be explicitly enabled on a per-directory basis by an administrator:
# Allow snapshots on /data/warehouse
hdfs dfsadmin -allowSnapshot /data/warehouse
# Disallow snapshots (and delete all existing snapshots first)
hdfs dfsadmin -disallowSnapshot /data/warehouse
Creating and Listing Snapshots
# Create a snapshot with an automatic timestamp name
hdfs dfs -createSnapshot /data/warehouse
# Create a snapshot with a custom name
hdfs dfs -createSnapshot /data/warehouse snap-2026-04-29
# List all snapshots on a directory
hdfs dfs -ls /data/warehouse/.snapshot
# List files inside a specific snapshot
hdfs dfs -ls /data/warehouse/.snapshot/snap-2026-04-29
The .snapshot directory is a hidden virtual directory — it does not appear in normal ls output but is accessible at any time.
Recovering Files from a Snapshot
# Recover a deleted file from snapshot back to the live directory
hdfs dfs -cp \
/data/warehouse/.snapshot/snap-2026-04-29/important-table/part-00000 \
/data/warehouse/important-table/part-00000
# Recover an entire directory
hdfs dfs -cp \
/data/warehouse/.snapshot/snap-2026-04-29/important-table \
/data/warehouse/important-table-recovered
Comparing Snapshots (SnapshotDiff)
Use snapshotDiff to see what changed between two points in time:
hdfs snapshotDiff /data/warehouse snap-2026-04-28 snap-2026-04-29
Output format:
M . (directory itself modified)
+ ./new-table (added)
- ./old-table (deleted)
M ./customers/part-00001 (modified)
R ./tmp/work -> ./tmp/done (renamed)
This is useful for incremental backup jobs — copy only the + and M entries since the last snapshot.
Renaming and Deleting Snapshots
# Rename a snapshot
hdfs dfs -renameSnapshot /data/warehouse snap-2026-04-29 daily-backup-20260429
# Delete a snapshot (frees space used by data only in this snapshot)
hdfs dfs -deleteSnapshot /data/warehouse daily-backup-20260429
Automating Daily Snapshots
A simple cron job for daily snapshots with 7-day retention:
#!/bin/bash
DIR=/data/warehouse
DATE=$(date +%Y-%m-%d)
RETAIN=7
# Create today's snapshot
hdfs dfs -createSnapshot "$DIR" "snap-$DATE"
# Delete snapshots older than RETAIN days
hdfs dfs -ls "$DIR/.snapshot" | awk '{print $NF}' | while read snap; do
snapname=$(basename "$snap")
snapdate=${snapname#snap-}
age=$(( ( $(date +%s) - $(date -d "$snapdate" +%s) ) / 86400 ))
if [[ $age -gt $RETAIN ]]; then
hdfs dfs -deleteSnapshot "$DIR" "$snapname"
echo "Deleted old snapshot: $snapname"
fi
done
Add to crontab (runs daily at 1 AM):
0 1 * * * /opt/hadoop/scripts/snapshot-rotate.sh >> /var/log/hdfs-snapshot.log 2>&1
Snapshot Limitations
| Limitation | Detail |
|---|---|
| Max snapshots per directory | 65,536 |
| Snapshottable directories | Must be enabled per directory; subdirectories cannot be independently snapshotted if parent is already snapshotted |
| Rename across snapshot boundaries | Files renamed across snapshotted directories may appear in both old and new paths |
| Quota accounting | Snapshot storage counts against the directory quota |
Checking Snapshot Status
# List all snapshotable directories in the cluster
hdfs lsSnapshottableDir
# Show snapshot statistics
hdfs dfs -count -v -s /data/warehouse