HDFS Snapshots

What Are Snapshots?

An HDFS snapshot is a read-only, point-in-time copy of a directory. Snapshots are created nearly instantly and consume no additional storage until files actually change — they work by tracking the differences (copy-on-write), not duplicating data.

Use snapshots to:

Protect against accidental deletion or corruption
Provide a stable baseline for backup jobs
Allow users to recover individual files without admin intervention
Create a consistent view for ETL pipelines while writes continue

Enable Snapshots on a Directory

Snapshots must be explicitly enabled on a per-directory basis by an administrator:

# Allow snapshots on /data/warehouse
hdfs dfsadmin -allowSnapshot /data/warehouse

# Disallow snapshots (and delete all existing snapshots first)
hdfs dfsadmin -disallowSnapshot /data/warehouse

Creating and Listing Snapshots

# Create a snapshot with an automatic timestamp name
hdfs dfs -createSnapshot /data/warehouse

# Create a snapshot with a custom name
hdfs dfs -createSnapshot /data/warehouse snap-2026-04-29

# List all snapshots on a directory
hdfs dfs -ls /data/warehouse/.snapshot

# List files inside a specific snapshot
hdfs dfs -ls /data/warehouse/.snapshot/snap-2026-04-29

The .snapshot directory is a hidden virtual directory — it does not appear in normal ls output but is accessible at any time.

Recovering Files from a Snapshot

# Recover a deleted file from snapshot back to the live directory
hdfs dfs -cp \
  /data/warehouse/.snapshot/snap-2026-04-29/important-table/part-00000 \
  /data/warehouse/important-table/part-00000

# Recover an entire directory
hdfs dfs -cp \
  /data/warehouse/.snapshot/snap-2026-04-29/important-table \
  /data/warehouse/important-table-recovered

Comparing Snapshots (SnapshotDiff)

Use snapshotDiff to see what changed between two points in time:

hdfs snapshotDiff /data/warehouse snap-2026-04-28 snap-2026-04-29

Output format:

M	.        (directory itself modified)
+	./new-table          (added)
-	./old-table          (deleted)
M	./customers/part-00001  (modified)
R	./tmp/work -> ./tmp/done  (renamed)

This is useful for incremental backup jobs — copy only the + and M entries since the last snapshot.

Renaming and Deleting Snapshots

# Rename a snapshot
hdfs dfs -renameSnapshot /data/warehouse snap-2026-04-29 daily-backup-20260429

# Delete a snapshot (frees space used by data only in this snapshot)
hdfs dfs -deleteSnapshot /data/warehouse daily-backup-20260429

Automating Daily Snapshots

A simple cron job for daily snapshots with 7-day retention:

#!/bin/bash
DIR=/data/warehouse
DATE=$(date +%Y-%m-%d)
RETAIN=7

# Create today's snapshot
hdfs dfs -createSnapshot "$DIR" "snap-$DATE"

# Delete snapshots older than RETAIN days
hdfs dfs -ls "$DIR/.snapshot" | awk '{print $NF}' | while read snap; do
  snapname=$(basename "$snap")
  snapdate=${snapname#snap-}
  age=$(( ( $(date +%s) - $(date -d "$snapdate" +%s) ) / 86400 ))
  if [[ $age -gt $RETAIN ]]; then
    hdfs dfs -deleteSnapshot "$DIR" "$snapname"
    echo "Deleted old snapshot: $snapname"
  fi
done

Add to crontab (runs daily at 1 AM):

0 1 * * * /opt/hadoop/scripts/snapshot-rotate.sh >> /var/log/hdfs-snapshot.log 2>&1

Snapshot Limitations

Limitation	Detail
Max snapshots per directory	65,536
Snapshottable directories	Must be enabled per directory; subdirectories cannot be independently snapshotted if parent is already snapshotted
Rename across snapshot boundaries	Files renamed across snapshotted directories may appear in both old and new paths
Quota accounting	Snapshot storage counts against the directory quota

Checking Snapshot Status

# List all snapshotable directories in the cluster
hdfs lsSnapshottableDir

# Show snapshot statistics
hdfs dfs -count -v -s /data/warehouse

What Are Snapshots?​

Enable Snapshots on a Directory​

Creating and Listing Snapshots​

Recovering Files from a Snapshot​

Comparing Snapshots (SnapshotDiff)​

Renaming and Deleting Snapshots​

Automating Daily Snapshots​

Snapshot Limitations​

Checking Snapshot Status​