HDFS vs Amazon S3: Choosing Your Hadoop Storage

April 27, 2026 · 2 min read

Big Data Engineers

As organizations move workloads to the cloud, one of the most common questions is: should I use HDFS or Amazon S3 as my Hadoop storage layer? Both are valid choices, but they have very different performance profiles and operational characteristics.

Architecture Differences

HDFS co-locates storage and compute. DataNodes store data on local disks, and Hadoop's data locality optimization ensures that MapReduce and Spark tasks run on the same node that holds the data — eliminating network I/O for reads.

Amazon S3 separates storage from compute. Your cluster nodes have no local data; all reads and writes traverse the network. This means data locality is impossible, but it also means your compute cluster can be resized independently of your storage.

Performance

Scenario	HDFS Winner	S3 Winner
Sequential large file reads	✅ Data locality	❌ Network I/O
Random small file access	✅ Local disk	❌ High latency per request
Cluster scale-up/down	❌ Data rebalancing needed	✅ Instant, no rebalancing
Long-term cold storage cost	❌ Always-on nodes	✅ Pay only for storage
Multi-framework access (Spark, Presto)	❌ Must be on same cluster	✅ Any cluster can read

Latency

HDFS latency for sequential reads is typically 10-40ms for a local DataNode read. S3 API calls have a base latency of 100-200ms before any data is transferred. For workloads with many small files, this difference is significant.

Cost Model

With HDFS, you pay for the full node (CPU + memory + disk) even when the cluster is idle. With S3, you pay only for stored bytes when not running jobs, making it dramatically cheaper for intermittent or serverless architectures.

The Hybrid Pattern

Many organizations adopt a hybrid approach:

Hot data → HDFS (low latency, high throughput)
Cold/archival data → S3 (cheap, durable, accessible)
Hadoop on S3 → Use s3a:// connector with EMRFS or Hadoop 3's S3A committer for correctness

# Access S3 from Hadoop
hdfs dfs -ls s3a://my-bucket/data/
hadoop jar myapp.jar -input s3a://my-bucket/input/ -output s3a://my-bucket/output/

Recommendation

On-premises cluster, performance-critical batch jobs → HDFS
Cloud-native, variable workloads, or data lake → S3 with a cloud Hadoop distribution (EMR, Dataproc, HDInsight)
Migrating to cloud → Start with S3 for new data, migrate HDFS data gradually

Architecture Differences​

Performance​

Latency​

Cost Model​

The Hybrid Pattern​

Recommendation​