HDFS vs Amazon S3: Choosing Your Hadoop Storage
As organizations move workloads to the cloud, one of the most common questions is: should I use HDFS or Amazon S3 as my Hadoop storage layer? Both are valid choices, but they have very different performance profiles and operational characteristics.
Architecture Differences
HDFS co-locates storage and compute. DataNodes store data on local disks, and Hadoop's data locality optimization ensures that MapReduce and Spark tasks run on the same node that holds the data — eliminating network I/O for reads.
Amazon S3 separates storage from compute. Your cluster nodes have no local data; all reads and writes traverse the network. This means data locality is impossible, but it also means your compute cluster can be resized independently of your storage.
Performance
| Scenario | HDFS Winner | S3 Winner |
|---|---|---|
| Sequential large file reads | ✅ Data locality | ❌ Network I/O |
| Random small file access | ✅ Local disk | ❌ High latency per request |
| Cluster scale-up/down | ❌ Data rebalancing needed | ✅ Instant, no rebalancing |
| Long-term cold storage cost | ❌ Always-on nodes | ✅ Pay only for storage |
| Multi-framework access (Spark, Presto) | ❌ Must be on same cluster | ✅ Any cluster can read |
Latency
HDFS latency for sequential reads is typically 10-40ms for a local DataNode read. S3 API calls have a base latency of 100-200ms before any data is transferred. For workloads with many small files, this difference is significant.
Cost Model
With HDFS, you pay for the full node (CPU + memory + disk) even when the cluster is idle. With S3, you pay only for stored bytes when not running jobs, making it dramatically cheaper for intermittent or serverless architectures.
The Hybrid Pattern
Many organizations adopt a hybrid approach:
- Hot data → HDFS (low latency, high throughput)
- Cold/archival data → S3 (cheap, durable, accessible)
- Hadoop on S3 → Use
s3a://connector with EMRFS or Hadoop 3's S3A committer for correctness
# Access S3 from Hadoop
hdfs dfs -ls s3a://my-bucket/data/
hadoop jar myapp.jar -input s3a://my-bucket/input/ -output s3a://my-bucket/output/
Recommendation
- On-premises cluster, performance-critical batch jobs → HDFS
- Cloud-native, variable workloads, or data lake → S3 with a cloud Hadoop distribution (EMR, Dataproc, HDInsight)
- Migrating to cloud → Start with S3 for new data, migrate HDFS data gradually
