Using Hadoop with Amazon S3: The S3A Connector Explained
The s3a:// filesystem connector in Hadoop lets you use Amazon S3 as a drop-in replacement for HDFS storage. It's the foundation for cost-effective data lake architectures where compute and storage are decoupled. This guide covers configuration, performance tuning, and production best practices.
Why S3A?
Amazon S3 offers virtually unlimited capacity at a fraction of the cost of on-premises HDFS. With the S3A connector (the third and current generation, replacing the older s3:// and s3n:// implementations), Hadoop jobs read and write S3 objects using familiar s3a://bucket/path URIs — no code changes required.
Key advantages of S3A:
- Storage decoupling — scale compute and storage independently
- Durability — S3 provides 99.999999999% (11 nines) object durability
- Cost — typically 70–80% cheaper than equivalent HDFS on-premises storage per GB
- Multi-region availability — replicate data across AWS regions easily
Core Configuration
Add the following to core-site.xml on all cluster nodes. Never store credentials in config files for production — use IAM roles instead.
<!-- core-site.xml -->
<configuration>
<!-- S3A implementation class -->
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
<!-- AWS credentials (use IAM roles in production instead) -->
<property>
<name>fs.s3a.access.key</name>
<value>YOUR_ACCESS_KEY</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>YOUR_SECRET_KEY</value>
</property>
<!-- AWS region -->
<property>
<name>fs.s3a.endpoint.region</name>
<value>us-east-1</value>
</property>
</configuration>
For production on EC2, use an IAM instance role and omit access/secret keys entirely — Hadoop will retrieve credentials from the EC2 metadata service automatically.
Authentication Methods
S3A supports multiple credential providers, tried in order:
<property>
<name>fs.s3a.aws.credentials.provider</name>
<value>
org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider,
com.amazonaws.auth.EnvironmentVariableCredentialsProvider,
com.amazonaws.auth.profile.ProfileCredentialsProvider
</value>
</property>
Provider priority (recommended for production):
- IAM Instance Role (EC2) or IAM Task Role (ECS/EKS)
- AWS environment variables (
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY) - AWS credentials file (
~/.aws/credentials) - Hardcoded keys in
core-site.xml(avoid in production)
Performance Tuning
S3 is not a filesystem — it's an object store. Network latency and request overhead are the main performance factors. These settings dramatically improve throughput:
Parallel Upload (Multipart)
<!-- Minimum part size for multipart upload (default 5MB) -->
<property>
<name>fs.s3a.multipart.size</name>
<value>67108864</value> <!-- 64MB -->
</property>
<!-- Threshold above which multipart upload is used (default 128MB) -->
<property>
<name>fs.s3a.multipart.threshold</name>
<value>134217728</value> <!-- 128MB -->
</property>
Connection Pool Size
<!-- Max connections to S3 per JVM -->
<property>
<name>fs.s3a.connection.maximum</name>
<value>100</value>
</property>
<!-- Enable HTTP keep-alive -->
<property>
<name>fs.s3a.connection.ssl.enabled</name>
<value>true</value>
</property>
Prefetch and Read-Ahead
<!-- Prefetch block size for sequential reads -->
<property>
<name>fs.s3a.readahead.range</name>
<value>1048576</value> <!-- 1MB -->
</property>
<!-- Enable predictive prefetch (Hadoop 3.3.5+) -->
<property>
<name>fs.s3a.prefetch.enabled</name>
<value>true</value>
</property>
<property>
<name>fs.s3a.prefetch.block.size</name>
<value>8388608</value> <!-- 8MB -->
</property>
Fast Upload Buffer
S3A's fast upload mode buffers data in memory or disk before uploading, improving job throughput:
<property>
<name>fs.s3a.fast.upload</name>
<value>true</value>
</property>
<!-- Buffer location: disk, array, or bytebuffer -->
<property>
<name>fs.s3a.fast.upload.buffer</name>
<value>disk</value>
</property>
Committer: Handling the Rename Problem
S3 doesn't support atomic directory renames. The classic FileOutputCommitter moves output files from a _temporary/ directory to the final path — on HDFS this is a metadata operation, but on S3 it means copying every byte. For large outputs this is catastrophic.
Use the S3A Magic Committer instead:
<!-- mapred-site.xml -->
<property>
<name>mapreduce.outputcommitter.factory.scheme.s3a</name>
<value>org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory</value>
</property>
<property>
<name>fs.s3a.committer.name</name>
<value>magic</value>
</property>
<property>
<name>fs.s3a.committer.magic.enabled</name>
<value>true</value>
</property>
The Magic Committer uses S3's multipart upload API to write files directly to their final location during the task — commit just calls CompleteMultipartUpload and is nearly instantaneous.
Working with S3A
Basic file operations
# List bucket contents
hadoop fs -ls s3a://my-bucket/data/
# Copy local file to S3
hadoop fs -put localfile.csv s3a://my-bucket/input/
# Run a MapReduce job with S3 input/output
hadoop jar hadoop-mapreduce-examples-*.jar wordcount \
s3a://my-bucket/input/ \
s3a://my-bucket/output/wordcount/
# Check S3 usage (note: no block count, S3 is object store)
hadoop fs -du -h s3a://my-bucket/
Per-bucket configuration
You can configure different credentials or endpoints per bucket using per-bucket properties:
<property>
<name>fs.s3a.bucket.my-west-bucket.endpoint.region</name>
<value>us-west-2</value>
</property>
<property>
<name>fs.s3a.bucket.my-west-bucket.access.key</name>
<value>WEST_REGION_ACCESS_KEY</value>
</property>
S3-Compatible Object Stores
S3A works with any S3-compatible object store by overriding the endpoint:
<!-- MinIO example -->
<property>
<name>fs.s3a.endpoint</name>
<value>http://minio.internal:9000</value>
</property>
<property>
<name>fs.s3a.path.style.access</name>
<value>true</value>
</property>
This enables the same Hadoop code to run against MinIO (on-premises), Ceph RGW, Cloudflare R2, or Backblaze B2 — same config pattern, different endpoint.
Common Issues and Fixes
| Symptom | Cause | Fix |
|---|---|---|
AccessDeniedException | Wrong credentials or missing IAM policy | Verify IAM policy includes s3:GetObject, s3:PutObject, s3:ListBucket |
| Slow output commit | Using old FileOutputCommitter | Switch to S3A Magic Committer |
IOException: No AWS credentials | No credential provider found | Configure IAM role or credential provider chain |
| High S3 costs (LIST requests) | Many small files / frequent listing | Combine small files, use s3a:// prefixes wisely |
FileNotFoundException on read after write | S3 eventual consistency (old SDKs) | Hadoop 3.3.1+ uses S3 strong consistency — upgrade |
Summary
The S3A connector transforms Hadoop into a hybrid compute engine that can process data wherever it lives. Key takeaways:
- Use IAM roles for authentication — never hardcode credentials
- Enable the Magic Committer to avoid catastrophic rename overhead
- Tune multipart upload size and connection pool for your workload
- S3 strong consistency (available since 2020) eliminates the "read-after-write" problem in modern Hadoop releases
- Per-bucket configuration lets you span multiple regions and credential sets
With S3A, running Hadoop workloads on transient EMR clusters or spot instances — reading and writing directly to S3 — becomes a practical, cost-efficient architecture.
