Skip to main content

Using Hadoop with Amazon S3: The S3A Connector Explained

· 5 min read
Hadoop.so Editorial Team
Big Data Engineers

The s3a:// filesystem connector in Hadoop lets you use Amazon S3 as a drop-in replacement for HDFS storage. It's the foundation for cost-effective data lake architectures where compute and storage are decoupled. This guide covers configuration, performance tuning, and production best practices.

Why S3A?

Amazon S3 offers virtually unlimited capacity at a fraction of the cost of on-premises HDFS. With the S3A connector (the third and current generation, replacing the older s3:// and s3n:// implementations), Hadoop jobs read and write S3 objects using familiar s3a://bucket/path URIs — no code changes required.

Key advantages of S3A:

  • Storage decoupling — scale compute and storage independently
  • Durability — S3 provides 99.999999999% (11 nines) object durability
  • Cost — typically 70–80% cheaper than equivalent HDFS on-premises storage per GB
  • Multi-region availability — replicate data across AWS regions easily

Core Configuration

Add the following to core-site.xml on all cluster nodes. Never store credentials in config files for production — use IAM roles instead.

<!-- core-site.xml -->
<configuration>
<!-- S3A implementation class -->
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>

<!-- AWS credentials (use IAM roles in production instead) -->
<property>
<name>fs.s3a.access.key</name>
<value>YOUR_ACCESS_KEY</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>YOUR_SECRET_KEY</value>
</property>

<!-- AWS region -->
<property>
<name>fs.s3a.endpoint.region</name>
<value>us-east-1</value>
</property>
</configuration>

For production on EC2, use an IAM instance role and omit access/secret keys entirely — Hadoop will retrieve credentials from the EC2 metadata service automatically.


Authentication Methods

S3A supports multiple credential providers, tried in order:

<property>
<name>fs.s3a.aws.credentials.provider</name>
<value>
org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider,
com.amazonaws.auth.EnvironmentVariableCredentialsProvider,
com.amazonaws.auth.profile.ProfileCredentialsProvider
</value>
</property>

Provider priority (recommended for production):

  1. IAM Instance Role (EC2) or IAM Task Role (ECS/EKS)
  2. AWS environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
  3. AWS credentials file (~/.aws/credentials)
  4. Hardcoded keys in core-site.xml (avoid in production)

Performance Tuning

S3 is not a filesystem — it's an object store. Network latency and request overhead are the main performance factors. These settings dramatically improve throughput:

Parallel Upload (Multipart)

<!-- Minimum part size for multipart upload (default 5MB) -->
<property>
<name>fs.s3a.multipart.size</name>
<value>67108864</value> <!-- 64MB -->
</property>

<!-- Threshold above which multipart upload is used (default 128MB) -->
<property>
<name>fs.s3a.multipart.threshold</name>
<value>134217728</value> <!-- 128MB -->
</property>

Connection Pool Size

<!-- Max connections to S3 per JVM -->
<property>
<name>fs.s3a.connection.maximum</name>
<value>100</value>
</property>

<!-- Enable HTTP keep-alive -->
<property>
<name>fs.s3a.connection.ssl.enabled</name>
<value>true</value>
</property>

Prefetch and Read-Ahead

<!-- Prefetch block size for sequential reads -->
<property>
<name>fs.s3a.readahead.range</name>
<value>1048576</value> <!-- 1MB -->
</property>

<!-- Enable predictive prefetch (Hadoop 3.3.5+) -->
<property>
<name>fs.s3a.prefetch.enabled</name>
<value>true</value>
</property>
<property>
<name>fs.s3a.prefetch.block.size</name>
<value>8388608</value> <!-- 8MB -->
</property>

Fast Upload Buffer

S3A's fast upload mode buffers data in memory or disk before uploading, improving job throughput:

<property>
<name>fs.s3a.fast.upload</name>
<value>true</value>
</property>

<!-- Buffer location: disk, array, or bytebuffer -->
<property>
<name>fs.s3a.fast.upload.buffer</name>
<value>disk</value>
</property>

Committer: Handling the Rename Problem

S3 doesn't support atomic directory renames. The classic FileOutputCommitter moves output files from a _temporary/ directory to the final path — on HDFS this is a metadata operation, but on S3 it means copying every byte. For large outputs this is catastrophic.

Use the S3A Magic Committer instead:

<!-- mapred-site.xml -->
<property>
<name>mapreduce.outputcommitter.factory.scheme.s3a</name>
<value>org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory</value>
</property>
<property>
<name>fs.s3a.committer.name</name>
<value>magic</value>
</property>
<property>
<name>fs.s3a.committer.magic.enabled</name>
<value>true</value>
</property>

The Magic Committer uses S3's multipart upload API to write files directly to their final location during the task — commit just calls CompleteMultipartUpload and is nearly instantaneous.


Working with S3A

Basic file operations

# List bucket contents
hadoop fs -ls s3a://my-bucket/data/

# Copy local file to S3
hadoop fs -put localfile.csv s3a://my-bucket/input/

# Run a MapReduce job with S3 input/output
hadoop jar hadoop-mapreduce-examples-*.jar wordcount \
s3a://my-bucket/input/ \
s3a://my-bucket/output/wordcount/

# Check S3 usage (note: no block count, S3 is object store)
hadoop fs -du -h s3a://my-bucket/

Per-bucket configuration

You can configure different credentials or endpoints per bucket using per-bucket properties:

<property>
<name>fs.s3a.bucket.my-west-bucket.endpoint.region</name>
<value>us-west-2</value>
</property>
<property>
<name>fs.s3a.bucket.my-west-bucket.access.key</name>
<value>WEST_REGION_ACCESS_KEY</value>
</property>

S3-Compatible Object Stores

S3A works with any S3-compatible object store by overriding the endpoint:

<!-- MinIO example -->
<property>
<name>fs.s3a.endpoint</name>
<value>http://minio.internal:9000</value>
</property>
<property>
<name>fs.s3a.path.style.access</name>
<value>true</value>
</property>

This enables the same Hadoop code to run against MinIO (on-premises), Ceph RGW, Cloudflare R2, or Backblaze B2 — same config pattern, different endpoint.


Common Issues and Fixes

SymptomCauseFix
AccessDeniedExceptionWrong credentials or missing IAM policyVerify IAM policy includes s3:GetObject, s3:PutObject, s3:ListBucket
Slow output commitUsing old FileOutputCommitterSwitch to S3A Magic Committer
IOException: No AWS credentialsNo credential provider foundConfigure IAM role or credential provider chain
High S3 costs (LIST requests)Many small files / frequent listingCombine small files, use s3a:// prefixes wisely
FileNotFoundException on read after writeS3 eventual consistency (old SDKs)Hadoop 3.3.1+ uses S3 strong consistency — upgrade

Summary

The S3A connector transforms Hadoop into a hybrid compute engine that can process data wherever it lives. Key takeaways:

  • Use IAM roles for authentication — never hardcode credentials
  • Enable the Magic Committer to avoid catastrophic rename overhead
  • Tune multipart upload size and connection pool for your workload
  • S3 strong consistency (available since 2020) eliminates the "read-after-write" problem in modern Hadoop releases
  • Per-bucket configuration lets you span multiple regions and credential sets

With S3A, running Hadoop workloads on transient EMR clusters or spot instances — reading and writing directly to S3 — becomes a practical, cost-efficient architecture.