Skip to main content

Securing Your Hadoop DataNode: Kerberos, Wire Encryption, and Best Practices

· 5 min read
Hadoop.so Editorial Team
Big Data Engineers

An unsecured Hadoop cluster is a ticking time bomb. Without authentication, any user on the network can read, write, or delete HDFS data. This guide covers the essential security layers for HDFS DataNodes: Kerberos authentication, data transfer encryption, block access tokens, and OS-level hardening.

Why DataNodes Are a Security Target

DataNodes are the workhorses of HDFS — they store actual data blocks and serve reads/writes to clients. In an unsecured cluster:

  • Any process that can reach port 9866 (DataNode transfer port) can read or write blocks directly
  • There's no per-user access control on who reads which data
  • A rogue client can inject corrupt or malicious blocks

Hadoop's security model addresses all of this through Kerberos-based mutual authentication, block access tokens, and optional wire encryption.


Layer 1: Kerberos Authentication

Kerberos is the foundation of Hadoop security. Every Hadoop service (NameNode, DataNode, ResourceManager, NodeManager) authenticates with a Kerberos principal before communicating.

Prerequisites

  • A running Kerberos KDC (MIT Kerberos or Active Directory)
  • DNS properly configured (Kerberos is very sensitive to hostname resolution)
  • Synchronized clocks across all nodes (within 5 minutes; use NTP)

Create Service Principals

For each DataNode host, create a principal:

# On the KDC
kadmin.local -q "addprinc -randkey hdfs/datanode1.example.com@EXAMPLE.COM"
kadmin.local -q "addprinc -randkey host/datanode1.example.com@EXAMPLE.COM"

# Export keytabs
kadmin.local -q "ktadd -k /etc/security/keytabs/hdfs.keytab hdfs/datanode1.example.com@EXAMPLE.COM"

Copy keytabs to each DataNode at /etc/security/keytabs/hdfs.keytab with ownership hdfs:hdfs and mode 400.

Enable Security in hdfs-site.xml

<!-- hdfs-site.xml -->
<property>
<name>dfs.block.access.token.enable</name>
<value>true</value>
</property>

<!-- DataNode SASL RPC authentication -->
<property>
<name>dfs.datanode.kerberos.principal</name>
<value>hdfs/_HOST@EXAMPLE.COM</value>
</property>
<property>
<name>dfs.datanode.keytab.file</name>
<value>/etc/security/keytabs/hdfs.keytab</value>
</property>

Enable Security in core-site.xml

<!-- core-site.xml -->
<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
</property>
<property>
<name>hadoop.security.authorization</name>
<value>true</value>
</property>
<property>
<name>hadoop.rpc.protection</name>
<value>authentication</value> <!-- or privacy for encryption -->
</property>

Layer 2: Block Access Tokens

Block access tokens prevent unauthorized direct block reads/writes even from nodes that have network access to a DataNode. The NameNode issues a short-lived token when a client requests a block location; the DataNode validates the token before serving data.

Enable with:

<property>
<name>dfs.block.access.token.enable</name>
<value>true</value>
</property>
<property>
<name>dfs.block.access.token.lifetime</name>
<value>600</value> <!-- seconds; default 600 -->
</property>

Without block tokens, a client who obtains a block location (host:port + block ID) can read that block without further auth. With tokens, the NameNode effectively gatekeeps all data transfers.


Layer 3: Wire Encryption

Even with Kerberos, data transferred between DataNodes and clients is in plaintext by default. Enable encryption for data in transit:

RPC Encryption (control plane)

<!-- core-site.xml -->
<property>
<name>hadoop.rpc.protection</name>
<value>privacy</value> <!-- authentication = auth only; integrity = + checksums; privacy = + encryption -->
</property>

Data Transfer Encryption (data plane)

<!-- hdfs-site.xml -->
<property>
<name>dfs.encrypt.data.transfer</name>
<value>true</value>
</property>
<property>
<name>dfs.encrypt.data.transfer.algorithm</name>
<value>rc4</value> <!-- or 3des for FIPS compliance -->
</property>
<property>
<name>dfs.encrypt.data.transfer.cipher.suites</name>
<value>AES/CTR/NoPadding</value> <!-- hardware-accelerated AES on modern CPUs -->
</property>

AES/CTR with hardware acceleration (AES-NI, available on most modern Intel/AMD CPUs) adds only 5–10% overhead compared to unencrypted transfer.


Layer 4: DataNode SASL on Privileged Ports

Running the DataNode data transfer on a privileged port (below 1024) proves that the process was started as root and later dropped privileges — adding OS-level verification. This is optional but adds defense in depth.

<property>
<name>dfs.datanode.address</name>
<value>0.0.0.0:1004</value>
</property>
<property>
<name>dfs.datanode.http.address</name>
<value>0.0.0.0:1006</value>
</property>

When using SASL (Hadoop 2.6+) instead of privileged ports, the DataNode proves its identity through Kerberos without needing root-owned ports:

<property>
<name>dfs.datanode.require.secure.ports</name>
<value>false</value>
</property>
<property>
<name>dfs.http.policy</name>
<value>HTTPS_ONLY</value>
</property>

Layer 5: OS Hardening

Kerberos secures the Hadoop layer, but the underlying OS must also be locked down:

File Permissions

# DataNode data directories should be owned by the hdfs user
chown -R hdfs:hadoop /data/hdfs/dn
chmod 700 /data/hdfs/dn

# Keytab files must not be world-readable
chmod 400 /etc/security/keytabs/hdfs.keytab
chown hdfs:hdfs /etc/security/keytabs/hdfs.keytab

Network Restrictions

Restrict DataNode ports to cluster-internal network ranges using iptables or firewalld:

# Only allow DataNode transfer port from within the cluster subnet
iptables -A INPUT -p tcp --dport 9866 -s 10.0.0.0/8 -j ACCEPT
iptables -A INPUT -p tcp --dport 9866 -j DROP

Run DataNode as Non-Root

The DataNode process should run as the hdfs system user, not root:

# /etc/hadoop/hadoop-env.sh
export HDFS_DATANODE_USER=hdfs
export HDFS_DATANODE_SECURE_USER=hdfs

Verifying Security Configuration

After enabling security, verify that everything works:

# Obtain a Kerberos ticket for the hdfs service user
kinit -kt /etc/security/keytabs/hdfs.keytab hdfs/namenode.example.com@EXAMPLE.COM

# List HDFS root (should succeed)
hdfs dfs -ls /

# Check that unauthenticated access is denied
kdestroy
hdfs dfs -ls / # Should fail with "No valid credentials"

Run an HDFS health check with auth:

kinit -kt /etc/security/keytabs/hdfs.keytab hdfs/namenode.example.com@EXAMPLE.COM
hdfs dfsadmin -report
hdfs fsck / -summary

Security Audit Checklist

ItemSecured
Kerberos principals created for all service hosts[ ]
Keytab files owned by service user, mode 400[ ]
hadoop.security.authentication = kerberos[ ]
dfs.block.access.token.enable = true[ ]
Data transfer encryption enabled[ ]
DataNode data dirs owned by hdfs user, mode 700[ ]
Firewall restricts DataNode ports to cluster subnet[ ]
HDFS audit logging enabled[ ]
NTP synchronized (< 5 min skew)[ ]
Ranger or Sentry for fine-grained authorization[ ]

Summary

Securing a Hadoop DataNode involves multiple complementary layers:

  1. Kerberos — mutual authentication between services and clients
  2. Block access tokens — prevent unauthorized direct block access
  3. Wire encryption — protect data in transit (RPC + data transfer)
  4. Privileged ports or SASL — OS-level service identity verification
  5. OS hardening — file permissions, firewall, non-root user

No single layer is sufficient on its own. A properly secured DataNode requires all these working together. For fine-grained row-level and column-level access control beyond what HDFS ACLs provide, look at Apache Ranger as the next step.