Apache ZooKeeper
What Is ZooKeeper?
Apache ZooKeeper is a distributed coordination service — a highly available system that provides primitives for building distributed applications: configuration management, distributed locking, leader election, and service discovery.
In the Hadoop ecosystem, ZooKeeper is a foundational dependency used by:
| System | ZooKeeper Role |
|---|---|
| HDFS HA | NameNode automatic failover (ZKFC) |
| YARN HA | ResourceManager leader election |
| HBase | Region Server coordination, Master election |
| Kafka | Broker registration, topic/partition metadata (pre-Kafka 3.x) |
| Oozie | Job state coordination |
| Hive | HiveServer2 instance discovery |
ZooKeeper Data Model
ZooKeeper stores data in a hierarchical namespace (like a filesystem), where each node is called a znode:
/
├── hadoop-ha/
│ ├── nameservice1/
│ │ ├── ActiveBreadCrumb ← which NameNode is active
│ │ └── ActiveStandbyElectorLock
├── hbase/
│ ├── master ← HBase active master
│ ├── rs/ ← region server registrations
│ └── table/
└── kafka/
├── brokers/
└── topics/
Znode types:
- Persistent — survives client disconnection
- Ephemeral — deleted automatically when the creating client disconnects (used for leader election and service registration)
- Sequential — appends a unique monotonic counter to the node name
Architecture
ZooKeeper Ensemble (3 or 5 nodes recommended)
Leader ──► Follower 1
──► Follower 2
──► Follower 3
Quorum requirement: majority must be alive
3 nodes → tolerates 1 failure
5 nodes → tolerates 2 failures
Never run 2 or 4 nodes — an even number doesn't increase fault tolerance but wastes a server.
Configuration (zoo.cfg)
# Data directory for snapshots
dataDir=/var/zookeeper/data
# Transaction log directory (put on separate disk from dataDir)
dataLogDir=/var/zookeeper/logs
# Client port
clientPort=2181
# Tick time (ms) — base unit for heartbeats and timeouts
tickTime=2000
# Follower init sync timeout (10 × tickTime)
initLimit=10
# Follower-leader sync timeout (5 × tickTime)
syncLimit=5
# Session timeout bounds (ms)
minSessionTimeout=4000
maxSessionTimeout=40000
# Ensemble members: server.id=host:leader-port:election-port
server.1=zk1.example.com:2888:3888
server.2=zk2.example.com:2888:3888
server.3=zk3.example.com:2888:3888
Each ZooKeeper node needs a myid file containing its server ID:
# On zk1:
echo 1 > /var/zookeeper/data/myid
# On zk2:
echo 2 > /var/zookeeper/data/myid
# On zk3:
echo 3 > /var/zookeeper/data/myid
Starting ZooKeeper
# Start
zkServer.sh start
# Stop
zkServer.sh stop
# Check status (shows if node is Leader or Follower)
zkServer.sh status
ZooKeeper CLI (zkCli)
# Connect to local ZooKeeper
zkCli.sh -server localhost:2181
# Connect to a specific ensemble member
zkCli.sh -server zk1.example.com:2181
Common commands inside the CLI:
# List children of a znode
ls /
# Get data stored in a znode
get /hadoop-ha/nameservice1/ActiveBreadCrumb
# Create a persistent znode with data
create /myapp/config "version=1.2"
# Create an ephemeral znode
create -e /myapp/locks/worker-01 "locked"
# Set data on an existing znode
set /myapp/config "version=1.3"
# Watch a znode for changes (triggers once)
get -w /myapp/config
# Delete a znode
delete /myapp/config
# Delete recursively
deleteall /myapp
Four-Letter Commands (Monitoring)
ZooKeeper responds to short text commands over TCP for health monitoring:
# Check if node is OK
echo ruok | nc zk1.example.com 2181
# Returns: imok
# Show server statistics
echo stat | nc zk1.example.com 2181
# List active connections
echo dump | nc zk1.example.com 2181
# Show environment info
echo envi | nc zk1.example.com 2181
# Show outstanding requests (should be near 0)
echo wchs | nc zk1.example.com 2181
Enable 4-letter commands in zoo.cfg (required in newer versions):
4lw.commands.whitelist=ruok,stat,dump,envi,wchs,mntr
Key Tuning Parameters
| Parameter | Default | Recommendation |
|---|---|---|
tickTime | 2000ms | Keep at 2000ms for most clusters |
maxSessionTimeout | 20× tickTime | Increase to 60000ms for slow HBase clients |
dataLogDir | same as dataDir | Always set to a separate disk (reduces latency) |
autopurge.snapRetainCount | 3 | Set to 5–10 to keep enough snapshots |
autopurge.purgeInterval | 0 (off) | Set to 1 (hourly purge) to prevent disk fill |
| JVM heap | 500MB | 1–2 GB for busy ensembles |
Enable automatic log purge
autopurge.snapRetainCount=5
autopurge.purgeInterval=1
JVM heap (zkEnv.sh or zookeeper-env.sh)
export ZK_SERVER_HEAP=1024 # 1 GB
ZooKeeper and HDFS HA
When configuring HDFS NameNode HA with automatic failover, ZooKeeper holds the active NameNode lock. The ZooKeeper Failover Controller (ZKFC) runs on each NameNode host:
# Format the ZooKeeper ZNode for HDFS HA (run once)
hdfs zkfc -formatZK
# Start ZKFC on each NameNode host
hdfs --daemon start zkfc
If the active NameNode fails, the ZKFC detects it and triggers a failover — the standby NameNode acquires the ZooKeeper lock and transitions to active automatically.