YARN Containers Deep Dive: How Resource Allocation Really Works
YARN (Yet Another Resource Negotiator) is Hadoop's cluster resource management layer. Understanding how YARN allocates containers — the fundamental unit of computation — is essential for getting good utilization and avoiding the frustrating "application is waiting for resources" message that plagues many clusters.
What Is a YARN Container?
A YARN container is a reservation of resources (CPU cores and memory) on a NodeManager. It's the unit in which YARN launches application tasks — MapReduce map/reduce tasks, Spark executors, Tez tasks, and so on all run inside YARN containers.
Each container has:
- Memory (in MB) — hard limit enforced by cgroups or virtual memory check
- vCores (virtual CPU cores) — a logical unit, not necessarily a physical core
- A host (the NodeManager it runs on)
- A priority within the application
Containers are ephemeral: they're created for a task, run to completion, and are released.
How Container Allocation Works
The flow from application submission to container launch:
Client
│
└──► ApplicationMaster (AM) submitted to YARN
│
└──► AM registers with ResourceManager
│
└──► AM requests containers (resource requests)
│
└──► ResourceManager scheduler evaluates queue + node availability
│
└──► ResourceManager issues container allocations
│
└──► AM contacts NodeManager → container launch
The ResourceManager's scheduler is the brain. It tracks available resources on each NodeManager and matches pending container requests against available capacity, obeying queue limits, node labels, and locality preferences.
Resource Request Parameters
An ApplicationMaster sends resource requests with these key parameters:
| Parameter | Description |
|---|---|
memory | Container memory in MB |
vCores | Virtual CPU cores |
nodes | Preferred host list (data locality) |
racks | Preferred rack list (rack locality) |
priority | Request priority (lower = higher priority) |
relaxLocality | Whether to fall back to any node if preferred not available |
Locality preference order
YARN tries to honor locality in this order:
- Node-local — container on the same node as the data block
- Rack-local — container on a node in the same rack
- Off-rack — any node in the cluster
For MapReduce, HDFS block locality means the map task ideally runs on a node that already has the block locally — avoiding network transfer entirely.
NodeManager Configuration
Configure how much resource each NodeManager exposes to YARN in yarn-site.xml:
<!-- Total memory available to YARN containers on this node -->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>49152</value> <!-- 48GB; leave some for OS and system processes -->
</property>
<!-- Total vCores available on this node -->
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>20</value> <!-- Leave 2-4 for OS -->
</property>
<!-- Minimum container memory allocation -->
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
<!-- Maximum container memory allocation -->
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>16384</value>
</property>
<!-- Minimum container vCores -->
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>
<!-- Maximum container vCores -->
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>8</value>
</property>
Important: Container allocations are rounded up to the nearest increment of minimum-allocation-mb. Requesting 1.5GB on a cluster with 1GB minimum gets you 2GB. Size your minimum allocation to match typical workloads.
Memory Enforcement: Virtual vs Physical
YARN can enforce memory limits in two ways:
Virtual Memory Check (default on)
YARN monitors the virtual memory (VM size) of each container process. If it exceeds yarn.nodemanager.vmem-pmem-ratio times the allocated physical memory, the container is killed:
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value> <!-- Default; container can use 2.1x its physical memory as virtual memory -->
</property>
This is a common source of spurious container kills on modern Linux systems where JVM virtual memory usage appears large. If you see Container killed due to exceeding virtual memory limits, either increase the ratio or disable the check:
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
Physical Memory Check
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>true</value>
</property>
Physical memory enforcement kills containers that exceed their allocated memory in RAM. This prevents one container from starving others on the same node.
cgroups Enforcement (recommended)
The most reliable enforcement uses Linux cgroups. The NodeManager enforces both CPU and memory limits at the OS level:
<property>
<name>yarn.nodemanager.container-executor.class</name>
<value>org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor</value>
</property>
<property>
<name>yarn.nodemanager.linux-container-executor.cgroups.hierarchy</name>
<value>/hadoop-yarn</value>
</property>
<property>
<name>yarn.nodemanager.linux-container-executor.cgroups.mount</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu.enabled</name>
<value>true</value>
</property>
Scheduler Types
Capacity Scheduler (default)
The Capacity Scheduler divides cluster resources into queues, each with a guaranteed minimum capacity and an optional maximum:
<!-- capacity-scheduler.xml -->
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>default,production,development</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.production.capacity</name>
<value>60</value> <!-- 60% of cluster -->
</property>
<property>
<name>yarn.scheduler.capacity.root.development.capacity</name>
<value>30</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.default.capacity</name>
<value>10</value>
</property>
Queues can borrow unused capacity from sibling queues (elastic sharing) and return it when the owner queue needs it.
Fair Scheduler
The Fair Scheduler distributes resources so that all running applications get an equal share over time. Good for multi-tenant environments with diverse workload sizes:
<!-- yarn-site.xml -->
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>
Diagnosing Container Issues
Application stuck waiting for resources
# Check what resources are available
yarn node -list -all
# Check scheduler queue utilization
yarn queue -status default
# Check ResourceManager logs for why containers aren't being assigned
# Look for: "Application ... is waiting for ..."
Container killed: OOM
# Check NodeManager logs on the host where the container ran
# Look for: "Container killed due to Physical memory limit"
# Or check YARN application logs:
yarn logs -applicationId application_XXXX_XXXX
Increase the container memory allocation in your job configuration:
# MapReduce
mapred job -Dmapreduce.map.memory.mb=4096 -Dmapreduce.reduce.memory.mb=8192
Container killed: virtual memory
# Quick fix: disable vmem check in yarn-site.xml
# Better fix: understand JVM memory layout on your OS
# The JVM reserves large amounts of virtual address space for mapped libraries
Sizing Containers for Common Workloads
MapReduce
<!-- mapred-site.xml -->
<property>
<name>mapreduce.map.memory.mb</name>
<value>2048</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx1638m</value> <!-- 80% of container memory for JVM heap -->
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx3276m</value>
</property>
Spark on YARN
spark-submit \
--master yarn \
--deploy-mode cluster \
--executor-memory 8g \
--executor-cores 4 \
--num-executors 20 \
--conf spark.yarn.executor.memoryOverhead=1024 \
my_spark_app.jar
memoryOverhead covers JVM off-heap memory (direct buffers, native libraries). Set it to at least 384MB or 10% of executor memory, whichever is larger.
Summary
| Concept | Key Point |
|---|---|
| Container | CPU + memory reservation on a NodeManager |
| AM negotiates | ApplicationMaster requests containers from ResourceManager |
| Locality | Node-local > rack-local > off-rack |
| vmem check | Common source of spurious kills; consider disabling on modern Linux |
| cgroups | Best enforcement; use LinuxContainerExecutor |
| Capacity Scheduler | Guaranteed queues with elastic borrowing |
| Fair Scheduler | Equal sharing over time for multi-tenant |
| Heap = 80% of MB | Set JVM -Xmx to ~80% of container memory |
Mastering YARN container allocation turns resource utilization from a guessing game into a predictable, measurable engineering discipline. Profile your actual workloads, size containers deliberately, and use queue priorities to give critical jobs the resources they need.
