Skip to main content

YARN Containers Deep Dive: How Resource Allocation Really Works

· 6 min read
Hadoop.so Editorial Team
Big Data Engineers

YARN (Yet Another Resource Negotiator) is Hadoop's cluster resource management layer. Understanding how YARN allocates containers — the fundamental unit of computation — is essential for getting good utilization and avoiding the frustrating "application is waiting for resources" message that plagues many clusters.

What Is a YARN Container?

A YARN container is a reservation of resources (CPU cores and memory) on a NodeManager. It's the unit in which YARN launches application tasks — MapReduce map/reduce tasks, Spark executors, Tez tasks, and so on all run inside YARN containers.

Each container has:

  • Memory (in MB) — hard limit enforced by cgroups or virtual memory check
  • vCores (virtual CPU cores) — a logical unit, not necessarily a physical core
  • A host (the NodeManager it runs on)
  • A priority within the application

Containers are ephemeral: they're created for a task, run to completion, and are released.


How Container Allocation Works

The flow from application submission to container launch:

Client

└──► ApplicationMaster (AM) submitted to YARN

└──► AM registers with ResourceManager

└──► AM requests containers (resource requests)

└──► ResourceManager scheduler evaluates queue + node availability

└──► ResourceManager issues container allocations

└──► AM contacts NodeManager → container launch

The ResourceManager's scheduler is the brain. It tracks available resources on each NodeManager and matches pending container requests against available capacity, obeying queue limits, node labels, and locality preferences.


Resource Request Parameters

An ApplicationMaster sends resource requests with these key parameters:

ParameterDescription
memoryContainer memory in MB
vCoresVirtual CPU cores
nodesPreferred host list (data locality)
racksPreferred rack list (rack locality)
priorityRequest priority (lower = higher priority)
relaxLocalityWhether to fall back to any node if preferred not available

Locality preference order

YARN tries to honor locality in this order:

  1. Node-local — container on the same node as the data block
  2. Rack-local — container on a node in the same rack
  3. Off-rack — any node in the cluster

For MapReduce, HDFS block locality means the map task ideally runs on a node that already has the block locally — avoiding network transfer entirely.


NodeManager Configuration

Configure how much resource each NodeManager exposes to YARN in yarn-site.xml:

<!-- Total memory available to YARN containers on this node -->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>49152</value> <!-- 48GB; leave some for OS and system processes -->
</property>

<!-- Total vCores available on this node -->
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>20</value> <!-- Leave 2-4 for OS -->
</property>

<!-- Minimum container memory allocation -->
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>

<!-- Maximum container memory allocation -->
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>16384</value>
</property>

<!-- Minimum container vCores -->
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>

<!-- Maximum container vCores -->
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>8</value>
</property>

Important: Container allocations are rounded up to the nearest increment of minimum-allocation-mb. Requesting 1.5GB on a cluster with 1GB minimum gets you 2GB. Size your minimum allocation to match typical workloads.


Memory Enforcement: Virtual vs Physical

YARN can enforce memory limits in two ways:

Virtual Memory Check (default on)

YARN monitors the virtual memory (VM size) of each container process. If it exceeds yarn.nodemanager.vmem-pmem-ratio times the allocated physical memory, the container is killed:

<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value> <!-- Default; container can use 2.1x its physical memory as virtual memory -->
</property>

This is a common source of spurious container kills on modern Linux systems where JVM virtual memory usage appears large. If you see Container killed due to exceeding virtual memory limits, either increase the ratio or disable the check:

<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>

Physical Memory Check

<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>true</value>
</property>

Physical memory enforcement kills containers that exceed their allocated memory in RAM. This prevents one container from starving others on the same node.

The most reliable enforcement uses Linux cgroups. The NodeManager enforces both CPU and memory limits at the OS level:

<property>
<name>yarn.nodemanager.container-executor.class</name>
<value>org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor</value>
</property>
<property>
<name>yarn.nodemanager.linux-container-executor.cgroups.hierarchy</name>
<value>/hadoop-yarn</value>
</property>
<property>
<name>yarn.nodemanager.linux-container-executor.cgroups.mount</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu.enabled</name>
<value>true</value>
</property>

Scheduler Types

Capacity Scheduler (default)

The Capacity Scheduler divides cluster resources into queues, each with a guaranteed minimum capacity and an optional maximum:

<!-- capacity-scheduler.xml -->
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>default,production,development</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.production.capacity</name>
<value>60</value> <!-- 60% of cluster -->
</property>
<property>
<name>yarn.scheduler.capacity.root.development.capacity</name>
<value>30</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.default.capacity</name>
<value>10</value>
</property>

Queues can borrow unused capacity from sibling queues (elastic sharing) and return it when the owner queue needs it.

Fair Scheduler

The Fair Scheduler distributes resources so that all running applications get an equal share over time. Good for multi-tenant environments with diverse workload sizes:

<!-- yarn-site.xml -->
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>

Diagnosing Container Issues

Application stuck waiting for resources

# Check what resources are available
yarn node -list -all

# Check scheduler queue utilization
yarn queue -status default

# Check ResourceManager logs for why containers aren't being assigned
# Look for: "Application ... is waiting for ..."

Container killed: OOM

# Check NodeManager logs on the host where the container ran
# Look for: "Container killed due to Physical memory limit"
# Or check YARN application logs:
yarn logs -applicationId application_XXXX_XXXX

Increase the container memory allocation in your job configuration:

# MapReduce
mapred job -Dmapreduce.map.memory.mb=4096 -Dmapreduce.reduce.memory.mb=8192

Container killed: virtual memory

# Quick fix: disable vmem check in yarn-site.xml
# Better fix: understand JVM memory layout on your OS
# The JVM reserves large amounts of virtual address space for mapped libraries

Sizing Containers for Common Workloads

MapReduce

<!-- mapred-site.xml -->
<property>
<name>mapreduce.map.memory.mb</name>
<value>2048</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx1638m</value> <!-- 80% of container memory for JVM heap -->
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx3276m</value>
</property>

Spark on YARN

spark-submit \
--master yarn \
--deploy-mode cluster \
--executor-memory 8g \
--executor-cores 4 \
--num-executors 20 \
--conf spark.yarn.executor.memoryOverhead=1024 \
my_spark_app.jar

memoryOverhead covers JVM off-heap memory (direct buffers, native libraries). Set it to at least 384MB or 10% of executor memory, whichever is larger.


Summary

ConceptKey Point
ContainerCPU + memory reservation on a NodeManager
AM negotiatesApplicationMaster requests containers from ResourceManager
LocalityNode-local > rack-local > off-rack
vmem checkCommon source of spurious kills; consider disabling on modern Linux
cgroupsBest enforcement; use LinuxContainerExecutor
Capacity SchedulerGuaranteed queues with elastic borrowing
Fair SchedulerEqual sharing over time for multi-tenant
Heap = 80% of MBSet JVM -Xmx to ~80% of container memory

Mastering YARN container allocation turns resource utilization from a guessing game into a predictable, measurable engineering discipline. Profile your actual workloads, size containers deliberately, and use queue priorities to give critical jobs the resources they need.