9 posts tagged with "YARN"

YARN resource management and containers

Apache Spark Key Components Explained: RDDs, DataFrames, Datasets, and the DAG

June 11, 2026 · 9 min read

Big Data Practitioner

If you have ever processed big data, you have almost certainly touched Apache Spark — and behind its speed sit four ideas worth understanding deeply: the RDD, the DataFrame, the Dataset, and the DAG. Master these and Spark stops feeling like magic; you can reason about why a job is fast, why a stage is slow, and how Spark recovers when a node dies.

This article is a from-scratch tour of Spark's key components and its execution model. It pairs with our companion post on how Spark supports big data processing — there we covered the why; here we go under the hood into the what and how.

How Does Apache Spark Support Big Data Processing? In-Memory Speed, DAGs, and a Unified Engine

June 11, 2026 · 7 min read

Bryan

Big Data Practitioner

Apache Spark supports big data processing by combining in-memory computation, distributed execution across a cluster, and a single unified programming model that covers batch, streaming, machine learning, and graph workloads. Where older systems wrote intermediate results to disk at every step, Spark keeps working data in memory and orchestrates it with a smart execution engine — which is why it became the default processing layer for modern big data.

This guide explains how Spark actually does this: the architecture that makes it fast, the libraries that make it flexible, and the optimizations that make it efficient — without requiring you to be a distributed-systems expert.

Hadoop YARN Architecture Explained: Components, Workflow, and How It Works

June 2, 2026 · 7 min read

Bryan

Big Data Practitioner

YARN — short for "Yet Another Resource Negotiator" — is the layer that turned Hadoop from a single-purpose MapReduce engine into a general-purpose cluster operating system. Introduced in Hadoop 2.0, it pulled resource management out of MapReduce and made it a service in its own right, so Spark, Flink, Tez, and batch MapReduce could all share the same cluster.

This guide breaks down the YARN architecture in plain terms: the daemons that run it, how a job flows through the system from submission to shutdown, and the real-world strengths and trade-offs of running YARN.

What Is Hadoop? A Plain-English Guide to Big Data's Foundational Framework

June 2, 2026 · 9 min read

Bryan

Big Data Practitioner

Apache Hadoop is an open-source framework that stores and processes enormous datasets by spreading the work across a cluster of ordinary computers instead of relying on one expensive machine. If a single server would buckle under the volume, Hadoop splits the data into pieces, hands each piece to a different node, and lets them all work in parallel.

This guide explains what Hadoop is in plain language: where it came from, the four components that make it tick, what people actually use it for, its strengths and weaknesses, and a practical path to learning it in 2026.

Hadoop 3 Features and Enhancements: A Deep Dive (2026)

May 22, 2026 · 12 min read

Hadoop.so Editorial Team

Big Data Engineers

Apache Hadoop 3 was the first release in nearly a decade that made operators rethink how they buy storage. Erasure coding cut disk overhead from 200% to 50%. The NameNode HA cap doubled, then more. The MapReduce shuffle path moved into native code. YARN learned to manage long-running services and Docker workloads. And every default port that lived in the Linux ephemeral range was moved out of it.

Several years after the 3.0 GA, Hadoop 3.3 and 3.4 lines are the de-facto on-prem standard, and most cloud Hadoop distributions (EMR, Dataproc, HDInsight, CDP) ship a 3.x core. This deep dive walks through every major feature in the Hadoop 3 line — what changed, why it matters, and where the tradeoffs hide — and ends with a side-by-side Hadoop 2.x vs 3.x comparison table.

What Is a Hadoop Cluster? Architecture, Sizing, and Best Practices

May 7, 2026 · 10 min read

Hadoop.so Editorial Team

Big Data Engineers

A Hadoop cluster is a network of commodity servers working in concert to store and process massive datasets that would be impractical to handle on a single machine. Understanding how a cluster is structured — and how to size and operate it properly — is essential knowledge for any big data engineer.

Upgrading from Hadoop 2 to Hadoop 3: A Complete How-To

April 24, 2026 · 5 min read

Hadoop.so Editorial Team

Big Data Engineers

Hadoop 3.x introduced erasure coding, YARN Timeline Service v2, multiple NameNode support, and significant performance improvements. If you're still running Hadoop 2.x, this guide walks through a safe, rolling upgrade path — without losing data or taking extended downtime.

YARN Containers Deep Dive: How Resource Allocation Really Works

April 20, 2026 · 6 min read

Hadoop.so Editorial Team

Big Data Engineers

YARN (Yet Another Resource Negotiator) is Hadoop's cluster resource management layer. Understanding how YARN allocates containers — the fundamental unit of computation — is essential for getting good utilization and avoiding the frustrating "application is waiting for resources" message that plagues many clusters.

YARN vs Kubernetes: Which Should Orchestrate Your Big Data Workloads?

April 19, 2026 · 6 min read

Hadoop.so Editorial Team

Big Data Engineers

Kubernetes has become the default orchestration platform for containerized applications. But should you migrate your Hadoop workloads off YARN onto Kubernetes? The answer depends heavily on your workload patterns, team expertise, and existing infrastructure. This post compares both platforms head-to-head.