Starting in CDH 4.2, YARN/MapReduce 2 (MR2) includes an even more
powerful Fair Scheduler. In addition to doing nearly all that it could
do in MapReduce 1 (MR1), the YARN Fair Scheduler can schedule
non-MapReduce jobs, schedule based on fine-grained memory instead of
slots, and support hierarchical queues. In this post, you’ll learn what
the Fair Scheduler’s role is and how it fulfills it, what it means to be
a YARN “scheduler,” and dive into its new features and how to get them
running on your cluster.
YARN/MR2 vs. MR1
YARN uses an updated terminology to reflect that it no longer just
manages resources for MapReduce. From YARN’s perspective, a MapReduce
job is an application. YARN schedules containers for map and reduce
tasks to live in. What was referred to as pools in the MR1 Fair
Scheduler has been updated to queue for consistency with the capacity
scheduler. An excellent and deeper explanation is available
here.
How Does it Work?
How a Hadoop scheduler functions can often be confusing, so we’ll
start with a short overview of what the Fair Scheduler does and how it
works.
A Hadoop scheduler is responsible for deciding which tasks get to run where and when to run them. The
Fair Scheduler,
originally developed at Facebook, seeks to promote fairness between
schedulable entities by awarding free space to those that are the most
underserved. (Cloudera recommends the Fair Scheduler for its wide set of
features and ease of use, and Cloudera Manager sets it as the
default. More than 95% of Cloudera’s customers use it.)
In Hadoop, the scheduler is a pluggable piece of code that lives
inside ResourceManager (the JobTracker, in MR1) the central execution
managing service. The ResourceManager is constantly receiving updates
from the NodeManagers that sit on each node in the cluster, that say
“What’s up, here are all the tasks I was running that just completed, do
you have any work for me?” The ResourceManager passes these updates to
the scheduler, and the scheduler then decides what new tasks, if any, to
assign to that node.
How does the scheduler decide? For the Fair Scheduler, it’s simple:
every application belongs to a “queue”, and we give a container to the
queue that has the fewest resources allocated to it right now. Within
that queue, we offer it to the application that has the fewest resources
allocated to it right now. The Fair Scheduler supports a number of
features that modify this a little, like weights on queues, minimum
shares, maximum shares, and FIFO policy within queues, but the basic
idea remains the same.
Beyond MapReduce
In MR1, the Fair Scheduler was purely a MapReduce scheduler. If you
wanted to run multiple parallel computation frameworks on the same
cluster, you would have to statically partition resources — or cross
your fingers and hope that the resources given to a MapReduce job
wouldn’t also be given to something else by that framework’s scheduler,
causing OSes to thrash. With YARN, the same scheduler can manage
resources for different applications on the same cluster, which should
allow for more multi-tenancy and a richer, more diverse Hadoop
ecosystem.
Scheduling Resources, Not Slots
A big change in the YARN Fair Scheduler is how it defines a
“resource”. In MR1, the basic unit of scheduling was the “slot”, an
abstraction of a space for a task on a machine in the cluster. Because
YARN expects to schedule jobs with heterogeneous task resource requests,
it instead allows containers to request variable amounts of memory and
schedules based on those. Cluster resources no longer need to be
partitioned into map and reduce slots, meaning that a large job can use
all the resources in the cluster in its map phase and then do so again
in its reduce phase. This allows for better utilization of the cluster,
better treatment of tasks with high resource requests, and more
portability of jobs between clusters — a developer no longer needs to
worry about a slot meaning different things on different clusters;
rather, they can request concrete resources to satisfy their jobs’
needs. Additionally, work is being done (
YARN-326) that will allow the Fair Scheduler to schedule based on CPU requirements and availability as well.
An implementation detail of this change that prevents applications
from starving under this new flexibility is the notion of reserved
containers. Imagine two jobs are running that each have enough tasks to
saturate more than the entire cluster. One job wants each of its mappers
to get 1GB, and another job wants its mappers to get 2GB. Suppose the
first job starts and fills up the entire cluster. Whenever one of its
task finishes, it will leave open a 1GB slot. Even though the second job
deserves the space, a naive policy will give it to the first one
because it’s the only job with tasks that fit. This could cause the
second job to be starved indefinitely.
To prevent this unfortunate situation, when space on a node is offered
to an application, if the application cannot immediately use it, it
reserves it, and no other application can be allocated a container on
that node until the reservation is fulfilled. Each node may have only
one reserved container. The total reserved memory amount is reported in
the ResourceManager UI. A high number means that it may take longer for
new jobs to get space.