Monday, July 15, 2013


Starting in CDH 4.2, YARN/MapReduce 2 (MR2) includes an even more powerful Fair Scheduler. In addition to doing nearly all that it could do in MapReduce 1 (MR1), the YARN Fair Scheduler can schedule non-MapReduce jobs, schedule based on fine-grained memory instead of slots, and support hierarchical queues. In this post, you’ll learn what the Fair Scheduler’s role is and how it fulfills it, what it means to be a YARN “scheduler,” and dive into its new features and how to get them running on your cluster.

YARN/MR2 vs. MR1

YARN uses an updated terminology to reflect that it no longer just manages resources for MapReduce. From YARN’s perspective, a MapReduce job is an application. YARN schedules containers for map and reduce tasks to live in. What was referred to as pools in the MR1 Fair Scheduler has been updated to queue for consistency with the capacity scheduler. An excellent and deeper explanation is available here.

How Does it Work?

How a Hadoop scheduler functions can often be confusing, so we’ll start with a short overview of what the Fair Scheduler does and how it works.
A Hadoop scheduler is responsible for deciding which tasks get to run where and when to run them. The Fair Scheduler, originally developed at Facebook, seeks to promote fairness between schedulable entities by awarding free space to those that are the most underserved. (Cloudera recommends the Fair Scheduler for its wide set of features and ease of use, and Cloudera Manager sets it as the default. More than 95% of Cloudera’s customers use it.)

In Hadoop, the scheduler is a pluggable piece of code that lives inside ResourceManager (the JobTracker, in MR1) the central execution managing service. The ResourceManager is constantly receiving updates from the NodeManagers that sit on each node in the cluster, that say “What’s up, here are all the tasks I was running that just completed, do you have any work for me?” The ResourceManager passes these updates to the scheduler, and the scheduler then decides what new tasks, if any, to assign to that node.
How does the scheduler decide? For the Fair Scheduler, it’s simple: every application belongs to a “queue”, and we give a container to the queue that has the fewest resources allocated to it right now. Within that queue, we offer it to the application that has the fewest resources allocated to it right now. The Fair Scheduler supports a number of features that modify this a little, like weights on queues, minimum shares, maximum shares, and FIFO policy within queues, but the basic idea remains the same.

Beyond MapReduce

In MR1, the Fair Scheduler was purely a MapReduce scheduler. If you wanted to run multiple parallel computation frameworks on the same cluster, you would have to statically partition resources — or cross your fingers and hope that the resources given to a MapReduce job wouldn’t also be given to something else by that framework’s scheduler, causing OSes to thrash. With YARN, the same scheduler can manage resources for different applications on the same cluster, which should allow for more multi-tenancy and a richer, more diverse Hadoop ecosystem.

Scheduling Resources, Not Slots

A big change in the YARN Fair Scheduler is how it defines a “resource”. In MR1, the basic unit of scheduling was the “slot”, an abstraction of a space for a task on a machine in the cluster. Because YARN expects to schedule jobs with heterogeneous task resource requests, it instead allows containers to request variable amounts of memory and schedules based on those. Cluster resources no longer need to be partitioned into map and reduce slots, meaning that a large job can use all the resources in the cluster in its map phase and then do so again in its reduce phase. This allows for better utilization of the cluster, better treatment of tasks with high resource requests, and more portability of jobs between clusters — a developer no longer needs to worry about a slot meaning different things on different clusters; rather, they can request concrete resources to satisfy their jobs’ needs. Additionally, work is being done (YARN-326) that will allow the Fair Scheduler to schedule based on CPU requirements and availability as well.
An implementation detail of this change that prevents applications from starving under this new flexibility is the notion of reserved containers. Imagine two jobs are running that each have enough tasks to saturate more than the entire cluster. One job wants each of its mappers to get 1GB, and another job wants its mappers to get 2GB. Suppose the first job starts and fills up the entire cluster. Whenever one of its task finishes, it will leave open a 1GB slot. Even though the second job deserves the space, a naive policy will give it to the first one because it’s the only job with tasks that fit. This could cause the second job to be starved indefinitely.
To prevent this unfortunate situation, when space on a node is offered to an application, if the application cannot immediately use it, it reserves it, and no other application can be allocated a container on that node until the reservation is fulfilled. Each node may have only one reserved container. The total reserved memory amount is reported in the ResourceManager UI. A high number means that it may take longer for new jobs to get space.

No comments:

Post a Comment