The first question people raise when getting started with
Hadoop is about selecting appropriate hardware for their Hadoop cluster.
This blog post describes the various factors that Hadoop
administrators take into account. We encourage others to chime in with
their experience configuring production Hadoop clusters. Although
Hadoop is designed to run on industry standard hardware, recommending an
ideal cluster configuration is not as easy as just delivering a list
of hardware specifications. Selecting hardware that provides the best
balance of performance and economy for a given workload requires testing
and validation. For example, users with IO-intensive workloads will
invest in more spindles per core. In this blog post we’ll discuss
workload evaluation and the critical role it plays in hardware
selection.
Hadoop was designed based on a new approach to storing and processing complex data. Instead of relying on a SAN for massive storage and reliability then moving it to a collection of blades for processing, Hadoop handles large data volumes and reliability in the software tier. Hadoop distributes data across a cluster of balanced machines and uses replication to ensure data reliability and fault tolerance. Because data is distributed on machines with compute power, processing can be sent directly to the machines storing the data. Since each machine in a Hadoop cluster both stores and processes data, they need to be configured to satisfy both data storage and processing requirements.
Here are several more examples of IO-bound workloads:
It is straightforward to measure live workloads and determine bottlenecks by putting thorough monitoring in place on the Hadoop cluster. We recommend installing Ganglia on all Hadoop machines to provide real-time statistics about CPU, disk, and network load. With Ganglia installed a Hadoop administrator can then run their MapReduce jobs and check the Ganglia dashboard to see how each machine is performing.
In addition to building out a cluster appropriate for the workload, we encourage our customers to work with a hardware vendor and understand the economics of power and cooling. Since Hadoop runs on tens, hundreds, or thousands of nodes an operations team can save a significant amount of money by investing in power-efficient hardware. Each hardware vendor will be able to provide tools and recommendations for how to monitor power and cooling.
There are four types of nodes in a basic Hadoop cluster. We refer here to a node as a machine performing a particular task. Most of the machines will function as both datanodes and tasktrackers. As we described, these nodes both store data and perform processing functions. We recommend the following specifications for datanodes/tasktrackers in a balanced Hadoop cluster:
When you expect your Hadoop cluster to grow beyond 20 machines we recommend that the initial cluster be configured as it were to span two racks, where each rack has a top of rack gigabit switch, and those switches are connected with a 10 GigE interconnect or core switch. Having two logical racks gives the operations team a better understand of the network requirements for inner-rack, and cross-rack communication.
With a Hadoop cluster in place the team can start identifying workloads and prepare to benchmark those workloads to identify CPU and IO bottlenecks. After some time benchmarking and monitoring, the team will have a good understanding as to how additional machines should be configured. It is common to have heterogeneous Hadoop clusters especially as they grow in size. Starting with a set of machines that are not perfect for your workload will not be a waste.
Below is a list of various hardware configurations for different workloads, including our original “base” recommendation:
When computing memory requirements, factor in that Java uses up to 10% for managing the virtual machine. We recommend configuring Hadoop to use strict heap size restrictions in order to avoid memory swapping to disk. Swapping greatly impacts MapReduce job performance and can be avoided by configuring machines with more RAM.
It is also important to optimize RAM for the memory channel width. For example, when using dual-channel memory each machine should be configured with pairs of DIMMs. With triple-channel memory each machine should have triplets of DIMMs. This means a machine might end up with 18GBs (9x2GB) of RAM instead of 16GBs (4x4GB).
Marrying storage and compute
Over the past decade IT organizations have standardized on blades and SANs (Storage Area Networks) to satisfy their grid and processing-intensive workloads. While this model makes a lot of sense for a number of standard applications such as web servers, app servers, smaller structured databases and simple ETL (Extract, Transform, Load) the requirements for infrastructure has been changing as the amount of data and number of users has grown. Web servers now have caching tiers, databases have gone massively parallel with local disk, and ETL jobs are pushing more data than they can handle locally. Hardware vendors have created innovative systems to address these requirements including storage blades, SAS (Serial Attached SCSI) switches, external SATA arrays and larger capacity rack units.Hadoop was designed based on a new approach to storing and processing complex data. Instead of relying on a SAN for massive storage and reliability then moving it to a collection of blades for processing, Hadoop handles large data volumes and reliability in the software tier. Hadoop distributes data across a cluster of balanced machines and uses replication to ensure data reliability and fault tolerance. Because data is distributed on machines with compute power, processing can be sent directly to the machines storing the data. Since each machine in a Hadoop cluster both stores and processes data, they need to be configured to satisfy both data storage and processing requirements.
Why workloads matter
In nearly all cases, a MapReduce job will either encounter a bottleneck reading data from disk or from the network (known as an IO-bound job) or in processing data (CPU-bound). An example of an IO-bound job is sorting, which requires very little processing (simple comparisons) and a lot of reading and writing to disk. An example of a CPU-bound job is classification, where some input data is processed in very complex ways to determine an ontology.Here are several more examples of IO-bound workloads:
- Indexing
- Searching
- Grouping
- Decoding/decompressing
- Data importing and exporting
- Machine learning
- Complex text mining
- Natural language processing
- Feature extraction
It is straightforward to measure live workloads and determine bottlenecks by putting thorough monitoring in place on the Hadoop cluster. We recommend installing Ganglia on all Hadoop machines to provide real-time statistics about CPU, disk, and network load. With Ganglia installed a Hadoop administrator can then run their MapReduce jobs and check the Ganglia dashboard to see how each machine is performing.
In addition to building out a cluster appropriate for the workload, we encourage our customers to work with a hardware vendor and understand the economics of power and cooling. Since Hadoop runs on tens, hundreds, or thousands of nodes an operations team can save a significant amount of money by investing in power-efficient hardware. Each hardware vendor will be able to provide tools and recommendations for how to monitor power and cooling.
How to pick hardware for your Hadoop cluster
The first step in choosing a machine configuration is to understand the type of hardware your operations team already manages. Operations teams often have opinions about new machine purchases and will prefer to work with hardware that they’re already familiar with. Hadoop is not the only system that benefits from efficiencies of scale. Remember to plan on using balanced hardware for an initial cluster when new to Hadoop and if you do not yet understand your workload.There are four types of nodes in a basic Hadoop cluster. We refer here to a node as a machine performing a particular task. Most of the machines will function as both datanodes and tasktrackers. As we described, these nodes both store data and perform processing functions. We recommend the following specifications for datanodes/tasktrackers in a balanced Hadoop cluster:
- 4 1TB hard disks in a JBOD (Just a Bunch Of Disks) configuration
- 2 quad core CPUs, running at least 2-2.5GHz
- 16-24GBs of RAM (24-32GBs if you’re considering HBase)
- Gigabit Ethernet
When you expect your Hadoop cluster to grow beyond 20 machines we recommend that the initial cluster be configured as it were to span two racks, where each rack has a top of rack gigabit switch, and those switches are connected with a 10 GigE interconnect or core switch. Having two logical racks gives the operations team a better understand of the network requirements for inner-rack, and cross-rack communication.
With a Hadoop cluster in place the team can start identifying workloads and prepare to benchmark those workloads to identify CPU and IO bottlenecks. After some time benchmarking and monitoring, the team will have a good understanding as to how additional machines should be configured. It is common to have heterogeneous Hadoop clusters especially as they grow in size. Starting with a set of machines that are not perfect for your workload will not be a waste.
Below is a list of various hardware configurations for different workloads, including our original “base” recommendation:
- Light Processing Configuration (1U/machine): Two quad core CPUs, 8GB memory, and 4 disk drives (1TB or 2TB). Note that CPU-intensive work such as natural language processing involves loading large models into RAM before processing data and should be configured with 2GB RAM/core instead of 1GB RAM/core.
- Balanced Compute Configuration (1U/machine): Two quad core CPUs, 16 to 24GB memory, and 4 disk drives (1TB or 2TB) directly attached using the motherboard controller. These are often available as twins with two motherboards and 8 drives in a single 2U cabinet.
- Storage Heavy Configuration (2U/machine): Two quad core CPUs, 16 to 24GB memory, and 12 disk drives (1TB or 2TB). The power consumption for this type of machine starts around ~200W in idle state and can go as high as ~350W when active.
- Compute Intensive Configuration (2U/machine): Two quad core CPUs, 48-72GB memory, and 8 disk drives (1TB or 2TB). These are often used when a combination of large in-memory models and heavy reference data caching is required.
Other hardware considerations
When we encounter applications that produce large amounts of intermediate data–on the order of the same amount as is read in–we recommend two ports on a single Ethernet card or two channel-bonded Ethernet cards to provide 2 Gbps per machine. Alternatively for customers who have already moved to 10 Gigabit Ethernet or Infiniband, these solutions can be used to address network bound workloads. Be sure that your operating system and BIOS are compatible if you’re considering switching to 10 Gigabit Ethernet.When computing memory requirements, factor in that Java uses up to 10% for managing the virtual machine. We recommend configuring Hadoop to use strict heap size restrictions in order to avoid memory swapping to disk. Swapping greatly impacts MapReduce job performance and can be avoided by configuring machines with more RAM.
It is also important to optimize RAM for the memory channel width. For example, when using dual-channel memory each machine should be configured with pairs of DIMMs. With triple-channel memory each machine should have triplets of DIMMs. This means a machine might end up with 18GBs (9x2GB) of RAM instead of 16GBs (4x4GB).
No comments:
Post a Comment