Wednesday, July 31, 2013

Java Interview questions for Hadoop developer

Java interview questions for Hadoop developer

Q1. Explain difference of Class Variable and Instance Variable and how are they declared in Java

Ans: Class Variable is a variable which is declared with static modifier. Instance variable is a variable in a class without static modifier.
The main difference between the class variable and Instance variable is, that first time, when class is loaded in to memory, then only memory is allocated for all class variables. That means, class variables do not depend on the Objets of that classes. What ever number of objects are there, only one copy is created at the time of class loding.

Q2. Explain Encapsulation,Inheritance and Polymorphism

Ans: Encapsulation is a process of binding or wrapping the data and the codes that operates on the data into a single entity. This keeps the data safe from outside interface and misuse. One way to think about encapsulation is as a protective wrapper that prevents code and data from being arbitrarily accessed by other code defined outside the wrapper.
Inheritance is the process by which one object acquires the properties of another object.
The meaning of Polymorphism is something like one name many forms. Polymorphism enables one entity to be used as as general category for different types of actions. The specific action is determined by the exact nature of the situation. The concept of polymorphism can be explained as “one interface, multiple methods”.

Q3. Explain garbage collection?

Ans: Garbage collection is one of the most important feature of Java.
Garbage collection is also called automatic memory management as JVM automatically removes the unused variables/objects (value is null) from the memory. User program cann’t directly free the object from memory, instead it is the job of the garbage collector to automatically free the objects that are no longer referenced by a program. Every class inherits finalize() method from java.lang.Object, the finalize() method is called by garbage collector when it determines no more references to the object exists. In Java, it is good idea to explicitly assign null into a variable when no more in us

Q4. What is similarities/difference between an Abstract class and Interface?

Ans: Differences- Interfaces provide a form of multiple inheritance. A class can extend only one other class.
- Interfaces are limited to public methods and constants with no implementation. Abstract classes can have a partial implementation, protected parts, static methods, etc.
- A Class may implement several interfaces. But in case of abstract class, a class may extend only one abstract class.
- Interfaces are slow as it requires extra indirection to find corresponding method in in the actual class. Abstract classes are fast.
Similarities
- Neither Abstract classes or Interface can be instantiated

Q5. What are different ways to make your class multithreaded in Java

Ans: There are two ways to create new kinds of threads:
- Define a new class that extends the Thread class
- Define a new class that implements the Runnable interface, and pass an object of that class to a Thread’s constructor.

Q6. What do you understand by Synchronization? How do synchronize a method call in Java? How do you synchonize a block of code in java ?

Ans: Synchronization is a process of controlling the access of shared resources by the multiple threads in such a manner that only one thread can access one resource at a time. In non synchronized multithreaded application, it is possible for one thread to modify a shared object while another thread is in the process of using or updating the object’s value. Synchronization prevents such type of data corruption.
- Synchronizing a method: Put keyword synchronized as part of the method declaration
- Synchronizing a block of code inside a method: Put block of code in synchronized (this) { Some Code }

Q7. What is transient variable?

Ans: Transient variable can’t be serialize. For example if a variable is declared as transient in a Serializable class and the class is written to an ObjectStream, the value of the variable can’t be written to the stream instead when the class is retrieved from the ObjectStreamthe value of the variable becomes null.

Q8. What is Properties class in Java. Which class does it extends?

Ans:The Properties class represents a persistent set of properties. The Properties can be saved to a stream or loaded from a stream. Each key and its corresponding value in the property list is a string

Q9. Explain the concept of shallow copy vs deep copy in Java

Ans: In case of shallow copy, the cloned object also refers to the same object to which the original object refers as only the object references gets copied and not the referred objects themselves.
In case deep copy, a clone of the class and all all objects referred by that class is made.

Q10. How can you make a shallow copy of an object in Java

Ans: Use clone() method inherited by Object class

Q11. How would you make a copy of an entire Java object (deep copy) with its state?

 Ans: Have this class implement Cloneable interface and call its method clone().

Thursday, July 25, 2013

More questions

1. Does Hadoop require SSH?

2. I am seeing connection refused in the logs. How do I troubleshoot this?

3. Why I do see broken images in jobdetails.jsp page?

4. How do I change final output file name with the desired name rather than in partitions like part-00000, part-00001?

5. When writing a New InputFormat, what is the format for the array of string returned by InputSplit\#getLocations()?

6. Does the name-node stay in safe mode till all under-replicated files are fully replicated?

Wednesday, July 24, 2013

A discussion on Hadoop

1. Consider you are uploading a file of 300 MB into HDFS, a data of 200 MB was successfully uploaded another client simultaneously wanted to read the uploaded Data(uploading is still continuing). What incurs at this situation
                 a) Arises an exception
                 b) Data will be displayed successfully
                 c) Uploading is interrupted
                 d) The uploaded 200 MB will be displayed

2. Why should you stop all the Task Trackers while decommissioning the nodes in Hadoop cluster
                 a) To overcome the situation of Speculative execution
                 b) To avoid external interference on the new nodes
                 c) In order to make the new nodes identify for Namenode
                 d) JobTacker recieves heartbeats from new nodes only when it is restarted
3. When do your Namenode enters the safe mode
                  a) When 80% of its metadata was filled
                  b) When least replication factor was reached
                  c) Both
                  d) When edit log was full

Tuesday, July 23, 2013

Hadoop FAQs


1.You are given a directory SampleDir of files containing the following  first.txt, _second.txt,.third.txt, #fourth.txt. If you provide SampleDir to the MR job,how many files are processed?

2. You have an external jar file of size 1.3MB that has the required dependencies to run your MR job.What steps do you take to copy the jar file to the task tracker



3.When a job is run,your  properties file are copied to distributed cache in order for your map jobs to access.How do u access the property file
 
4. If you have m mappers and n reducers in a given job,shuffle and sort algorithm will result in  how many copy and write operations
 
5. You have 100 Map tasks running out of which ,99 have completed and one task is running slow.The system replicates the slower running task on a different machine and output is collected from the first completed maptask.Rest of the map tasks are killed.What is this phenomenon

Monday, July 22, 2013

Hadoop in ETL process

Traditional ETL architectures can no longer provide the scalability required by the business at an affordable cost. That's why many organizations are turning to Hadoop. But, Hadoop alone is not a data integration solution. Performing even the simplest ETL tasks require mastering disparate tools and writing hundreds of lines of code. Hadoop ETL Solution provides a smarter approach, turning your Hadoop environment into a complete data integration solution!

Everything you need for Hadoop ETL. No coding, No Tuning, No Kidding!

  • Connect to any data source or target
  • Exploit mainframe data
  • Develop MapReduce ETL jobs without coding
  • Jump-start your Hadoop productivity with use case accelerators to help you build common ETL tasks
  • Build, re-use, and check impact analysis with enhanced metadata capabilities
  • Optimize performance and efficiency of each individual node
  • Never tune again
  •                

Apache Hadoop has two main subprojects:

  • MapReduce - The framework that understands and assigns work to the nodes in a cluster.
  • HDFS - A file system that spans all the nodes in a Hadoop cluster for data storage. It links together the file systems on many local nodes to make them into one big file system. HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes
Hadoop is supplemented by an ecosystem of Apache projects, such as PigHive andZookeeper, that extend the value of Hadoop and improves its usability.

So what’s the big deal?

Hadoop changes the economics and the dynamics of large scale computing. Its impact can be boiled down to four salient characteristics. 

About Hadoop®

Apache™ Hadoop® is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer.
Hadoop enables a computing solution that is:
  • Scalable– New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top.
  • Cost effective– Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.
  • Flexible– Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.
  • Fault tolerant– When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.

Think Hadoop is right for you?

Eighty percent of the world’s data is unstructured, and most businesses don’t even attempt to use this data to their advantage. Imagine if you could afford to keep all the data generated by your business? Imagine if you had a way to analyze that data?

Sunday, July 21, 2013


What is big data?

Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.

Big data in action

What types of business problems can a big data platform help you address? There are multiple uses for big data in every industry – from analyzing larger volumes of data than was previously possible to drive more precise answers, to analyzing data in motion to capture opportunities that were previously lost. A big data platform will enable your organization to tackle complex problems that previously could not be solved.

Big data spans three dimensions: Volume, Velocity, Variety.


Volume: Enterprises are awash with ever-growing data of all types, easily amassing terabytes even petabytes of information.
  • Turn 12 terabytes of Tweets created each day into improved product sentiment analysis
  • Convert 350 billion annual meter readings to better predict power consumption
Velocity: Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value.
  • Scrutinize 5 million trade events created each day to identify potential fraud
  • Analyze 500 million daily call detail records in real-time to predict customer churn faster
Variety: Big data is any type of data - structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together.
  • Monitor 100’s of live video feeds from surveillance cameras to target points of interest
  • Exploit the 80% data growth in images, video and documents to improve customer satisfaction

Big data = Big Return on Investment (ROI)

While there is a lot of buzz about big data in the market, it isn’t hype. Plenty of customers are seeing tangible ROI using IBM solutions to address their big data challenges:
  • Healthcare: 20% decrease in patient mortality by analyzing streaming patient data
  • Telco: 92% decrease in processing time by analyzing networking and call data
  • Utilities: 99% improved accuracy in placing power generation resources by analyzing 2.8 petabytes of untapped data

The 5 game changing big data use cases

What is a use case?

                        A use case helps you solve a specific business challenge by using patterns or examples of technology solutions. Your use case, customized for your unique issue, provides answers to your business problem.

While much of the big data activity in the market up to now has been experimenting and learning about big data technologies, IBM has been focused on also helping organizations understand what problems big data can address.

We’ve identified the top 5 high value use cases that can be your first step into big data:

       1


            .
  • Big Data Exploration

    Find, visualize, understand all big data to improve decision making. Big data exploration addresses the challenge that every large organization faces: information is stored in many different systems and silos and people need access to that data to do their day-to-day work and make important decisions.


    2

    .

  • Enhanced 360ยบ View of the Customer

    Extend existing customer views by incorporating additional internal and external information sources. Gain a full understanding of customers—what makes them tick, why they buy, how they prefer to shop, why they switch, what they’ll buy next, and what factors lead them to recommend a company to others.

    3

    .
  • Security/Intelligence Extension

    Lower risk, detect fraud and monitor cyber security in real time. Augment and enhance cyber security and intelligence analysis platforms with big data technologies to process and analyze new types (e.g. social media, emails, sensors, Telco) and sources of under-leveraged data to significantly improve intelligence, security and law enforcement insight

    4

    .

  • Operations Analysis

    Analyze a variety of machine and operational data for improved business results. The abundance and growth of machine data, which can include anything from IT machines to sensors and meters and GPS devices requires complex analysis and correlation across different types of data sets. By using big data for operations analysis, organizations can gain real-time visibility into operations, customer experience, transactions and behavior.

    5

    .

  • Data Warehouse Augmentation

    Integrate big data and data warehouse capabilities to increase operational efficiency. Optimize your data warehouse to enable new types of analysis. Use big data technologies to set up a staging area or landing zone for your new data before determining what data should be moved to the data warehouse. Offload infrequently accessed or aged data from warehouse and application databases using information integration software and tools.

An introduction to Apache Hadoop

Hadoop is designed to run on commodity hardware and can scale up or down without system interruption. It consists of three main functions: storage, processing and resource management.
  1. Processing – MapReduce
    Computation in Hadoop is based on the MapReduce paradigm that distributes tasks across a cluster of coordinated “nodes.” It was designed to run on commodity hardware and to scale up or down without system interruption.
  2. Storage – HDFS
    Storage is accomplished with the Hadoop Distributed File System (HDFS) – a reliable and distributed file system that allows large volumes of data to be stored and rapidly accessed across large clusters of commodity servers.
  3. Resource Management – YARN (New in Hadoop 2.0)
    YARN performs the resource management function in Hadoop 2.0 and extends MapReduce capabilities by supporting non-MapReduce workloads associated with other programming models. The YARN based architecture of Hadoop 2 is the most significant change introduced to the Hadoop project.

Tuesday, July 16, 2013

Apache Hadoop mirrors at the hub...

Apache Hadoop mirrors

  http://code.google.com/p/autosetup1/downloads/
            detail?name=hadoop-0.20.2.tar.gz&can=2&q=
  http://www.apache.org/dyn/closer.cgi
  http://hadoop.apache.org/releases.html

JDK6
http://www.oracle.com/technetwork/java/javasebusiness/downloads/
     java-archive-downloads-javase6-419409.html#jdk-6u45-oth-JPR

JDK7 
for 32bit Download linux-x86
for 63bit Download linux-x64

  http://www.oracle.com/technetwork/java/javase/downloads/
        jdk7-downloads-1880260.html

ECOSYSTEM DOWNLOADING MIRRORS

 Hive
  http://hive.apache.org/releases.html
  http://www.apache.org/dyn/closer.cgi/hive/
 pig
  http://archive.apache.org/dist/hadoop/pig/stable/
 
 Hbase
  http://www.apache.org/dyn/closer.cgi/hbase/
  http://sourceforge.net/projects/hbasemanagergui/
 Sqoop
  http://www.apache.org/dyn/closer.cgi/sqoop/1.4.3
 Flume
  http://flume.apache.org/download.html
  http://flume.apache.org/
 Chukwa
  http://incubator.apache.org/chukwa/
  
 

When to use Hbase and when for MapReduce..?

Very often I do get a query on when to use HBase and when MapReduce. HBase provides an SQL like interface with Phoenix and MapReduce provides a similar SQL interface with Hive. Both can be used to get insights from the data.


I would like the analogy of HBase/MapReduce to an plane/train. A train can carry a lot of material at a slow pace, while a plane relatively can carry less material at a faster pace. Depending on the amount of the material to be transferred from one location to another and the urgency, either a plane or a train can me used to move material.

Similarly HBase (or in fact any database) provides relatively low latency (response time) at the cost of low throughput (data transferred/processed), while MapReduce provides high latency at the cost of high throughput. So, depending on the NFR (Non Functional Requirements) of the application either HBase or MapReduce can be picked.

E-commerce or any customer facing application requires a quick response time to the end user and and also only a few records related to the customer have to be picked, so HBase would fit the bill. But, for all the back end/batch processing MapReduce can be used.

Hbase Use Cases

HBase at Pinterest

Pinterest is completely deployed on Amazon EC2. Pinterest uses a follow model where users follow other users. This requires a following feed for every user that gets updated everytime a followee creates or updates a pin. This is a classic social media application problem. For Pinterest, this amounts to 100s of millions of pins per month that gets fanned out as billions of writes per day.
So the ‘Following Feed‘ is implemented using Hbase. Some specifics:
  • They chose a wide schema where each user’s following feed is a single row in HBase. This exploits the sorting order within columns for ordering (each user wants to see the latest in his feed) and results in atomic transactions per user.
  • To optimize writes, they increased per region memstore size. 512M memstore leads to 40M HFile instead of the small 8M file created by default memstore This leads to less frequent compactions.
  • They take care of the potential for infinite columns by trimming the feed during compactions: there really is not much point having an infinite feed anyway.
  • They also had to do GC tuning (who doesn’t) opting for more frequent but smaller pauses.
Another very interesting fact. They maintain a mean time to recovery (MTTR) of less than 2 minutes. This is a great accomplishment since HBase favors consistency over availability. They achieve this via reducing various timeout settings (socket, connect, stale node, etc.) and the number of retries. They also avoid the single point of failure by using 2 clusters. To avoid NameNode failure, they keep a copy on EBS.

HBase at Groupon

Groupon has two distinct use caes. Deliver deals to users via email (a batch process) and provide a relevant user experience on the website. They have increasingly tuned their deals to be more accurate and relevant to individual users (personalization).
They started out with running Hadoop MapReduce (MR) jobs for email deal delivery and used MySQL for their online application – but ideally wanted the same system for both.
They now run their Relevance and Personalization system on HBase. In order to cater to the very different workload characteristics of the two systems(email, online), they run 2 HBase clusters that are replicated so they have the same content but are tuned and accessed differently.
Groupon also uses a very wide schema – One colmn-family for ‘user history and profile’ and the other for email history.
A 10 node cluster runs HBase (apart from the 100 node Hadoop cluster). Each node has 96GB RAM, 2

HBase at Longtail Video

This company provides JW Player, an online video player used by over 2 million websites. They have lots of data which is processed by their online analytics tool. They too are completely deployed on AWS and as such use HBase and EMR from Amazon. They read data from and write data to S3.
They had the following requirements:
  • fast queries across data sets
  • support for date-range queries
  • store huge amounts of aggregated data
  • flexibility in dimensions used for rollup tables
HBase fit the bill. They use multiple clusters to partition their read and write intensive workloads similar to Groupon. They are a full-fledged python shop so use Happybase and have Thrift running on all the nodes of the HBase cluster.

Basic Hardware requirements for Hadoop Fully Distributed cluster

The first question people raise when getting started with Hadoop is about selecting appropriate hardware for their Hadoop cluster. This blog post describes the various factors that Hadoop administrators take into account.  We encourage others to chime in with their experience configuring production Hadoop clusters.  Although Hadoop is designed to run on industry standard hardware, recommending an ideal cluster configuration is not as easy as just delivering a list of hardware specifications. Selecting hardware that provides the best balance of performance and economy for a given workload requires testing and validation. For example, users with IO-intensive workloads will invest in more spindles per core. In this blog post we’ll discuss workload evaluation and the critical role it plays in hardware selection.

Marrying storage and compute

Over the past decade IT organizations have standardized on blades and SANs (Storage Area Networks) to satisfy their grid and processing-intensive workloads. While this model makes a lot of sense for a number of standard applications such as web servers, app servers, smaller structured databases and simple ETL (Extract, Transform, Load) the requirements for infrastructure has been changing as the amount of data and number of users has grown. Web servers now have caching tiers, databases have gone massively parallel with local disk, and ETL jobs are pushing more data than they can handle locally. Hardware vendors have created innovative systems to address these requirements including storage blades, SAS (Serial Attached SCSI) switches, external SATA arrays and larger capacity rack units.
Hadoop was designed based on a new approach to storing and processing complex data. Instead of relying on a SAN for massive storage and reliability then moving it to a collection of blades for processing, Hadoop handles large data volumes and reliability in the software tier. Hadoop distributes data across a cluster of balanced machines and uses replication to ensure data reliability and fault tolerance. Because data is distributed on machines with compute power, processing can be sent directly to the machines storing the data. Since each machine in a Hadoop cluster both stores and processes data, they need to be configured to satisfy both data storage and processing requirements.

Why workloads matter

In nearly all cases, a MapReduce job will either encounter a bottleneck reading data from disk or from the network (known as an IO-bound job) or in processing data (CPU-bound). An example of an IO-bound job is sorting, which requires very little processing (simple comparisons) and a lot of reading and writing to disk. An example of a CPU-bound job is classification, where some input data is processed in very complex ways to determine an ontology.
Here are several more examples of IO-bound workloads:
  • Indexing
  • Searching
  • Grouping
  • Decoding/decompressing
  • Data importing and exporting
Here are several more examples of CPU-bound workloads:
  • Machine learning
  • Complex text mining
  • Natural language processing
  • Feature extraction
Since our customers need to understand their workloads in order to fully optimize their Hadoop hardware, we often start out with a classic chicken and egg problem on our hands. Most teams looking to build a Hadoop cluster don’t yet know the profile of their workload and often the first jobs that an organization runs with Hadoop are far different than the jobs that Hadoop is used for as they build proficiency.  Additionally, some workloads might be bound in unforeseen ways.  For example, sometimes theoretical IO-bound workloads might actually be CPU-bound because of a user’s choice of compression.  Or sometimes different implementations of an algorithm might change how the MapReduce job is constrained.  For these reasons it makes sense to invest in a balanced Hadoop cluster when the team is unfamiliar with the types of jobs they are going to run.  The team can benchmark MapReduce jobs once they’re running on the balanced cluster, to understand how they’re bound.
It is straightforward to measure live workloads and determine bottlenecks by putting thorough monitoring in place on the Hadoop cluster. We recommend installing Ganglia on all Hadoop machines to provide real-time statistics about CPU, disk, and network load.  With Ganglia installed a Hadoop administrator can then run their MapReduce jobs and check the Ganglia dashboard to see how each machine is performing.

In addition to building out a cluster appropriate for the workload, we encourage our customers to work with a hardware vendor and understand the economics of power and cooling. Since Hadoop runs on tens, hundreds, or thousands of nodes an operations team can save a significant amount of money by investing in power-efficient hardware. Each hardware vendor will be able to provide tools and recommendations for how to monitor power and cooling.

How to pick hardware for your Hadoop cluster

The first step in choosing a machine configuration is to understand the type of hardware your operations team already manages. Operations teams often have opinions about new machine purchases and will prefer to work with hardware that they’re already familiar with. Hadoop is not the only system that benefits from efficiencies of scale. Remember to plan on using balanced hardware for an initial cluster when new to Hadoop and if you do not yet understand your workload.
There are four types of nodes in a basic Hadoop cluster. We refer here to a node as a machine performing a particular task. Most of the machines will function as both datanodes and tasktrackers. As we described, these nodes both store data and perform processing functions. We recommend the following specifications for datanodes/tasktrackers in a balanced Hadoop cluster:
  • 4 1TB hard disks in a JBOD (Just a Bunch Of Disks) configuration
  • 2 quad core CPUs, running at least 2-2.5GHz
  • 16-24GBs of RAM (24-32GBs if you’re considering HBase)
  • Gigabit Ethernet
The namenode is responsible for coordinating data storage on the cluster and the jobtracker for coordinating data processing.  The last type of node is the secondarynamenode, which can be colocated on the namenode machine for small clusters, and will run on the same hardware as the namenode for larger clusters.  We recommend our customers purchase hardened machines for running the namenodes and jobtrackers, with redundant power and enterprise-grade RAIDed disks. Namenodes also require more RAM relative to the number of data blocks in the cluster. A good rule of thumb is to assume 1GB of namenode memory for every one million blocks stored in the distributed file system. With 100 datanodes in a cluster, 32GBs of RAM on the namenode provides plenty of room to grow. We also recommend having a standby machine to replace the namenode or jobtracker, in the case when one of these fails suddenly.
When you expect your Hadoop cluster to grow beyond 20 machines we recommend that the initial cluster be configured as it were to span two racks, where each rack has a top of rack gigabit switch, and those switches are connected with a 10 GigE interconnect or core switch. Having two logical racks gives the operations team a better understand of the network requirements for inner-rack, and cross-rack communication.
With a Hadoop cluster in place the team can start identifying workloads and prepare to benchmark those workloads to identify CPU and IO bottlenecks. After some time benchmarking and monitoring, the team will have a good understanding as to how additional machines should be configured. It is common to have heterogeneous Hadoop clusters especially as they grow in size. Starting with a set of machines that are not perfect for your workload will not be a waste.
Below is a list of various hardware configurations for different workloads, including our original “base” recommendation:
  • Light Processing Configuration (1U/machine): Two quad core CPUs, 8GB memory, and 4 disk drives (1TB or 2TB). Note that CPU-intensive work such as natural language processing involves loading large models into RAM before processing data and should be configured with 2GB RAM/core instead of 1GB RAM/core.
  • Balanced Compute Configuration (1U/machine): Two quad core CPUs, 16 to 24GB memory, and 4 disk drives (1TB or 2TB) directly attached using the motherboard controller. These are often available as twins with two motherboards and 8 drives in a single 2U cabinet.
  • Storage Heavy Configuration (2U/machine): Two quad core CPUs, 16 to 24GB memory, and 12 disk drives (1TB or 2TB). The power consumption for this type of machine starts around ~200W in idle state and can go as high as ~350W when active.
  • Compute Intensive Configuration (2U/machine): Two quad core CPUs, 48-72GB memory, and 8 disk drives (1TB or 2TB). These are often used when a combination of large in-memory models and heavy reference data caching is required.
Note that we expect to adopt 6 and 8 core configurations as they arrive.  The following diagram shows how a machine should be configured according to workload:

Other hardware considerations

When we encounter applications that produce large amounts of intermediate data–on the order of the same amount as is read in–we recommend two ports on a single Ethernet card or two channel-bonded Ethernet cards to provide 2 Gbps per machine. Alternatively for customers who have already moved to 10 Gigabit Ethernet or Infiniband, these solutions can be used to address network bound workloads.  Be sure that your operating system and BIOS are compatible if you’re considering switching to 10 Gigabit Ethernet.
When computing memory requirements, factor in that Java uses up to 10% for managing the virtual machine. We recommend configuring Hadoop to use strict heap size restrictions in order to avoid memory swapping to disk. Swapping greatly impacts MapReduce job performance and can be avoided by configuring machines with more RAM.
It is also important to optimize RAM for the memory channel width. For example, when using dual-channel memory each machine should be configured with pairs of DIMMs. With triple-channel memory each machine should have triplets of DIMMs. This means a machine might end up with 18GBs (9x2GB) of RAM instead of 16GBs (4x4GB).

Conclusions

Purchasing appropriate hardware for a Hadoop cluster requires benchmarking and careful planning to fully understand the workload. However, Hadoop clusters are commonly heterogeneous and we recommend deploying initial hardware with balanced specifications when getting started.