Sunday, July 21, 2013

An introduction to Apache Hadoop

Hadoop is designed to run on commodity hardware and can scale up or down without system interruption. It consists of three main functions: storage, processing and resource management.
  1. Processing – MapReduce
    Computation in Hadoop is based on the MapReduce paradigm that distributes tasks across a cluster of coordinated “nodes.” It was designed to run on commodity hardware and to scale up or down without system interruption.
  2. Storage – HDFS
    Storage is accomplished with the Hadoop Distributed File System (HDFS) – a reliable and distributed file system that allows large volumes of data to be stored and rapidly accessed across large clusters of commodity servers.
  3. Resource Management – YARN (New in Hadoop 2.0)
    YARN performs the resource management function in Hadoop 2.0 and extends MapReduce capabilities by supporting non-MapReduce workloads associated with other programming models. The YARN based architecture of Hadoop 2 is the most significant change introduced to the Hadoop project.

