Hadoop at VisualPath

Tuesday, July 16, 2013

What is NOSQL.?

Here is a nice video by Martin Fowler on `Introduction to NoSQL`. At the end he talks about polygot databases. It's not like NoSQL is going to replace RDBMs and that RDBMs will vanish for ever. NoSQL and RDBMs have their own space for meeting different requirements and will coexist. Fowler explains it in a very succinct way and it's also fun to watch his talk.

It's interesting to note that he doesn't mention Spanner anywhere. For the impatient to read the Google paper on Spanner, here are some articles on Spanner (1, 2) and a video (1).

I am firm believer that getting a clear understanding of the core concepts is very essential before diving into a new technology. NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence is a very nice book for those who are getting started with NoSQL. I am also starting a new page with links to NoSQL and will keep it updated as I come across some interesting information.

Monday, July 15, 2013

CDH3 Vs CDH4

Starting in CDH 4.2, YARN/MapReduce 2 (MR2) includes an even more powerful Fair Scheduler. In addition to doing nearly all that it could do in MapReduce 1 (MR1), the YARN Fair Scheduler can schedule non-MapReduce jobs, schedule based on fine-grained memory instead of slots, and support hierarchical queues. In this post, you’ll learn what the Fair Scheduler’s role is and how it fulfills it, what it means to be a YARN “scheduler,” and dive into its new features and how to get them running on your cluster.

YARN/MR2 vs. MR1

YARN uses an updated terminology to reflect that it no longer just manages resources for MapReduce. From YARN’s perspective, a MapReduce job is an application. YARN schedules containers for map and reduce tasks to live in. What was referred to as pools in the MR1 Fair Scheduler has been updated to queue for consistency with the capacity scheduler. An excellent and deeper explanation is available here.

How Does it Work?

How a Hadoop scheduler functions can often be confusing, so we’ll start with a short overview of what the Fair Scheduler does and how it works.
A Hadoop scheduler is responsible for deciding which tasks get to run where and when to run them. The Fair Scheduler, originally developed at Facebook, seeks to promote fairness between schedulable entities by awarding free space to those that are the most underserved. (Cloudera recommends the Fair Scheduler for its wide set of features and ease of use, and Cloudera Manager sets it as the default. More than 95% of Cloudera’s customers use it.)

In Hadoop, the scheduler is a pluggable piece of code that lives inside ResourceManager (the JobTracker, in MR1) the central execution managing service. The ResourceManager is constantly receiving updates from the NodeManagers that sit on each node in the cluster, that say “What’s up, here are all the tasks I was running that just completed, do you have any work for me?” The ResourceManager passes these updates to the scheduler, and the scheduler then decides what new tasks, if any, to assign to that node.
How does the scheduler decide? For the Fair Scheduler, it’s simple: every application belongs to a “queue”, and we give a container to the queue that has the fewest resources allocated to it right now. Within that queue, we offer it to the application that has the fewest resources allocated to it right now. The Fair Scheduler supports a number of features that modify this a little, like weights on queues, minimum shares, maximum shares, and FIFO policy within queues, but the basic idea remains the same.

Beyond MapReduce

In MR1, the Fair Scheduler was purely a MapReduce scheduler. If you wanted to run multiple parallel computation frameworks on the same cluster, you would have to statically partition resources — or cross your fingers and hope that the resources given to a MapReduce job wouldn’t also be given to something else by that framework’s scheduler, causing OSes to thrash. With YARN, the same scheduler can manage resources for different applications on the same cluster, which should allow for more multi-tenancy and a richer, more diverse Hadoop ecosystem.

Scheduling Resources, Not Slots

A big change in the YARN Fair Scheduler is how it defines a “resource”. In MR1, the basic unit of scheduling was the “slot”, an abstraction of a space for a task on a machine in the cluster. Because YARN expects to schedule jobs with heterogeneous task resource requests, it instead allows containers to request variable amounts of memory and schedules based on those. Cluster resources no longer need to be partitioned into map and reduce slots, meaning that a large job can use all the resources in the cluster in its map phase and then do so again in its reduce phase. This allows for better utilization of the cluster, better treatment of tasks with high resource requests, and more portability of jobs between clusters — a developer no longer needs to worry about a slot meaning different things on different clusters; rather, they can request concrete resources to satisfy their jobs’ needs. Additionally, work is being done (YARN-326) that will allow the Fair Scheduler to schedule based on CPU requirements and availability as well.
An implementation detail of this change that prevents applications from starving under this new flexibility is the notion of reserved containers. Imagine two jobs are running that each have enough tasks to saturate more than the entire cluster. One job wants each of its mappers to get 1GB, and another job wants its mappers to get 2GB. Suppose the first job starts and fills up the entire cluster. Whenever one of its task finishes, it will leave open a 1GB slot. Even though the second job deserves the space, a naive policy will give it to the first one because it’s the only job with tasks that fit. This could cause the second job to be starved indefinitely.
To prevent this unfortunate situation, when space on a node is offered to an application, if the application cannot immediately use it, it reserves it, and no other application can be allocated a container on that node until the reservation is fulfilled. Each node may have only one reserved container. The total reserved memory amount is reported in the ResourceManager UI. A high number means that it may take longer for new jobs to get space.

Hadoop single node cluster in ease.......

***Cent OS 6.4 ( 32bit/64bit) configuring CDH4***
Single Node Cluster:---

* Install centos iso image in vm workstation.

* Download java1.6 based on Cent OS 32 bit or 64 bit

eg : 32 bit
          jdk-6u43-linux-i586-rpm.bin

* Give executable permission to jdk-6u43-linux-i586-rpm.bin

$chmod 755 jdk-6u43-linux-i586-rpm.bin

* Install java from root user(where java is there)

#./jdk-6u43-linux-i586-rpm.bin

* Export java home
#export JAVA_HOME=/usr/java/jdk1.6.0_43

* Download required hadoop version
ex:hadoop-1.0.3.tar.gz

* Switch to other user   [ To add user:
                                               Switch to root user and
                                               #su adduser <username> (default group of user is root)]

#su <username>
password:<password>

* Unzip hadoop tar file
$tar -zxvf hadoop-1.o.3.tar.gz

* Making hadoop recognize java
cd /hadoop/conf
$vi hadoop-env.sh
and add following line
export JAVA_HOME=/usr/java/jdk1.6.0_43     //your java will be installed here
save and quit

* Configuring HADOOP_HOME directory to hadoop instalation directory
export HADOOP_HOME

* Goto your home dir
$cd ~

open .bashrc file in vi editor and add following lines
export HADOOP_HOME=<hadoop installed location>
export PATH=$PATH:$HADOOP_HOME/bin

Note:
* Add user to sudoers file
goto root user and open /etc/sudoers file and add following line
<username> ALL=(ALL)    NOPASSWD:ALL

*Making update-alternatives working
add following line to .bashrc file in u r home dir

*export PATH=$PATH:/sbin:/usr/sbin:/usr/local/sbin

*Make jps running

goto home $cd ~

open .bashrc file and add following line

export PATH=$PATH:/usr/java/jdk1.6.0_43/bin

*Set all configurations

#vi /home/training/hadoop/hadoop-1.0.3/conf/core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property>

#vi /home/training/hadoop/hadoop-1.0.3/conf/mapred-site.xml

<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>

#vi /home/training/hadoop/hadoop-1.0.3/conf/hdfs-site.xml

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

Format the namenode [Format your name-node when you set up your cluster for first time]

$hadoop namenode -format

*Start all the services

$/home/training/hadoop/hadoop-1.0.3/bin/start-all.sh
                                      or
    You can directly start services from your home as start-all.sh

*Open browser and check weather the services start or not

    http://localhost:50070         or                  http://localhost:50030

*Installing eclipse in node

change to root user and do the following
#

download eclipse
   eg.: eclipse-java-europa-winter-linux-gtk.tar.gz

create a dir eclipse under /home/training

copy the downloaded file to eclipse folder and untar the file

tar -zxvf eclipse-java-europa-winter-linux-gtk.tar.gz

*Change permissions for eclipse dir

chmod -R +r /opt/eclipse

*Create Eclipse executable on /usr/bin path

touch /usr/bin/eclipse

chmod 755 /usr/bin/eclipse

## Open eclipse file with your favourite editor ##
nano -w /usr/bin/eclipse

## Paste following content to file ##
#!/bin/sh
export ECLIPSE_HOME="/home/training/eclipse"

$ECLIPSE_HOME/eclipse $*

*Bring eclipse icon on desktop

## Create following file, with our favourite editor ##
/usr/share/applications/eclipse.desktop

## Add following content to file and save ##
[Desktop Entry]
Encoding=UTF-8
Name=Eclipse
Comment=Eclipse SDK 4.2.1
Exec=eclipse
Icon=/home/training/eclipse/icon.xpm
Terminal=false
Type=Application
Categories=GNOME;Application;Development;
StartupNotify=true

after successful installation goto Applications->programming->eclipse->right click and ->addthis launcher to desktop

launch eclipse by double clicking eclipse icon on desktop

click on new project-->select mapreduce project-->click on configure Hadoop Install dir adnd give <hadoop install location>

Thursday, July 11, 2013

Hadoop use cases

UseCases

The ecosystem hub

Ecosystem

Big Data ecosystem is evolving at a very rapid pace and it's difficult to keep track of the changes. The ecosystem provides a lot of choices (open source vs proprietary, free vs commercial, batch vs streaming). For a new-bee, it not only takes good amount of time and effort to get familiar with a framework, but it's also perplexing where to start.

Hadoop has got a lot of attention and many start with Hadoop, but Hadoop is not the solution for everything. Let's take graph processing, Hama and Giraph (though in incubating) are better then Hadoop for it. This page attempts to give an idea of the ecosystem around Big Data.

http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/ http://nosql.mypopescu.com/post/1541593207/quick-reference-hadoop-tools-ecosystem http://www.acunu.com/blogs/sean-owen/hadoop-universe/ http://karmasphere.com/Blog/making-sense-of-the-big-data-and-hadoop-ecosystem-finding-some-clarity.html http://www.onstrategies.com/blog/2011/06/06/hadoop-ecosystem-starts-crystallizing/ http://radar.oreilly.com/2012/02/what-is-apache-hadoop.html

Here are some of the useful articles/blogs to get started with the Hadoop ecosystem.

Sqoop

https://blogs.apache.org/sqoop/entry/apache_sqoop_overview

https://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop

HBase

http://hbase.apache.org/book.html

http://www.larsgeorge.com/

https://blogs.apache.org/hbase/entry/coprocessor_introduction

https://github.com/jrkinley/hbase-bulk-import-example
http://www.deerwalk.com/blog/bulk-importing-data/

Giraph

http://engineering.linkedin.com/open-source/apache-giraph-framework-large-scale-graph-processing-hadoop-reaches-01-milestone

Oozie

http://www.infoq.com/articles/introductionOozie
http://www.infoq.com/articles/oozieexample
http://www.infoq.com/articles/ExtendingOozie

http://oozie.apache.org/docs/3.3.0/CoordinatorFunctionalSpec.html
http://www.crobak.org/2012/07/workflow-engines-for-hadoop/

Flume

https://blogs.apache.org/flume/entry/flume_ng_architecture
https://cwiki.apache.org/confluence/display/FLUME/Getting+Started
http://flume.apache.org/FlumeUserGuide.html

http://blog.cloudera.com/blog/2012/11/streaming-data-into-apache-hbase-using-apache-flume/

Pig

http://hortonworks.com/blog/pig-eye-for-the-sql-guy/

Storm

Storm which has been released by Twitter is known as Hadoop for realtime processing. More about Storm here.

https://github.com/nathanmarz/storm/wiki
http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html
http://www.ibm.com/developerworks/library/os-twitterstorm/index.html

http://developer.yahoo.com/blogs/ydn/storm-hadoop-convergence-big-data-low-latency-processing-54503.html
https://github.com/nathanmarz/storm/wiki/Common-patterns
http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/
http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/

Monday, July 8, 2013

The Hadoop questions hub

1. Where is your client located in Hadoop network.?

2. What is the difference between Name node federation and Name node High availability.?

3. What is elastic MapReduce in Hadoop.?

4. List out the difference between Hive and Pig. Where to use which.?

5. How to notify Task tracker failure.?

Saturday, July 6, 2013

The hadoop question hub

1. Can't I go for VSAM for distributed storage and process that in parallel to attain the features of Hadoop.?

2. What is the internal algorithm of HDFS.?

3. How can you scale Hadoop efficiency in data processing.?

4. Compare and contrast SAP HANA and Hadoop

5. Can I have filters while uploading data into Hadoop.?

6. List the scenarios where your job failed