Tuesday, July 16, 2013

Hbase Use Cases

HBase at Pinterest

Pinterest is completely deployed on Amazon EC2. Pinterest uses a follow model where users follow other users. This requires a following feed for every user that gets updated everytime a followee creates or updates a pin. This is a classic social media application problem. For Pinterest, this amounts to 100s of millions of pins per month that gets fanned out as billions of writes per day.
So the ‘Following Feed‘ is implemented using Hbase. Some specifics:
  • They chose a wide schema where each user’s following feed is a single row in HBase. This exploits the sorting order within columns for ordering (each user wants to see the latest in his feed) and results in atomic transactions per user.
  • To optimize writes, they increased per region memstore size. 512M memstore leads to 40M HFile instead of the small 8M file created by default memstore This leads to less frequent compactions.
  • They take care of the potential for infinite columns by trimming the feed during compactions: there really is not much point having an infinite feed anyway.
  • They also had to do GC tuning (who doesn’t) opting for more frequent but smaller pauses.
Another very interesting fact. They maintain a mean time to recovery (MTTR) of less than 2 minutes. This is a great accomplishment since HBase favors consistency over availability. They achieve this via reducing various timeout settings (socket, connect, stale node, etc.) and the number of retries. They also avoid the single point of failure by using 2 clusters. To avoid NameNode failure, they keep a copy on EBS.

HBase at Groupon

Groupon has two distinct use caes. Deliver deals to users via email (a batch process) and provide a relevant user experience on the website. They have increasingly tuned their deals to be more accurate and relevant to individual users (personalization).
They started out with running Hadoop MapReduce (MR) jobs for email deal delivery and used MySQL for their online application – but ideally wanted the same system for both.
They now run their Relevance and Personalization system on HBase. In order to cater to the very different workload characteristics of the two systems(email, online), they run 2 HBase clusters that are replicated so they have the same content but are tuned and accessed differently.
Groupon also uses a very wide schema – One colmn-family for ‘user history and profile’ and the other for email history.
A 10 node cluster runs HBase (apart from the 100 node Hadoop cluster). Each node has 96GB RAM, 2

HBase at Longtail Video

This company provides JW Player, an online video player used by over 2 million websites. They have lots of data which is processed by their online analytics tool. They too are completely deployed on AWS and as such use HBase and EMR from Amazon. They read data from and write data to S3.
They had the following requirements:
  • fast queries across data sets
  • support for date-range queries
  • store huge amounts of aggregated data
  • flexibility in dimensions used for rollup tables
HBase fit the bill. They use multiple clusters to partition their read and write intensive workloads similar to Groupon. They are a full-fledged python shop so use Happybase and have Thrift running on all the nodes of the HBase cluster.

No comments:

Post a Comment