Kalyan Hadoop Training in Hyderabad @ ORIEN IT, Ameerpet, 040 65142345 , 9703202345

Friday, 31 October 2014

Facebook's New Realtime Analytics System: HBase to Process 20 Billion Events Per Day

Facebook did it again. They've built another system capable of doing something useful with ginormous streams of realtime data. Last time we saw Facebook release their New Real-Time Messaging System: HBase To Store 135+ Billion Messages A Month. This time it's a realtime analytics system handling over 20 billion events per day (200,000 events per second) with a lag of less than 30 seconds.
Alex Himel, Engineering Manager at Facebook, explains what they've built (video) and the scale required:

Social plugins have become an important and growing source of traffic for millions of websites over the past year. We released a new version of Insights for Websites last week to give site owners better analytics on how people interact with their content and to help them optimize their websites in real time. To accomplish this, we had to engineer a system that could process over 20 billion events per day (200,000 events per second) with a lag of less than 30 seconds.

Alex does an excellent job with the presentation. Highly recommended. But let's take a little deeper look at what's going on...
The need for such a high powered analytics system is driven by Facebook's brilliant plan for world wide web domination via the viral propagation of social plugins, all tying the non-Facebook web back into Facebook and the Facebook web back into the non-Facebook web. Basically anything that people can do is captured and fed back through Facebook and anything done on Facebook can be displayed on your website, building closer relations between the two.
Facebook's Social Plugins are Roman Empire Management 101. You don't have to conquer everyone to build an empire. You just have control everyone with the threat they could be conquered while making them realize, oh by the way, there's lots of money to be made being friendly with Rome. This strategy worked for quite a while as I recall.
You've no doubt seen Social Plugins on websites out the wild. A social plugin lets you see what your friends have liked, commented on or shared on sites across the web. The idea is putting social plugins on a site makes the content more engaging. Your friends can see what you are liking and in turn websites can see what everyone is liking. Content that is engaging gives you more clicks, more likes, and more comments. For a business or brand, or even an individual, the more engaging the content is, the more people see it, the more it pops up in news feeds, the more it drives traffic to a site.
The formerly lone-wolf web, where content hunters stalked web sites silently and singly, has been turned into a charming little village, where everyone knows your name. That's the power of social.
Posts here on HighScalability, for example, now have Like buttons. TechCrunch has famously moved to using Facebook's commenting system. Immediately the debate centered on the quality of the comment system itself, but that was hardly the point, the point was to plunge TechCrunch more deeply into Facebook's ecosystem of 500+ million users. Other plugins include: Recommendations, Activity Feed, Login, Registration, Facepile, and Live Stream.
All that data doesn't mean much unless you can make sense of it and also prove to content providers that social plugins actually do make their sites more engaging. That's where Facebook's Insights System comes in. It's an analytics system giving you access to all that juicy data being collected. It offers stats like Like button analytics, Comments box analytics, Popular pages, Demographics, and Organic sharing.
Imagine millions of websites and billions of pages and millions of people continually streaming data in via these social plugins. How do you make sense of all that data in real-time? It's a challenging problem.

Value Proposition

With an Insights System content producers can see what people like, which will enable content producers to generate more of what people like, which raises the content quality of the web, which gives users a better Facebook experience.

System Goals

Give people realtime counters, in a very reliable way, across a bunch of different metrics, and account for data skew.
Provide anonymous data. You can not figure out who the people are.
Show why plugins are valuable. What value is your business deriving from it?
Make the data more actionable. Help users take action to make their content more valuable.
- New UI metaphors. Use the idea of a funnel.
- How many people see a plugin, how many people take action on it, and how many are converted to traffic back on your site.
Make the data more timely.
- They went realtime. Went from a 48-hour turn around to 30 seconds.
- Multiple points of failure were removed to make this goal.

Challenges

Lots of Event Types
- Tracking 100+ metrics.
- Plugin impressions.
- Likes
- News Feed Impressions
- News Feed Clicks
- Demographics
Massive Amounts of Data
- 20 billion events per day (200,000 events per second)
Data Skew - Uneven Distribution of Keys
- Likes follow something like a power law distribution. The long tail gets very few likes, but some resources get huge numbers of likes.
- This brings up issues of hot regions, hot keys, and lock contention.

Implemented a Bunch of Different Prototypes

MySQL DB Counters
- Have a row with a key and a counter.
- Results in lots of database activity.
- Stats are kept at a day bucket granularity. Every day at midnight the stats would roll over.
  - When the roll over period is reached this resulted in a lot of writes to the database, which caused a lot of lock contention.
  - Tried to spread the work by taking into account time zones.
  - Tried to shard things differently.
- The high write rate led to lock contention, it was easy to overload the databases, had to constantly monitor the databases, and had to rethink their sharding strategy.
- Solution not well tailored to the problem.
In-Memory Counters
- If you are worried about bottlenecks in IO then throw it all in-memory.
- No scale issues. Counters are stored in memory so writes are fast and the counters are easy to shard.
- Felt in-memory counters, for reasons not explained, weren't as accurate as other approaches. Even a 1% failure rate would be unacceptable. Analytics drive money so the counters have to be highly accurate.
- They didn't implement this system. It was a thought experiment and the accuracy issue caused them to move on.
MapReduce
- Used Hadoop/Hive for previous solution.
- Flexible. Easy to get running. Can handle IO, both massive writes and reads. Don't have to know how they will query ahead of time. The data can be stored and then queried.
- Not realtime. Many dependencies. Lots of points of failure. Complicated system. Not dependable enough to hit realtime goals.
Cassandra
- HBase seemed a better solution based on availability and the write rate.
- Write rate was the huge bottleneck being solved.

The Winner: HBase + Scribe + Ptail + Puma

At a high level:
- HBase stores data across distributed machines.
- Use a tailing architecture, new events are stored in log files, and the logs are tailed.
- A system rolls the events up and writes them into storage.
- A UI pulls the data out and displays it to users.
Data Flow
- User clicks Like on a web page.
- Fires AJAX request to Facebook.
- Request is written to a log file using Scribe.
  - Scribe handles issues like file roll over.
  - Scribe is built on the same HTFS file store Hadoop is built on.
  - Write extremely lean log lines. The more compact the log lines the more can be stored in memory.
- Ptail
  - Data is read from the log files using Ptail. Ptail is an internal tool built to aggregate data from multiple Scribe stores. It tails the log files and pulls data out.
  - Ptail data is separated out into three streams so they can eventually be sent to their own clusters in different datacenters.
    - Plugin impression
    - News feed impressions
    - Actions (plugin + news feed)
- Puma
  - Batch data to lessen the impact of hot keys. Even though HBase can handle a lot of writes per second they still want to batch data. A hot article will generate a lot of impressions and news feed impressions which will cause huge data skews which will cause IO issues. The more batching the better.
  - Batch for 1.5 seconds on average. Would like to batch longer but they have so many URLs that they run out of memory when creating a hashtable.
  - Wait for last flush to complete for starting new batch to avoid lock contention issues.
- UI Renders Data
  - Frontends are all written in PHP.
  - The backend is written in Java and Thrift is used as the messaging format so PHP programs can query Java services.
- Caching solutions are used to make the web pages display more quickly.
  - Performance varies by the statistic. A counter can come back quickly. Find the top URL in a domain can take longer. Range from .5 to a few seconds.
  - The more and longer data is cached the less realtime it is.
  - Set different caching TTLs in memcache.
- MapReduce
  - The data is then sent to MapReduce servers so it can be queried via Hive.
  - This also serves as a backup plan as the data can be recovered from Hive.
  - Raw logs are removed after a period of time.
HBase is a distribute column store.
- Database interface to Hadoop. Facebook has people working internally on HBase.
- Unlike a relational database you don't create mappings between tables.
- You don't create indexes. The only index you have a primary row key.
- From the row key you can have millions of sparse columns of storage. It's very flexible. You don't have to specify the schema. You define column families to which you can add keys at anytime.
- Key feature to scalability and reliability is the WAL, write ahead log, which is a log of the operations that are supposed to occur.
  - Based on the key, data is sharded to a region server.
  - Written to WAL first.
  - Data is put into memory. At some point in time or if enough data has been accumulated the data is flushed to disk.
  - If the machine goes down you can recreate the data from the WAL. So there's no permanent data loss.
  - Use a combination of the log and in-memory storage they can handle an extremely high rate of IO reliably.
- HBase handles failure detection and automatically routes across failures.
- Currently HBase resharding is done manually.
  - Automatic hot spot detection and resharding is on the roadmap for HBase, but it's not there yet.
  - Every Tuesday someone looks at the keys and decides what changes to make in the sharding plan.
Schema
- Store on a per URL basis a bunch of counters.
- A row key, which is the only lookup key, is the MD5 hash of the reverse domain
  - Selecting the proper key structure helps with scanning and sharding.
  - A problem they have is sharding data properly onto different machines. Using a MD5 hash makes it easier to say this range goes here and that range goes there.
  - For URLs they do something similar, plus they add an ID on top of that. Every URL in Facebook is represented by a unique ID, which is used to help with sharding.
  - A reverse domain, com.facebook/ for example, is used so that the data is clustered together. HBase is really good at scanning clustered data, so if they store the data so it's clustered together they can efficiently calculate stats across domains.
- Think of every row a URL and every cell as a counter, you are able to set different TTLs (time to live) for each cell. So if keeping an hourly count there's no reason to keep that around for every URL forever, so they set a TTL of two weeks. Typically set TTLs on a per column family basis.
Per server they can handle 10,000 writes per second.
Checkpointing is used to prevent data loss when reading data from log files.
- Tailers save log stream check points in HBase.
- Replayed on startup so won't lose data.
Useful for detecting click fraud, but it doesn't have fraud detection built in.
Tailer Hot Spots
- In a distributed system there's a chance one part of the system can be hotter than another.
- One example are region servers that can be hot because more keys are being directed that way.
- One tailer can be lag behind another too.
- If one tailer is an hour behind and the others are up to date, what numbers do you display in the UI?
  - For example, impressions have a way higher volume than actions, so CTR rates were way higher in the last hour.
  - Solution is to figure out the least up to date tailer and use that when querying metrics.
Future Directions
- Top Lists
  - Very hard to find the top URLs, the URLs with the most likes, for domains like YouTube which have millions of URLs shared very quickly.
  - Need more creative solutions to keep an in-memory sorts and keep it up to date as data changes.
- Distinct User Counts
  - How many people across a time window liked a URL. Easy to do in MapReduce, hard to do with a naive counter solution.
- Generalize for Applications other than Social Plugins
- Move to Multiple Datacenters
  - Single datacenter currently, but hope to move to multiple datacenters.
  - Current fallback plan is to use the MapReduce system.
  - The backup systems are tested each night. Queries against Hive and this new system are compared to see that they match.
Project
- Took about 5 months.
- Two engineers first started working on the project. Then a 50% engineer was added.
- Two UI people worked on the front-end.
- Looks like about 14 people worked on the product in from engineering, design, PM, and operations.

When we look at the messaging system and this analytics system, we notice what the two systems have in common: large numbers, HBase, real-time. The challenge of dealing with huge write loads in a reliable and timely fashion is a common substrate to these problems. Facebook is focussing on the HBase, Hadoop, HDFS ecosystem and counting on the operational quirks to be ironed out later. Others choose Cassandra because they love it's scalability, multi-datacenter functionality, and ease of operational use, but it doesn't fit as cleanly into the overall analytics stack.
What does this mean for you? Even if you aren't Facebook this architecture is simple enough and composed of enough off the shelf tools that it could work for much small projects too.

Twitter’s Plan to Analyze 100 Billion Tweets

If Twitter is the “nervous system of the web” as some people think, then what is the brain that makes sense of all those signals (tweets) from the nervous system? That brain is the Twitter Analytics System and Kevin Weil, as Analytics Lead at Twitter, is the homunculus within in charge of figuring out what those over 100 billion tweets (approximately the number of neurons in the human brain) mean.
Twitter has only 10% of the expected 100 billion tweets now, but a good brain always plans ahead. Kevin gave a talk, Hadoop and Protocol Buffers at Twitter, at the Hadoop Meetup, explaining how Twitter plans to use all that data to an answer key business questions.
What type of questions is Twitter interested in answering? Questions that help them better understand Twitter. Questions like:

How many requests do we serve in a day?
What is the average latency?
How many searches happen in day?
How many unique queries, how many unique users, what is their geographic distribution?
What can we tell about as user from their tweets?
Who retweets more?
How does usage differ for mobile users?
What went wrong at the same time?
Which features get users hooked?
What is a user’s reputation?
How deep do retweets go?
Which new features worked?

And many many more. The questions help them understand Twitter, their analytics system helps them get the answers faster.

Hadoop and Pig are Used for Analysis

Any question you can think of requires analyzing big data for answers. 100 billion is a lot of tweets. That’s why Twitter uses Hadoop and Pig as their analysis platform. Hadoop provides: key-value storage on a distributed file system, horizontal scalability, fault tolerance, and map-reduce for computation. Pig is a query a mechanism that makes it possible to write complex queries on top of Hadoop.
Saying you are using Hadoop is really just the beginning of the story. The rest of the story is what is the best way to use Hadoop? For example, how do you store data in Hadoop?
This may seem an odd question, but the answer has big consequences. In a relational database you don’t store the data, the database stores you, er, it stores the data for you. APIs move that data around in a row format.
Not so with Hadoop. Hadoop’s key-value model means it’s up to you how data is stored. Your choice has a lot to do with performance, how much data can be stored, and how agile you can be in reacting to future changes.
Each tweet has 12 fields, 3 of which have sub structure, and the fields can and will change over time as new features are added. What is the best way to store this data?

Data is Stored in Protocol Buffers to Keep it Efficient and Flexible

Twitter considered CSV, XML, JSON, and Protocol Buffers as possible storage formats. Protocol Buffer is a way of encoding structured data in an efficient yet extensible format. Google uses Protocol Buffers for almost all of its internal RPC protocols and file formats. BSON (binary JSON) was not evaluated, but would probably not work because it doesn’t have an IDL (interface definition language.) Avro is one potential option that they’ll look into in the future.
An evaluation matrix was created which declared Protocol Buffers the winner. Protocol Buffers won because it allows data to be split across different nodes; it is reusable for data other than just tweets (logs, file storage, RPC, etc); it parses efficiently; fields can be added, changed, and deleted without having to change deployed code; the encoding is small; it supports hierarchical relationships. All the other options failed one or more of these criteria.

IDL Used for Codegen

Surprisingly efficiency, flexibility and other sacred geek metrics were not the only reason Twitter liked Protocol Buffers. What is often considered a weakness, Protocol Buffer’s use of an IDL to describe data structures, is actually considered a big win by Twitter. Having to define data structure IDL is often seen as a useless waste of time. But from the IDL they generate, as part of the build process, all Hadoop related code: Protocol Buffer InoutFOrmats, OutputFormats, Writables, Pig LoadFuncs, Pig StoreFuncs, and more.
All the code that once was written by hand for each new data structure is now simply auto generated from the IDL. This saves ton of effort and the code is much less buggy. IDL actually saves time.
At one point model driven auto generation was a common tactic on many projects. Then fashion moved to hand generating everything. Codegen it seems wasn't agile enough. Once you hand generate everything you start really worrying about the verbosity of your language, which moved everyone to more dynamic languages, and ironically DSLs were still often listed as an advantage of languages like Ruby. Another consequence of hand coding was the framework of the weekitis. Frameworks help blunt the damage caused by thinking everything must be written from scratch.
It’s good to see code generation coming into fashion again. There’s a lot of power in using a declarative specification and then writing highly efficient, system specific code generators. Data structures are the low hanging fruit, but it’s also possible to automate larger more complex processes.
Overall it was a very interesting and useful talk. I like seeing the careful evaluation of different options based on knowing what you want and why. It's refreshing to see how these smart choices can synergize and make a better and more stable system.

Hadoop and Protocol Buffers at Twitter
A Peek Under Twitter's Hood - Twitter’s open source page goes live.
Hadoop
ProtocolBuffers
Hadoop Bay Area User Group - Feb 17th at Yahoo! - RECAP
Twitter says "Today, we are seeing 50 million tweets per day—that's an average of 600 tweets per second."