Showing posts with label News. Show all posts
Showing posts with label News. Show all posts

Friday 31 October 2014

Twitter’s Plan to Analyze 100 Billion Tweets

If Twitter is the “nervous system of the web” as some people think, then what is the brain that makes sense of all those signals (tweets) from the nervous system? That brain is the Twitter Analytics System and Kevin Weil, as Analytics Lead at Twitter, is the homunculus within in charge of figuring out what those over 100 billion tweets (approximately the number of neurons in the human brain) mean.
Twitter has only 10% of the expected 100 billion tweets now, but a good brain always plans ahead. Kevin gave a talk, Hadoop and Protocol Buffers at Twitter, at the Hadoop Meetup, explaining how Twitter plans to use all that data to an answer key business questions.
What type of questions is Twitter interested in answering? Questions that help them better understand Twitter. Questions like:
  1. How many requests do we serve in a day?
  2. What is the average latency?
  3. How many searches happen in day?
  4. How many unique queries, how many unique users, what is their geographic distribution?
  5. What can we tell about as user from their tweets?
  6. Who retweets more?
  7. How does usage differ for mobile users?
  8. What went wrong at the same time?
  9. Which features get users hooked?
  10. What is a user’s reputation?
  11. How deep do retweets go?
  12. Which new features worked?
And many many more. The questions help them understand Twitter, their analytics system helps them get the answers faster.

Hadoop and Pig are Used for Analysis

Any question you can think of requires analyzing big data for answers. 100 billion is a lot of tweets. That’s why Twitter uses Hadoop and Pig as their analysis platform. Hadoop provides: key-value storage on a distributed file system, horizontal scalability, fault tolerance, and map-reduce for computation. Pig is a query a mechanism that makes it possible to write complex queries on top of Hadoop.
Saying you are using Hadoop is really just the beginning of the story. The rest of the story is what is the best way to use Hadoop? For example, how do you store data in Hadoop?
This may seem an odd question, but the answer has big consequences. In a relational database you don’t store the data, the database stores you, er, it stores the data for you. APIs move that data around in a row format.
Not so with Hadoop. Hadoop’s key-value model means it’s up to you how data is stored. Your choice has a lot to do with performance, how much data can be stored, and how agile you can be in reacting to future changes.
Each tweet has 12 fields, 3 of which have sub structure, and the fields can and will change over time as new features are added. What is the best way to store this data?

Data is Stored in Protocol Buffers to Keep it Efficient and Flexible

Twitter considered CSV, XML, JSON, and Protocol Buffers as possible storage formats. Protocol Buffer is a way of encoding structured data in an efficient yet extensible format. Google uses Protocol Buffers for almost all of its internal RPC protocols and file formats. BSON (binary JSON) was not evaluated, but would probably not work because it doesn’t have an IDL (interface definition language.) Avro is one potential option that they’ll look into in the future.
An evaluation matrix was created which declared Protocol Buffers the winner. Protocol Buffers won because it allows data to be split across different nodes; it is reusable for data other than just tweets (logs, file storage, RPC, etc); it parses efficiently; fields can be added, changed, and deleted without having to change deployed code; the encoding is small; it supports hierarchical relationships. All the other options failed one or more of these criteria.

IDL Used for Codegen

Surprisingly efficiency, flexibility and other sacred geek metrics were not the only reason Twitter liked Protocol Buffers. What is often considered a weakness, Protocol Buffer’s use of an IDL to describe data structures, is actually considered a big win by Twitter. Having to define data structure IDL is often seen as a useless waste of time. But from the IDL they generate, as part of the build process, all Hadoop related code: Protocol Buffer InoutFOrmats, OutputFormats, Writables, Pig LoadFuncs, Pig StoreFuncs, and more.
All the code that once was written by hand for each new data structure is now simply auto generated from the IDL. This saves ton of effort and the code is much less buggy. IDL actually saves time.
At one point model driven auto generation was a common tactic on many projects. Then fashion moved to hand generating everything. Codegen it seems wasn't agile enough. Once you hand generate everything you start really worrying about the verbosity of your language, which moved everyone to more dynamic languages, and ironically DSLs were still often listed as an advantage of languages like Ruby. Another consequence of hand coding was the framework of the weekitis. Frameworks help blunt the damage caused by thinking everything must be written from scratch.
It’s good to see code generation coming into fashion again. There’s a lot of power in using a declarative specification and then writing highly efficient, system specific code generators. Data structures are the low hanging fruit, but it’s also possible to automate larger more complex processes.
Overall it was a very interesting and useful talk. I like seeing the careful evaluation of different options based on knowing what you want and why.  It's refreshing to see how these smart choices can synergize and make a better and more stable system.

Related Articles

  1. Hadoop and Protocol Buffers at Twitter
  2. A Peek Under Twitter's Hood - Twitter’s open source page goes live.
  3. Hadoop
  4. ProtocolBuffers
  5. Hadoop Bay Area User Group - Feb 17th at Yahoo! - RECAP
  6. Twitter says "Today, we are seeing 50 million tweets per day—that's an average of 600 tweets per second."

Thursday 31 July 2014

BigData Challenges

BigData Challenges

The Data Science Process

A data science project may begin with a very well-defined question  -- Which of these 200 genetic markers are the best predictors of disease X? -- or an open-ended one -- How can we  decrease emergency room wait time in a hospital? Either way, once the motivating question has been identified, a data science project progresses through five iterative stages:

  • Harvest Data: Find and choose data sources
  • Clean Data: Load data into pre-processing environment; prep data for analysis
  • Analysis: Develop and execute the actual analysis
  • Visualize: Display the results in ways that effectively communicate new insights, or point out where the analysis needs to be further developed
  • Publish: Deliver the results to their intended recipient, whether human or machine
Each of these stages is associated with its own challenges, and correspondingly, with a plethora of tools that have sprung up to address those particular challenges.  Data science is an iterative process; at any stage, it may be necessary to circle back to earlier stages in order to incorporate new data or revise models.

Below is an outline of challenges that arise in each stage; it is meant to give the reader a sense of the scale and complexity of such challenges, not to enumerate them exhaustively:

Challenges of stage 1: Harvest Data



This is a classic needle-in-a-haystack problem: there exist millions of available data sets in the world, and of those only a handful are suitable, much less accessible, for a particular project. The exact criteria for what counts as "suitable data" will vary from project to project, but even when the criteria are fairly straightforward, finding data sets and proving that they meet those criteria can be a complex and time-consuming process.

When dealing with public data, the data sets are scattered and often poorly described. Organizations ranging from the federal government to universities to companies have begun to publish and/or curate large public data sets, but this is a fairly new practice, and there is much room for improvement. Dealing with internal data is not any easier: within an enterprise, there can be multiple data warehouses belonging to different departments, each contributed to by multiple users with little integration or uniformity across warehouses.

In addition to format and content, metadata, particularly provenance information, is crucial: the history of a data set, who produced it, how it was produced, when it was last updated, etc. also determine how suitable a data set is for a given project. However, this information is not often tracked and/or stored with the data, and if it is, it may be incomplete or manually generated.

Challenges of stage 2: Cleanse/Prep Data



This stage can require operations as simple as visually inspecting samples of the data to ones as complex as transforming the entire data set. Format and content are two major areas of concern.

With respect to format, data comes in a variety of formats, from highly structured (relational) to unstructured (photos, text documents) to anything in between (XML, CSVs), and these formats may not play well together. The user may need to write custom code in order to convert the data sets to compatible formats, use programming languages or purpose-built software, or even manually manipulate the data in programs like Excel.  This latter path becomes a non-option once the data set exceeds a certain size.

With respect to content and data quality, there are numerous criteria to consider, but  some major ones are accuracy, internal consistency, and compliance with applicable regulations (e.g. privacy laws, internal policies). The same data may be stored in different ways across data sets (e.g. multiple possible formats for date/time information), or the data set may have multiple "parent" data sets whose content must meet the same criteria.

In the Hadoop ecosystem, one common tool for initially inspecting and prepping data is Hive. Hive is commonly used for ad-hoc querying and data summarization, and in this context, Hive's strengths are its familiar SQL-like  query language (HiveQL) and its ability to handle both structured and semi-structured data.

However, Hive lacks the functional flexibility needed for significantly transforming raw data into a form more fitting for the planned analysis, often a standard part of the "data munging" process. Outside of Hadoop, data scientists use languages such as R, Python or Perl to execute these transformations, but these tools are limited to the processing power of the machines - often the users' own laptops - on which they are installed.

Challenges of stage 3: Analyze



Once the data is prepared, there is often a "scene change;" that is, the analytics take place in an environment different from the pre-processing environment. For instance, the latter may be a data warehouse while the former is a desktop application. This can prove to be another logistical challenge, particularly if the pre-processing environment has a greater capacity than the analytical one.

This stage is where data science most clearly borrows from, or is an extension of, statistics. It requires starting with data, forming a hypothesis about what that data says about a given slice of reality, formally modelling that hypothesis, running data through the model, observing the results, refining the hypothesis, refining the model and repeating. Having specialist-level knowledge of the relevant sector or field of study is also very helpful; on the other hand, there is also a risk of confirmation bias if one's background knowledge is given undue weight over what the numbers say.

Given that data sets can contain up to thousands of variables and millions or billions of records, performing data science on these very large data sets often calls for approaches such as machine learning and data mining. Both involve programs based on statistical principles that can complete tasks and answer questions without explicit human direction, usually by means of pattern recognition. Machine learning is defined by algorithms and performance metrics enabling programs to interpret new data in the context of historical data and continuously revise predictions.

A machine learning program is typically aimed at answering a specific question about a data set with generally known characteristics, all with minimal human interaction. In contrast, data mining is defined by the need to discover previously unknown features of a data set that may be especially large and unstructured. In this case, the task or question is less specific, and a program may require more explicit direction from the human data scientist to reveal useful features of the data.

Challenges of stage 4: Visualize


Visualization is necessary for both the "Analyze" and "Publish" stages, though it plays slightly different roles in each:
In the former, the data scientist uses visualization to more easily see the results of each round of testing. Graphics at this stage are often bare-bones and simple: scatterplots, histograms, etc. - but effectively capture feedback for the data scientist on what the latest round of modeling and testing indicates.

In the latter, the emphasis is often on interactiveness and intuitive graphics, so that the data products can be used as effectively as possible by the end users. For example, if a data science project's goal is to mine hospital data for insights on how the hospital can ensure continuity in the medical team assigned to a patients throughout their stay, it is not the data scientist's job to predict all the exact situations in which the doctors will use the results of analysis. Rather, the goal of visualization in this case is to expose facets of the data in a way that is intuitive to the user and still gives them flexibility; i.e. does not lock them into one view or use of the data.

The emphasis on visualization in data science is a result of the same factors that have produced the increased demand for data science itself: the scale and complexity of available data has grown to a point where useful insights do not lie in plain view but must be unearthed, polished, and displayed to their best advantage. Visualization is a key part of accomplishing those goals.

Challenges of stage 5: Publish



This stage could also be called "Implementation." The potential challenges here are as varied as the potential goals of the use case. A data science project could be part of product development for a smartphone app, in which case the intended recipient of the output could be either the app designers, or could be supporting an already-deployed app.

Similarly, a data science project could be used by financial firms to inform investment decision-making, in which case the recipient could be either a piece of automated trading software or a team of brokers. Suffice it to say that data scientists will need to be concerned with many of the same features of the output data sets as they were with the input data sets - format, content, provenance - and that a data scientist's mastery of the data also involves knowing how to best present it, whether to machines or to humans.

BigData Success Stories

BigData Success Stories

Every company, in particular large large enterprises, faces both great opportunities and challenges with respect to extracting value from the data available to them. In each of the following notable examples, organizations have used data science to craft solutions to pressing problems, a process which in turn opens up more opportunities and challenges.

"Moneyball"


One of the best-known examples of data science, due to a best-selling book and recent film, is the collective movement in baseball toward intensive statistical analysis of player performance to complement traditional appraisals. Led by the works of baseball analyst Bill James, and ultimately by management decisions of Oakland Athletics' manager Billy Beane, teams discovered that some players were undervalued by traditional metrics. Since Boston hired James in 2003, the Red Sox have won two World Series, following more than 80 years without a title.
Among sports, baseball presented an especially rich opportunity for analysis, given access to meticulous records for many games dating back to over one hundred years ago. Today other sports organizations, such as the NBA, have begun to apply some of the same techniques.

 

Get Out the Vote 2012


In the 2012 presidential election, both campaigns supported get-out-the-vote activities with intricate polling, data mining, and statistical analysis. In the Obama campaign, this analytics effort was code-named Narwhal, while the Romney campaign dubbed theirs Orca. The aim of these systems, described by Atlantic Monthly and other sources, was to identify voters inclined to vote for the candidate, convince the individual to vote for the candidate of the party in question, and direct resources to ensure the individual reached the polling place on Election Day. Reportedly, Narwhal had a capacity of 175 million names and employed more than 120 data scientists. Orca was comparatively small, with a capacity of 23 million names.

Ultimately, Narwhal is credited with having delivered for the Obama campaign "ground game" in battleground states, turning a seemingly close election into a comfortable victory and validating many polls that had predicted a small but consistent advantage for the incumbent.

Additional Links

 

The "Like" Button


In 2010, the social networking company sought to determine whether the proposed "like" button would catch on with users and how this link to other websites could drive traffic to the site. The company offered the feature to a small sample of users and collected detailed web logs for those users. This data was processed into structured form (using Hadoop and Hive). Analytics showed that overall content generation increased, even content unrelated to the "like" button.
Aside from the impact of the feature on the Facebook website itself, this feature provides the company with a wealth of information about user behavior across the Internet which can be used to predict reactions to new features and determine the impact of online advertising.

Additional Links

 

Pregnancy Prediction


In early 2012, the New York Times described a concerted effort by consumer goods giant Target to use purchase records, including both the identity of the purchased items and the temporal distribution of those purchases - to classify customers, particularly pregnant women. Predicting when women were in the early stages of a pregnancy presented the opportunity to gain almost exclusive consumer loyalty for a period when families might want to save time by shopping at a single location. Once individuals were identified by the company's data science team, Target mailed coupons for products often purchased by pregnant women. In one incident, a man called the company to complain that his daughter had received mailings filled with such coupons, and it was unacceptable that the company was encouraging her to become pregnant. Days later, the man called to apologize: Target had, in fact, correctly surmised that his daughter was pregnant.

Additional Links

 

Open-Source Recommendation Algorithms


For online businesses, making good recommendations to customers is a classic data science challenge. As described by Wired, online retailer Overstock.com used to spend $2 million annually on software to drive recommendations for additional purchases. Last year, Overstock.com's R&D team used machine learning algorithms from Mahout, an open source Hadoop project, to develop a news article recommendation app based on articles that the user had read.
The success of this app inspired the company to use Mahout to replace the recommendation service for products on its own main website. By turning to an open source solution for data science, the company is saving millions of dollars. In addition to Overstock.com, other players in online retail are now taking a close look at Mahout and other Hadoop technologies.

Additional Links

 

Transparency in Healthcare


An abundance of publicly available data online has stoked further interest in "democratizing" access to information that would help the average citizen understand certain markets. Wired talked to Fred Trotter about his successful Freedom of Information Act (FOIA) request to obtain records on doctor referrals, which could provide insights into the healthcare system. The resulting data set, which he called the Doctor Social Graph, has already been turned into a tool for patients. Trotter hopes to combine this data set with others from healthcare organizations to develop a tool for rating doctors.

Additional Links

 

Photo Quality Prediction


A travel app startup called JetPac knew they had a problem. They needed to figure out better ways to automatically identify the best pictures among thousands taken by their users, based on metadata such as captions, dimensions, and location. In fall 2011, the company partnered with another start up, Kaggle, to set up a competition among data scientists all over the world to develop an algorithm. The ideal algorithm would allow a machine to come to the same conclusion as a human about the quality of a picture, and the top prize was $5,000.
Building on the highest-ranked algorithm from the competition, JetPac successfully introduced the new functionality into their product. According to Wired, the company subsequently received $2.4 million in venture capital funding. With a little help from the data science community, JetPac is well on its way to building its business.

Additional Links

 

Energy Efficiency


For almost five years, the DC-based startup Opower has built a business on providing power consumers recommendations on how to reduce their bills. The company's system collects data from 75 utilities on more than 50 million homes, then sends recommendations through email and other means.
Although other companies offer similar services, Opower has been able to scale up significantly using Hadoop to store large amounts of data from power meters across the country. The more data the company can analyze, the greater the opportunity for good recommendations, and the more energy there is to be saved. Success has enabled Opower to develop new offerings in partnership with established companies such as Facebook.

Additional Links

 

Improving Patient Outcomes



Health insurance giant Aetna was not achieving the desired level of success in addressing symptoms of metabolic syndrome, which is associated with heart disease and strokes. In summer 2012, the company had created an in-house data science team, and the group went to work on the issue. In partnership with an outside lab focused on metabolic syndrome, Aetna used their data on more than 18 million customers to design personalized recommendations for patients suffering from related symptoms.
Aetna intends to harvest additional data available for this kind of analysis by incorporating natural language processing to read notes handwritten by doctors. Ultimately, the company plans to use data science to bring down costs and improve outcomes for cancer patients.

Additional Links

 

Pre-Paid Phone Service


Every company wants to know the right time to reach out to customers, making sure the message has the maximum impact and avoiding a perception of saturation. A company called Globys is helping large telecommunications corporations understand when to make the pitch. In particular, Globys analyzed data on users of pre-paid phone services, who are not locked into a longer contract. These users face a decision on a regular basis of whether to stay with a particular company or make a change. Globys was able to identify the right time in the user's "recharge cycle" for the company to reach out. With these recommendations, companies have seen revenue from pre-paid services increase up to 50 percent.

Additional Links

How a little open source project came to dominate big data

How a little open source project came to dominate big data

It began as a nagging technical problem that needed solving. Now, it’s driving a market that’s expected to be worth $50.2 billion by 2020.

There are countless open source projects with crazy names in the software world today, but the vast majority of them never make it onto enterprises’ collective radar. Hadoop is an exception of pachydermic proportions.

Named after a child’s toy elephant, Hadoop is now powering big data applications at companies such as Yahoo YHOO and Facebook FB ; more than half of the Fortune 50 use it, providers say.

The software’s “refreshingly unique approach to data management is transforming how companies store, process, analyze and share big data,” according to Forrester analyst Mike Gualtieri. “Forrester believes that Hadoop will become must-have infrastructure for large enterprises.”

Globally, the Hadoop market was valued at $1.5 billion in 2012; by 2020, it is expected to reach $50.2 billion.

It’s not often a grassroots open source project becomes a de facto standard in industry. So how did it happen?

‘A market that was in desperate need’

“Hadoop was a happy coincidence of a fundamentally differentiated technology, a permissively licensed open source codebase and a market that was in desperate need of a solution for exploding volumes of data,” said RedMonk cofounder and principal analyst Stephen O’Grady. “Its success in that respect is no surprise.”
Created by Doug Cutting and Mike Cafarella, the software—like so many other inventions—was born of necessity. In 2002, the pair were working on an open source search engine called Nutch. “We were making progress and running it on a small cluster, but it was hard to imagine how we’d scale it up to running on thousands of machines the way we suspected Google was,” Cutting said.

Shortly thereafter Google GOOG published a series of academic papers on its own Google File System and MapReduce infrastructure systems, and “it was immediately clear that we needed some similar infrastructure for Nutch,” Cafarella said.
“The way Google was approaching things was different and powerful,” Cutting explained. Whereas so far at that point “you had to build a special-purpose system for each distributed thing you wanted to do,” Google’s approach offered instead a general-purpose automated framework for distributed computing. “It took care of the hard part of distributed computing so you could focus just on your application,” Cutting said.

Both Cutting and Cafarella (who are now chief architect at Cloudera and University of Michigan assistant professor of computer science and engineering, respectively) knew they wanted to make a version of their own—not just for Nutch, but for the benefit of others as well—and they knew they wanted to make it open source.
“I don’t enjoy the business aspects,” Cutting said. “I’m a technical guy. I enjoy working on the code, tackling the problems with peers and trying to improve it, not trying to sell it. I’d much rather tell people, ‘It’s kind of OK at this; it’s terrible at that; maybe we can make it better.’ To be able to be brutally honest is really nice—it’s much harder to be that way in a commercial setting.”

But the pair knew that the potential upside of success could be staggering.  “If I was right and it was useful technology that lots of people wanted to use, I’d be able to pay my rent—and without having to risk my shirt on a startup,” Cutting said.
For Cafarella, “Making Nutch open source was part of a desire to see search engine technology outside the control of a few companies, but also a tactical decision that would maximize the likelihood of getting contributions from engineers at big companies. We specifically chose an open source license that made it easy for a company to contribute.”

It was a good decision. “Hadoop would not have become a big success without large investments from Yahoo and other firms,” Cafarella said.

‘How would you compete with open source?’

So Hadoop borrowed an idea from Google, made the concept open source, and both encouraged and got investment from powerhouses like Yahoo. But that wasn’t all that drove its success. Luck—in the form of sheer, unanticipated market demand—also played a key role.

“I knew other people would probably have similar problems, but I had no idea just how many other people,” Cutting said. “I thought it would be mostly people building text search engines. I didn’t see it being used by folks in insurance, banking, oil discovery—all these places where it’s being used today.”

Looking back, “my conjecture is that we were early enough, and that the combination of being first movers and being open source and being a substantial effort kept there from being a lot of competitors early on,” he said. “Mike and I got so far, but it took tens of engineers from Yahoo several more years to make it stable.”
And even if a competitor did manage to catch up, “how would you compete with something open source?” Cutting said. “Competing against open source is a tough game—everybody else is collaborating on it; the cost is zero. It’s easier to join than to fight.”

IBM IBM , Microsoft MSFT , and Oracle ORCL are among the large companies that chose to collaborate with Hadoop.
Though Cafarella isn’t surprised that Web companies use Hadoop, he is astonished at “how many people now have data management problems that 12 years ago were exceedingly rare,” he said. “Everyone now has the problems that used to belong to just Yahoo and Google.”

Hadoop represents “somewhat of a turning point in the primary drivers of open source software technology,” said Jay Lyman, a senior analyst for enterprise software with 451 Research. Before, open source software such as the Linux operating system were best known for offering a cost-effective alternative to proprietary software like Microsoft’s Windows. “Cost savings and efficiency drove much of the enterprise use,” Lyman said.

With the advent of NoSQL databases and Hadoop, however, “we saw innovation among the primary drivers of adoption and use,” Lyman said. “When it comes to NoSQL or Hadoop technology, there is not really a proprietary alternative.”
Hadoop’s success has come as a pleasant surprise to its creators. “I didn’t expect an open source project would ever take over an industry like this,” Cutting said. “I’m overjoyed.”

And it’s still on a roll. “Hadoop is now much bigger than the original components,” Cafarella said. “It’s an entire stack of tools, and the stack keeps growing. Individual components might have some competition—mainly MapReduce—but I don’t see any strong alternative to the overall Hadoop ecosystem.”

The project’s adaptability “argues for its continued success,” RedMonk’s O’Grady said. “Hadoop today is a very different, and more versatile, project than it was even a year or two ago.”

But there’s plenty of work to be done. Looking ahead, Cutting—with the support of Cloudera—has begun to focus on the policy needed to accommodate big data technology.

“Now that we have this technology and so much digitization of just about every aspect of commerce and government and we have these tools to process all this digital data, we need to make sure we’re using it in ways we think are in the interests of society,” he said. “In many ways, the policy needs to catch up with the technology.

“One way or other, we are going to end up with laws. We want them to be the right ones.”

 

Related Posts Plugin for WordPress, Blogger...