Kalyan Hadoop Training in Hyderabad @ ORIEN IT, Ameerpet, 040 65142345 , 9703202345

Thursday, 31 July 2014

BigData Success Stories

Every company, in particular large large enterprises, faces both great opportunities and challenges with respect to extracting value from the data available to them. In each of the following notable examples, organizations have used data science to craft solutions to pressing problems, a process which in turn opens up more opportunities and challenges.

"Moneyball"

One of the best-known examples of data science, due to a best-selling book and recent film, is the collective movement in baseball toward intensive statistical analysis of player performance to complement traditional appraisals. Led by the works of baseball analyst Bill James, and ultimately by management decisions of Oakland Athletics' manager Billy Beane, teams discovered that some players were undervalued by traditional metrics. Since Boston hired James in 2003, the Red Sox have won two World Series, following more than 80 years without a title.
Among sports, baseball presented an especially rich opportunity for analysis, given access to meticulous records for many games dating back to over one hundred years ago. Today other sports organizations, such as the NBA, have begun to apply some of the same techniques.

BaseballProspectus.com: A Statistician Rereads Bill James
ESPN.com: John Hollinger reflects on the about-face of professional baseball's stance toward analytics
ESPN.com: Rick Carlisle, the Dallas Mavericks, and number-crunching

Get Out the Vote 2012

In the 2012 presidential election, both campaigns supported get-out-the-vote activities with intricate polling, data mining, and statistical analysis. In the Obama campaign, this analytics effort was code-named Narwhal, while the Romney campaign dubbed theirs Orca. The aim of these systems, described by Atlantic Monthly and other sources, was to identify voters inclined to vote for the candidate, convince the individual to vote for the candidate of the party in question, and direct resources to ensure the individual reached the polling place on Election Day. Reportedly, Narwhal had a capacity of 175 million names and employed more than 120 data scientists. Orca was comparatively small, with a capacity of 23 million names.

Ultimately, Narwhal is credited with having delivered for the Obama campaign "ground game" in battleground states, turning a seemingly close election into a comfortable victory and validating many polls that had predicted a small but consistent advantage for the incumbent.

Additional Links

TheAtlantic.com: When the Nerds Go Marching In
Narwhal vs. Orca: A breakdown

The "Like" Button

In 2010, the social networking company sought to determine whether the proposed "like" button would catch on with users and how this link to other websites could drive traffic to the site. The company offered the feature to a small sample of users and collected detailed web logs for those users. This data was processed into structured form (using Hadoop and Hive). Analytics showed that overall content generation increased, even content unrelated to the "like" button.
Aside from the impact of the feature on the Facebook website itself, this feature provides the company with a wealth of information about user behavior across the Internet which can be used to predict reactions to new features and determine the impact of online advertising.

Additional Links

Infoq.com: Facebook on Hadoop, Hive, HBase, and A/B Testing

Pregnancy Prediction

In early 2012, the New York Times described a concerted effort by consumer goods giant Target to use purchase records, including both the identity of the purchased items and the temporal distribution of those purchases - to classify customers, particularly pregnant women. Predicting when women were in the early stages of a pregnancy presented the opportunity to gain almost exclusive consumer loyalty for a period when families might want to save time by shopping at a single location. Once individuals were identified by the company's data science team, Target mailed coupons for products often purchased by pregnant women. In one incident, a man called the company to complain that his daughter had received mailings filled with such coupons, and it was unacceptable that the company was encouraging her to become pregnant. Days later, the man called to apologize: Target had, in fact, correctly surmised that his daughter was pregnant.

Additional Links

Forbes.com: How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did
NYTimes.com: How Companies Learn Your Secrets

Open-Source Recommendation Algorithms

For online businesses, making good recommendations to customers is a classic data science challenge. As described by Wired, online retailer Overstock.com used to spend $2 million annually on software to drive recommendations for additional purchases. Last year, Overstock.com's R&D team used machine learning algorithms from Mahout, an open source Hadoop project, to develop a news article recommendation app based on articles that the user had read.
The success of this app inspired the company to use Mahout to replace the recommendation service for products on its own main website. By turning to an open source solution for data science, the company is saving millions of dollars. In addition to Overstock.com, other players in online retail are now taking a close look at Mahout and other Hadoop technologies.

Additional Links

Wired.com: Mahout, There It Is! Open Source Algorithms Remake Overstock.com

Transparency in Healthcare

An abundance of publicly available data online has stoked further interest in "democratizing" access to information that would help the average citizen understand certain markets. Wired talked to Fred Trotter about his successful Freedom of Information Act (FOIA) request to obtain records on doctor referrals, which could provide insights into the healthcare system. The resulting data set, which he called the Doctor Social Graph, has already been turned into a tool for patients. Trotter hopes to combine this data set with others from healthcare organizations to develop a tool for rating doctors.

Additional Links

Wired.com: Bringing Hidden Healthcare Data Into the Open
Fred Trotter: Tracking the Social Doctor: Opening Up Physician Referral Data (And Much More)

Photo Quality Prediction

A travel app startup called JetPac knew they had a problem. They needed to figure out better ways to automatically identify the best pictures among thousands taken by their users, based on metadata such as captions, dimensions, and location. In fall 2011, the company partnered with another start up, Kaggle, to set up a competition among data scientists all over the world to develop an algorithm. The ideal algorithm would allow a machine to come to the same conclusion as a human about the quality of a picture, and the top prize was $5,000.
Building on the highest-ranked algorithm from the competition, JetPac successfully introduced the new functionality into their product. According to Wired, the company subsequently received $2.4 million in venture capital funding. With a little help from the data science community, JetPac is well on its way to building its business.

Additional Links

Wired.com: How One Startup Turned a $5,000 Contest Into Millions
Kaggle.com:Photo Quality Prediction competition

Energy Efficiency

For almost five years, the DC-based startup Opower has built a business on providing power consumers recommendations on how to reduce their bills. The company's system collects data from 75 utilities on more than 50 million homes, then sends recommendations through email and other means.
Although other companies offer similar services, Opower has been able to scale up significantly using Hadoop to store large amounts of data from power meters across the country. The more data the company can analyze, the greater the opportunity for good recommendations, and the more energy there is to be saved. Success has enabled Opower to develop new offerings in partnership with established companies such as Facebook.

Additional Links

GigaOm.com: OPower - The Big Data Energy Player to Beat

Improving Patient Outcomes

Health insurance giant Aetna was not achieving the desired level of success in addressing symptoms of metabolic syndrome, which is associated with heart disease and strokes. In summer 2012, the company had created an in-house data science team, and the group went to work on the issue. In partnership with an outside lab focused on metabolic syndrome, Aetna used their data on more than 18 million customers to design personalized recommendations for patients suffering from related symptoms.
Aetna intends to harvest additional data available for this kind of analysis by incorporating natural language processing to read notes handwritten by doctors. Ultimately, the company plans to use data science to bring down costs and improve outcomes for cancer patients.

Additional Links

Gigaom.com: How Aetna is using big data to improve patient health

Pre-Paid Phone Service

Every company wants to know the right time to reach out to customers, making sure the message has the maximum impact and avoiding a perception of saturation. A company called Globys is helping large telecommunications corporations understand when to make the pitch. In particular, Globys analyzed data on users of pre-paid phone services, who are not locked into a longer contract. These users face a decision on a regular basis of whether to stay with a particular company or make a change. Globys was able to identify the right time in the user's "recharge cycle" for the company to reach out. With these recommendations, companies have seen revenue from pre-paid services increase up to 50 percent.

Additional Links

ZDnet.com: Big data has potential to double telco prepaid revenue: Globys

BigData using R

The role of programming languages in data science

When it comes to data science, programming languages are ubiquitous and essential tools. Two of the most popular languages used in data science are Python and R; other commonly-used languages include Perl, Java, C#, and C++. These languages are used during all stages of data science: harvesting, cleaning or "munging," analysis, and visualization.
While the debates over which language is best are often characterized as "religious wars," some favor the balanced approach of using both at different stages, depending on the strengths of each language. To quote Rachel Schutt, currently a data scientist at Google, "Don't get too attached to tools, languages and methods; use what gets the job done. Be versatile."
With that in mind, this section will discuss one such language/tool, R, as a "Swiss Army knife" for statistical analysis.

R: a Swiss Army knife for data scientists

R Itself

R is an open-source statistical programming language and environment. In addition to being available to any user free of charge, R provides considerably more flexibility to users than proprietary statistical software packages. With R, data scientists can be especially creative in how they approach their problem, from powerful facilities for data cleaning to cutting-edge analytical techniques to refined visualization.

At first, R may not seem like the right tool for everyone. Compared to other analytics tools, there is a steep learning curve. R is a programming language in its own right, an implementation of the S language. The basic R download also includes a minimal user interface, relying only on the command line in the terminal or console.

Fortunately, other groups have developed more user-friendly interfaces such as RStudio IDE, which give R a similar feel to other statistical packages. There are also numerous resources for all levels of R users from novices to advanced users, as shown below. For learning and troubleshooting, R has thorough and accessible documentation supported by a responsive and knowledgeable online community.

Screenshot of RStudio

The most effective R users will have some background in both programming and statistics, like most data scientists. Today, R is used widely in fields such as public health, biostatistics, climate science, market research, economics and financial analysis. Large enterprises often use R for prototyping analysis from start to finish. Known corporate users include Google, Facebook, The New York Times Infographics, Kickstarter, Bing, and Zillow.
One of the most powerful features of R is the extensive library of packages with advanced statistical techniques and custom functions developed by the robust community of expert users. These include packages for domain-specific analysis such as PerformanceAnalytics and Quantmod (finance), geoplot and RGoogleMaps (location data), and bioconductor (bioinformatics). Others packages enable advanced data science methods such as machine learning and data mining. Finally, packages extend already-impressive graphics capabilities, including ggplot and Lattice for static graphics and D3 for interactive graphics.

Learning R

There are many resources for learning and using R. The Comprehensive R Archive Network (CRAN) is an online repository for documentation from the R Development Core Team and all packages developed for R. Key documents include An Introduction to R, R Data Import/Export, and R Language Definition. The R FAQ also provides a broad overview of the language.
For novices, various websites offer tutorials in R. Online learning portals such as Coursera offer courses in data analysis using R as the language of instruction.

Learning R online:

In addition to these online resources, there are various books on R. R in a Nutshell by Joseph Adler and R Cookbook by Paul Teetor provide an excellent introduction and reference.

For advanced users, R has a strong community represented in numerous websites and blogs. Much of the content is accessible through the dedicated search engine RSeek. The R Development Core Team organizes an annual conference called useR! as well as The R Journal.

RHadoop

One immediate challenge for R is that all operations are performed in memory. In the context of very large data sets, this slows down computation or makes it impossible to even load the data into R, if the machine does not have sufficient memory resources. However, new R packages are available that adapt R for use in Hadoop. While they are not meant to duplicate all R functionality for use in Hadoop, they greatly enhance the appeal of R for enterprises interested in data science.

RHadoop: rhbase, rhdfs, rmr

The RHadoop package is comprised of three open source R libraries. The rhdfs library allows R to read and write files from the Hadoop File System (HDFS). The rhbase library translates R commands into HBase. Finally, the rmr library allows R users to write MapReduce code in a syntax similar to the R language. This requires users to specify the 'map' and 'reduce' portion of a function or script using familiar R constructs and syntax. Altogether, RHadoop provides an interface to Hadoop that is familiar to R users and addresses the limitations of performing computations in memory on very large data sets.

Additional Links

RHadoop Tutorial: MapReduce in R

Pages

Thursday, 31 July 2014

BigData Success Stories

BigData Success Stories

"Moneyball"

Get Out the Vote 2012

Additional Links

The "Like" Button

Additional Links

Pregnancy Prediction

Additional Links

Open-Source Recommendation Algorithms

Additional Links

Transparency in Healthcare

Additional Links

Photo Quality Prediction

Additional Links

Energy Efficiency

Additional Links

Improving Patient Outcomes

Additional Links

Pre-Paid Phone Service

Additional Links

BigData using R

The role of programming languages in data science

R: a Swiss Army knife for data scientists

R Itself

Screenshot of RStudio

Learning R

Learning R online:

RHadoop

RHadoop: rhbase, rhdfs, rmr

Additional Links