Showing posts with label Data Science. Show all posts
Showing posts with label Data Science. Show all posts

Tuesday 21 October 2014

The Developer’s Guide to Data Science

When developers talk about using data, they are usually concerned with ACID, scalability, and other operational aspects of managing data. But data science is not just about making fancy business intelligence reports for management. Data drives the user experience directly, not after the fact.
Large scale analysis and adaptive features are being built into the fabric of many of today’s applications. The world is already full of applications that learn what we like. Gmail sorts our priority inbox for us. Facebook decides what's important in our newsfeed on our behalf. E-commerce sites are full of recommendations, sometimes eerily accurate. We see automatic tagging and classification of natural language resources. Ad-targeting systems predict how likely you are to click on a given ad. The list goes on and on.
Many of the applications discussed above emerged from web giants like Google, Yahoo, and Facebook and other successful startups. Yes, these places are filled to the brim with very smart people, working on the bleeding edge. But make no mistake, this trend will trickle down into “regular” application development too. In fact, it already has. When users interact with slick and intelligent apps every day, their expectations for business applications rise as well. For enterprise applications it's not a matter of if, but when.
This is why many enterprise developers will need to familiarize themselves with data science. Granted, the term is incredibly hyped, but there's a lot of substance behind the hype. So we might as well give it a name and try to figure out what it means for us as developers.

From developer to data scientist

How do we cope with these increased expectations? It's not just a software engineering problem. You can't just throw libraries at it and hope for the best. Yes, there are great machine learning libraries, like Apache Mahout (Java) and scikit-learn (Python). There are even programming languages squarely aimed at doing data science, such as the R language. But it's not just about that. There is a more fundamental level of understanding you need to attain before you can properly wield these tools.
This article will not be enough to gain the required level of understanding. It can, however, show you the landmarks along the road to data science. This diagram (adapted from Drew Conway's original) shows the lay of the land:
Data science venn diagramg
As software engineers, we can relate to hacking skills. It's our bread and butter. And that's good, because from that solid foundation you can branch out into the other fields and become more well-rounded.
Let's tackle domain expertise first. It may sound obvious, but if you want to create good models for your data, then you need to know what you're talking about. This is not strictly true for all approaches. For example, deep learning and other machine learning techniques might be viewed as an exception. In general though, having more domain-specific knowledge is better. So start looking beyond the user-stories in your backlog and talk to your domain experts about what really makes the clock tick. Beware though: if you only know your domain and can churn out decent code, you're in the danger zone. This means you're at risk of re-inventing the wheel, misapplying techniques, and shooting yourself in the foot in a myriad of other ways.
Of course, the elephant in the room here is “math & statistics.” The link between math and the implementation of features such as recommendation or classification is very strong. Even if you're not building a recommender algorithm from scratch (which hopefully you wouldn't have to), you need to know what goes on under the hood in order to select the right one and to tune it correctly. As the diagram points out, the combination of domain expertise and math and statistics knowledge is traditionally the expertise area of researchers and analysts within companies. But when you combine these skills with software engineering prowess, many new doors will open.
What can you do as developer if you don't want to miss the bus? Before diving head-first into libraries and tools, there are several areas where you can focus your energy:
  • Data management
  • Statistics
  • Math
We'll look at each of them in the remainder of this article. Think of these items as the major stops on the road to data science.

Data management

Recommendation, classification, and prediction engines cannot be coded in a vacuum. You need data to drive the process of creating/tuning a good recommender engine for your application, in your specific context. It all starts with gathering relevant data, which might already be in your databases. If you don’t already have the data, you might have to set up new ways of capturing relevant data. Then comes the act of combining and cleaning data. This is also known as data wrangling or munging. Different algorithms have different pre-conditions on input data. You'll have to develop a strong intuition for good data versus messy data.
Typically, this phase of a data science project is very experimental. You'll need tools that help you quickly process lots of heterogeneous data and iterate on different strategies. Real world data is ugly and lacks structure. Dynamic scripting languages are often used to filter and organize data because they fit this challenge perfectly. A popular choice is Python with Pandas or the R language.
It's important to keep a close eye on everything related to data munging. Just because it's not production code, doesn't mean it's not important. There won't be any compiler errors or test failures when you silently omit or distort data, but it will influence the validity of all subsequent steps. Make sure you keep all your data management scripts, and keep both mangled and unmangled data. That way you can always trace your steps. Garbage in, garbage out applies as always.

Statistics

Once you have data in the appropriate format, the time has come to do something useful with it. Much of the time you’ll be working with sample data to create models that handle yet unseen data. How can you infer valid information from this sample? How do you even know your data is representative? This is where we enter the domain of statistics, a vitally important part of data science. I've heard it said: “a Data Scientist is a person who is better at statistics than any software engineer and better at software engineering than any statistician.”
What should you know? Start by mastering the basics. Understand probabilities and probability distributions. When is a sample large enough to be representative? Know about common assumptions such as independence of probabilities, or that values are expected to follow a normal distribution. Many statistical procedures only make sense in the context of these assumptions. How do you test the significance of your findings? How do you select promising features from your data as input for algorithms? Any introductory material on statistics can teach you this. After that, move on the Bayesian statistics. It will pop up more and more in the context of machine learning.
It's not just theory. Did you notice how we conveniently glossed over the “science” part of data science up till now? Doing data science is essentially setting up experiments with data. Fortunately, the world of statistics knows a thing or two about experimental setup. You'll learn that you should always divide your data into a training set (to build your model) and a test set (to validate your model). Otherwise, your model won’t work for real-world data: you’ll end up with an overfitting model. Even then, you're still susceptible to pitfalls like multiple testing. There's a lot to take into account.

Math

Statistics tells you about the when and why, but for the how, math is unavoidable. Many popular algorithms such as linear regression, neural networks, and various recommendation algorithms all boil down to math. Linear algebra, to be more precise. So brushing up on vector and matrix manipulations is a must. Again, many libraries abstract over the details for you, but it is essential to know what is going on behind the scenes in order to know which knobs to turn. When results are different than you expected, you need to know how to debug the algorithm.
It's also very instructive to try and code at least one algorithm from scratch. Take linear regression for example, implemented with gradient descent. You will experience the intimate connection between optimization, derivatives, and linear algebra when researching and implementing it. Andrew Ng's Machine Learning class on Coursera takes you through this journey in a surprisingly accessible way.

But wait, there's more...

Besides the fundamentals discussed so far, getting good at data science includes many other skills, such as clearly communicating the results of data-driven experiments, or scaling whatever algorithm or data munging method you selected across a cluster for large datasets. Also, many algorithms in data science are “batch-oriented,” requiring expensive recalculations. Translation into online versions of these algorithms is often necessary. Fortunately, many (open source) products and libraries can help with the last two challenges.
Data science is a fascinating combination between real-world software engineering, math, and statistics. This explains why the field is currently dominated by PhDs. On the flipside, we live in an age where education has never been more accessible. Be it through MOOCs, websites, or books. If you want read a hands-on book to get started, read Machine Learning for Hackers, then move on to a more rigorous book like Elements of Statistical Learning. There are no shortcuts on the road to data science. Broadening your view from software engineering to data science will be hard, but certainly rewarding.

Thursday 31 July 2014

BigData Challenges

BigData Challenges

The Data Science Process

A data science project may begin with a very well-defined question  -- Which of these 200 genetic markers are the best predictors of disease X? -- or an open-ended one -- How can we  decrease emergency room wait time in a hospital? Either way, once the motivating question has been identified, a data science project progresses through five iterative stages:

  • Harvest Data: Find and choose data sources
  • Clean Data: Load data into pre-processing environment; prep data for analysis
  • Analysis: Develop and execute the actual analysis
  • Visualize: Display the results in ways that effectively communicate new insights, or point out where the analysis needs to be further developed
  • Publish: Deliver the results to their intended recipient, whether human or machine
Each of these stages is associated with its own challenges, and correspondingly, with a plethora of tools that have sprung up to address those particular challenges.  Data science is an iterative process; at any stage, it may be necessary to circle back to earlier stages in order to incorporate new data or revise models.

Below is an outline of challenges that arise in each stage; it is meant to give the reader a sense of the scale and complexity of such challenges, not to enumerate them exhaustively:

Challenges of stage 1: Harvest Data



This is a classic needle-in-a-haystack problem: there exist millions of available data sets in the world, and of those only a handful are suitable, much less accessible, for a particular project. The exact criteria for what counts as "suitable data" will vary from project to project, but even when the criteria are fairly straightforward, finding data sets and proving that they meet those criteria can be a complex and time-consuming process.

When dealing with public data, the data sets are scattered and often poorly described. Organizations ranging from the federal government to universities to companies have begun to publish and/or curate large public data sets, but this is a fairly new practice, and there is much room for improvement. Dealing with internal data is not any easier: within an enterprise, there can be multiple data warehouses belonging to different departments, each contributed to by multiple users with little integration or uniformity across warehouses.

In addition to format and content, metadata, particularly provenance information, is crucial: the history of a data set, who produced it, how it was produced, when it was last updated, etc. also determine how suitable a data set is for a given project. However, this information is not often tracked and/or stored with the data, and if it is, it may be incomplete or manually generated.

Challenges of stage 2: Cleanse/Prep Data



This stage can require operations as simple as visually inspecting samples of the data to ones as complex as transforming the entire data set. Format and content are two major areas of concern.

With respect to format, data comes in a variety of formats, from highly structured (relational) to unstructured (photos, text documents) to anything in between (XML, CSVs), and these formats may not play well together. The user may need to write custom code in order to convert the data sets to compatible formats, use programming languages or purpose-built software, or even manually manipulate the data in programs like Excel.  This latter path becomes a non-option once the data set exceeds a certain size.

With respect to content and data quality, there are numerous criteria to consider, but  some major ones are accuracy, internal consistency, and compliance with applicable regulations (e.g. privacy laws, internal policies). The same data may be stored in different ways across data sets (e.g. multiple possible formats for date/time information), or the data set may have multiple "parent" data sets whose content must meet the same criteria.

In the Hadoop ecosystem, one common tool for initially inspecting and prepping data is Hive. Hive is commonly used for ad-hoc querying and data summarization, and in this context, Hive's strengths are its familiar SQL-like  query language (HiveQL) and its ability to handle both structured and semi-structured data.

However, Hive lacks the functional flexibility needed for significantly transforming raw data into a form more fitting for the planned analysis, often a standard part of the "data munging" process. Outside of Hadoop, data scientists use languages such as R, Python or Perl to execute these transformations, but these tools are limited to the processing power of the machines - often the users' own laptops - on which they are installed.

Challenges of stage 3: Analyze



Once the data is prepared, there is often a "scene change;" that is, the analytics take place in an environment different from the pre-processing environment. For instance, the latter may be a data warehouse while the former is a desktop application. This can prove to be another logistical challenge, particularly if the pre-processing environment has a greater capacity than the analytical one.

This stage is where data science most clearly borrows from, or is an extension of, statistics. It requires starting with data, forming a hypothesis about what that data says about a given slice of reality, formally modelling that hypothesis, running data through the model, observing the results, refining the hypothesis, refining the model and repeating. Having specialist-level knowledge of the relevant sector or field of study is also very helpful; on the other hand, there is also a risk of confirmation bias if one's background knowledge is given undue weight over what the numbers say.

Given that data sets can contain up to thousands of variables and millions or billions of records, performing data science on these very large data sets often calls for approaches such as machine learning and data mining. Both involve programs based on statistical principles that can complete tasks and answer questions without explicit human direction, usually by means of pattern recognition. Machine learning is defined by algorithms and performance metrics enabling programs to interpret new data in the context of historical data and continuously revise predictions.

A machine learning program is typically aimed at answering a specific question about a data set with generally known characteristics, all with minimal human interaction. In contrast, data mining is defined by the need to discover previously unknown features of a data set that may be especially large and unstructured. In this case, the task or question is less specific, and a program may require more explicit direction from the human data scientist to reveal useful features of the data.

Challenges of stage 4: Visualize


Visualization is necessary for both the "Analyze" and "Publish" stages, though it plays slightly different roles in each:
In the former, the data scientist uses visualization to more easily see the results of each round of testing. Graphics at this stage are often bare-bones and simple: scatterplots, histograms, etc. - but effectively capture feedback for the data scientist on what the latest round of modeling and testing indicates.

In the latter, the emphasis is often on interactiveness and intuitive graphics, so that the data products can be used as effectively as possible by the end users. For example, if a data science project's goal is to mine hospital data for insights on how the hospital can ensure continuity in the medical team assigned to a patients throughout their stay, it is not the data scientist's job to predict all the exact situations in which the doctors will use the results of analysis. Rather, the goal of visualization in this case is to expose facets of the data in a way that is intuitive to the user and still gives them flexibility; i.e. does not lock them into one view or use of the data.

The emphasis on visualization in data science is a result of the same factors that have produced the increased demand for data science itself: the scale and complexity of available data has grown to a point where useful insights do not lie in plain view but must be unearthed, polished, and displayed to their best advantage. Visualization is a key part of accomplishing those goals.

Challenges of stage 5: Publish



This stage could also be called "Implementation." The potential challenges here are as varied as the potential goals of the use case. A data science project could be part of product development for a smartphone app, in which case the intended recipient of the output could be either the app designers, or could be supporting an already-deployed app.

Similarly, a data science project could be used by financial firms to inform investment decision-making, in which case the recipient could be either a piece of automated trading software or a team of brokers. Suffice it to say that data scientists will need to be concerned with many of the same features of the output data sets as they were with the input data sets - format, content, provenance - and that a data scientist's mastery of the data also involves knowing how to best present it, whether to machines or to humans.
Related Posts Plugin for WordPress, Blogger...