Kalyan Hadoop Training in Hyderabad @ ORIEN IT, Ameerpet, 040 65142345 , 9703202345

Tuesday, 21 October 2014

Big Data Architecture Best Practices

The marketing department of software vendors have done a good job making Big Data go mainstream, whatever that means. The promise of we can achieve anything if we make use of Big Data; business insight and beating our competitions to submission. Yet, there is no well-publicised Big Data successful implementation. The question is: why not? Clearly this silver bullet where businesses have seen billions of dollars invested in but no return on investment! Who is to blame? After all, businesses do not have to publicise their internal processes or projects. I have a different view to that and the cause is on the IT department. Most Big Data projects are driven by the technologist not the business there is create lack of understanding in aligning the architecture with the business vision for the future.
The Preliminary Phase Big Data projects are not different to any other IT projects. All projects spur out of business needs / requirements. This is not The Matrix; we cannot answer questions which have not been asked yet. Before any work begin or discussion around which technology to use, all stakeholders need to have an understanding of:

The organisational context
The key drivers and elements of the organisation
The requirements for architecture work
The architecture principles
The framework to be used
The relationships between management frameworks
The enterprise architecture maturity

In the majority of cases, Big Data projects involves knowing the current business technology landscape; in terms of current and future applications and services:

Strategies and business plans
Business principles, goals, and drivers
Major framework currently implemented in the business
Governance and legal frameworks
IT strategy
Pre-existing Architecture Framework, Organisational Model, and Architecture repository

The Big Data Continuum Big Data projects are not and should never been executed in isolation. The simple fact that Big Data need to feed from other system means there should a channel of communication open across teams. In order to have a successful architecture, I came up with five simple layers/ stacks to Big Data implementation. To the more technically inclined architect, this would seem obvious:

Data sources
Big Data ETL
Data Services API
Application
User Interface Services

Big Data Protocol Stack

Data Sources

Current and future applications will produce more and more data which will need to be process in order to gain any competitive advantages from them. Data comes in all sorts but we can categorise them into two:

Structured data – usually stored following a predefined formats such as using known and proven database techniques. Not all structured data are stored in database as there are many businesses using flat files such as Microsoft Excel or Tab Delimited files for storing data
Unstructured data – businesses generates great amount of unstructured data such emails, instant messaging, video conferencing, internet, flat files such documents and images, and the list is endless. We call the data "unstructured" as they do not follow a format which will make facilitate a user to query its content.

I have spent a large part of my career working on Enterprise Search technology before even "Big Data" was coined. Understanding where the data is coming from and in what shape is valuable to a successful implementation of a Big Data ETL project. Before a single a line of programming code is written, architects will have to try and normalise the data to common format.

Big Data ETL

This is the part that excites technologists and especially the development teams. There are so many blogs and articles published every day about Big Data tools that this creates confusions among non-tech people. Everybody is excited about processing petabytes of data using the coolest kid on the block: Hadoop and its ecosystem. Before we get carried away, we first need to put some baseline in place:

Real-time processing
Batch processing

Big Data - Data Consolidation

The purpose of Extract Transform Load projects, regardless of using Hadoop or not, is to consolidate the data into a single viewMaster Data Management for querying on demand. Hadoop and its ecosystem deals with the ETL aspect of Big Data not the querying part. The tools used will heavily depends of processing need of the project: either Real-time or batch; i.e. Hadoop is a batch processing framework for large volume of data. Once the data has been processed, the Master Data Management system (MDM) can be stored in a data repository such as NoSQL based or RDBMS – this will only depends on the querying requirements.

Data Services API

As most of the limelight goes to the tools for ETL, a very important area is usually overlooked until later almost as a secondary thought. MDM will need to be stored in a repository in order for the information to be retrieve when needed. In a true Service Oriented Architecture spirit, the data repository should be able to expose some interfaces to external third party applications for data retrieval and manipulation. In the past, MDM were mostly created in RDBMS and retrieval and manipulation were carried out through the use of the Structured Query Language. Well this does not have to change but architects should be aware of other forms of database such NoSQL types. The following questions should be asked when choosing a database solution:

Is there are standard query language
How do we connect to the database; DB drivers or available web services
Will the database scale when the data grows
What security mechanism are in place for protecting some or whole data

Other questions specific to the project should also be included in the checklist.

Business Applications

So far, we have extracted the data, transformed and loaded it into a Master Data Management system. The normalised data is now exposed through web services (or DB drivers) to be used by third party applications. Business applications are the reason why to undertake Big Data projects in the first place. Some will argue that we should hire Data Scientists (?). According many blogs, Data Scientist roles is to understand the data, explore the data, prototype (new answers to unknown questions) and evaluate their findings. This is interesting as it reminds me the motion picture The Matrix, where the Architect knew the answers to the questions before Neo has even asked them yet and decides which one are relevant or not. Now this is not how businesses are run. It will be extremely valuable if the data scientist may suggest subconsciously (Inception) a new way to do something but most of the time the questions will come from business to be answered by the Data Scientist or whoever knows the data. The business applications will be the answer to those questions.

User Interfaces Services

User interfaces are the make or break of the project; a badly designed UI will affect adoption regardless of the data behind it, an intuitive design will increase adoption and maybe user will start questioning the quality of the data. Users will access the data differently; mobile, TV and web as an example. Users will usually focus on a certain aspect of the data and therefore they will require the data to be presented in a customised way. Some other users will want the data to be available through their current dashboard and match their current look and feel. As always, security will also be a concern. Enterprise portal have been around for a long time and they are usually used for data integration projects. Nevertheless, standards such as Web Services for Remote Portlets (WSRP) make it possible for User Interfaces to be served through Web Service calls.
Conclusion This article show the importance of architecting a Big Data project before embarking on the project. The project needs to be in line with the business vision and have a good understanding of the current and future technology landscape. The data needs to bring value to the business and therefore business needs to be involved from the outset. Understanding how the data will be used is key to its success and taking a service oriented architecture approach will ensure that the data can serve many business needs.

The Developer’s Guide to Data Science

When developers talk about using data, they are usually concerned with ACID, scalability, and other operational aspects of managing data. But data science is not just about making fancy business intelligence reports for management. Data drives the user experience directly, not after the fact.

Large scale analysis and adaptive features are being built into the fabric of many of today’s applications. The world is already full of applications that learn what we like. Gmail sorts our priority inbox for us. Facebook decides what's important in our newsfeed on our behalf. E-commerce sites are full of recommendations, sometimes eerily accurate. We see automatic tagging and classification of natural language resources. Ad-targeting systems predict how likely you are to click on a given ad. The list goes on and on.

Many of the applications discussed above emerged from web giants like Google, Yahoo, and Facebook and other successful startups. Yes, these places are filled to the brim with very smart people, working on the bleeding edge. But make no mistake, this trend will trickle down into “regular” application development too. In fact, it already has. When users interact with slick and intelligent apps every day, their expectations for business applications rise as well. For enterprise applications it's not a matter of if, but when.

This is why many enterprise developers will need to familiarize themselves with data science. Granted, the term is incredibly hyped, but there's a lot of substance behind the hype. So we might as well give it a name and try to figure out what it means for us as developers.

From developer to data scientist

How do we cope with these increased expectations? It's not just a software engineering problem. You can't just throw libraries at it and hope for the best. Yes, there are great machine learning libraries, like Apache Mahout (Java) and scikit-learn (Python). There are even programming languages squarely aimed at doing data science, such as the R language. But it's not just about that. There is a more fundamental level of understanding you need to attain before you can properly wield these tools.

This article will not be enough to gain the required level of understanding. It can, however, show you the landmarks along the road to data science. This diagram (adapted from Drew Conway's original) shows the lay of the land:

As software engineers, we can relate to hacking skills. It's our bread and butter. And that's good, because from that solid foundation you can branch out into the other fields and become more well-rounded.

Let's tackle domain expertise first. It may sound obvious, but if you want to create good models for your data, then you need to know what you're talking about. This is not strictly true for all approaches. For example, deep learning and other machine learning techniques might be viewed as an exception. In general though, having more domain-specific knowledge is better. So start looking beyond the user-stories in your backlog and talk to your domain experts about what really makes the clock tick. Beware though: if you only know your domain and can churn out decent code, you're in the danger zone. This means you're at risk of re-inventing the wheel, misapplying techniques, and shooting yourself in the foot in a myriad of other ways.

Of course, the elephant in the room here is “math & statistics.” The link between math and the implementation of features such as recommendation or classification is very strong. Even if you're not building a recommender algorithm from scratch (which hopefully you wouldn't have to), you need to know what goes on under the hood in order to select the right one and to tune it correctly. As the diagram points out, the combination of domain expertise and math and statistics knowledge is traditionally the expertise area of researchers and analysts within companies. But when you combine these skills with software engineering prowess, many new doors will open.

What can you do as developer if you don't want to miss the bus? Before diving head-first into libraries and tools, there are several areas where you can focus your energy:

Data management
Statistics
Math

We'll look at each of them in the remainder of this article. Think of these items as the major stops on the road to data science.

Data management

Recommendation, classification, and prediction engines cannot be coded in a vacuum. You need data to drive the process of creating/tuning a good recommender engine for your application, in your specific context. It all starts with gathering relevant data, which might already be in your databases. If you don’t already have the data, you might have to set up new ways of capturing relevant data. Then comes the act of combining and cleaning data. This is also known as data wrangling or munging. Different algorithms have different pre-conditions on input data. You'll have to develop a strong intuition for good data versus messy data.

Typically, this phase of a data science project is very experimental. You'll need tools that help you quickly process lots of heterogeneous data and iterate on different strategies. Real world data is ugly and lacks structure. Dynamic scripting languages are often used to filter and organize data because they fit this challenge perfectly. A popular choice is Python with Pandas or the R language.

It's important to keep a close eye on everything related to data munging. Just because it's not production code, doesn't mean it's not important. There won't be any compiler errors or test failures when you silently omit or distort data, but it will influence the validity of all subsequent steps. Make sure you keep all your data management scripts, and keep both mangled and unmangled data. That way you can always trace your steps. Garbage in, garbage out applies as always.

Statistics

Once you have data in the appropriate format, the time has come to do something useful with it. Much of the time you’ll be working with sample data to create models that handle yet unseen data. How can you infer valid information from this sample? How do you even know your data is representative? This is where we enter the domain of statistics, a vitally important part of data science. I've heard it said: “a Data Scientist is a person who is better at statistics than any software engineer and better at software engineering than any statistician.”

What should you know? Start by mastering the basics. Understand probabilities and probability distributions. When is a sample large enough to be representative? Know about common assumptions such as independence of probabilities, or that values are expected to follow a normal distribution. Many statistical procedures only make sense in the context of these assumptions. How do you test the significance of your findings? How do you select promising features from your data as input for algorithms? Any introductory material on statistics can teach you this. After that, move on the Bayesian statistics. It will pop up more and more in the context of machine learning.

It's not just theory. Did you notice how we conveniently glossed over the “science” part of data science up till now? Doing data science is essentially setting up experiments with data. Fortunately, the world of statistics knows a thing or two about experimental setup. You'll learn that you should always divide your data into a training set (to build your model) and a test set (to validate your model). Otherwise, your model won’t work for real-world data: you’ll end up with an overfitting model. Even then, you're still susceptible to pitfalls like multiple testing. There's a lot to take into account.

Math

Statistics tells you about the when and why, but for the how, math is unavoidable. Many popular algorithms such as linear regression, neural networks, and various recommendation algorithms all boil down to math. Linear algebra, to be more precise. So brushing up on vector and matrix manipulations is a must. Again, many libraries abstract over the details for you, but it is essential to know what is going on behind the scenes in order to know which knobs to turn. When results are different than you expected, you need to know how to debug the algorithm.

It's also very instructive to try and code at least one algorithm from scratch. Take linear regression for example, implemented with gradient descent. You will experience the intimate connection between optimization, derivatives, and linear algebra when researching and implementing it. Andrew Ng's Machine Learning class on Coursera takes you through this journey in a surprisingly accessible way.

But wait, there's more...

Besides the fundamentals discussed so far, getting good at data science includes many other skills, such as clearly communicating the results of data-driven experiments, or scaling whatever algorithm or data munging method you selected across a cluster for large datasets. Also, many algorithms in data science are “batch-oriented,” requiring expensive recalculations. Translation into online versions of these algorithms is often necessary. Fortunately, many (open source) products and libraries can help with the last two challenges.

Data science is a fascinating combination between real-world software engineering, math, and statistics. This explains why the field is currently dominated by PhDs. On the flipside, we live in an age where education has never been more accessible. Be it through MOOCs, websites, or books. If you want read a hands-on book to get started, read Machine Learning for Hackers, then move on to a more rigorous book like Elements of Statistical Learning. There are no shortcuts on the road to data science. Broadening your view from software engineering to data science will be hard, but certainly rewarding.

Pages