Showing posts with label BigData. Show all posts
Showing posts with label BigData. Show all posts

Tuesday 12 July 2016

Free Workshop on Big Data E-commerce Project End-to-End Explanation

------------------------------------------------------------------------------------------------------------
  Free Workshop on Big Data E-commerce Project End-to-End Explanation
------------------------------------------------------------------------------------------------------------

Details about the Workshop:

1. Understanding the software requirements specification (SRS).

2. Understanding the Design of Product

3. Migrating existing project from RDMBS => HBASE

4. Migrating existing project from RDMBS => PHOENIX

5. Migrating existing project from RDMBS => CASSANDRA

6. Migrating existing project from RDMBS => MONGODB

7. Understanding the challeges with RDBMS, why to go to HADOOP

8. Understanding the challeges with HADOOP, why to go to SPARK

9. Visualize the REPORTS generated using RDBMS 

10. Visualize the REPORTS generated using HADOOP COMPONENTS

11. Visualize the REPORTS generated using SPARK

12. All these above functionalities verify through LIVE PROJECT




Friday 14 August 2015

How Big Data Analytics & Hadoop Complement Your Existing Data Warehouse

Syncsort and Cloudera provide you a seamless approach to unlocking the value of all of your data – including mainframe data.This webinar will help you learn more about the cost savings and other benefits of ingesting and processing your mainframe data in Hadoop.






Date: Monday, Sep 23 2013
Description

Syncsort and Cloudera provide you a seamless approach to unlocking the value of all of your data – including mainframe data. 

This webinar will help you learn more about the cost savings and other benefits of 
ingesting and processing your mainframe data in Hadoop.

refer:
http://www.cloudera.com/content/cloudera/en/resources/library/recordedwebinar/
how-big-data-analytics---hadoop-complement-your-existing-data-wa.html

IBM Mainframe Makeover: Mobile, Big Data Reality Check

IBM z13 refresh packs performance gains, but can it really join the mobile, big data, and cloud age? Here's a look at aspirations versus realities.
8 Quiet Firsts In Tech In 2014
8 Quiet Firsts In Tech In 2014
(Click image for larger view and slideshow.)
Yes, many of the largest organizations -- big banks, big insurance companies, big retailers, and big government agencies -- still run IBM mainframes. So when IBM announces a new generation of the platform, as it did on Tuesday (on its usual two-year cycle) with the release of the z13, hardware unit financial results always perk up -- which is news IBM could use these days.
Advances in chips and memory always bring new heights of performance, and it's no different with the z13. IBM says the new machine can process 2.5 billion transactions per day -- "the equivalent of 100 Cyber Mondays every day of the year." But IBM is also doing its best to give the z13 a very modern spin, touting mobile, cloud, and big data analytics performance.

How real are these claims? Here's a closer look at realities versus aspirations.
[Want more on IBM's hardware business? Read IBM Sells Chip Business, Declines Continue.]
Hardware constantly evolves, but 50-year-old mainframes have required periodic makeovers. Fifteen years ago, for example, IBM took advantage of virtualization to enable the z series to run Linux- and Java-based applications. That has opened the door to massive workload consolidation, and mainframes now run all types of Linux-based apps, such as mission-critical SAP apps and Oracle databases, while still running Unix apps; iSeries (formerly AS/400) apps; and classic CICS (Customer Information Control System), IMS, and DB2 workloads on the z/OS mainframe operating system.
That consolidation story still has strong appeal, and it helps the mainframe to win 80 to 100 new customers per year, according to IBM. (IBM could offer no details on how many customers get off mainframes each year, but a spokesperson says it's a short list of smaller companies.) Some new customers run Linux exclusively, as is the case with Brazilian credit union Sicoob, which first started using a z-series machine four years ago and is now among IBM's top-ten mainframe customers.
Mainframe as mobile platform?IBM's z13 announcement put a big emphasis on meeting modern mobile demands, but surely it's not suggesting that companies should build mobile apps to run on mainframe, is it? Well, yes and no. Smartphones and tablets are in no way different from PCs, in that they're access devices calling on backend systems that run on mainframes. But because they're so handy and accessible, these new devices are generating a tsunami of requests for information. People now check account balances 12 times a month where they might have done so twice a month on a PC.
Mobile traffic is maxing out backend system calls, so z13 performance gains are needed. As for mobile apps running on mainframe? It's early days.
Mobile traffic is maxing out backend system calls, so z13 performance gains are needed. As for mobile apps running on mainframe? It's early days.
The z13 better addresses mobile workloads simply by offering more processing power. Another answer here is z/OS Connect, a connector technology that supports the delivery of information from CISC, IMS, and DB2 in a RESTful manner with simple API calls. Billed as a linchpin of Web, cloud, and mobile enablement for mainframe, z/OS Connect was introduced last year. It is compatible with z12 servers.
IBM also hopes that mobile apps might be developed to run on the mainframe. Here, IBM has ported its Mobile First (formerly Worklight) mobile app development portfolio to run on zSeries, and IBM says there are also plans to support IBM's Cloudant NoSQL database service and third-party products like MongoDB on mainframe. NoSQL platforms, often running in the cloud, are where the lion's share of mobile development and application delivery is happening these days. It remains to be seen whether third-party vendors and customers will bring this work onto the IBM mainframe in IBM's cloud.
Analytics and big data on mainframe?In 2011 IBM introduced the IBM DB2 Analytics Accelerator (IDAA), a coprocessor that essentially puts a Netezza-based data warehouse alongside mainframes handling transactional workloads. The benefit is that you can do analysis of the transactional data on the mainframe without moving that information or degrading the performance of the transactional system.
With z13, IBM is putting more of its software onto the IDAA, including SPSS for analytics and Cognos BI for reporting. This opens up support for real-time fraud detection, real-time upselling and cross-selling, and other opportunities driven by predictive analytics. One major insurance company using this capability is shifting to doing fraud-detection analysis against 100% of the claims it processes on the mainframe. Previously the company used red-flag rules and transaction thresholds to kick out roughly 10% of its transactions into a separate analytical fraud-detection app, but now it can analyze every claim in real time.
The benefits here are twofold: Claims that turn out to be legitimate won't be delayed, risking penalties. And fraudulent claims that didn't set rules or value thresholds won't fall through the cracks or require follow-up investigations. In another real-time use case, IBM says a chain of Eastern European gas stations and convenience stores is using real-time analysis for upselling at checkout time. When customers fill up their gas tanks, real-time checks of the loyalty program trigger discount offers that have increased total sales.
IBM says z13 can easily support distributed systems that have become synonymous with big data analytics, but here, too, IBM's ambition will confront market realities. IBM has ported its own BigInsights Hadoop distribution to run on the mainframe, for example, but there hasn't been a stampede of third-party players in the big data community following suit. Veristorm, for one, has a Hadoop distribution ported to run on System z, but most distributed big data platforms -- Hadoop and NoSQL databases -- were designed specifically to run on commodity x86 servers. The whole idea is cheap compute. IBM says that with virtualization, one System z chip can handle the equivalent of 15, 20, or even 25 x86 chips. But if the software was designed to run on x86 and nobody has bothered to port it, it's a theoretical argument.
Cloud on mainframe?The list of software designed to run on System z and IBM Power servers is long, but what's compatible with x86 is much longer. This is a big reason IBM bought (x86-centric) Softlayer -- to be at the center of its cloud story. IBM's cloud didn't catch fire until Softlayer entered the picture. IBM can talk about support for Linux, KVM virtualization, and OpenStack, and ambitions to see all sorts of third-party software running on mainframe. But the software that can run System z today and what IBM hopes for are two different things.
Where big companies already have big-iron workloads, the System z has already proven its appeal in workload consolidation, private cloud, and hybrid cloud scenarios. In fact, I suspect the success of System z has had a lot to do with the hardships of the IBM Power line (caught in a closing vise between System z and x86). But where the core of cloud deployment and big data activity is concerned, it will be an uphill climb for IBM to persuade those communities to run on System z, no matter how trivial porting software might be.
Employers see a talent shortage. Job hunters see a broken hiring process. In the rush to complete projects, the industry risks rushing to an IT talent failure. Get the Talent Shortage Debate issue of InformationWeek today.
Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio


refer: http://www.informationweek.com/big-data/big-data-analytics/ibm-mainframe-makeover-mobile-big-data-reality-check/a/d-id/1318620

IBM Named Hadoop Leader, Launches Big Data Mainframe

IBM's huge investments in Big Data initiatives seem to be paying off, as it was recently named a leader in Hadoop accessibility by a research firm and this week launched a new Hadoop-capable mainframe computer it describes as "one of the most sophisticated computer systems ever built."
"In a survey of more than 1,000 Big Data developers, analyst firm Evans Data Corp. found that IBM is the leading provider of Hadoop among developers, with more than 25 percent of respondents identifying IBM's Hadoop as their principle distribution," the company announced Tuesday. "The survey also focused on key growth areas such as machine learning and streaming analytics, where 18 percent of developers cited IBM InfoSphere Streams as their preferred application for machine learning, making it the second most popular choice in the category."

IBM told this site it couldn't release more survey data, contained in a for-sale report by the software development research firm.
On the same day, the z13 mainframe supercomputer was unveiled, giving new meaning to the concept of Big Data transaction processing. "If you plan to be a computer engineer or software programmer, consider this amazing system as the platform for your career," said exec Ross Mauri in a blog post.
Specializing in mobile transactions, the z13 is reportedly the first system capable of processing up to 2.5 billion transactions per day, which IBM said was the equivalent to handling the traffic generated by 100 "cyber Mondays" (the busiest online shopping day of the year) every day for a full year.
"The z13 includes new support for Hadoop, enabling unstructured data to be analyzed in the system. Other analytics advances include faster acceleration of queries by adding DB2 BLU for Linux providing an in-memory database, enhancements to the IBM DB2 analytics accelerator, and vastly improved performance for mathematically intense analytics workloads," the company said.
IBM said the z13 was five years in the making at a cost of $1 billion, leveraging more than 500 new patents and incorporating the work of some 60 client partners. The scope of this project exemplifies the vast resources IBM has ploughed into the mobile/cloud/Big Data space.
The mainframe is reportedly the first to feature practical real-time encryption of any amount of mobile transactions and first to come with embedded analytics to glean insights from ongoing transactions. That's all made possible by the world's fastest microprocessor and huge amounts of memory and bandwidth.
"We designed the z13 machine from the ground up with mobility in mind," Mauri said. "Over the past decade, the world has witnessed an explosion of data -- from electronic commerce, social media, business systems, Web sites and the Internet of Things. Today, our interactions with data and with each other are increasingly going mobile. As a result, we're consuming and creating data all the time, every day. And you can expect 100 times more data to flow 2-3 years from now. The z13, with 141 of the industry's fastest microprocessors (with 5 GHz performance) on board, is the only computer that's purpose-built to handle this mobile data tsunami."
The company also had some software news, unveiling a preview of new z/OS software for advanced analytic and data-serving capabilities. IBM said when it's available it will expand the ability of the new mainframe to process in-memory analytics and analyze mobile transactions.
IBM also announced favorable benchmark testing results for its SQL-on-Hadoop offering, Big SQL; noted that its online Big Data University is used regularly by more than 200,000 developers; and said almost 1,000 coders have signed on for its Big Data for Social Good Hadoop Challenge to solve civil and social problems.

About the Author
David Ramel is an editor and writer for 1105 Media. 

refer: https://adtmag.com/articles/2015/01/15/ibm-big-data.aspx

Access your data in near-real time for seamless integration

vStorm Enterprise

Access your data in near-real time for seamless integration

Veristorm’s vStorm Enterprise software brings near-real-time access to z/OS data so you can break mainframe quarantine and integrate your data across your IT infrastructure. With fast, self-service access to data, data analysts can bypass ETL in favor of Extract-Hadoop-Transform. You can copy z/OS data from: 
  • DB2
  • VSAM
  • IMS
  • Sequential files
  • System log files
  • Operator log files
It alleviates the need for SQL extracts, ETL consulting engagements with simple “point and click” data movement from z/OS to HDFS. EBCDIC code page and BCD issues are handled in-flight without MIPS costs, and users avoid the complexity of COBOL copybooks and the security risks, compliance challenges and delays of offloading data.
The vStorm Enterprise software can output to the destination of your choice:
  • Linux on System z
  • Distributed Hadoop, including Cloudera, Hortonworks and BigInsights
  • Linux servers
  • High Performance Computing appliances, like Scalable Informatics
  • zDoop, our validated Hadoop on System z distribution, which also enables Hive
   

refer: http://www.veristorm.com/vstorm-enterprise

SUSE and Veristorm bring Hadoop to IBM Power Systems

on Tue, 03/17/2015 - 21:10
Today we announced that we’re extending our partnership with SUSE to cover IBM Power Systems and there are good reasons to be excited about it. Before this, our partnership was primarily about bringing Hadoop to IBM’s z Systems mainframe and about moving data from z/OS into (almost) any flavor of Hadoop. But Power Systems was missing, and now it’s not.
We’ve certified Veristorm Data Hub (VDH) with SUSE Linux on IBM Power Systems built on the POWER8 architecture. This is open source Apache Hadoop, free to install and use. You can still use vStorm Enterprise (now running on Power) to move z/OS data into Hadoop, and now your choices of Hadoop include Power. You can even move new data sources into VDH, including Oracle and other Hadoop distributions.
The result is that you can keep your data within the IBM ecosystem and take advantage of some of the most exciting hardware around. Power Systems on POWER8 promises 4x to 8x performance gains over the previous generation in many areas and we’re working on benchmarks now to see how that plays out for Hadoop.
For Power Systems customers, we think this is a big move toward more open systems. With SUSE Linux and support for little-endian mode, the same mode used by x86_64 platforms, it should be easy to run existing Hadoop applications on Power.
Of course, the other part of this big move is that this is open source, Apache Hadoop. You can install it and use it for free, as a sandbox or in production, and without fear of lock-in. The Hadoop community is vibrant, and now it has a home on Power Systems.
“The partnership… complements SUSE's existing partnerships with leading Hadoop vendors, providing customers with the ability to move valuable mainframe data to Hadoop to reap the benefits of big data-related programs.” — Naji Almahmoud, head of global business development for SUSE.
You can read the full Press Release here.

refer: http://www.veristorm.com/content/suse-and-veristorm-bring-hadoop-ibm-power-systems

From Mainframe to Big Data: What are the Challenges?

If your company is still managing its data using mainframe, then there may be a little yellow elephant in the room. As organizations look more and more to Big Data to help them garner the necessary business intelligence required to make informed decisions, the growing need for Big Data processing is becoming increasingly unavoidable. 
True, it’s very easy to forget that in this smartphone, Facebook and Twitter dominated world that the data stored on your mainframe – transactional data, for instance, bookings, feedback etc. ¬– is actually of as much importance now and to Big Data as it has ever been. Mainframes simply just don’t enter the technology conversations very much anymore – but they are, of course, critical.

Mainframes Generate Data

Much of the world’s largest and most critical industries – healthcare, finance, insurance, retail etc. – still generate a huge majority of their data from mainframe. Put simply, mainframe data cannot be ignored, but the time has come to start offloading that data onto a Big Data analytics platform.Indeed, if you haven’t done so already, then it’s seemingly inevitable that you will one day find yourself – just like numerous other mainframe professionals – beginning the process of offloading your mainframe batch to a framework like Hadoop. 
You will, no doubt, already be aware of this pressing, modern day desideratum – but are you ready to make such a seismic shift? Do you have a big data strategy in place, and what skills will you need? But perhaps the first question you should be asking yourself is – is it the right decision?
Without meaning to seem curt – yes, it is, quite frankly, and here’s why:

The Advantages of Moving From Mainframe to Hadoop:

  • Mainframe is helplessly quarantined in the world of structured data (sales data, purchase orders, customer location etc.), from which analysts can only glean so much. Hadoop, on the other hand, handles and processes a much higher volume and variety of unstructured Big Data (documents, texts, images, social media, mobile data etc.) that’s generated by business, providing analysts with a much more detailed information cache to work with.
  • Licensing fees for mainframe software – as well as mainframe maintenance and support costs – are minimized when migrating to a Hadoop solution.
  • Eradicates the need to purchase more mainframes as processing demands increase in line with business growth.
  • Mainframe coding is outdated and inefficient.
  • Hadoop is open-source, rendering it cost effective in the first instance, but the time that it saves in batch processing data is an economic no-brainer in itself.
  • The technology already underpins gigantic technology firms such as Google, Yahoo, Facebook, Twitter, eBay and many more, proving its worth.
  • We live in the age of Big Data, and it’s growing exponentially with each second that ticks by. Utilizing this data is the only way that you can help your company stay competitive in the global business world – after all, everyone else is using it.
  • If you’ve mastered mainframe, you’ll be more than capable of learning Hadoop, which is much simpler in spite of its superior power as a processor. 

So, what are the Challenges of Moving from Mainframe to Big Data and Hadoop?

Integration

There’s a common misconception that moving mainframe data to Hadoop is simple. It’s not. For the most part this is because you’ll require the need of two teams, or at least two people – one who understands mainframe, the other who understands Hadoop. Finding someone who’s skilled in both is very difficult indeed (see ‘Professional Skills and Skilled Professionals’ below). 
In the first instance, your mainframe professional will have to identify and select the data required for transfer – something that Hadoop developers will find difficult if not impossible to do themselves. The data then needs to be prepared –customer data may need to be filtered or aggregated for example – before it can then be translated into a language that is understood by Hadoop.
But, even after all this, the mainframe data still has to be rationalized with the COBOL copybook – something that requires a very particular skillset, and one that the Hadoop professional will almost certainly not have. But, once this stage is complete, you can finally make the FTP transfer and load up the files to Hadoop. Obviously the transition is achievable, but you have to be prepared for this very techy adventure along the way. 

Security

The highly sensitive data contained in your mainframe means that all transfers onto Hadoop must be made with utmost care. Not a single mistake can be made when transferring data, and security must be guaranteed across the whole process. 
The huge security challenge that this presents can often mean that extraction software has to be installed in the first instance onto the mainframe. If you find this to be the case, then it’s absolutely imperative that you ensure that anything you install onto the mainframe to bring data into Hadoop is proven enterprise-grade software that can boast a watertight security track record and reputation. Using an authorization protocol for all users is an absolute must.

Cost

One of the main attractions of a Hadoop migration from mainframe is that organizations are looking to reduce IT expenses by curbing the amount of data that is being processed on their mainframe. The cost of storage can be hugely reduced on Hadoop clusters as compared to their mainframe counterparts. The cost of managing a single terabyte of data in a mainframe environment can be anything from $20,000 to $100,000, whereas on Hadoop this is reduced to about $1,000 (SearchDataManagement). 
It is no surprise, then, that in the most part mainframe modernization efforts have stalled. They are very risky initiatives as well as being expensive, and Hadoop is plugging the gap.  

Professional Skills and Skilled Professionals

The future is almost certainly set for Hadoop to overtake mainframe as the default data managing system. As a result, this growing interest in Hadoop technologies is driving a huge demand for data scientists with Big Data skills. Companies will always need to find ways of staying competitive in the world of global business, and the intelligent utilization of big data is absolutely imperative in achieving this.
However, the charge of analyzing, classifying and drawing pertinent information from the enormous hordes of unstructured raw data requires the services of highly skilled and trained data professionals, and, quite simply, there’s currently a shortage of such people. 
The USA alone faces a shortage of between 140,000 and 190,000 big data professionals with the analytical skills required to make key business decisions based on Big Data analysis.
However, what this does indicate is big career opportunities for mainframe professionals. Companies that are transitioning to Hadoop want people with experience and knowledge of analytics, and so, if you’re a mainframe professional, then the time is now to start bolstering your skills set and start learning Hadoop and its approaches (MapReduce and the like). 

Storage

Big Data, by its very nature, is growing so fast that it is becoming nigh on impossible to keep up with. Indeed, there isn’t even the storage space to cope with it all. As early as 2010, there was already 35% more digital information created than the capacity exists to store it, with the figure approaching 60% today. Indeed, writing in Forbes Christopher Frank points out that 90% of the data in the world today has been crated in the last two years alone. 

IBM Plug Skills Gap

Clearly, the Big Data skills gap needs to be plugged, and is something that has garnered global attention. Indeed, IBM has made a significant move towards solving the problem. Partnering with more than 1000 universities worldwide, IBM last year announced nine new academic collaborations designed to prepare students for the 4.4 million jobs that are set to be created globally to support Big Data by 2015. 
The hope – and the need – is to create a massive new fleet of knowledgeable professionals who can use ‘out of the box’ thinking to draw relevant inferences from the wealth of Big Data available to them, and convert the findings into meaningful business strategies. Eventually, nearly every area of research will be using Big Data analysis to draw conclusions that will affect how those specific businesses operate. And the pool is huge. Computing, humanities, social sciences, medicine, natural history – anything and everything will one day benefit from Big Data. 
The point is that no matter what field you’re working in, despite the challenges of recruitment surrounding Big Data analytics, the opportunities that it represents mean that sooner or later your skills as solely a mainframe professional will not suffice. For companies, finding the right person for the Big Data job role is the biggest challenge. Your transition from mainframe to Hadoop might still be in process, and of course the move might not be entire – you may still have some very pertinent uses for your mainframe. Finding the right person with the right skills to match the present and future requirements of your data management is indeed the most pressing demand. 

Tuesday 21 October 2014

Big Data Architecture Best Practices

The marketing department of software vendors have done a good job making Big Data go mainstream, whatever that means. The promise of we can achieve anything if we make use of Big Data; business insight and beating our competitions to submission. Yet, there is no well-publicised Big Data successful implementation. The question is: why not? Clearly this silver bullet where businesses have seen billions of dollars invested in but no return on investment! Who is to blame? After all, businesses do not have to publicise their internal processes or projects. I have a different view to that and the cause is on the IT department. Most Big Data projects are driven by the technologist not the business there is create lack of understanding in aligning the architecture with the business vision for the future.
The Preliminary Phase Big Data projects are not different to any other IT projects. All projects spur out of business needs / requirements. This is not The Matrix; we cannot answer questions which have not been asked yet. Before any work begin or discussion around which technology to use, all stakeholders need to have an understanding of:
  • The organisational context
  • The key drivers and elements of the organisation
  • The requirements for architecture work
  • The architecture principles
  • The framework to be used
  • The relationships between management frameworks
  • The enterprise architecture maturity
In the majority of cases, Big Data projects involves knowing the current business technology landscape; in terms of current and future applications and services:
  • Strategies and business plans
  • Business principles, goals, and drivers
  • Major framework currently implemented in the business
  • Governance and legal frameworks
  • IT strategy
  • Pre-existing Architecture Framework, Organisational Model, and Architecture repository
The Big Data Continuum Big Data projects are not and should never been executed in isolation. The simple fact that Big Data need to feed from other system means there should a channel of communication open across teams. In order to have a successful architecture, I came up with five simple layers/ stacks to Big Data implementation. To the more technically inclined architect, this would seem obvious:
  • Data sources
  • Big Data ETL
  • Data Services API
  • Application
  • User Interface Services
Big Data Protocol Stack

Data Sources

Current and future applications will produce more and more data which will need to be process in order to gain any competitive advantages from them. Data comes in all sorts but we can categorise them into two:
  1. Structured data – usually stored following a predefined formats such as using known and proven database techniques. Not all structured data are stored in database as there are many businesses using flat files such as Microsoft Excel or Tab Delimited files for storing data
  2. Unstructured data – businesses generates great amount of unstructured data such emails, instant messaging, video conferencing, internet, flat files such documents and images, and the list is endless. We call the data "unstructured" as they do not follow a format which will make facilitate a user to query its content.
I have spent a large part of my career working on Enterprise Search technology before even "Big Data" was coined. Understanding where the data is coming from and in what shape is valuable to a successful implementation of a Big Data ETL project. Before a single a line of programming code is written, architects will have to try and normalise the data to common format.

Big Data ETL

This is the part that excites technologists and especially the development teams. There are so many blogs and articles published every day about Big Data tools that this creates confusions among non-tech people. Everybody is excited about processing petabytes of data using the coolest kid on the block: Hadoop and its ecosystem. Before we get carried away, we first need to put some baseline in place:
  • Real-time processing
  • Batch processing

Big Data - Data Consolidation
The purpose of Extract Transform Load projects, regardless of using Hadoop or not, is to consolidate the data into a single viewMaster Data Management for querying on demand. Hadoop and its ecosystem deals with the ETL aspect of Big Data not the querying part. The tools used will heavily depends of processing need of the project: either Real-time or batch; i.e. Hadoop is a batch processing framework for large volume of data. Once the data has been processed, the Master Data Management system (MDM) can be stored in a data repository such as NoSQL based or RDBMS – this will only depends on the querying requirements.

Data Services API

As most of the limelight goes to the tools for ETL, a very important area is usually overlooked until later almost as a secondary thought. MDM will need to be stored in a repository in order for the information to be retrieve when needed. In a true Service Oriented Architecture spirit, the data repository should be able to expose some interfaces to external third party applications for data retrieval and manipulation. In the past, MDM were mostly created in RDBMS and retrieval and manipulation were carried out through the use of the Structured Query Language. Well this does not have to change but architects should be aware of other forms of database such NoSQL types. The following questions should be asked when choosing a database solution:
  • Is there are standard query language
  • How do we connect to the database; DB drivers or available web services
  • Will the database scale when the data grows
  • What security mechanism are in place for protecting some or whole data
Other questions specific to the project should also be included in the checklist.

Business Applications

So far, we have extracted the data, transformed and loaded it into a Master Data Management system. The normalised data is now exposed through web services (or DB drivers) to be used by third party applications. Business applications are the reason why to undertake Big Data projects in the first place. Some will argue that we should hire Data Scientists (?). According many blogs, Data Scientist roles is to understand the data, explore the data, prototype (new answers to unknown questions) and evaluate their findings. This is interesting as it reminds me the motion picture The Matrix, where the Architect knew the answers to the questions before Neo has even asked them yet and decides which one are relevant or not. Now this is not how businesses are run. It will be extremely valuable if the data scientist may suggest subconsciously (Inception) a new way to do something but most of the time the questions will come from business to be answered by the Data Scientist or whoever knows the data. The business applications will be the answer to those questions.

User Interfaces Services

User interfaces are the make or break of the project; a badly designed UI will affect adoption regardless of the data behind it, an intuitive design will increase adoption and maybe user will start questioning the quality of the data. Users will access the data differently; mobile, TV and web as an example. Users will usually focus on a certain aspect of the data and therefore they will require the data to be presented in a customised way. Some other users will want the data to be available through their current dashboard and match their current look and feel. As always, security will also be a concern. Enterprise portal have been around for a long time and they are usually used for data integration projects. Nevertheless, standards such as Web Services for Remote Portlets (WSRP) make it possible for User Interfaces to be served through Web Service calls.
Conclusion This article show the importance of architecting a Big Data project before embarking on the project. The project needs to be in line with the business vision and have a good understanding of the current and future technology landscape. The data needs to bring value to the business and therefore business needs to be involved from the outset. Understanding how the data will be used is key to its success and taking a service oriented architecture approach will ensure that the data can serve many business needs.

Monday 20 October 2014

Hadoop Distributions

Hadoop Distributions

Here are a few of the MPP database vendors with their buyers:
Greenplum => EMC
Netezza => IBM
DATAllegro => Microsoft
Aster Data => Teradata
Vertica => HP
Similarly, I think the landscape of Hadoop vendors will change in the near future. Here are the major vendors in this Hadoop space as of September 2014:
Cloudera
  • Private
  • Investments: 2011 – $40M; 2014 – $900M
  • Around 600 employees
  • Founded in 2009
  • Partners with Oracle, Intel (funding), and Amazon (but also competes with Amazon)
Hortonworks
  • Private
  • Investements: 2011 – $23M + $25M
  • 201-500 employees
  • Founded in 2011
  • Partners with Yahoo, Teradata, and SAP
IBM
  • Public
  • $100B Revenue / year
  • 400K employees
  • Founded in 1911
MapR
  • Private
  • Investments: 2009 – $9M; 2014 – $110M
  • 201-500 employees
  • Founded in 2009
  • Partners with Google
Pivotal
  • Private
  • Investments: 2013 – $100M from GE and assets from EMC and VMWare
  • 3000+ employees
  • Founded in 2013 (Pivotal), 2003 (Greenplum), 1998 (VMWare) and 1979 (EMC)
  • Partners with EMC, VMWare, and GE
Amazon
  • Public
  • $75B Revenue / year
  • 132K employees
  • Founded in 1994
Hadoop Vendors Tomorrow
Cloudera => Oracle or Amazon
It will probably be Oracle because of the existing partnership and leadership that came from Oracle but Amazon may want it more. If Oracle doesn’t buy Cloudera, they will probably try to create their own distribution like they did with Linux.
Hortonworks => Teradata
It is only a matter of time before Teradata will have to buy Hortonworks. Microsoft might try to buy Hortonworks or just take a fork of the Windows version to rebrand. Microsoft worked with Sybase a long time ago with SQL Server and then took the code and ran rather than buying Sybase. So because of that history, I think Microsoft won’t buy and Teradata will.
Teradata bought Aster Data and Hortonworks would complete their data portfolio. Teradata for the EDW, Aster Data for Data Marts, and Hortonworks for their Data Lake.
MapR => Google
Google will snatch up MapR which will make MapR very happy.
So that leaves IBM and Amazon as the two publicly held companies left. Pivotal is privately held but by EMC, VMWare, and GE which gives all indications based on past actions by EMC that this company will go public and be big.
Post Acquisitions
So after the big shakeup, I think you’ll see these vendors remaining selling Hadoop:
  • Pivotal: 100% Apache based with the best SQL Engine
  • IBM: Big Insights
  • Teradata: Hortonworks
  • Oracle: Cloudera
  • Google: MapR
  • Amazon: Elastic MapReduce
I could be wrong but I really do think there will be a consolidation of vendors in the near future.
Related Posts Plugin for WordPress, Blogger...