Monday, 20 October 2014

Hadoop Distributions

Today’s landscape of Hadoop vendors is mostly comprised of privately held companies that have big investments and partnerships with well established companies. This is very similar to the MPP marketplace in 2010 when these companies were purchased by much larger companies.

Here are a few of the MPP database vendors with their buyers:
Greenplum => EMC
Netezza => IBM
DATAllegro => Microsoft
Aster Data => Teradata
Vertica => HP

Similarly, I think the landscape of Hadoop vendors will change in the near future. Here are the major vendors in this Hadoop space as of September 2014:

Cloudera

Private
Investments: 2011 – $40M; 2014 – $900M
Around 600 employees
Founded in 2009
Partners with Oracle, Intel (funding), and Amazon (but also competes with Amazon)

Hortonworks

Private
Investements: 2011 – $23M + $25M
201-500 employees
Founded in 2011
Partners with Yahoo, Teradata, and SAP

IBM

Public
$100B Revenue / year
400K employees
Founded in 1911

MapR

Private
Investments: 2009 – $9M; 2014 – $110M
201-500 employees
Founded in 2009
Partners with Google

Pivotal

Private
Investments: 2013 – $100M from GE and assets from EMC and VMWare
3000+ employees
Founded in 2013 (Pivotal), 2003 (Greenplum), 1998 (VMWare) and 1979 (EMC)
Partners with EMC, VMWare, and GE

Amazon

Public
$75B Revenue / year
132K employees
Founded in 1994

Hadoop Vendors Tomorrow
Cloudera => Oracle or Amazon
It will probably be Oracle because of the existing partnership and leadership that came from Oracle but Amazon may want it more. If Oracle doesn’t buy Cloudera, they will probably try to create their own distribution like they did with Linux.

Hortonworks => Teradata
It is only a matter of time before Teradata will have to buy Hortonworks. Microsoft might try to buy Hortonworks or just take a fork of the Windows version to rebrand. Microsoft worked with Sybase a long time ago with SQL Server and then took the code and ran rather than buying Sybase. So because of that history, I think Microsoft won’t buy and Teradata will.

Teradata bought Aster Data and Hortonworks would complete their data portfolio. Teradata for the EDW, Aster Data for Data Marts, and Hortonworks for their Data Lake.

MapR => Google
Google will snatch up MapR which will make MapR very happy.

So that leaves IBM and Amazon as the two publicly held companies left. Pivotal is privately held but by EMC, VMWare, and GE which gives all indications based on past actions by EMC that this company will go public and be big.

Post Acquisitions
So after the big shakeup, I think you’ll see these vendors remaining selling Hadoop:

Pivotal: 100% Apache based with the best SQL Engine
IBM: Big Insights
Teradata: Hortonworks
Oracle: Cloudera
Google: MapR
Amazon: Elastic MapReduce

I could be wrong but I really do think there will be a consolidation of vendors in the near future.

XML Parsing with Pivotal

XML Parsing

I recently worked with a customer that receives very large and complex XML files from external partners. This customer wanted the XML files parsed and available for SQL access so they can do reporting and analytics.

There are many ways to handle XML files but in this case in which I had very large files, I needed a cluster of machines and Hadoop is pretty good at that. The processing can be done with Map Reduce or a tool like Pig which simplifies Map Reduce.

Solution 1
Steps

Load raw file to Hadoop
Transform XML to tab delimited file with Pig
Create External Table in HAWQ to read file data in Hadoop

Sample XML file.

<?xml version="1.0"?>
<catalog>
      <large-product>
         <name>foo1</name>
         <price>110</price>
      </large-product>
      <large-product>
         <name>foo2</name>
         <price>120</price>
      </large-product>
      <large-product>
         <name>foo3</name>
         <price>130</price>
      </large-product>
      <large-product>
         <name>foo4</name>
         <price>140</price>
      </large-product>
      <large-product>
         <name>foo5</name>
         <price>150</price>
      </large-product>
      <small-product>
         <name>bar1</name>
         <price>10</price>
      </small-product>
      <small-product>
         <name>bar2</name>
         <price>20</price>
      </small-product>
      <small-product>
         <name>bar3</name>
         <price>30</price>
      </small-product>
      <small-product>
         <name>bar4</name>
         <price>40</price>
      </small-product>
      <small-product>
         <name>bar5</name>
         <price>50</price>
      </small-product>
</catalog>

As you can see, I have two record sets of large products and small products but I just want the small products in a table.

Fist, put the raw XML data into Hadoop.

hdfs dfs -mkdir /demo4
hdfs dfs -put catalog.xml /demo4

Here is the Pig script.

REGISTER /usr/lib/gphd/pig/piggybank.jar;
A = LOAD '/demo4/catalog.xml' 
USING org.apache.pig.piggybank.storage.XMLLoader('small-product') 
AS (doc:chararray);

clean = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(doc,'<small-product>\\s*<name>(.*)</name>\\s*<price>(.*)</price>\\s*</small-product>'))
AS (name:chararray,price:int);
store clean into '/demo4/alt_small_data';

What Pig is doing for me is to first only get the small-product records. This only requires a single line in the script and is very useful. The next step is to use regular expressions to parse each tag. This is very painful to get right because Pig use Map Reduce to parse the data. This is powerful but relatively slow to iterate until you get it right. Even with a small file, each iteration took at least 30 seconds to execute and the full file took 22 minutes.

The last step is to create an External Table in HAWQ.

DROP EXTERNAL TABLE IF EXISTS ext_alt_demo4;
CREATE EXTERNAL TABLE ext_alt_demo4
(
  name text, price int
)
 LOCATION (
    'pxf://pivhdsne:50070/demo4/alt_small_data/part*?profile=HdfsTextSimple'
)
 FORMAT 'text' (delimiter E'\t');

And selecting the data in HAWQ.

SELECT * FROM ext_alt_demo4;
 name | price 
------+-------
 bar1 |    10
 bar2 |    20
 bar3 |    30
 bar4 |    40
 bar5 |    50
(5 rows)

Time: 127.334 ms

This was my first approach for XML parsing until I got frustrated with the many XML tags to create regular expressions for. The XML I had wasn’t as neat as my example so I had to re-run the Pig script over and over again for each slight modification to the parsing logic.

Solution 2
This the same basic process as Solution 1 but instead of parsing each record with regular expressions in Pig, I will create a single column and do the parsing with SQL in HAWQ.

Here is my Pig script.

REGISTER /usr/lib/gphd/pig/piggybank.jar;
A = LOAD '/demo4/catalog.xml' USING org.apache.pig.piggybank.storage.XMLLoader('small-product') AS (doc:chararray);
clean = foreach A generate REPLACE(REPLACE(doc, '\\u000D', ''), '\\u000A', '');
store clean into '/demo4/small_data';

So instead of using regular expressions, I’m replacing carriage return and newline characters from the XML so that each record is in one row. Then I store that back in Hadoop.

Here is the External Table in HAWQ.

CREATE EXTERNAL TABLE ext_demo4
(
xml_data text
)
LOCATION (
'pxf://pivhdsne:50070/demo4/small_data/part*?profile=HdfsTextSimple'
)
FORMAT 'TEXT' (delimiter E'\t');

I then created a simple SQL function to parse the data.

CREATE OR REPLACE FUNCTION fn_extract_xml_value(p_tag text, p_xml text) RETURNS TEXT AS
$$
SELECT SPLIT_PART(SUBSTRING($2 FROM '<' || $1 || '>(.*)</' || $1 || '>'), '<', 1)
$$
LANGUAGE SQL;

And my SQL statement that parses the data.

SELECT (fn_extract_xml_value('name', xml_data))::text AS name, (fn_extract_xml_value('price', xml_data))::int AS price FROM ext_demo4;                 
 name | price 
------+-------
 bar1 |    10
 bar2 |    20
 bar3 |    30
 bar4 |    40
 bar5 |    50
(5 rows)

Time: 94.887 ms

The benefit for me in this second approach is the huge performance increase in the iterative approach of getting the XML parsing correct. Instead of taking several minutes to validate my code in Pig, I could execute a SQL statement that takes less than 1 second to run. It took another quick second to modify the SQL function and then I would try again.

Summary
Hadoop is powerful and has become commodity software with many distributions available that are all pretty much the same. The difference in distributions is the software that is unique to each vendor. Some vendors rely on their management tools while Pivotal HD has HAWQ which is the most robust SQL engine for Hadoop. This example shows how you can leverage the built-in functionality of Hadoop plus HAWQ to be more productive compared to using any other Hadoop distribution.

Kalyan Hadoop Training in Hyderabad @ ORIEN IT, Ameerpet, 040 65142345 , 9703202345

Pages

Monday, 20 October 2014

Hadoop Distributions

Hadoop Distributions

XML Parsing with Pivotal

XML Parsing