Monday, 20 October 2014

Hadoop Distributions

Hadoop Distributions

Here are a few of the MPP database vendors with their buyers:
Greenplum => EMC
Netezza => IBM
DATAllegro => Microsoft
Aster Data => Teradata
Vertica => HP
Similarly, I think the landscape of Hadoop vendors will change in the near future. Here are the major vendors in this Hadoop space as of September 2014:
Cloudera
  • Private
  • Investments: 2011 – $40M; 2014 – $900M
  • Around 600 employees
  • Founded in 2009
  • Partners with Oracle, Intel (funding), and Amazon (but also competes with Amazon)
Hortonworks
  • Private
  • Investements: 2011 – $23M + $25M
  • 201-500 employees
  • Founded in 2011
  • Partners with Yahoo, Teradata, and SAP
IBM
  • Public
  • $100B Revenue / year
  • 400K employees
  • Founded in 1911
MapR
  • Private
  • Investments: 2009 – $9M; 2014 – $110M
  • 201-500 employees
  • Founded in 2009
  • Partners with Google
Pivotal
  • Private
  • Investments: 2013 – $100M from GE and assets from EMC and VMWare
  • 3000+ employees
  • Founded in 2013 (Pivotal), 2003 (Greenplum), 1998 (VMWare) and 1979 (EMC)
  • Partners with EMC, VMWare, and GE
Amazon
  • Public
  • $75B Revenue / year
  • 132K employees
  • Founded in 1994
Hadoop Vendors Tomorrow
Cloudera => Oracle or Amazon
It will probably be Oracle because of the existing partnership and leadership that came from Oracle but Amazon may want it more. If Oracle doesn’t buy Cloudera, they will probably try to create their own distribution like they did with Linux.
Hortonworks => Teradata
It is only a matter of time before Teradata will have to buy Hortonworks. Microsoft might try to buy Hortonworks or just take a fork of the Windows version to rebrand. Microsoft worked with Sybase a long time ago with SQL Server and then took the code and ran rather than buying Sybase. So because of that history, I think Microsoft won’t buy and Teradata will.
Teradata bought Aster Data and Hortonworks would complete their data portfolio. Teradata for the EDW, Aster Data for Data Marts, and Hortonworks for their Data Lake.
MapR => Google
Google will snatch up MapR which will make MapR very happy.
So that leaves IBM and Amazon as the two publicly held companies left. Pivotal is privately held but by EMC, VMWare, and GE which gives all indications based on past actions by EMC that this company will go public and be big.
Post Acquisitions
So after the big shakeup, I think you’ll see these vendors remaining selling Hadoop:
  • Pivotal: 100% Apache based with the best SQL Engine
  • IBM: Big Insights
  • Teradata: Hortonworks
  • Oracle: Cloudera
  • Google: MapR
  • Amazon: Elastic MapReduce
I could be wrong but I really do think there will be a consolidation of vendors in the near future.

XML Parsing with Pivotal

XML Parsing

There are many ways to handle XML files but in this case in which I had very large files, I needed a cluster of machines and Hadoop is pretty good at that. The processing can be done with Map Reduce or a tool like Pig which simplifies Map Reduce.
Solution 1
Steps
  • Load raw file to Hadoop
  • Transform XML to tab delimited file with Pig
  • Create External Table in HAWQ to read file data in Hadoop
Sample XML file.
<?xml version="1.0"?>
<catalog>
      <large-product>
         <name>foo1</name>
         <price>110</price>
      </large-product>
      <large-product>
         <name>foo2</name>
         <price>120</price>
      </large-product>
      <large-product>
         <name>foo3</name>
         <price>130</price>
      </large-product>
      <large-product>
         <name>foo4</name>
         <price>140</price>
      </large-product>
      <large-product>
         <name>foo5</name>
         <price>150</price>
      </large-product>
      <small-product>
         <name>bar1</name>
         <price>10</price>
      </small-product>
      <small-product>
         <name>bar2</name>
         <price>20</price>
      </small-product>
      <small-product>
         <name>bar3</name>
         <price>30</price>
      </small-product>
      <small-product>
         <name>bar4</name>
         <price>40</price>
      </small-product>
      <small-product>
         <name>bar5</name>
         <price>50</price>
      </small-product>
</catalog>
As you can see, I have two record sets of large products and small products but I just want the small products in a table.
Fist, put the raw XML data into Hadoop.
hdfs dfs -mkdir /demo4
hdfs dfs -put catalog.xml /demo4
Here is the Pig script.
REGISTER /usr/lib/gphd/pig/piggybank.jar;
A = LOAD '/demo4/catalog.xml' 
USING org.apache.pig.piggybank.storage.XMLLoader('small-product') 
AS (doc:chararray);

clean = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(doc,'<small-product>\\s*<name>(.*)</name>\\s*<price>(.*)</price>\\s*</small-product>'))
AS (name:chararray,price:int);
store clean into '/demo4/alt_small_data';
What Pig is doing for me is to first only get the small-product records. This only requires a single line in the script and is very useful. The next step is to use regular expressions to parse each tag. This is very painful to get right because Pig use Map Reduce to parse the data. This is powerful but relatively slow to iterate until you get it right. Even with a small file, each iteration took at least 30 seconds to execute and the full file took 22 minutes.
The last step is to create an External Table in HAWQ.
DROP EXTERNAL TABLE IF EXISTS ext_alt_demo4;
CREATE EXTERNAL TABLE ext_alt_demo4
(
  name text, price int
)
 LOCATION (
    'pxf://pivhdsne:50070/demo4/alt_small_data/part*?profile=HdfsTextSimple'
)
 FORMAT 'text' (delimiter E'\t');
And selecting the data in HAWQ.
SELECT * FROM ext_alt_demo4;
 name | price 
------+-------
 bar1 |    10
 bar2 |    20
 bar3 |    30
 bar4 |    40
 bar5 |    50
(5 rows)

Time: 127.334 ms
This was my first approach for XML parsing until I got frustrated with the many XML tags to create regular expressions for. The XML I had wasn’t as neat as my example so I had to re-run the Pig script over and over again for each slight modification to the parsing logic.
Solution 2
This the same basic process as Solution 1 but instead of parsing each record with regular expressions in Pig, I will create a single column and do the parsing with SQL in HAWQ.
Here is my Pig script.
REGISTER /usr/lib/gphd/pig/piggybank.jar;
A = LOAD '/demo4/catalog.xml' USING org.apache.pig.piggybank.storage.XMLLoader('small-product') AS (doc:chararray);
clean = foreach A generate REPLACE(REPLACE(doc, '\\u000D', ''), '\\u000A', '');
store clean into '/demo4/small_data';
So instead of using regular expressions, I’m replacing carriage return and newline characters from the XML so that each record is in one row. Then I store that back in Hadoop.
Here is the External Table in HAWQ.
CREATE EXTERNAL TABLE ext_demo4
(
xml_data text
)
LOCATION (
'pxf://pivhdsne:50070/demo4/small_data/part*?profile=HdfsTextSimple'
)
FORMAT 'TEXT' (delimiter E'\t');
I then created a simple SQL function to parse the data.
CREATE OR REPLACE FUNCTION fn_extract_xml_value(p_tag text, p_xml text) RETURNS TEXT AS
$$
SELECT SPLIT_PART(SUBSTRING($2 FROM '<' || $1 || '>(.*)</' || $1 || '>'), '<', 1)
$$
LANGUAGE SQL;
And my SQL statement that parses the data.
SELECT (fn_extract_xml_value('name', xml_data))::text AS name, (fn_extract_xml_value('price', xml_data))::int AS price FROM ext_demo4;                 
 name | price 
------+-------
 bar1 |    10
 bar2 |    20
 bar3 |    30
 bar4 |    40
 bar5 |    50
(5 rows)

Time: 94.887 ms
The benefit for me in this second approach is the huge performance increase in the iterative approach of getting the XML parsing correct. Instead of taking several minutes to validate my code in Pig, I could execute a SQL statement that takes less than 1 second to run. It took another quick second to modify the SQL function and then I would try again.
Summary
Hadoop is powerful and has become commodity software with many distributions available that are all pretty much the same. The difference in distributions is the software that is unique to each vendor. Some vendors rely on their management tools while Pivotal HD has HAWQ which is the most robust SQL engine for Hadoop. This example shows how you can leverage the built-in functionality of Hadoop plus HAWQ to be more productive compared to using any other Hadoop distribution.
Related Posts Plugin for WordPress, Blogger...