Monday 20 October 2014

HAWQ with Parquet Files

HAWQ with Parquet Files

Here is a quick example showing how this work.
First, create a table with some data. You can either use the CTAS method or the more traditional CREATE TABLE and then INSERT. Either method works and it is up to your preference on which you do. This example generates only 100 records of some fake customer data.
CTAS
CREATE TABLE CUSTOMER 
WITH (appendonly=true, orientation=parquet)
AS
SELECT i AS id, 'jon' || i AS fname, 'roberts' || i AS lname, i::text || ' main street'::text AS address, 'new york'::text AS city, 'ny'::text AS state, lpad(i, 5, '0') AS zip
FROM (SELECT generate_series(1, 100) AS i) AS sub
DISTRIBUTED BY (id);
CREATE and then INSERT
CREATE TABLE customer
(
  id integer,
  fname text,
  lname text,
  address text,
  city text,
  state text,
  zip text
)
WITH (APPENDONLY=true, ORIENTATION=parquet, 
  OIDS=FALSE
)
DISTRIBUTED BY (id);

INSERT INTO customer
SELECT i AS id, 'jon' || i AS fname, 'roberts' || i AS lname, i::text || ' main street'::text AS address, 'new york'::text AS city, 'ny'::text AS state, lpad(i, 5, '0') AS zip
FROM (SELECT generate_series(1, 100) AS i) AS sub;
Now you have data in the Parquet format in HAWQ. Pretty easy, huh?
Next, I’ll use a nifty tool that queries the HAWQ catalog which tells me where the Parquet files are.
gpextract -o customer.yaml -W customer -dgpadmin
And here is the customer.yaml file it created.
DBVersion: PostgreSQL 8.2.15 (Greenplum Database 4.2.0 build 1) (HAWQ 1.2.0.1 build
  8119) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2 compiled on Apr
  23 2014 16:12:32
DFS_URL: hdfs://phd1.pivotalguru.com:8020
Encoding: UTF8
FileFormat: Parquet
Parquet_FileLocations:
  Checksum: false
  CompressionLevel: 0
  CompressionType: null
  EnableDictionary: false
  Files:
  - path: /hawq_data/gpseg0/16385/16554/16622.0
    size: 4493
  - path: /hawq_data/gpseg1/16385/16554/16622.0
    size: 4499
  PageSize: 1048576
  RowGroupSize: 8388608
TableName: public.customer
Version: 1.0.0
Notice the path to the files which are in Hadoop and are in the Parquet format.
Now you can use a tool like Pig to look at the data.
grunt> A = load '/hawq_data/gpseg{0,1}/16385/16554/16622' USING parquet.pig.ParquetLoader();
grunt> describe A;                                                                          
A: {id: int,fname: bytearray,lname: bytearray,address: bytearray,city: bytearray,state: bytearray,zip: bytearray}
grunt> B = foreach A generate id, fname, lname, address, city, state, zip;
grunt> dump B;
(2,jon2,roberts2,2 main street,new york,ny,00002)
(4,jon4,roberts4,4 main street,new york,ny,00004)
(6,jon6,roberts6,6 main street,new york,ny,00006)
(8,jon8,roberts8,8 main street,new york,ny,00008)
.....
Parquet is easy to use in HAWQ and doesn’t lock you into a Pivotal HD and HAWQ only solution. It is easy to use the other tools like Pig or MapReduce to read the Parquet files in your Hadoop cluster. No vendor lock-in.

No comments:

Post a Comment

Related Posts Plugin for WordPress, Blogger...