HAWQ with Parquet Files
Parquet is a format for column oriented data to be stored in HDFS. It is part of the Apache distribution and is also available in Pivotal HD and HAWQ. HAWQ can store and read data in the Parquet format and it is also available with the open source components of Pivotal HD such as Pig and MapReduce.
Here is a quick example showing how this work.
First, create a table with some data. You can either use the CTAS method or the more traditional CREATE TABLE and then INSERT. Either method works and it is up to your preference on which you do. This example generates only 100 records of some fake customer data.
CTAS
CREATE TABLE CUSTOMER WITH (appendonly=true, orientation=parquet) AS SELECT i AS id, 'jon' || i AS fname, 'roberts' || i AS lname, i::text || ' main street'::text AS address, 'new york'::text AS city, 'ny'::text AS state, lpad(i, 5, '0') AS zip FROM (SELECT generate_series(1, 100) AS i) AS sub DISTRIBUTED BY (id);
CREATE and then INSERT
CREATE TABLE customer ( id integer, fname text, lname text, address text, city text, state text, zip text ) WITH (APPENDONLY=true, ORIENTATION=parquet, OIDS=FALSE ) DISTRIBUTED BY (id); INSERT INTO customer SELECT i AS id, 'jon' || i AS fname, 'roberts' || i AS lname, i::text || ' main street'::text AS address, 'new york'::text AS city, 'ny'::text AS state, lpad(i, 5, '0') AS zip FROM (SELECT generate_series(1, 100) AS i) AS sub;
Now you have data in the Parquet format in HAWQ. Pretty easy, huh?
Next, I’ll use a nifty tool that queries the HAWQ catalog which tells me where the Parquet files are.
gpextract -o customer.yaml -W customer -dgpadmin
And here is the customer.yaml file it created.
DBVersion: PostgreSQL 8.2.15 (Greenplum Database 4.2.0 build 1) (HAWQ 1.2.0.1 build 8119) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2 compiled on Apr 23 2014 16:12:32 DFS_URL: hdfs://phd1.pivotalguru.com:8020 Encoding: UTF8 FileFormat: Parquet Parquet_FileLocations: Checksum: false CompressionLevel: 0 CompressionType: null EnableDictionary: false Files: - path: /hawq_data/gpseg0/16385/16554/16622.0 size: 4493 - path: /hawq_data/gpseg1/16385/16554/16622.0 size: 4499 PageSize: 1048576 RowGroupSize: 8388608 TableName: public.customer Version: 1.0.0
Notice the path to the files which are in Hadoop and are in the Parquet format.
Now you can use a tool like Pig to look at the data.
grunt> A = load '/hawq_data/gpseg{0,1}/16385/16554/16622' USING parquet.pig.ParquetLoader(); grunt> describe A; A: {id: int,fname: bytearray,lname: bytearray,address: bytearray,city: bytearray,state: bytearray,zip: bytearray} grunt> B = foreach A generate id, fname, lname, address, city, state, zip; grunt> dump B; (2,jon2,roberts2,2 main street,new york,ny,00002) (4,jon4,roberts4,4 main street,new york,ny,00004) (6,jon6,roberts6,6 main street,new york,ny,00006) (8,jon8,roberts8,8 main street,new york,ny,00008) .....
Parquet is easy to use in HAWQ and doesn’t lock you into a Pivotal HD and HAWQ only solution. It is easy to use the other tools like Pig or MapReduce to read the Parquet files in your Hadoop cluster. No vendor lock-in.