Monday, 20 October 2014

HAWQ with Parquet Files

HAWQ with Parquet Files

Here is a quick example showing how this work.
First, create a table with some data. You can either use the CTAS method or the more traditional CREATE TABLE and then INSERT. Either method works and it is up to your preference on which you do. This example generates only 100 records of some fake customer data.
CTAS
CREATE TABLE CUSTOMER 
WITH (appendonly=true, orientation=parquet)
AS
SELECT i AS id, 'jon' || i AS fname, 'roberts' || i AS lname, i::text || ' main street'::text AS address, 'new york'::text AS city, 'ny'::text AS state, lpad(i, 5, '0') AS zip
FROM (SELECT generate_series(1, 100) AS i) AS sub
DISTRIBUTED BY (id);
CREATE and then INSERT
CREATE TABLE customer
(
  id integer,
  fname text,
  lname text,
  address text,
  city text,
  state text,
  zip text
)
WITH (APPENDONLY=true, ORIENTATION=parquet, 
  OIDS=FALSE
)
DISTRIBUTED BY (id);

INSERT INTO customer
SELECT i AS id, 'jon' || i AS fname, 'roberts' || i AS lname, i::text || ' main street'::text AS address, 'new york'::text AS city, 'ny'::text AS state, lpad(i, 5, '0') AS zip
FROM (SELECT generate_series(1, 100) AS i) AS sub;
Now you have data in the Parquet format in HAWQ. Pretty easy, huh?
Next, I’ll use a nifty tool that queries the HAWQ catalog which tells me where the Parquet files are.
gpextract -o customer.yaml -W customer -dgpadmin
And here is the customer.yaml file it created.
DBVersion: PostgreSQL 8.2.15 (Greenplum Database 4.2.0 build 1) (HAWQ 1.2.0.1 build
  8119) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2 compiled on Apr
  23 2014 16:12:32
DFS_URL: hdfs://phd1.pivotalguru.com:8020
Encoding: UTF8
FileFormat: Parquet
Parquet_FileLocations:
  Checksum: false
  CompressionLevel: 0
  CompressionType: null
  EnableDictionary: false
  Files:
  - path: /hawq_data/gpseg0/16385/16554/16622.0
    size: 4493
  - path: /hawq_data/gpseg1/16385/16554/16622.0
    size: 4499
  PageSize: 1048576
  RowGroupSize: 8388608
TableName: public.customer
Version: 1.0.0
Notice the path to the files which are in Hadoop and are in the Parquet format.
Now you can use a tool like Pig to look at the data.
grunt> A = load '/hawq_data/gpseg{0,1}/16385/16554/16622' USING parquet.pig.ParquetLoader();
grunt> describe A;                                                                          
A: {id: int,fname: bytearray,lname: bytearray,address: bytearray,city: bytearray,state: bytearray,zip: bytearray}
grunt> B = foreach A generate id, fname, lname, address, city, state, zip;
grunt> dump B;
(2,jon2,roberts2,2 main street,new york,ny,00002)
(4,jon4,roberts4,4 main street,new york,ny,00004)
(6,jon6,roberts6,6 main street,new york,ny,00006)
(8,jon8,roberts8,8 main street,new york,ny,00008)
.....
Parquet is easy to use in HAWQ and doesn’t lock you into a Pivotal HD and HAWQ only solution. It is easy to use the other tools like Pig or MapReduce to read the Parquet files in your Hadoop cluster. No vendor lock-in.

When to ANALYZE in Greenplum and HAWQ?

When to ANALYZE in Greenplum and HAWQ?

Greenplum and HAWQ will perform an ANALYZE automatically for you so the query optimizer will have good statistics and build a good plan. Greenplum and HAWQ also allow you to configure this if needed with gp_autostats_mode and gp_autostats_on_change_threshold.
gp_autostats_mode
This specifies when an automatic ANALYZE should take place. The values for this configuration parameter can be:
none
on_change
on_no_stats
on_no_stats
The default is on_no_stats and the database will automatically perform an ANALYZE for you. If you INSERT data into an empty table (new table or a table you just did a TRUNCATE on), the database will automatically gather statistics with an ANALYZE.
Typically in an Analytics Data Warehouse, you will insert data once and then read it many times. So the default of on_no_stats will automatically give you statistics as needed with no additional work from you.
none
Self explanatory. You have to execute ANALYZE if you want statistics.
on_change
With this setting, if perform an INSERT, UPDATE, or DELETE that exceeds the gp_autostats_on_change_threshold value, then an automatic ANALYZE will happen. Note for HAWQ, you can only INSERT.
In Greenplum and in the rare case where you are doing a DELETE or UPDATE to a table, you will need to execute an ANALYZE statement. Or, you can set the database to use on_change and the ANALYZE will happen automatically.
gp_autostats_on_change_threshold
This is only relevant for on_change and it is the number of rows that need to change before an automatic ANALYZE will happen. The default is over 2 billion rows so if you really want to use on_change, then you will also need to reduce this configuration parameter to something more realistic.
Examples
Example 1 – The default
gp_autostats_mode = on_no_stats
gp_autostats_on_change_threshold = 2147483647
CREATE TABLE my_table AS 
SELECT state, count(*) AS counter
FROM customer
GROUP BY state
DISTRIBUTED BY (state);
--Statistics will be gathered automatically.
CREATE TABLE my_table 
(state varchar(100),
 counter int)
DISTRIBUTED BY (state);

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics will be gathered automatically.
TRUNCATE my_table;

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics will be gathered automatically.
TRUNCATE my_table;

INSERT INTO my_table
VALUES ('dummy', 0);
--Statistics will be gathered automatically.

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics won't be updated.  The planner will think the table has only 1 row.

ANALYZE my_table;
--Statistics manually gathered and correctly shows the correct number of rows in the table.
Example 2 – Using on_change
gp_autostats_mode = on_change
gp_autostats_on_change_threshold = 1000000
CREATE TABLE my_table AS 
SELECT state, count(*) AS counter
FROM customer
GROUP BY state
DISTRIBUTED BY (state);
--Statistics will be gathered automatically only if the number of rows is 1M or more.
CREATE TABLE my_table 
(state varchar(100),
 counter int)
DISTRIBUTED BY (state);

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics will be gathered automatically only if the number of rows is 1M or more.
TRUNCATE my_table;

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics will be gathered automatically only if the number of rows is 1M or more.
TRUNCATE my_table;

INSERT INTO my_table
VALUES ('dummy', 0);
--Statistics will not be gathered automatically.

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics will be gathered automatically only if the number of rows is 1M or more.
Checking the Statistics
And here are a couple of queries you can use to see the statics information.
SELECT c.oid, c.relpages, c.reltuples 
FROM pg_class c 
JOIN pg_namespace n ON c.relnamespace = n.oid 
WHERE n.nspname = 'public' 
AND c.relname = 'my_table';
--using the oid from the previous query
SELECT * FROM pg_statistic WHERE starelid = 298610;
Summary
Greenplum and HAWQ automatically gather statistics for you in most cases. If you are doing lots of DML activity, you can change the configuration to still automatically gather statistics for you too. These automatic settings make life easier for DBAs, Developers, and Analysts but still give you the flexibility to configure it in the best way for your environment.
Related Posts Plugin for WordPress, Blogger...