Kalyan Hadoop Training in Hyderabad @ ORIEN IT, Ameerpet, 040 65142345 , 9703202345

Showing posts with label Greenplum. Show all posts

Monday, 20 October 2014

Pivotal HD and Greenplum Database Integration

So you are thinking about building a Hadoop cluster or already have one and wondering how you can get data to and from Hadoop into an MPP database like Greenplum Database (GPDB). With GPDB, this can be done in parallel which makes it the ideal solution for an Analytics, Reporting, or Data Warehousing environment.

In this example, I will generate some dummy data with a SQL statement in GPDB and load it into PHD with an INSERT statement. I will then read the data back out of PHD with a SELECT statement.

One time steps Configuration

1. Download Pivotal HD that matches your version.

https://network.gopivotal.com/products/pivotal-hd

2. Copy the file to the Master server in the GPDB cluster. In my example, gpdbvm43 is single node VM with 2 segments.

scp PHD-2.0.1.0-148.tar.gz root@gpdbvm43:/root

3. ssh as gpadmin to GPDB 4.3 instance

vi .bashrc
[add these entries]
export JAVA_HOME=/usr/java/latest
export HADOOP_HOME=/usr/lib/gphd

Note: this needs to be repeated on every host in the cluster. You can use gpscp or gpssh to make these changes to all hosts too.

4. Make sure the database is running.

gpstart -a

5. Change the gp_hadoop_target_version to be compatible with Pivotal HD 1.0 and greater.

gpconfig -c gp_hadoop_target_version -v gphd-2.0

6. Add an entry to your /etc/hosts for all of the hosts in the Hadoop cluster. This needs to be done on all hosts in the GPDB cluster too. I’m using a single node VM of PHD so I just have one entry.

vi /etc/hosts 
[add every node in the PHD cluster]
192.168.239.203 pivhdsne.localdomain pivhdsne

Note: You can then use gpscp to copy the revised hosts file to the other hosts in the cluster.

7. Install the PHD client in the GPDB cluster.

su - 
tar --no-same-owner -zxvf PHD-1.1.0.0-76.tar.gz
cd PHD-2.0.1.0-148/utility/rpm
rpm -i *.rpm
cd ../../zookeeper/rpm
rpm -i *.rpm
cd ../../hadoop/rpm
yum install nc
rpm -i *.rpm
exit

Note: You can use gpscp to copy the tar.gz file to the other hosts in the cluster and then use gpssh to execute these commands. Be sure to source the greenplum_path.sh after connecting as root. “nc” may not be needed on your cluster but was required with my VM.

8. Now that you are gpadmin again, bounce the database.

gpstop -r

Pivotal HD Configuration
You likely already have this done but if you are using the single node VM of Pivotal HD, then you will need to edit your /etc/hosts file there so that Hadoop is accessible remotely.

1. ssh as root to the VM

[root@pivhdsne ~]# ifconfig
eth1      Link encap:Ethernet  HWaddr 00:0C:29:20:A3:8F  
          inet addr:192.168.239.203  Bcast:192.168.239.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:fe20:a38f/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:9168 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1199 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:653659 (638.3 KiB)  TX bytes:199320 (194.6 KiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:3779367 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3779367 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:4407394192 (4.1 GiB)  TX bytes:4407394192 (4.1 GiB)

Now vi /etc/hosts and change 127.0.0.1 to the IP address of the VM.

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.239.203 pivhdsne pivhdsne.localdomain

GPDB External Tables
1. Example of Writable External Table to PHD.

CREATE WRITABLE EXTERNAL TABLE ext_foo
(i int, bar text) LOCATION ('gphdfs://pivhdsne/foo_bar') FORMAT 'text' (delimiter '|' null 'null');

2. Insert some data into PHD from GPDB.

INSERT INTO ext_foo SELECT i, 'bar_' || i FROM generate_series(1, 100) AS i;

3. Create External Table to view the data in PHD.

CREATE EXTERNAL TABLE ext_get_foo
(i int, bar text) LOCATION ('gphdfs://pivhdsne/foo_bar') FORMAT 'text' (delimiter '|' null 'null');

4. Select the data.

postgres=# SELECT * FROM ext_get_foo ORDER BY i LIMIT 10;
 i  |  bar   
----+--------
  1 | bar_1
  2 | bar_2
  3 | bar_3
  4 | bar_4
  5 | bar_5
  6 | bar_6
  7 | bar_7
  8 | bar_8
  9 | bar_9
 10 | bar_10
(10 rows)

Parallel!
Here is how you can see that it was done in parallel. Notice there are two files in the /foo_bar directory that I specified in the External Writable Table above.

[pivhdsne:~]$ hdfs dfs -ls /foo_bar
Found 2 items
-rw-r--r--   3 gpadmin hadoop        490 2014-05-24 00:19 /foo_bar/0_1400790662-0000004541
-rw-r--r--   3 gpadmin hadoop        494 2014-05-24 00:19 /foo_bar/1_1400790662-0000004541

There are two files because my single node VM of GPDB has two Segments. Each Segment wrote its file to Hadoop at the same time. Completely parallel and scalable.

More proof!

[pivhdsne:~]$ hdfs dfs -cat /foo_bar/0_1400790662-0000004541 | more
1|bar_1
3|bar_3
5|bar_5
7|bar_7
9|bar_9
...

[pivhdsne:~]$ hdfs dfs -cat /foo_bar/1_1400790662-0000004541 | more
2|bar_2
4|bar_4
6|bar_6
8|bar_8
10|bar_10
...

Extra Credit!
Log into HAWQ on the PHD cluster and create an External Table to the same files!

CREATE EXTERNAL TABLE ext_get_foo
(i int, bar text) LOCATION 
('pxf://pivhdsne:50070/foo_bar?profile=HdfsTextSimple') FORMAT 'text' (delimiter '|' null 'null');

gpadmin=# SELECT * FROM ext_get_foo ORDER BY i limit 10;
 i  |  bar   
----+--------
  1 | bar_1
  2 | bar_2
  3 | bar_3
  4 | bar_4
  5 | bar_5
  6 | bar_6
  7 | bar_7
  8 | bar_8
  9 | bar_9
 10 | bar_10
(10 rows)

As you can see, there are lots of ways to move data between Greenplum database and Hadoop to satisfy a large variety of use cases to solve business problems.

HAWQ versus Greenplum Database

HAWQ is a port of Greenplum database to work with the HDFS file system from Hadoop. It is only available with the Pivotal HD distribution of Hadoop even though Pivotal HD is 100% Apache Hadoop compliant.

This post will focus on the major differences in the filesystem, External Tables, DDL commands, and DML commands. I’m comparing Greenplum database version 4.3 with HAWQ version 1.2.0.1 which is shipped with Pivotal HD version 2.0.1.

HDFS versus Posix Filesystem
With Greenplum database, each segment’s files are always local to the host. Even in a failover condition when the mirror is acting as the primary, the data is local to the processing.

With HAWQ, HDFS handles high availability by having three copies of the data across multiple nodes. Because of this, the mirroring that is built into the Greenplum database is removed from HAWQ. If a node were to fail, then Hadoop automatically creates a third copy of the data. So a segment running on a Hadoop data node may not have the data it needs local to it and will need to get data from other physical nodes.

External Tables
Both HAWQ and Greenplum database have External Tables but differ when accessing external data in Hadoop.

HAWQ has PXF which is the Pivotal Extension Framework. It has the ability to access files in HDFS stored as plain text but also in Hive, Hbase, Avro, and Gemfire XD. You can write your own custom profiles to get data from HDFS. PXF can also get some statistics about these files so the optimizer is smarter when accessing these External Tables.

Greenplum database doesn’t have PXF but does have GPHDFS. GPHDFS enables Greenplum database to read and write data to HDFS. It doesn’t have built-in capabilities to Avro, Hive, HBase, and Gemfire XD. It also doesn’t have statistics for these External Tables.

HAWQ is great at exploring and transforming data in Hadoop while Greenplum database is great at bulk loading data from Hadoop into the database as well as bulk writing data from Greenplum database into Hadoop. So land all of your data in Hadoop, transform it with SQL and then create data marts in Greenplum database.

Functions
Both offer functions but HAWQ doesn’t have SECURITY DEFINER functions yet.

DDL Commands
Here is a list of commands that are in Greenplum database but not in HAWQ:

CREATE AGGREGATE: user defined aggregate like SUM and COUNT.
CREATE CAST: user defined conversion of two datatypes.
CREATE CONVERSION: user defined conversion of character set encodings.
CREATE DOMAIN: user defined datatype with optional constraints.
CREATE INDEX: indexes aren’t supported in HAWQ.
CREATE OPERATOR: user defined operator like != is the same as <>.
CREATE OPERATOR CLASS: user defined class of how a data type is used within an Index.
CREATE RULE: user defined filter placed on a table or view like “gender_code in (‘M’, ‘F’)”
CREATE TABLESPACE: user defined directory to be using in Posix filesystem to store database objects.
CREATE TRIGGER: user defined trigger for a table. Note that this is very limited in Greenplum database.

DML Commands
HDFS is designed for “write once, read many” and can not handle file pruning which is required for DELETE and UPDATE commands. Because of this HAWQ doesn’t support UPDATE and DELETE commands while Greenplum database does.

Summary
There are other small differences between the two products but these are the major ones.

When to ANALYZE in Greenplum and HAWQ?

Table statistics gives the cost based optimizer information needed to build the best query plan possible and this information is gathered with the ANALYZE command. But when should you execute an ANALYZE on a table? Doesn’t Grenplum and HAWQ do this automatically?

Greenplum and HAWQ will perform an ANALYZE automatically for you so the query optimizer will have good statistics and build a good plan. Greenplum and HAWQ also allow you to configure this if needed with gp_autostats_mode and gp_autostats_on_change_threshold.

gp_autostats_mode
This specifies when an automatic ANALYZE should take place. The values for this configuration parameter can be:
none
on_change
on_no_stats

on_no_stats
The default is on_no_stats and the database will automatically perform an ANALYZE for you. If you INSERT data into an empty table (new table or a table you just did a TRUNCATE on), the database will automatically gather statistics with an ANALYZE.

Typically in an Analytics Data Warehouse, you will insert data once and then read it many times. So the default of on_no_stats will automatically give you statistics as needed with no additional work from you.

none
Self explanatory. You have to execute ANALYZE if you want statistics.

on_change
With this setting, if perform an INSERT, UPDATE, or DELETE that exceeds the gp_autostats_on_change_threshold value, then an automatic ANALYZE will happen. Note for HAWQ, you can only INSERT.

In Greenplum and in the rare case where you are doing a DELETE or UPDATE to a table, you will need to execute an ANALYZE statement. Or, you can set the database to use on_change and the ANALYZE will happen automatically.

gp_autostats_on_change_threshold
This is only relevant for on_change and it is the number of rows that need to change before an automatic ANALYZE will happen. The default is over 2 billion rows so if you really want to use on_change, then you will also need to reduce this configuration parameter to something more realistic.

Examples

Example 1 – The default
gp_autostats_mode = on_no_stats
gp_autostats_on_change_threshold = 2147483647

CREATE TABLE my_table AS 
SELECT state, count(*) AS counter
FROM customer
GROUP BY state
DISTRIBUTED BY (state);
--Statistics will be gathered automatically.

CREATE TABLE my_table 
(state varchar(100),
 counter int)
DISTRIBUTED BY (state);

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics will be gathered automatically.

TRUNCATE my_table;

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics will be gathered automatically.

TRUNCATE my_table;

INSERT INTO my_table
VALUES ('dummy', 0);
--Statistics will be gathered automatically.

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics won't be updated.  The planner will think the table has only 1 row.

ANALYZE my_table;
--Statistics manually gathered and correctly shows the correct number of rows in the table.

Example 2 – Using on_change
gp_autostats_mode = on_change
gp_autostats_on_change_threshold = 1000000

CREATE TABLE my_table AS 
SELECT state, count(*) AS counter
FROM customer
GROUP BY state
DISTRIBUTED BY (state);
--Statistics will be gathered automatically only if the number of rows is 1M or more.

CREATE TABLE my_table 
(state varchar(100),
 counter int)
DISTRIBUTED BY (state);

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics will be gathered automatically only if the number of rows is 1M or more.

TRUNCATE my_table;

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics will be gathered automatically only if the number of rows is 1M or more.

TRUNCATE my_table;

INSERT INTO my_table
VALUES ('dummy', 0);
--Statistics will not be gathered automatically.

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics will be gathered automatically only if the number of rows is 1M or more.

Checking the Statistics
And here are a couple of queries you can use to see the statics information.

SELECT c.oid, c.relpages, c.reltuples 
FROM pg_class c 
JOIN pg_namespace n ON c.relnamespace = n.oid 
WHERE n.nspname = 'public' 
AND c.relname = 'my_table';

--using the oid from the previous query
SELECT * FROM pg_statistic WHERE starelid = 298610;

Summary
Greenplum and HAWQ automatically gather statistics for you in most cases. If you are doing lots of DML activity, you can change the configuration to still automatically gather statistics for you too. These automatic settings make life easier for DBAs, Developers, and Analysts but still give you the flexibility to configure it in the best way for your environment.

Pages

Monday, 20 October 2014

Pivotal HD and Greenplum Database Integration

HAWQ versus Greenplum Database

When to ANALYZE in Greenplum and HAWQ?