Pivotal HD and Greenplum Database Integration
So you are thinking about building a Hadoop cluster or already have one and wondering how you can get data to and from Hadoop into an MPP database like Greenplum Database (GPDB). With GPDB, this can be done in parallel which makes it the ideal solution for an Analytics, Reporting, or Data Warehousing environment.
In this example, I will generate some dummy data with a SQL statement in GPDB and load it into PHD with an INSERT statement. I will then read the data back out of PHD with a SELECT statement.
One time steps Configuration
1. Download Pivotal HD that matches your version.
https://network.gopivotal.com/products/pivotal-hd
2. Copy the file to the Master server in the GPDB cluster. In my example, gpdbvm43 is single node VM with 2 segments.
scp PHD-2.0.1.0-148.tar.gz root@gpdbvm43:/root
3. ssh as gpadmin to GPDB 4.3 instance
vi .bashrc [add these entries] export JAVA_HOME=/usr/java/latest export HADOOP_HOME=/usr/lib/gphd
Note: this needs to be repeated on every host in the cluster. You can use gpscp or gpssh to make these changes to all hosts too.
4. Make sure the database is running.
gpstart -a
5. Change the gp_hadoop_target_version to be compatible with Pivotal HD 1.0 and greater.
gpconfig -c gp_hadoop_target_version -v gphd-2.0
6. Add an entry to your /etc/hosts for all of the hosts in the Hadoop cluster. This needs to be done on all hosts in the GPDB cluster too. I’m using a single node VM of PHD so I just have one entry.
vi /etc/hosts [add every node in the PHD cluster] 192.168.239.203 pivhdsne.localdomain pivhdsne
Note: You can then use gpscp to copy the revised hosts file to the other hosts in the cluster.
7. Install the PHD client in the GPDB cluster.
su - tar --no-same-owner -zxvf PHD-1.1.0.0-76.tar.gz cd PHD-2.0.1.0-148/utility/rpm rpm -i *.rpm cd ../../zookeeper/rpm rpm -i *.rpm cd ../../hadoop/rpm yum install nc rpm -i *.rpm exit
Note: You can use gpscp to copy the tar.gz file to the other hosts in the cluster and then use gpssh to execute these commands. Be sure to source the greenplum_path.sh after connecting as root. “nc” may not be needed on your cluster but was required with my VM.
8. Now that you are gpadmin again, bounce the database.
gpstop -r
Pivotal HD Configuration
You likely already have this done but if you are using the single node VM of Pivotal HD, then you will need to edit your /etc/hosts file there so that Hadoop is accessible remotely.
You likely already have this done but if you are using the single node VM of Pivotal HD, then you will need to edit your /etc/hosts file there so that Hadoop is accessible remotely.
1. ssh as root to the VM
[root@pivhdsne ~]# ifconfig eth1 Link encap:Ethernet HWaddr 00:0C:29:20:A3:8F inet addr:192.168.239.203 Bcast:192.168.239.255 Mask:255.255.255.0 inet6 addr: fe80::20c:29ff:fe20:a38f/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:9168 errors:0 dropped:0 overruns:0 frame:0 TX packets:1199 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:653659 (638.3 KiB) TX bytes:199320 (194.6 KiB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:3779367 errors:0 dropped:0 overruns:0 frame:0 TX packets:3779367 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:4407394192 (4.1 GiB) TX bytes:4407394192 (4.1 GiB)
Now vi /etc/hosts and change 127.0.0.1 to the IP address of the VM.
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.239.203 pivhdsne pivhdsne.localdomain
GPDB External Tables
1. Example of Writable External Table to PHD.
1. Example of Writable External Table to PHD.
CREATE WRITABLE EXTERNAL TABLE ext_foo (i int, bar text) LOCATION ('gphdfs://pivhdsne/foo_bar') FORMAT 'text' (delimiter '|' null 'null');
2. Insert some data into PHD from GPDB.
INSERT INTO ext_foo SELECT i, 'bar_' || i FROM generate_series(1, 100) AS i;
3. Create External Table to view the data in PHD.
CREATE EXTERNAL TABLE ext_get_foo (i int, bar text) LOCATION ('gphdfs://pivhdsne/foo_bar') FORMAT 'text' (delimiter '|' null 'null');
4. Select the data.
postgres=# SELECT * FROM ext_get_foo ORDER BY i LIMIT 10; i | bar ----+-------- 1 | bar_1 2 | bar_2 3 | bar_3 4 | bar_4 5 | bar_5 6 | bar_6 7 | bar_7 8 | bar_8 9 | bar_9 10 | bar_10 (10 rows)
Parallel!
Here is how you can see that it was done in parallel. Notice there are two files in the /foo_bar directory that I specified in the External Writable Table above.
Here is how you can see that it was done in parallel. Notice there are two files in the /foo_bar directory that I specified in the External Writable Table above.
[pivhdsne:~]$ hdfs dfs -ls /foo_bar Found 2 items -rw-r--r-- 3 gpadmin hadoop 490 2014-05-24 00:19 /foo_bar/0_1400790662-0000004541 -rw-r--r-- 3 gpadmin hadoop 494 2014-05-24 00:19 /foo_bar/1_1400790662-0000004541
There are two files because my single node VM of GPDB has two Segments. Each Segment wrote its file to Hadoop at the same time. Completely parallel and scalable.
More proof!
[pivhdsne:~]$ hdfs dfs -cat /foo_bar/0_1400790662-0000004541 | more 1|bar_1 3|bar_3 5|bar_5 7|bar_7 9|bar_9 ... [pivhdsne:~]$ hdfs dfs -cat /foo_bar/1_1400790662-0000004541 | more 2|bar_2 4|bar_4 6|bar_6 8|bar_8 10|bar_10 ...
Extra Credit!
Log into HAWQ on the PHD cluster and create an External Table to the same files!
Log into HAWQ on the PHD cluster and create an External Table to the same files!
CREATE EXTERNAL TABLE ext_get_foo (i int, bar text) LOCATION ('pxf://pivhdsne:50070/foo_bar?profile=HdfsTextSimple') FORMAT 'text' (delimiter '|' null 'null'); gpadmin=# SELECT * FROM ext_get_foo ORDER BY i limit 10; i | bar ----+-------- 1 | bar_1 2 | bar_2 3 | bar_3 4 | bar_4 5 | bar_5 6 | bar_6 7 | bar_7 8 | bar_8 9 | bar_9 10 | bar_10 (10 rows)
As you can see, there are lots of ways to move data between Greenplum database and Hadoop to satisfy a large variety of use cases to solve business problems.