Tuesday, 21 October 2014

Apache Hive – Getting Started

The Apache Hive™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. Source : hive.apache.org

This post is a fast paced, instruction based tutorial that dives directly into using Hive.

Creating a database

A database can be created using the CREATE DATABASE command at the hive prompt.
Syntax:
 CREATE DATABASE <database_name> 
E.g.
hive> CREATE DATABASE test_hive_db;
OK
Time taken: 0.048 seconds
The CREATE DATABASE command creates the database under HDFS at the default location: /user/hive/warehouse
This can be verified using the DESCRIBE command.
Syntax:
DESCRIBE DATABASE <database_name>
E.g.
hive> DESCRIBE DATABASE test_hive_db;
OK
test_hive_db hdfs://localhost:54310/user/hive/warehouse/test_hive_db.db
Time taken: 0.042 seconds, Fetched: 1 row(s)

Using a database

To use a database we can use the USE command.
Syntax:
USE <database_name>
E.g.
hive> USE test_hive_db;
OK
Time taken: 0.045 seconds

Dropping a database

To drop a database we can use the DROP DATABASE command.
Syntax:
DROP DATABASE <database_name>;
E.g.
hive> DROP DATABASE test_hive_db;
OK
Time taken: 0.233 seconds
To drop a database that has tables within it, you need to use the CASCADE directive along with the DROP DATABASE command.
Syntax:
DROP DATABASE <database_name> CASCADE;

Apache Hadoop Streaming

Apache Hadoop Streaming is a feature that allows developers to write MapReduce applications using languages like Python, Ruby, etc. A language that can read from standard input (STDIN) and write to standard output (STDOUT) can be used to write MapReduce applications.
In this post, I use Ruby to write the map and reduce functions.
First, let’s have some sample data. For a simple test, I have one file, that has just one line with a few repeating words.
The contents of the file (sample.txt) is as follows:
she sells sea shells on the sea shore where she sells fish too
Next, let’s create a ruby file for the map function and call it map.rb
Contents of map.rb
#!/usr/bin/env ruby

STDIN.each do |line|
line.split.each do |word|
puts word + "\t" + "1"
end
end
In the above map code, we are splitting each line into words and emitting each word as a key with value 1.
Now, let’s create a ruby file for the reduce function and call it reduce.rb
Contents of reduce.rb
#!/usr/bin/env ruby

prev_key=nil
init_val = 1

STDIN.each do |line|
key, value = line.split("\t")
if prev_key != nil && prev_key != key
puts prev_key + "\t" + init_val.to_s
prev_key = key
init_val = 1
elsif prev_key == nil
prev_key = key
elsif prev_key == key
init_val = init_val + value.to_i
end
end

puts prev_key + "\t" + init_val.to_s
In the above reduce code we take in each key and sum up the values for that key before printing.
We are now ready to test the map and reduce function locally before we run on the cluster.
Execute the following:
$ cat sample.txt | ruby map.rb | sort | ruby reduce.rb
The output should be as follows:
fish 1
on 1
sea 2
sells 2
she 2
shells 1
shore 1
the 1
too 1
where 1
In the above command we have provided the contents of the file sample.txt as input to the map.rb which in turn provides data to reduce.rb. The data is sorted before it is sent to the reducer.
It looks like are program is working as expected. Now it’s time to deploy this on a Hadoop cluster.
First, let move the sample data to a folder in HDFS:
$ hadoop fs -copyFromLocal sample.txt /user/data/
Once we have our sample data in HDFS we can execute the following command from the hadoop/bin folder to execute our MapReduce job:
$ hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar -file map.rb -mapper map.rb -file reduce.rb -reducer reduce.rb -input /user/data/* -output /user/wc
If everything goes fine you should see the following output on your terminal:
packageJobJar: [map.rb, reduce.rb, /home/hduser/tmp/hadoop-unjar2392048729049303810/] [] /tmp/streamjob3038768339999397115.jar tmpDir=null
13/12/12 10:25:01 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/12/12 10:25:01 WARN snappy.LoadSnappy: Snappy native library not loaded
13/12/12 10:25:01 INFO mapred.FileInputFormat: Total input paths to process : 1
13/12/12 10:25:01 INFO streaming.StreamJob: getLocalDirs(): [/home/hduser/tmp/mapred/local]
13/12/12 10:25:01 INFO streaming.StreamJob: Running job: job_201312120020_0007
13/12/12 10:25:01 INFO streaming.StreamJob: To kill this job, run:
13/12/12 10:25:01 INFO streaming.StreamJob: /home/hduser/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=hdfs://localhost:9001 -kill job_201312120020_0007
13/12/12 10:25:01 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201312120020_0007
13/12/12 10:25:02 INFO streaming.StreamJob: map 0% reduce 0%
13/12/12 10:25:05 INFO streaming.StreamJob: map 50% reduce 0%
13/12/12 10:25:06 INFO streaming.StreamJob: map 100% reduce 0%
13/12/12 10:25:13 INFO streaming.StreamJob: map 100% reduce 33%
13/12/12 10:25:14 INFO streaming.StreamJob: map 100% reduce 100%
13/12/12 10:25:15 INFO streaming.StreamJob: Job complete: job_201312120020_0007
13/12/12 10:25:15 INFO streaming.StreamJob: Output: /user/wc
Now, let’s look at the output file generated to see our results:
$ hadoop fs -cat /user/wc/part-00000

fish 1
on 1
sea 2
sells 2
she 2
shells 1
shore 1
the 1
too 1
where 1
The results are just as we expected. We have successfully built and executed Hadoop MapReduce application using streaming written in Ruby.
Related Posts Plugin for WordPress, Blogger...