Apache Hadoop Streaming

Tuesday, 21 October 2014

Apache Hadoop Streaming

Apache Hadoop Streaming is a feature that allows developers to write MapReduce applications using languages like Python, Ruby, etc. A language that can read from standard input (STDIN) and write to standard output (STDOUT) can be used to write MapReduce applications.

In this post, I use Ruby to write the map and reduce functions.

First, let’s have some sample data. For a simple test, I have one file, that has just one line with a few repeating words.
The contents of the file (sample.txt) is as follows:

she sells sea shells on the sea shore where she sells fish too

Next, let’s create a ruby file for the map function and call it map.rb

Contents of map.rb

#!/usr/bin/env ruby

STDIN.each do |line|
line.split.each do |word|
puts word + "\t" + "1"
end
end

In the above map code, we are splitting each line into words and emitting each word as a key with value 1.

Now, let’s create a ruby file for the reduce function and call it reduce.rb

Contents of reduce.rb

#!/usr/bin/env ruby

prev_key=nil
init_val = 1

STDIN.each do |line|
key, value = line.split("\t")
if prev_key != nil && prev_key != key
puts prev_key + "\t" + init_val.to_s
prev_key = key
init_val = 1
elsif prev_key == nil
prev_key = key
elsif prev_key == key
init_val = init_val + value.to_i
end
end

puts prev_key + "\t" + init_val.to_s

In the above reduce code we take in each key and sum up the values for that key before printing.

We are now ready to test the map and reduce function locally before we run on the cluster.

Execute the following:

$ cat sample.txt | ruby map.rb | sort | ruby reduce.rb

The output should be as follows:

fish 1
on 1
sea 2
sells 2
she 2
shells 1
shore 1
the 1
too 1
where 1

In the above command we have provided the contents of the file sample.txt as input to the map.rb which in turn provides data to reduce.rb. The data is sorted before it is sent to the reducer.
It looks like are program is working as expected. Now it’s time to deploy this on a Hadoop cluster.

First, let move the sample data to a folder in HDFS:

$ hadoop fs -copyFromLocal sample.txt /user/data/

Once we have our sample data in HDFS we can execute the following command from the hadoop/bin folder to execute our MapReduce job:

$ hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar -file map.rb -mapper map.rb -file reduce.rb -reducer reduce.rb -input /user/data/* -output /user/wc

If everything goes fine you should see the following output on your terminal:

packageJobJar: [map.rb, reduce.rb, /home/hduser/tmp/hadoop-unjar2392048729049303810/] [] /tmp/streamjob3038768339999397115.jar tmpDir=null
13/12/12 10:25:01 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/12/12 10:25:01 WARN snappy.LoadSnappy: Snappy native library not loaded
13/12/12 10:25:01 INFO mapred.FileInputFormat: Total input paths to process : 1
13/12/12 10:25:01 INFO streaming.StreamJob: getLocalDirs(): [/home/hduser/tmp/mapred/local]
13/12/12 10:25:01 INFO streaming.StreamJob: Running job: job_201312120020_0007
13/12/12 10:25:01 INFO streaming.StreamJob: To kill this job, run:
13/12/12 10:25:01 INFO streaming.StreamJob: /home/hduser/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=hdfs://localhost:9001 -kill job_201312120020_0007
13/12/12 10:25:01 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201312120020_0007
13/12/12 10:25:02 INFO streaming.StreamJob: map 0% reduce 0%
13/12/12 10:25:05 INFO streaming.StreamJob: map 50% reduce 0%
13/12/12 10:25:06 INFO streaming.StreamJob: map 100% reduce 0%
13/12/12 10:25:13 INFO streaming.StreamJob: map 100% reduce 33%
13/12/12 10:25:14 INFO streaming.StreamJob: map 100% reduce 100%
13/12/12 10:25:15 INFO streaming.StreamJob: Job complete: job_201312120020_0007
13/12/12 10:25:15 INFO streaming.StreamJob: Output: /user/wc

Now, let’s look at the output file generated to see our results:

$ hadoop fs -cat /user/wc/part-00000

fish 1
on 1
sea 2
sells 2
she 2
shells 1
shore 1
the 1
too 1
where 1

The results are just as we expected. We have successfully built and executed Hadoop MapReduce application using streaming written in Ruby.

Kalyan Hadoop Training in Hyderabad @ ORIEN IT, Ameerpet, 040 65142345 , 9703202345

Pages

Tuesday, 21 October 2014

Apache Hadoop Streaming

No comments:

Post a Comment