Showing posts with label Pig. Show all posts
Showing posts with label Pig. Show all posts

Tuesday 21 October 2014

Why pig is called as data flow language?

Hadoop is composed of two main components.
1. HDFS – Stores large amount of data
2. MapReduce – Processes data stored in HDFS
Processing data involves quiet a few common set of problems. In Software terminology, solutions to commonly occurring problems is called as design patterns. Following are some of common problems that we as a MapReduce developer has to solve.
1. Interpret data stored using some schema
2. Filter some of the data using some conditional logic
3. Sort the data
4. Apply join between two or multiple data sets.
As a developer, first thing we would like to do is create a generic framework which can be reused in all the cases involving above common set of problems.
Apache Pig is a generic framework  which consists of implementation of many MapReduce Design Pattens.
Apache Pig is implemented in Java Programming Language.
Instead of providing Java Based API framework, Pig provides its own scripting language which is called as Pig Latin.
Pig Latin is a very simple scripting language. It has constructs which can be used to apply different transformation on the data one after another.
Above diagram shows a sample data flow. After data is loaded, multiple operators(e.g. filter, group, sort etc.) are applied on that data before the final output is stored.
Pig provides developers many operators which can be applied on data one after another to get final output.
Once data is loaded, it flows through all Pig operators.
This is the reason Pig is called as data flow language.

Code , Debug & Test Apache Pig Scripts using Eclipse on Windows

Introduction
While developing a software, knowing how to debug the code is the most important part. It helps to solve any bugs in the code easily as well as helps us understand the internals of dependent framework code. It definitely applies to Apache Pig scripting. In this blog i will explain how to Code , Debug & Test Apache Pig Scripts using Eclipse on Windows.
Prerequisites:
  1. Install Eclipse Juno or above
  2. Install m2eclipse plugin
  3. Install JDK 1.6 or above
  4. Install Cygwin 1.7.5 or above
  5. <CYGWIN_HOME>/bin folder is added into PATH environment variable. CYGWIN_HOME is the installation directory of cygwin.
Steps:
Before you start with following steps, make sure all prerequisites are met.
1. Start Eclipse
2. From Eclipse File menu, create a new project.
01 Createnewproject3. From New Project wizard, select Maven project and click on Next button.
02 createmavenproject
4. On New Maven Project screen, click on next button
03 createmavenproject2
5. Select “maven-archtype-quicktype” as the project archtype and click on Next button.
04 selectarchtype
6. Enter appropriate Group Id, Artifact Id, Version & Package name and click on Finish button.
05 createprojectformfill
7. It creates a maven project which is shown in the package explorer. I have just expanded the project to show its structure. Project consists of auto generated App.java file and its corresponding AppTest.java file. AppTest.java contains the junit test code inside it. Also there is a file named pom.xml. It is a maven metadata file.
06 eclipsemavenproject
8. Double click on pom.xml to open it in POM editor.
07 openpomfile
9. Click on the pom.xml tab in the POM editor to see the contents of pom.xml
08 switchtopomtextmode
10. Add cloudera repository information to the pom.xml. We need some important dependencies for this project (e.g.  pig, pigunit & hadoop). These dependencies are available as maven artifacts in the cloudera’s maven repository.
123456789101112
<repositories>
<repository>
<id>cloudera-releases</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
</repositories>
09 addclouderarepository
11. Add dependencies on pig, pigunit, hadoop and some other dependencies like antlr, jackson etc. We are going to debug pig scripts in the Eclipse. For that purpose we need pig and pigunit. Because pig requires hadoop-core to work, we also need to add dependency on hadoop-core.
12345678910111213141516171819202122232425262728293031
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>0.20.2-cdh3u6</version>
</dependency>
<dependency>
<groupId>org.apache.pig</groupId>
<artifactId>pigunit</artifactId>
<version>0.10.0-cdh3u4</version>
</dependency>
<dependency>
<groupId>org.apache.pig</groupId>
<artifactId>pig</artifactId>
<version>0.10.0-cdh3u4</version>
</dependency>
<dependency>
<groupId>org.antlr</groupId>
<artifactId>antlr</artifactId>
<version>3.5</version>
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-mapper-asl</artifactId>
<version>1.9.12</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
view rawpigdependecy.xml hosted with ❤ by GitHub
10 addpigdependencies
12. While debugging we also need the source core and javadoc of dependencies. To enable downloading of source code and javadoc of dependencies,  go to menu Window > Preferences > Maven.
11 windowspreferences
13. Select the checkboxes for Download Artifact Sources and Download Artifact JavaDoc. Click on Apply and then OK. After this m2eclipse plugin downloads the sources and JavaDoc and attaches them to the artifact jar.
12 downloadsourcesandjavadoc
14. Right click on main folder and create a new folder under it.
13 createnewfolder
15. Name the newly created folder as resources.
14 createresourcesfolder
16. In resources folder, create two files named wordcount.pig and sample.data. Inwordcount.pig file, we are going to write a pig code that will count the number of occurrences of each word present in the sample.data file.
15 createnewfile
17. Add data into sample.data file.
12345678
Johny, Johny!
Yes, Papa
Eating sugar?
No, Papa
Telling lies?
No, Papa
Open your mouth!
Ha! Ha! Ha!
view rawsample.data hosted with ❤ by GitHub
17 createinputdata
18. Add pig code in the wordcount.pig file.
12345
A = load 'src/main/resources/sample.data';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = group B by word;
D = foreach C generate COUNT(B), group;
dump D;
view rawwordcount.pig hosted with ❤ by GitHub
18 addpigcode
19. Now we need to add pig unit test case to test and debug this wordcount.pig file.
20. Double click on the AppTest.java file to open in Java Editor.
19 opentestcase
21. Open AppTest.java file. Remove all the existing functions from the AppTest class. And add testWordCountScript fucntion.
1234567
public void testStudentsPigScript() throws Exception {
PigTest pigTest = new PigTest("src/main/resources/wordcount.pig");
pigTest.assertOutput("D", new String[] { "(2,No)", "(3,Ha!)",
"(1,Yes)", "(1,Open)", "(3,Papa)", "(1,your)", "(1,Johny)",
"(1,lies?)", "(1,Eating)", "(1,Johny!)", "(1,mouth!)",
"(1,sugar?)", "(1,Telling)", });
}
view rawAppTest.java hosted with ❤ by GitHub
25 addtestcode
22. As we are going to run pig code in the eclipse, we need to use larger heap while running pig unit test case.
23. To do that, select the AppTest.java file. Go to menu Run > Run Configurations …
20 testcaserunconfiguration
24. In Run Configurations window, double click on JUnit to create a Run Configuration for AppTest
22 doubleclickjunit
25. Go to Arguments tab and in the VM arguments section add “-Xmx1024m” to set the max JVM heap to 1Gb.
23 jvmheapsize
26. Again Select the AppTest.java file. Go to menu Run > Debug Configurations …
25.1 debugconfig
27. Select AppTest and click on Debug
25.2 debugtest
28. Test case should execute successfully and a green bar is shown.
26 executetest
29. Now we want to debug COUNT udf. To do that, press Ctrl+Shift+T which is a Eclipse shortcut to open a class. In the Open Type window, Type COUNT in the first text field. This will automatically show all the classes related to name COUNT. We are interested in the COUNT class present under package org.apache.pig.builtin. Select that COUNT class. And click on the OK button.
26 opencountudf
30. Because we have enabled attachment of source to the jar, the source code of COUNT udf is shown in the java editor.
27 addinitialdebugpoint
31. Because COUNT is a aggregate UDF, it contain implementation for Initial, Intermed & Final states. We will just put debug breakpoint in all these three state implementations exec function.
29 addfinalbedugpoint
32. Again Select the AppTest.java file. Go to menu Run > Debug Configurations …
25.1 debugconfig
33. Select AppTest and click on Debug
25.2 debugtest
34. A dialog box “Confirm Perspective Switch” will appear. Click on Yes button.
30 remembermydecision
35. You can see the activate breakpoint in the Java editor. Now by using Eclipse debug functionality you can debug the complete COUNT udf.
31 debugpoint
36. Now you are ready to debug any UDF code or even pig code.
Hope it helps!
Related Posts Plugin for WordPress, Blogger...