Tuesday 21 October 2014

Why pig is called as data flow language?

Hadoop is composed of two main components.
1. HDFS – Stores large amount of data
2. MapReduce – Processes data stored in HDFS
Processing data involves quiet a few common set of problems. In Software terminology, solutions to commonly occurring problems is called as design patterns. Following are some of common problems that we as a MapReduce developer has to solve.
1. Interpret data stored using some schema
2. Filter some of the data using some conditional logic
3. Sort the data
4. Apply join between two or multiple data sets.
As a developer, first thing we would like to do is create a generic framework which can be reused in all the cases involving above common set of problems.
Apache Pig is a generic framework  which consists of implementation of many MapReduce Design Pattens.
Apache Pig is implemented in Java Programming Language.
Instead of providing Java Based API framework, Pig provides its own scripting language which is called as Pig Latin.
Pig Latin is a very simple scripting language. It has constructs which can be used to apply different transformation on the data one after another.
Above diagram shows a sample data flow. After data is loaded, multiple operators(e.g. filter, group, sort etc.) are applied on that data before the final output is stored.
Pig provides developers many operators which can be applied on data one after another to get final output.
Once data is loaded, it flows through all Pig operators.
This is the reason Pig is called as data flow language.

No comments:

Post a Comment

Related Posts Plugin for WordPress, Blogger...