Pig

Hive Vs Pig

 

Feature
Hive
Pig
Language
SQL-like
PigLatin
Schemas/Types
Yes (explicit)
Yes (implicit)
Partitions
Yes
No
Server
Optional (Thrift)
No
User Defined Functions (UDF)
Yes (Java)
Yes (Java)
Custom Serializer/Deserializer
Yes
Yes
DFS Direct Access
Yes (implicit)
Yes (explicit)
Join/Order/Sort
Yes
Yes
Shell
Yes
Yes
Streaming
Yes
Yes
Web Interface
Yes
No
JDBC/ODBC
Yes (limited)
No

Apache Pig and Hive are two projects that layer on top of Hadoop, and provide a higher-level language for using Hadoop's MapReduce library. Apache Pig provides a scripting language for describing operations like reading, filtering, transforming, joining, and writing data -- exactly the operations that MapReduce was originally designed for. Rather than expressing these operations in thousands of lines of Java code that uses MapReduce directly, Pig lets users express them in a language not unlike a bash or perl script. Pig is excellent for prototyping and rapidly developing MapReduce-based jobs, as opposed to coding MapReduce jobs in Java itself.
If Pig is "Scripting for Hadoop", then Hive is "SQL queries for Hadoop". Apache Hive offers an even more specific and higher-level language, for querying data by running Hadoop jobs, rather than directly scripting step-by-step the operation of several MapReduce jobs on Hadoop. The language is, by design, extremely SQL-like. Hive is still intended as a tool for long-running batch-oriented queries over massive data; it's not "real-time" in any sense. Hive is an excellent tool for analysts and business development types who are accustomed to SQL-like queries and Business Intelligence systems; it will let them easily leverage your shiny new Hadoop cluster to perform ad-hoc queries or generate report data across data stored in storage systems mentioned above.

 

WORD COUNT EXAMPLE - PIG SCRIPT

Q) How to find the number of occurrences of the words in a file using the pig script?

You can find the famous word count example written in map reduce programs in apache website. Here we will write a simple pig script for the word count problem.

The following pig script finds the number of times a word repeated in  a file:

Word Count Example Using Pig Script:

lines = LOAD '/user/hadoop/HDFS_File.txt' AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;

The above pig script, first splits each line into words using the TOKENIZE operator. The tokenize function creates a bag of words. Using the FLATTEN function, the bag is converted into a tuple. In the third statement, the words are grouped together so that the count can be computed which is done in fourth statement.

You can see just with 5 lines of pig program, we have solved the word count problem very easily.

 

HOW TO FILTER RECORDS - PIG TUTORIAL EXAMPLES

Pig allows you to remove unwanted records based on a condition. The Filter functionality is similar to the WHERE clause in SQL. The FILTER operator in pig is used to remove unwanted records from the data file. The syntax of FILTER operator is shown below: 
<new relation> = FILTER <relation> BY <condition>

Here relation is the data set on which the filter is applied, condition is the filter condition and new relation is the relation created after filtering the rows. 

Pig Filter Examples: 

Lets consider the below sales data set as an example 
year,product,quantity
---------------------
2000, iphone, 1000
2001, iphone, 1500 
2002, iphone, 2000
2000, nokia,  1200
2001, nokia,  1500
2002, nokia,  900

1. select products whose quantity is greater than or equal to 1000. 
grunt> A = LOAD '/user/hadoop/sales' USING PigStorage(',') AS (year:int,product:chararray,quantity:int);
grunt> B = FILTER A BY quantity >= 1000;
grunt> DUMP B;
(2000,iphone,1000)
(2001,iphone,1500)
(2002,iphone,2000)
(2000,nokia,1200)
(2001,nokia,1500)

2. select products whose quantity is greater than 1000 and year is 2001 
grunt> C = FILTER A BY quantity > 1000 AND year == 2001;
(2001,iphone,1500)
(2001,nokia,1500)

3. select products with year not in 2000 
grunt> D = FILTER A BY year != 2000;
grunt> DUMP D;
(2001,iphone,1500)
(2002,iphone,2000)
(2001,nokia,1500)
(2002,nokia,900)

You can use all the logical operators (NOT, AND, OR) and relational operators (< , >, ==, !=, >=, <= ) in the filter conditions.

 

CREATING SCHEMA, READING AND WRITING DATA - PIG TUTORIAL

The first step in processing a data set using pig is to define a schema for the data set. A schema is a representation of the data set in terms of fields. Let see how to define a schema with an example. 

Consider the following products data set in Hadoop as an example: 
10, iphone,  1000
20, samsung, 2000
30, nokia,   3000

Here first field is the product id, second field is the product name and third field is the product price. 

Defining Schema: 

The LOAD operator is used to define a schema for a data set. Let see different usages of the LOAD operator for defining the schema for the above dataset. 

1. Creating Schema without specifying any fields. 

In this method, we don't specify any field names for creating the schema. An example is shown below: 
grunt> A = LOAD '/user/hadoop/products';

Pig is a data flow language. Each operational statement in pig consists of a relation and an operation. The left side of the statement is called relation and the right side is called the operation. Pig statements must terminated with a semicolon. Here A is a relation. /user/hadoop/products is the file in the hadoop. 

To view the schema of a relation, use the describe statement which is shown below: 
grunt> describe A;
Schema for A unknown.

As there are no fields are defined, the above describe statement on A shows that "Schema for A unkown". To display the contents on the console use the DUMP operator. 
grunt> DUMP A;
(10,iphone,1000)
(20,samsung,2000)
(30,nokia,3000)

To write the data set into HDFS, use the STORE operator as shown below 
grunt> STORE A INTO 'hadoop directory name'

2. Defining schema without specifying any data types. 

We can create a schema just by specifying the field names without any data types. An example is shown below: 
grunt> A = LOAD '/user/hadoop/products' USING PigStorage(',') AS (id, product_name, price);

grunt> describe A;
A: {id: bytearray,product_name: bytearray,price: bytearray}

grunt> STORE A into '/user/hadoop/products' USING PigStorage('|'); --Writes data with pipe as delimiter into hdfs product directory.

The PigStorge is used to specify the field delimiter. The default field delimiter is tab. If your data is a tab separated, then you can ignore the USING PigStorage keywords. In the STORE operation, you can use the PigStorage class for specifying the output separator. 

You have to specify the field names in the 'AS' clause. As we didn't specified any data type, by default pig assigned bytearray as the data type for the fields. 

3. Defining schema with field names and data types. 

To specify the data type use the colon. Take a look at the below example: 
grunt> A = LOAD '/user/hadoop/products' USING PigStorage(',') AS (id:int, product_name:chararray, price:int);

grunt> describe A;
A: {id: int,product_name: chararray,price: int}

Accessing the Fields: 

So far, we have seen how to define a schema, how to print the contents of the data on the console and how to write data to hdfs. Now we will see how to access the fields. 

The fields can be accessed in two ways: 

  • Field Names: We can specify the field name to access the values from that particular value.
  • Positional Parameters: The field positions start from 0 to n. $0 indicates first field, $1 indicates second field.

Example:
grunt> A = LOAD '/user/products/products' USING PigStorage(',') AS (id:int, product_name:chararray, price:int);
grunt> B = FOREACH A GENERATE id;
grunt> C = FOREACH A GENERATE $1,$2;
grunt> DUMP B;
(10)
(20)
(30)
grunt> DUMP C;
(iphone,1000)
(samsung,2000)
(nokia,3000)

FOREACH is like a for loop used to iterate over the records of a relation. The GENERATE keyword specifies what operation to do on the record. In the above example, the GENERATE is used to get the fields from the relation A. 

Note: It is always good practice to see the schema of a relation using the describe statement before performing a operation. By knowing the schema, you will know how to access the fields in the schema.

 

PIG DATA TYPES - PRIMITIVE AND COMPLEX

Pig has a very limited set of data types. Pig data types are classified into two types. They are:
  • Primitive
  • Complex

Primitive Data Types: The primitive datatypes are also called as simple datatypes. The simple data types that pig supports are: 
  • int : It is signed 32 bit integer. This is similar to the Integer in java.
  • long : It is a 64 bit signed integer. This is similar to the Long in java.
  • float : It is a 32 bit floating point. This data type is similar to the Float in java.
  • double : It is a 63 bit floating pint. This data type is similar to the Double in java.
  • chararray : It is character array in unicode UTF-8 format. This corresponds to java's String object.
  • bytearray : Used to represent bytes. It is the default data type. If you don't specify a data type for a filed, then bytearray datatype is assigned for the field.
  • boolean : to represent true/false values.

Complex Types: Pig supports three complex data types. They are listed below: 
  • Tuple : An ordered set of fields. Tuple is represented by braces. Example: (1,2)
  • Bag : A set of tuples is called a bag. Bag is represented by flower or curly braces. Example: {(1,2),(3,4)}
  • Map : A set of key value pairs. Map is represented in a square brackets. Example: [key#value] . The # is used to separate key and value.

Pig allows nesting of complex data structures. Example: You can nest a tuple inside a tuple, bag and a Map 

Null: Null is not a datatype. Null is an undefined value or corrupted value. Example: Let say you have declared a field as int type. However that field contains character values. When reading data from this field, pig converts those character values(corrupted) values into Nulls. Any operation with Null results in Null. The Null in pig is similar to the Null in SQL.

 

RELATIONS, BAGS, TUPLES, FIELDS - PIG TUTORIAL

In this article, we will see what is a relation, bag, tuple and field. Let see each one of these in detail. 

Lets consider the following products dataset as an example: 

Id, product_name
-----------------------
10, iphone
20, samsung
30, Nokia

  • Field: A field is a piece of data. In the above data set product_name is a field. 
  • Tuple: A tuple is a set of fields. Here Id and product_name form a tuple. Tuples are represented by braces. Example: (10, iphone). 
  • Bag: A bag is collection of tuples. Bag is represented by flower braces. Example: {(10,iphone),(20, samsung),(30,Nokia)}. 
  • Relation: Relation represents the complete database. A relation is a bag. To be precise relation is an outer bag. We can call a relation as a bag of tuples.
To compare with RDBMS, a relation is a table, where as the tuples in the bag corresponds to the rows in the table. Note that tuples in pig doesn't require to contain same number of fields and fields in the same position have the same data type.

 

HOW TO RUN PIG PROGRAMS - EXAMPLES

Pig programs can be run in three methods which work in both local and MapReduce mode. They are

  • Script Mode
  • Grunt Mode
  • Embedded Mode
Let see each mode in detail 

Script Mode or Batch Mode: In script mode, pig runs the commands specified in a script file. The following example shows how to run a pig programs from a script file: 
> cat scriptfile.pig
A = LOAD 'script_file';
DUMP A;
> pig scriptfile.pig

(pig script mode example)
(pig runs on top of hadoop)

Grunt Mode or Interactive Mode: The grunt mode can also be called as interactive mode. Grunt is pig's interactive shell. It is started when no file is specified for pig to run. 
> pig
grunt> A = LOAD 'grunt_file';
grunt> DUMP A;

(pig grunt or interactive mode example)
(pig runs on top of hadoop)

You can also run pig scripts from grunt using run and exec commands. 
grunt> run scriptfile.pig
grunt> exec scriptfile.pig

Embedded Mode: You can embed pig programs in java and can run from java.

8 comments:

  1. Interesting blog thanks for sharing While choosing your perfect ride for driving, Accord Cars comes with and the best packages for you to pick from. Car rentals for self drive in Chennai are done the easier. Just pick out your plan from hourly, daily, weekly and even monthly plans available.

    ReplyDelete

  2. Very useful blog thanks for sharing IndPac India the German technology Packaging and sealing machines in India is the leading manufacturer and exporter of Packing Machines in India.

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. I am a blogger. Your site give me much information, Keep it and update. Thanks for providing this kind of information.

    Java Training in Chennai

    Java Course in Chennai

    ReplyDelete

Related Posts Plugin for WordPress, Blogger...