Monday, 23 June 2014

SQL ==> PIG

How to map the SQL query in PIG script

SQL Function	SQL	PIG
SELECT	SELECT column_name,column_name FROM table_name;	FOREACH alias GENERATE column_name, column_name;
SELECT *	SELECT * FROM table_name;	FOREACH alias GENERATE *;
DISTINCT	SELECT DISTINCT column_name,column_name FROM table_name;	DISTINCT(FOREACH alias GENERATE column_name, column_name);
WHERE	SELECT column_name,column_name FROM table_name WHERE column_name operator value;	FOREACH (FILTER alias BY column_name operator value) GENERATE column_name, column_name;
AND/OR	... WHERE (column_name operator value1 AND column_name operator value2) OR column_name operator value3;	FILTER alias BY (column_name operator value1 AND column_name operator value2) OR column_name operator value3;
ORDER BY	... ORDER BY column_name ASC\|DESC, column_name ASC\|DESC;	ORDER alias BY column_name ASC\|DESC, column_name ASC\|DESC;
TOP/LIMIT	SELECT TOP number column_nameFROM table_name ORDER BY column_name ASC\|DESC; SELECT column_name FROM table_name ORDER BY column_name ASC\|DESC LIMIT number;	FOREACH (GROUP alias BY column_name) GENERATELIMIT alias number; TOP(number, column_index, alias);
GROUP BY	SELECT function(column_name) FROM table GROUPBY column_name;	FOREACH (GROUP alias BY column_name) GENERATE function(alias.column_name);
LIKE	... WHERE column_name LIKE pattern;	FILTER alias BYREGEX_EXTRACT(column_name, pattern, 1) IS NOT NULL;
IN	... WHERE column_name IN (value1,value2,...);	FILTER alias BY column_name IN(value1, value2,...);
JOIN	SELECT column_name(s) FROM table1 JOIN table2 ON table1.column_name=table2.column_name;	FOREACH (JOIN alias1 BY column_name, alias2 BY column_name) GENERATE column_name(s);
LEFT/RIGHT/FULL OUTER JOIN	SELECT column_name(s) FROM table1 LEFT\|RIGHT\|FULL OUTER JOIN table2 ON table1.column_name=table2.column_name;	FOREACH (JOIN alias1 BY column_name LEFT\|RIGHT\|FULL, alias2 BY column_name) GENERATE column_name(s);
UNION ALL	SELECT column_name(s) FROM table1 UNION ALL SELECT column_name(s) FROM table2;	UNION alias1, alias2;
AVG	SELECT AVG(column_name) FROM table_name;	FOREACH (GROUP alias ALL) GENERATE AVG(alias.column_name);
COUNT	SELECT COUNT(column_name) FROM table_name;	FOREACH (GROUP alias ALL) GENERATE COUNT(alias.column_name);
COUNT DISTINCT	SELECT COUNT(DISTINCT column_name) FROM table_name;	FOREACH alias { unique_column = DISTINCT column_name; GENERATE COUNT(unique_column); };
MAX	SELECT MAX(column_name) FROM table_name;	FOREACH (GROUP alias ALL) GENERATE MAX(alias.column_name);
MIN	SELECT MIN(column_name) FROM table_name;	FOREACH (GROUP alias ALL) GENERATE MIN(alias.column_name);
SUM	SELECT SUM(column_name) FROM table_name;	FOREACH (GROUP alias ALL) GENERATE SUM(alias.column_name);
HAVING	... HAVING aggregate_function(column_name) operator value;	FILTER alias BY aggregate_function(column_name) operator value;
UCASE/UPPER	SELECT UCASE(column_name) FROM table_name;	FOREACH alias GENERATE UPPER(column_name);
LCASE/LOWER	SELECT LCASE(column_name) FROM table_name;	FOREACH alias GENERATE LOWER(column_name);
SUBSTRING	SELECT SUBSTRING(column_name,start,length) AS some_name FROM table_name;	FOREACH alias GENERATE SUBSTRING(column_name, start, start+length) as some_name;
LEN	SELECT LEN(column_name) FROM table_name;	FOREACH alias GENERATE SIZE(column_name);
ROUND	SELECT ROUND(column_name, 0) FROM table_name;	FOREACH alias GENERATE ROUND(column_name);

Thursday, 19 June 2014

Cloudera Certification Practice Test

Cloudera Certification Practice Test Exam with answers

-------------------------------------------------------------------------------------

1. You have decided to develop a custom composite key class that you will use for keys emitted during the map phase to be consumed by a reducer. Which interface must this key implement?
Correct Answer: WritableComparable

Explanation

All keys and values must implement the Writable interface so that they can be wriiten to disk. In addition, keys emitted during the map phase must implement WritableComparable so that the keys can be sorted during the shuffle and sort phase. WritableComparable inherits from Writable.

Further Reading

For more information, see the Yahoo! Developer Network Apach Hadoop Tutorial, Custom Key Types.
Hadoop: the Definitive Guide, chapter four, in the Serialization: The Writable Interface section.

2. You’ve built a MapReduce job that denormalizes a very large table, resulting in an extremely large amount of output data. Which two cluster resources will your job stress? (Choose two).

Correct Answer: network I/O , disk I/O

Explanation

When denormalizing a table, the amount of data written by the map phase will far exceed the amount of data read by the map phase. All of the data written during the map phase is first written to local disk and then transferred over the network to the reducers during the beginning of the reduce phase. Writing a very large amount of data in the map phase will therefore create a large amount of local disk I/O on the machines running map tasks and network I/O. Because map output is stored in a fixed size buffer that is written periodically to disk, this operation will not tax the memory of the machines running the map tasks. Denormalizing a table is not a compute-intesive operation, so this operation will not tax the processors of the machines running the map tasks.

Further Reading

Hadoop: the Definitive Guide, chapter six, Shuffle and Sort: The Map Side section includes more information on the process for writing map output to disk.
Hadoop: the Definitive Guide, chapter six, Shuffle and Sort: The Reduce Side section explains further how data is transferred to the reducers
Denormalizing a table is a form of join operation. You can read more about performing joins in MapReduce in Join Algorithms using Map/Reduce

3. When is the earliest that the reduce() method of any reduce task in a given job called?

Correct Answer: Not until all map tasks have completed

Explanation

No reduce task&rquo;s reduce() method is called until all map tasks have completed. Every reduce task&rquo;s reduce() method expects to receive its data in sorted order. If the reduce() method is called before all of the map tasks have completed, it would be possible that the reduce() method would receive the data out of order.

Further Reading

Hadoop: The Definitive Guide, chapter six includes a detailed explanation of the shuffle and sort phase of a MapReduce job.

4. You have 50 files in the directory /user/foo/example. Each file is 300MB. You submit a MapReduce job with /user/foo/example as the input path.

How much data does a single Mapper processes as this job executes?

Correct Answer: A single input split

Explanation

An input split is a unit of work (a chunk of data) that is processed by a single map task in a MapReduce program (represented by the Java interface InputSplit). The InputFormat you specify for MapReduce program determines how input files are split into records and read. Each map task processes a single split; each split is further comprised of records (key-value pairs) which the map task processes.

A MapReduce program run over a data set is usually called a MapReduce “job.” By splitting up input files, MapReduce can process a single file in parallel; if the file is very large, this represents a significant performance improvement. Also, because a single file is worked on in splits, it allows MapReduce to schedule those processes on different nodes in the cluster, nodes that have that piece of data already locally stored on that node, which also results in significant performance improvements. An InputSplit can span HDFS block boundaries.

Further Reading

Hadoop: The Definitive Guide, chapter two includes an excellent general discussion of input splits

Hadoop Administrator

5. In the execution of a MapReduce job, where does the Mapper place the intermediate data of each Map task?

Correct Answer: The Mapper stores the intermediate data on the underlying filesystem of the local disk of the machine which ran the Map task

Explanation

Intermediate data from a Mapper is stored on the local filesystem of the machine on which the Mapper ran. Each Reducer then copies its portion of that intermediate data to its own local disk. When the job terminates, all intermediate data is removed.

See Hadoop: The Definitive Guide, 3rd Edition, Chapter 2, under the section "Data Flow"

6. A client wishes to read a file from HDFS. To do that, it needs the block IDs (their names) which make up the file. From where does it obtain those block IDs?

Correct Answer: The NameNode reads the block IDs from memory and returns them to the client.

Explanation

When a client wishes to read a file from HDFS, it contacts the NameNode and requests the names and locations of the blocks which make up the file. For rapid access, the NameNode has the block IDs stored in RAM.

Further Reading
See Hadoop Operations, under the section "The Read Path."

7. Your cluster has slave nodes in three different racks, and you have written a rack topology script identifying each machine as being in rack1, rack2, or rack3. A client machine outside of the cluster writes a small (one-block) file to HDFS. The first replica of the block is written to a node on rack2. How is block placement determined for the other two replicas?

Correct Answer: Either both will be written to nodes on rack1, or both will be written to nodes on rack3.

Explanation

For the default threefold replication, Hadoop’s rack placement policy is to write the first copy of a block on a node in one rack, then the other two copies on two nodes in a different rack. Since the first copy is written to rack2, the other two will either be written to two nodes on rack1, or two nodes on rack3.

Further Reading

For more information on rack topology and block placement policy, see Hadoop Operations: A Guide for Developers and Administrators page 130.
See the Apache Docs on Data Replication

8. A slave node in your cluster has 24GB of RAM, and 12 physical processor cores on hyperthreading-enabled processors. You set the value of mapred.child.java.opts to -Xmx1G, and the value of mapred.tasktracker.map.tasks.maximum to 12. What is the most appropriate value to set for mapred.tasktracker.reduce.tasks.maximum?

Correct Answer: 6

Explanation

For optimal performance, it is important to avoid the use of virtual memory (swapping) on slave nodes. From the configuration shown, the slave node could run 12 Map tasks simultaneously, and each one will use 1GB of RAM, resulting in a total of 12GB used. The TaskTracker daemon itself will use 1GB of RAM, as will the DataNode daemon. This is a total of 14GB. The operating system will also use some RAM -- a reasonable estimate would be 1-2GB. Thus, we can expect to have approximately 8-9GB of RAM available for Reducers. So the most appropriate of the choices presented is that we should configure the node to be able to run 6 Reducers simultaneously.

Further Reading
Hadoop: The Definitive Guide, 3rd Edition, Chapter 9, under the section “Environment Settings”

HBase

9. Your client application needs to scan a region for the row key value 104. Given a store file that contains the following list of Row Key values:

100, 101, 102, 103, 104, 105, 106, 107 What would a bloom filter return?

Correct Answer: Confirmation that 104 may be contained in the set

Explanation

A Bloom filter is a kind of membership test using probability -- it tells you whether an element is a member of a set. It is quick and memory-efficient. The trade-off is that it is probabilistic where false positives are possible though false negatives are not; thus if your Bloom Filter returns true, it confirms that a key may be contained in a table. If Bloom Filter returns false, it confirms that a key is definitely not contained in a table.

Enabling Bloom Filters may save your disk seek and improve read latency.

Further Reading

Wikipedia on Bloom Filters
HBase Documentation on Bloom Filters section 12.6.4. Bloom Filters includes:
Bloom Filters can be enabled per ColumnFamily. Use HColumnDescriptor.setBloomFilterType(NONE | ROW | ROWCOL) to enable blooms per ColumnFamily. Default = NONE for no bloom filters. If ROW, the hash of the row will be added to the bloom on each insert. If ROWCOL, the hash of the row + column family + column family qualifier will be added to the bloom on each key insert.

10. You have two tables in an existing RDBMS. One contains information about the products you sell (name, size, color, etc.) The other contains images of the products in JPEG format. These tables are frequently joined in queries to your database. You would like to move this data into HBase. What is the most efficient schema design for this scenario?

Correct Answer: Create a single table with two column families

Explanation

Access patterns are an important factor in HBase schema design. Even though the two tables in this scenario have very different data sizes and formats, it is better to store them in one table if you are accessing them together most of the time.

Column families allow for separation of data. You can store different types of data and format into different column families. Attributes such as compression, Bloom filters, and replication are set on per column family basis. In this example, it is better to store product information and product images into two different column families and one table.

Further Reading
HBase Documentation on Column Family Section 5.6.especially the part:

Physically, all column family members are stored together on the filesystem. Because tunings and storage specifications are done at the column family level, it is advised that all column family members have the same general access pattern and size characteristics.

11. You need to create a Blogs table in HBase. The table will consist of a single Column Family called Content and two column qualifiers, Author and Comment. What HBase shell command should you use to create this table?

Correct Answer: create 'Blogs', 'Content'

Explanation

When you create a HBase table, you need to specify table name and column family name.
In the Hbase Shell, you can create a table with the command:

create
'myTable', 'myColumnFamily'

For this example:

Table name: 'Blogs'
ColumnFamily: 'Content'

We can create the table and verify it with describe table command.

hbase> create 'Blogs', 'Content' hbase> describe 'Blogs' {Name => 'Blogs', FAMILIES => [{NAME => 'Content', ....

Further Reading
see the HBase Shell commands for create

Create table; pass table name, a dictionary of specifications per column family, and optionally a dictionary of table configuration. Dictionaries are described below in the GENERAL NOTES section.
Examples:

hbase> create 't1', {NAME => 'f1', VERSIONS => 5} hbase> create 't1', {NAME => 'f1'}, {NAME => 'f2'}, {NAME => 'f3'} hbase> # The above in shorthand would be the following: hbase> create 't1', 'f1', 'f2', 'f3' hbase> create 't1', {NAME => 'f1', VERSIONS => 1, TTL => 2592000, BLOCKCACHE => true}

12. From within an HBase application you are performing a check and put operation with the following code:table.checkAndPut(Bytes.toBytes("rowkey"),Bytes.toBytes("colfam"),Bytes.toBytes("qualifier"), Bytes.toBytes("barvalue"), newrow));

Which describes this check and put operation?

Correct Answer: Check if rowkey/colfam/qualifier exists and has the cell value "barvalue". If so, put the values in newrow and return "true".

Explanation

The method checkAndPut returns "true" if a row with specific column family, and column qualifier value matches the expected value; if "true" is returned, it executes the put with new value. If the specific value is not present in a row, it returns "false" and the put is not executed.

Further Reading
HBase Documentation on checkAndPut:

Data Science Essentials (DS-200)

13. What is the best way to evaluate the quality of the model found by an unsupervised algorithm like k-means clustering, given metrics for the cost of the clustering (how well it fits the data) and its stability (how similar the clusters are across multiple runs over the same data)?

Correct Answer: The lowest cost clustering subject to a stability constraint

Explanation

There is a tradeoff between cost and stability in unsupervised learning. The more tightly you fit the data, the less stable the model will be, and vice versa. The idea is to find a good balance with more weight given to the cost. Typically a good approach is to set a stability threshold and select the model that achieves the lowest cost above the stability threshold.

14. A sandwich shop studies the number of men, and women, that enter the shop during the lunch hour from noon to 1pm each day. They find that the number of men that enter can be modeled as a random variable with distribution Poisson(M), and likewise the number of women that enter as Poisson(W). What is likely to be the best model of the total number of customers that enter during the lunch hour?

Correct Answer: Poisson(M+W)

Explanation

The total number of customers is the sum of the number of men and women. The sum of two Poisson random variables also follows a Poisson distribution with rate equal to the sum of their rates. The Normal and Binomial distribution can approximate the Poisson distribution in certain cases, but the expressions above do not approximate Poisson(M+W).

15. Which condition supports the choice of a support vector machine over logistic regression for a classification problem?

Correct Answer: The test set will be dense, and contain examples close to decision boundary learned from the training set

Explanation

The SVM algorithm is a maximum margin classifier, and tries to pick a decision boundary that creates the widest margin between classes, rather than just any boundary that separates the classes. This helps generalization to test data, since it is less likely to misclassify points near the decision boundary, as the boundary maintains a large margin from training examples.

SVMs are not particularly better at multi-label clasification. Linear separability is not required for either classification technique, and does not relate directly to an advantage of SVMs. SVMs are not particularly more suited to low dimensional data.

Wednesday, 18 June 2014

Map Reduce Scheduling information

JobTracker and TaskTracker: the MapReduce engine

Hadoop distributed file systems comes the MapReduce engine, which consists of one JobTracker, to which client applications submit MapReduce jobs. The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible. With a rack-aware file system, the JobTracker knows which node contains the data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This reduces network traffic on the main backbone network. If a TaskTracker fails or times out, that part of the job is rescheduled. The TaskTracker on each node spawns off a separate Java Virtual Machine process to prevent the TaskTracker itself from failing if the running job crashes the JVM. A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status. The Job Tracker and TaskTracker status and information is exposed by Jetty and can be viewed from a web browser.

If the JobTracker failed on Hadoop 0.20 or earlier, all ongoing work was lost. Hadoop version 0.21 added some checkpointing to this process; the JobTracker records what it is up to in the file system. When a JobTracker starts up, it looks for any such data, so that it can restart work from where it left off.

Known limitations of this approach are:

The allocation of work to TaskTrackers is very simple. Every TaskTracker has a number of available slots (such as "4 slots"). Every active map or reduce task takes up one slot. The Job Tracker allocates work to the tracker nearest to the data with an available slot. There is no consideration of the current system load of the allocated machine, and hence its actual availability.
If one TaskTracker is very slow, it can delay the entire MapReduce job – especially towards the end of a job, where everything can end up waiting for the slowest task. With speculative execution enabled, however, a single task can be executed on multiple slave nodes.

Scheduling

By default Hadoop uses FIFO, and optional 5 scheduling priorities to schedule jobs from a work queue. In version 0.19 the job scheduler was refactored out of the JobTracker, and added the ability to use an alternate scheduler (such as the Fair scheduler or the Capacity scheduler).

Fair scheduler

The fair scheduler was developed by Facebook. The goal of the fair scheduler is to provide fast response times for small jobs and QoS for production jobs. The fair scheduler has three basic concepts.

Jobs are grouped into Pools.
Each pool is assigned a guaranteed minimum share.
Excess capacity is split between jobs.

By default, jobs that are uncategorized go into a default pool. Pools have to specify the minimum number of map slots, reduce slots, and a limit on the number of running jobs.

Capacity scheduler

The capacity scheduler was developed by Yahoo. The capacity scheduler supports several features that are similar to the fair scheduler.

Jobs are submitted into queues.
Queues are allocated a fraction of the total resource capacity.
Free resources are allocated to queues beyond their total capacity.
Within a queue a job with a high level of priority has access to the queue's resources.

There is no preemption once a job is running.

Other applications

The HDFS file system is not restricted to MapReduce jobs. It can be used for other applications, many of which are under development at Apache. The list includes theHBase database, the Apache Mahout machine learning system, and the Apache Hive Data Warehouse system. Hadoop can in theory be used for any sort of work that is batch-oriented rather than real-time, that is very data-intensive, and able to work on pieces of the data in parallel. As of October 2009, commercial applications of Hadoop included:

Log and/or clickstream analysis of various kinds
Marketing analytics
Machine learning and/or sophisticated data mining
Image processing
Processing of XML messages
Web crawling and/or text processing
General archiving, including of relational/tabular data, e.g. for compliance

Pig Function Cheat Sheet

Basic Operators
Relational Operators
Functions
Load/Store Functions
Math Functions
String Functions
Tuple, Bag, Map Functions
UDFs
Data Types
Complex Types

Basic Operators

Operator	Description	Example
Arithmetic Operators	+, -, *, /, %, ?:	X = FOREACH A GENERATE f1, f2, f1%f2; X = FOREACH A GENERATE f2, (f2==1?1:COUNT(B));
Boolean Operators	and, or, not	X = FILTER A BY (f1==8) OR (NOT (f2+f3 > f1));
Cast Operators	Casting from one datatype to another	B = FOREACH A GENERATE (int)$0 + 1; B = FOREACH A GENERATE $0 + 1, $1 + 1.0
Comparison Operators	==, !=, >, <, >=, <=, matches	X = FILTER A BY (f1 == 8); X = FILTER A BY (f2 == ‘apache’); X = FILTER A BY (f1 matches ‘.apache.’);
Construction Operators	Used to construct tuple (), bag {} and map []	B = foreach A generate (name, age); B = foreach A generate {(name, age)}, {name, age}; B = foreach A generate [name, gpa];
Dereference Operators	dereference tuples (tuple.id or tuple.(id,…)), bags (bag.id or bag.(id,…)) and maps (map#’key’)	X = FOREACH A GENERATE f2.t1,f2.t3 (dereferencing is used to retrieve two fields from tuple f2)
Disambiguate Operator	( :: ) used to identify field names after JOIN, COGROUP, CROSS, or FLATTEN operators	A = load ‘data1′ as (x, y); B = load ‘data2′ as (x, y, z); C = join A by x, B by x; D = foreach C generate A::y;
Flatten Operator	Flatten un-nests tuples as well as bags	consider a relation that has a tuple of the form (a, (b, c)). The expression GENERATE $0, flatten($1), will cause that tuple to become (a, b, c).
Null Operator	is null, is not null	X = FILTER A BY f1 is not null;
Sign Operators	+ -> has no effect, – -> changes the sign of a positive/negative number	A = LOAD ‘data’ as (x, y, z); B = FOREACH A GENERATE -x, y;

Relational Operators

Operator	Description	Example
COGROUP/GROUP	Groups the data in one or more relations. The COGROUP operator groups together tuples that have the same group key (key field)	A = load ‘student’ AS (name:chararray,age:int,gpa:float); B = GROUP A BY age;
CROSS	Computes the cross product of two or more relations	X = CROSS A,B A = (1, 2, 3) B = (2, 4) DUMP X; (4, 2, 1) (8, 9) (1,2,3,2,4) (1, 3) (1,2,3,8,9) (1,2,3,1,3) (4,2,1,2,4) (4,2,1,8,9) (4,2,1,1,3)
DEFINE	Assigns an alias to a UDF or streaming command.	DEFINE CMD `perl PigStreaming.pl – nameMap` input(stdin using PigStreaming(‘,’)) output(stdout using PigStreaming(‘,’)); A = LOAD ‘file’; B = STREAM B THROUGH CMD;
DISTINCT	Removes duplicate tuples in a relation.	X = DISTINCT A; A = (8,3,4) DUMP X; (1,2,3) (1,2,3) (4,3,3) (4,3,3) (4,3,3) (8,3,4) (1,2,3)
FILTER	Selects tuples from a relation based on some condition.	X = FILTER A BY f3 == 3; A = (1,2,3) DUMP X; (4,5,6) (1,2,3) (7,8,9) (4,3,3) (4,3,3) (8,4,3) (8,4,3)
FOREACH	Generates transformation of data for each row as specified	X = FOREACH A GENERATE a1, a2; A = (1,2,3) DUMP X; (4,2,5) (1,2) (8,3,6) (4,2) (8,3)
IMPORT	Import macros defined in a separate file.	/* myscript.pig */ IMPORT ‘my_macro.pig’;
JOIN	Performs an inner join of two or more relations based on common field values.	X = JOIN A BY a1, B BY b1; DUMP X (1,2,1,3) A = (1,2) B = (1,3) (1,2,1,2) (4,5) (1,2) (4,5,4,7) (4,7)
LOAD	Loads data from the file system.	A = LOAD ‘myfile.txt’; LOAD ‘myfile.txt’ AS (f1:int, f2:int, f3:int);
MAPREDUCE	Executes native MapReduce jobs inside a Pig script.	A = LOAD ‘WordcountInput.txt’; B = MAPREDUCE ‘wordcount.jar’ STORE A INTO ‘inputDir’ LOAD ‘outputDir’ AS (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`;
ORDERBY	Sorts a relation based on one or more fields.	A = LOAD ‘mydata’ AS (x: int, y: map[]); B = ORDER A BY x;
SAMPLE	Partitions a relation into two or more relations, selects a random data sample with the stated sample size.	Relation X will contain 1% of the data in relation A. A = LOAD ‘data’ AS (f1:int,f2:int,f3:int); X = SAMPLE A 0.01;
SPLIT	Partitions a relation into two or more relations based on some expression.	SPLIT input_var INTO output_var IF (field1 is not null), ignored_var IF (field1 is null);
STORE	Stores or saves results to the file system.	STORE A INTO ‘myoutput’ USING PigStorage (‘’); 123 42*1
STREAM	Sends data to an external script or program	A = LOAD ‘data’; B = STREAM A THROUGH `stream.pl -n 5`;
UNION	Computes the union of two or more relations. (Does not preserve the order of tuples)	X = UNION A, B; A = (1,2,3) B = (2,4) DUMP X; (4,2,1) (8,9) (1,2,3) (1,3) (4,2,1) (2,4) (8,9) (1,3)

Functions

Function	Syntax	Description
AVG	AVG(expression	Computes the average of the numeric values in a single-column bag.
CONCAT	CONCAT (expression, expression)	Concatenates two expressions of identical type.
COUNT	COUNT(expression)	Computes the number of elements in a bag, it ignores null.
COUNT_STAR	COUNT_STAR(expression)	Computes the number of elements in a bag, it includes null.
DIFF	DIFF (expression, expression)	Compares two fields in a tuple, any tuples that are in one bag but not the other are returned in a bag.
DIFF	DIFF (expression, expression)	Compares two fields in a tuple, any tuples that are in one bag but not the other are returned in a bag.
IsEmpty	IsEmpty(expression)	Checks if a bag or map is empty.
MAX	MAX(expression)	Computes the maximum of the numeric values or chararrays in a single-column bag
MIN	MIN(expression)	Computes the minimum of the numeric values or chararrays in a single-column bag.
SIZE	SIZE(expression)	Computes the number of elements based on any Pig data type. SIZE includes NULL values in the size computation
SUM	SUM(expression)	Computes the sum of the numeric values in a single-column bag.
TOKENIZE	TOKENIZE(expression [, 'field_delimiter'])	Splits a string and outputs a bag of words.

Load/Store Functions

FUnction	Syntax	Description
Handling Compression	A = load ‘myinput.gz’; store A into ‘myoutput.gz’;	PigStorage and TextLoader support gzip and bzip compression for both read (load) and write (store). BinStorage does not support compression.
BinStorage	A = LOAD ‘data’ USING BinStorage();	Loads and stores data in machine-readable format.
JsonLoader, JsonStorage	A = load ‘a.json’ using JsonLoader();	Load or store JSON data.
PigDump	STORE X INTO ‘output’ USING PigDump();	Stores data in UTF-8 format.
PigStorage	A = LOAD ‘student’ USING PigStorage(‘\t’) AS (name: chararray, age:int, gpa: float);	Loads and stores data as structured text files.
TextLoader	A = LOAD ‘data’ USING TextLoader();	Loads unstructured data in UTF-8 format.

Math Functions

Operator	Description	Example
ABS	ABS(expression)	Returns the absolute value of an expression. If the result is not negative (x ≥ 0), the result is returned. If the result is negative (x < 0), the negation of the result is returned.
ACOS	ACOS(expression)	Returns the arc cosine of an expression.
ASIN	ASIN(expression)	Returns the arc sine of an expression.
ATAN	ATAN(expression)	Returns the arc tangent of an expression.
CBRT	CBRT(expression)	Returns the cube root of an expression.
CEIL	CEIL(expression)	Returns the value of an expression rounded up to the nearest integer. This function never decreases the result value.
COS	COS(expression)	Returns the trigonometric cosine of an expression.
COSH	COSH(expression)	Returns the hyperbolic cosine of an expression.
EXP	EXP(expression)	Returns Euler’s number e raised to the power of x.
FLOOR	FLOOR(expression)	Returns the value of an expression rounded down to the nearest integer. This function never increases the result value.
LOG	LOG(expression)	Returns the natural logarithm (base e) of an expression.
LOG10	LOG10(expression)	Returns the base 10 logarithm of an expression.
RANDOM	RANDOM( )	Returns a pseudo random number (type double) greater than or equal to 0.0 and less than 1.0.
ROUND	ROUND(expression)	Returns the value of an expression rounded to an integer (if the result type is float) or rounded to a long (if the result type is double).
SIN	SIN(expression)	Returns the sine of an expression.
SINH	SINH(expression)	Returns the hyperbolic sine of an expression.
SQRT	SQRT(expression)	Returns the positive square root of an expression.
TAN	TAN(expression)	Returns the trignometric tangent of an angle.
TANH	TANH(expression)	Returns the hyperbolic tangent of an expression.

String Functions

Operator	Description	Example
INDEXOF	INDEXOF(string, ‘character’, startIndex)	Returns the index of the first occurrence of a character in a string, searching forward from a start index.
LAST_INDEX	LAST_INDEX_OF(expression)	Returns the index of the last occurrence of a character in a string, searching backward from a start index.
LCFIRST	LCFIRST(expression)	Converts the first character in a string to lower case.
LOWER	LOWER(expression)	Converts all characters in a string to lower case.
REGEX_EXTRACT	REGEX_EXTRACT (string, regex, index)	Performs regular expression matching and extracts the matched group defined by an index parameter. The function uses Java regular expression form.
REGEX_EXTRACT_ALL	REGEX_EXTRACT (string, regex)	Performs regular expression matching and extracts all matched groups. The function uses Java regular expression form.
REPLACE	REPLACE(string, ‘oldChar’, ‘newChar’);	Replaces existing characters in a string with new characters.
STRSPLIT	STRSPLIT(string, regex, limit)	Splits a string around matches of a given regular expression.
SUBSTRING	SUBSTRING(string, startIndex, stopIndex)	Returns a substring from a given string.
TRIM	TRIM(expression)	Returns a copy of a string with leading and trailing white space removed.
UCFIRST	UCFIRST(expression)	Returns a string with the first character converted to upper case.
UPPER	UPPER(expression)	Returns a string converted to upper case.

Tuple, Bag, Map Functions

Operator	Description	Example
TOTUPLE	TOTUPLE(expression [, expression ...])	Converts one or more expressions to type tuple.
TOBAG	TOBAG(expression [, expression ...])	Converts one or more expressions to individual tuples which are then placed in a bag.
TOMAP	TOMAP(key-expression, value-expression [, key-expression, value-expression ...])	Converts key/value expression pairs into a map. Needs an even number of expressions as parameters. The elements must comply with map type rules.
TOP	TOP(topN,column,relation)	Returns the top-n tuples from a bag of tuples.

User Defined Functions (UDFs)

Pig provides extensive support for user defined functions (UDFs) as a way to specify custom processing. Pig UDFs can currently be implemented in three languages: Java, Python, JavaScript and Ruby.

Registering UDFs

Registering Java UDFs:

---register_java_udf.pig
register 'your_path_to_piggybank/piggybank.jar';
divs      = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,
                date:chararray, dividends:float);

Registering Python UDFs (The Python script must be in your current directory):

--register_python_udf.pig
register 'production.py' using jython as bballudfs;
players  = load 'baseball' as (name:chararray, team:chararray,
                pos:bag{t:(p:chararray)}, bat:map[]);

Writing UDFs

Java UDFs:

package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class UPPER extends EvalFunc
{
   public String exec(Tuple input) throws IOException {
       if (input == null || input.size() == 0)
           return null;
           try{
              String str = (String)input.get(0);
              return str.toUpperCase();
           }catch(Exception e){
              throw new IOException("Caught exception processing input row ", e);
           }
      }
  }

Python UDFs

#Square - Square of a number of any data type
@outputSchemaFunction("squareSchema") -- Defines a script delegate function that defines schema for this function depending upon the input type.
def square(num):
   return ((num)*(num))
@schemaFunction("squareSchema") --Defines delegate function and is not registered to Pig.
 def squareSchema(input):
   return input

 #Percent- Percentage
 @outputSchema("percent:double") --Defines schema for a script UDF in a format that Pig understands and is able to parse
 def percent(num, total):
   return num * 100 / total

Data Types

Simple Types

Operator	Description	Example
int	Signed 32-bit integer	10
long	Signed 64-bit integer	Data: 10L or 10l Display: 10L
float	32-bit floating point	Data: 10.5F or 10.5f or 10.5e2f or 10.5E2F Display: 10.5F or 1050.0F
double	64-bit floating point	Data: 10.5 or 10.5e2 or 10.5E2 Display: 10.5 or 1050.0
chararray	Character array (string) in Unicode UTF-8 format	hello world
bytearray	Byte array (blob)
boolean	boolean	true/false (case insensitive)

Complex Types

Operator	Description	Example
tuple	An ordered set of fields.	(19,2)
bag	An collection of tuples.	{(19,2), (18,1)}
map	A set of key value pairs.	[name#John,phone#5551212]

Kalyan Hadoop Training in Hyderabad @ ORIEN IT, Ameerpet, 040 65142345 , 9703202345

Pages

Monday, 23 June 2014

SQL ==> PIG

SQL ==> PIG

How to map the SQL query in PIG script

Thursday, 19 June 2014

Cloudera Certification Practice Test

Wednesday, 18 June 2014

Map Reduce Scheduling information

JobTracker and TaskTracker: the MapReduce engine

Scheduling

Fair scheduler

Capacity scheduler

Other applications

Pig Function Cheat Sheet

Pig Function Cheat Sheet

Basic Operators

Relational Operators

Functions

Load/Store Functions

Math Functions

String Functions

Tuple, Bag, Map Functions

User Defined Functions (UDFs)

Data Types

Simple Types

Complex Types

Pages

Monday, 23 June 2014

SQL ==> PIG

SQL ==> PIG

How to map the SQL query in PIG script

Thursday, 19 June 2014

Cloudera Certification Practice Test

Wednesday, 18 June 2014

Map Reduce Scheduling information

JobTracker and TaskTracker: the MapReduce engine

Scheduling

Fair scheduler

Capacity scheduler

Other applications

Pig Function Cheat Sheet

Pig Function Cheat Sheet

Basic Operators

Relational Operators

Functions

Load/Store Functions

Math Functions

String Functions

Tuple, Bag, Map Functions

User Defined Functions (UDFs)

Data Types

Simple Types

Complex Types