Wednesday 23 October 2013

Anatomy of a MapReduce Job

Anatomy of a MapReduce Job

Hadoop MapReduce jobs are divided into a set of map tasks and reduce tasks that run in a distributed fashion on a cluster of computers. Each task work on a small subset of the data it has been assigned so that the load is spread across the cluster.
The input to a MapReduce job is a set of files in the data store that are spread out over the HDFS. In Hadoop, these files are split with an input formatwhich defines how to separate a files into input split. You can assume that input split is a byte-oriented view of a chunk of the files to be loaded by a map task.
The map task generally performs loading, parsing, transformation and filtering operations, whereas reduce task is responsible for grouping and aggregating the data produced by map tasks to generate final output. This is the way a wide range of problems can be solved with such a straightforward paradigm, from simple numerical aggregation to complex join operations and cartesian  products.
Each map task in Hadoop is broken into following phases: record reader, mapper, combiner, partitioner. The output of map phase, called intermediate key and values are sent to the reducers. The reduce tasks are broken into following phases: shuffle, sort, reducer and output format. The map tasks are assigned by Hadoop framework to those DataNodes where the actual data to be processed resides. This ensures that the data typically doesn’t have to move over the network  to save the network bandwidth and data is computed on the local machine itself so called map task is data local.
MapReduce Framework
http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/

Mapper

Record Reader:

The record reader translates an input split generated by input format into records. The purpose of record reader is to parse the data into record but doesn’t parse the record itself. It passes the data to the mapper in form of key/value pair. Usually the key in this context is positional information and the value is a chunk of data that composes a record. In our future articles we will discuss more about NLineInputFormat and custom record readers.

Map:

Map function is the heart of mapper task, which is executed on each key/value pair from the record reader to produce zero or more key/value pair, called intermediate pairs. The decision of what is key/value pair depends on what the MapReduce job is accomplishing. The data is grouped on key and the value is the information pertinent to the analysis in the reducer.

Combiner:

Its an optional component but highly useful and provides extreme performance gain of MapReduce job without any downside. Combiner is not applicable to all the MapReduce algorithms but where ever it can be applied it is always recommended to use. It takes the intermediate keys from the mapper and applies a user-provided method to aggregate values in a small scope of that one mapper. e.g sending (hadoop, 3) requires fewer bytes than sending (hadoop, 1) three times over the network. We will cover combiner in much more depth in our future articles.

Partitioner:

The partitioner takes the intermediate key/value pairs from mapper and split them into shards, one shard per reducer. This randomly distributes  the keyspace evenly over the reducer, but still ensures that keys with the same value in different mappers end up at the same reducer. The partitioned data is written to the local filesystem for each map task and waits to be pulled by its respective reducer.

Reducer

Shuffle and Sort:

The reduce task start with the shuffle and sort step. This step takes the output files written by all of the partitioners and downloads them to the local machine in which the reducer is running. These individual data pipes are then sorted by keys into one larger data list. The purpose of this sort is to group equivalent keys together so that their values can be iterated over easily in the reduce task.

Reduce:

The reducer takes the grouped data as input and runs a reduce function once per key grouping. The function is passed the key and an iterator over all the values associated with that key. A wide range of processing can happen in this function, the data can be aggregated, filtered, and combined in a number of ways. Once it is done, it sends zero or more key/value pair to the final step, the output format.

Output Format:

The output format translate the final key/value pair from the reduce function and writes it out to a file by a record writer. By default, it will separate the key and value with a tab and separate record with a new line character. We will discuss in our future articles about how to write your own customized output format.

 

Wednesday 9 October 2013

UNIX INTERVIEW QUESTIONS

UNIX INTERVIEW QUESTIONS ON AWK COMMAND

Awk is powerful tool in Unix. Awk is an excellent tool for processing the files which have data arranged in rows and columns format. It is a good filter and report writer. 
1. How to run awk command specified in a file?
awk -f filename

2. Write a command to print the squares of numbers from 1 to 10 using awk command
awk 'BEGIN { for(i=1;i<=10;i++) {print "square of",i,"is",i*i;}}'

3. Write a command to find the sum of bytes (size of file) of all files in a directory.
ls -l | awk 'BEGIN {sum=0} {sum = sum + $5} END {print sum}'

4. In the text file, some lines are delimited by colon and some are delimited by space. Write a command to print the third field of each line.

awk '{ if( $0 ~ /:/ ) { FS=":"; } else { FS =" "; } print $3 }' filename

5. Write a command to print the line number before each line?
awk '{print NR, $0}' filename

6. Write a command to print the second and third line of a file without using NR.
awk 'BEGIN {RS="";FS="\n"} {print $2,$3}' filename

7. Write a command to print zero byte size files?
ls -l | awk '/^-/ {if ($5 !=0 ) print $9 }'

8. Write a command to rename the files in a directory with "_new" as postfix?
ls -F | awk '{print "mv "$1" "$1".new"}' | sh

9. Write a command to print the fields in a text file in reverse order?
awk 'BEGIN {ORS=""} { for(i=NF;i>0;i--) print $i," "; print "\n"}' filename

10. Write a command to find the total number of lines in a file without using NR
awk 'BEGIN {sum=0} {sum=sum+1} END {print sum}' filename

Another way to print the number of lines is by using the NR. The command is
awk 'END{print NR}' filename


UNIX INTERVIEW QUESTIONS ON GREP COMMAND

The grep is one of the powerful tools in unix. Grep stands for "global search for regular expressions and print". The power of grep lies in using regular expressions mostly.

The general syntax of grep command is
grep [options] pattern [files]

1. Write a command to print the lines that has the the pattern "july" in all the files in a particular directory?

grep july *
This will print all the lines in all files that contain the word “july” along with the file name. If any of the files contain words like "JULY" or "July", the above command would not print those lines.

2. Write a command to print the lines that has the word "july" in all the files in a directory and also suppress the filename in the output.

grep -h july *

3. Write a command to print the lines that has the word "july" while ignoring the case.

grep -i july *
The option i make the grep command to treat the pattern as case insensitive.

4. When you use a single file as input to the grep command to search for a pattern, it won't print the filename in the output. Now write a grep command to print the filename in the output without using the '-H' option.
grep pattern filename /dev/null
The /dev/null or null device is special file that discards the data written to it. So, the /dev/null is always an empty file.

Another way to print the filename is using the '-H' option. The grep command for this is
grep -H pattern filename

5. Write a Unix command to display the lines in a file that do not contain the word "july"?
grep -v july filename
The '-v' option tells the grep to print the lines that do not contain the specified pattern.

6. Write a command to print the file names in a directory that has the word "july"?
grep -l july *
The '-l' option make the grep command to print only the filename without printing the content of the file. As soon as the grep command finds the pattern in a file, it prints the pattern and stops searching other lines in the file.

7. Write a command to print the file names in a directory that does not contain the word "july"?
grep -L july *
The '-L' option makes the grep command to print the filenames that do not contain the specified pattern.

8. Write a command to print the line numbers along with the line that has the word "july"?
grep -n july filename
The '-n' option is used to print the line numbers in a file. The line numbers start from 1

9. Write a command to print the lines that starts with the word "start"?
grep '^start' filename
The '^' symbol specifies the grep command to search for the pattern at the start of the line.

10. Write a command to print the lines which end with the word "end"?
grep 'end$' filename
The '$' symbol specifies the grep command to search for the pattern at the end of the line.

11. Write a command to select only those lines containing "july" as a whole word?
grep -w july filename
The '-w' option makes the grep command to search for exact whole words. If the specified pattern is found in a string, then it is not considered as a whole word. For example: In the string "mikejulymak", the pattern "july" is found. However "july" is not a whole word in that string.


UNIX INTERVIEW QUESTIONS ON SED COMMAND

SED is a special editor used for modifying files automatically.

1. Write a command to replace the word "bad" with "good" in file?
sed s/bad/good/ < filename

2. Write a command to replace the word "bad" with "good" globally in a file?
sed s/bad/good/g < filename

3. Write a command to replace the character '/' with ',' in a file?
sed 's/\//,/' < filename
sed 's|/|,|' < filename

4. Write a command to replace the word "apple" with "(apple)" in a file?
sed s/apple/(&)/ < filename

5. Write a command to switch the two consecutive words "apple" and "mango" in a file?
sed 's/\(apple\) \(mango\)/\2 \1/' < filename

6. Write a command to replace the second occurrence of the word "bat" with "ball" in a file?
sed 's/bat/ball/2' < filename

7. Write a command to remove all the occurrences of the word "jhon" except the first one in a line with in the entire file?
sed 's/jhon//2g' < filename

8. Write a command to remove the first number on line 5 in file?
sed '5 s/[0-9][0-9]*//' < filename

9. Write a command to remove the first number on all lines that start with "@"?
sed '\,^@, s/[0-9][0-9]*//' < filename

10. Write a command to replace the word "gum" with "drum" in the first 100 lines of a file?
sed '1,00 s/gum/drum/' < filename

11. write a command to replace the word "lite" with "light" from 100th line to last line in a file?
sed '100,$ s/lite/light/' < filename

12. Write a command to remove the first 10 lines from a file?
sed '1,10 d' < filename

13. Write a command to duplicate each line in a file?
sed 'p' < filename

14. Write a command to duplicate empty lines in a file?
sed '/^$/ p' < filename

15. Write a sed command to print the lines that do not contain the word "run"?
sed -n '/run/!p' < filename


UNIX INTERVIEW QUESTIONS ON CUT COMMAND

The cut command is used to used to display selected columns or fields from each line of a file. Cut command works in two modes:
  • Delimited selection: The fields in the line are delimited by a single character like blank,comma etc.
  • Range selection: Each field starts with certain fixed offset defined as range.
1. Write a command to display the third and fourth character from each line of a file?
cut -c 3,4 filename

2. Write a command to display the characters from 10 to 20 from each line of a file?
cut -c 10-20 filename

3. Write a command to display the first 10 characters from each line of a file?
cut -c -10 filename

4. Write a comamnd to display from the 10th character to the end of the line?
cut -c 10- filename

5. The fields in each line are delimited by comma. Write a command to display third field from each line of a file?
cut -d',' -f2 filename

6. Write a command to print the fields from 10 to 20 from each line of a file?
cut -d',' -f10-20 filename

7. Write a command to print the first 5 fields from each line?
cut -d',' -f-5 filename

8. Write a command to print the fields from 10th to the end of the line?
cut -d',' -f10- filename

9. By default the cut command displays the entire line if there is no delimiter in it. Which cut option is used to supress these kind of lines?
The -s option is used to supress the lines that do not contain the delimiter.

10. Write a cut command to extract the username from 'who am i' comamnd?
who am i | cut -f1 -d' '


UNIX INTERVIEW QUESTIONS ON FIND COMMAND

Find utility is used for searching files using the directory information.

1. Write a command to search for the file 'test' in the current directory?
find -name test -type f

2. Write a command to search for the file 'temp' in '/usr' directory?
find /usr -name temp -type f

3. Write a command to search for zero byte size files in the current directory?
find -size 0 -type f

4. Write a command to list the files that are accessed 5 days ago in the current directory?
find -atime 5 -type f

5. Write a command to list the files that were modified 5 days ago in the current directory?
find -mtime 5 -type f

6. Write a command to search for the files in the current directory which are not owned by any user in the /etc/passwd file?
find . -nouser -type f

7. Write a command to search for the files in '/usr' directory that start with 'te'?
find /usr -name 'te*' -type f

8. Write a command to search for the files that start with 'te' in the current directory and then display the contents of the file?
find . -name 'te*' -type f -exec cat {} \;

9. Write a command to list the files whose status is changed 5 days ago in the current directory?
find -ctime 5 -type f

10. Write a command to list the files in '/usr' directory that start with 'ch' and then display the number of lines in each file?
find /usr -name 'ch*' -type f -exec wc -l {} \;


TOP UNIX INTERVIEW QUESTIONS - PART 1

1. How to display the 10th line of a file?
head -10 filename | tail -1

2. How to remove the header from a file?
sed -i '1 d' filename

3. How to remove the footer from a file?
sed -i '$ d' filename

4. Write a command to find the length of a line in a file?

The below command can be used to get a line from a file.
sed –n '<n> p' filename

We will see how to find the length of 10th line in a file
sed -n '10 p' filename|wc -c

5. How to get the nth word of a line in Unix?
cut –f<n> -d' '

6. How to reverse a string in unix?
echo "java" | rev

7. How to get the last word from a line in Unix file?
echo "unix is good" | rev | cut -f1 -d' ' | rev

8. How to replace the n-th line in a file with a new line in Unix?
sed -i'' '10 d' filename      # d stands for delete
sed -i'' '10 i new inserted line' filename    # i stands for insert

9. How to check if the last command was successful in Unix?
echo $?

10. Write command to list all the links from a directory?
ls -lrt | grep "^l"

11. How will you find which operating system your system is running on in UNIX?
uname -a

12. Create a read-only file in your home directory?
touch file; chmod 400 file

13. How do you see command line history in UNIX?

The 'history' command can be used to get the list of commands that we are executed.

14. How to display the first 20 lines of a file?

By default, the head command displays the first 10 lines from a file. If we change the option of head, then we can display as many lines as we want.
head -20 filename

An alternative solution is using the sed command
sed '21,$ d' filename

The d option here deletes the lines from 21 to the end of the file

15. Write a command to print the last line of a file?

The tail command can be used to display the last lines from a file.
tail -1 filename

Alternative solutions are:
sed -n '$ p' filename
awk 'END{print $0}' filename


TOP UNIX INTERVIEW QUESTIONS - PART 2

1. How do you rename the files in a directory with _new as suffix?
ls -lrt|grep '^-'| awk '{print "mv "$9" "$9".new"}' | sh

2. Write a command to convert a string from lower case to upper case?
echo "apple" | tr [a-z] [A-Z]

3. Write a command to convert a string to Initcap.
echo apple | awk '{print toupper(substr($1,1,1)) tolower(substr($1,2))}'

4. Write a command to redirect the output of date command to multiple files?

The tee command writes the output to multiple files and also displays the output on the terminal.
date | tee -a file1 file2 file3

5. How do you list the hidden files in current directory?
ls -a | grep '^\.'

6. List out some of the Hot Keys available in bash shell? 
  • Ctrl+l - Clears the Screen.
  • Ctrl+r - Does a search in previously given commands in shell.
  • Ctrl+u - Clears the typing before the hotkey.
  • Ctrl+a - Places cursor at the beginning of the command at shell.
  • Ctrl+e - Places cursor at the end of the command at shell.
  • Ctrl+d - Kills the shell.
  • Ctrl+z - Places the currently running process into background.

7. How do you make an existing file empty?
cat /dev/null >  filename

8. How do you remove the first number on 10th line in file?
sed '10 s/[0-9][0-9]*//' < filename

9. What is the difference between join -v and join -a?
join -v : outputs only matched lines between two files.
join -a : In addition to the matched lines, this will output unmatched lines also.

10. How do you display from the 5th character to the end of the line from a file?
cut -c 5- filename


TOP UNIX INTERVIEW QUESTIONS - PART 3

1. Display all the files in current directory sorted by size?
ls -l | grep '^-' | awk '{print $5,$9}' |sort -n|awk '{print $2}'

2. Write a command to search for the file 'map' in the current directory?
find -name map -type f

3. How to display the first 10 characters from each line of a file?
cut -c -10 filename

4. Write a command to remove the first number on all lines that start with "@"?
sed '\,^@, s/[0-9][0-9]*//' < filename

5. How to print the file names in a directory that has the word "term"?
grep -l term *

The '-l' option make the grep command to print only the filename without printing the content of the file. As soon as the grep command finds the pattern in a file, it prints the pattern and stops searching other lines in the file.

6. How to run awk command specified in a file?
awk -f filename

7. How do you display the calendar for the month march in the year 1985?

The cal command can be used to display the current month calendar. You can pass the month and year as arguments to display the required year, month combination calendar.
cal 03 1985

This will display the calendar for the March month and year 1985.

8. Write a command to find the total number of lines in a file?
wc -l filename

Other ways to print the total number of lines are
awk 'BEGIN {sum=0} {sum=sum+1} END {print sum}' filename
awk 'END{print NR}' filename

9. How to duplicate empty lines in a file?
sed '/^$/ p' < filename

10. Explain iostat, vmstat and netstat?
  • Iostat: reports on terminal, disk and tape I/O activity.
  • Vmstat: reports on virtual memory statistics for processes, disk, tape and CPU activity.
  • Netstat: reports on the contents of network data structures.



TOP UNIX INTERVIEW QUESTIONS - PART 4

1. How do you write the contents of 3 files into a single file?
cat file1 file2 file3 > file

2. How to display the fields in a text file in reverse order?
awk 'BEGIN {ORS=""} { for(i=NF;i>0;i--) print $i," "; print "\n"}' filename

3. Write a command to find the sum of bytes (size of file) of all files in a directory.
ls -l | grep '^-'| awk 'BEGIN {sum=0} {sum = sum + $5} END {print sum}'

4. Write a command to print the lines which end with the word "end"?
grep 'end$' filename

The '$' symbol specifies the grep command to search for the pattern at the end of the line.

5. Write a command to select only those lines containing "july" as a whole word?
grep -w july filename

The '-w' option makes the grep command to search for exact whole words. If the specified pattern is found in a string, then it is not considered as a whole word. For example: In the string "mikejulymak", the pattern "july" is found. However "july" is not a whole word in that string.

6. How to remove the first 10 lines from a file?
sed '1,10 d' < filename

7. Write a command to duplicate each line in a file?
sed 'p' < filename

8. How to extract the username from 'who am i' comamnd?
who am i | cut -f1 -d' '

9. Write a command to list the files in '/usr' directory that start with 'ch' and then display the number of lines in each file?
wc -l /usr/ch*

Another way is 
find /usr -name 'ch*' -type f -exec wc -l {} \;

10. How to remove blank lines in a file ?
grep -v ‘^$’ filename > new_filename

TOP UNIX INTERVIEW QUESTIONS - PART 5

1. How to display the processes that were run by your user name ?
ps -aef | grep <user_name>

2. Write a command to display all the files recursively with path under current directory?
find . -depth -print

3. Display zero byte size files in the current directory?
find -size 0 -type f

4. Write a command to display the third and fifth character from each line of a file?
cut -c 3,5 filename

5. Write a command to print the fields from 10th to the end of the line. The fields in the line are delimited by a comma?
cut -d',' -f10- filename

6. How to replace the word "Gun" with "Pen" in the first 100 lines of a file?
sed '1,00 s/Gun/Pen/' < filename

7. Write a Unix command to display the lines in a file that do not contain the word "RAM"?
grep -v RAM filename

The '-v' option tells the grep to print the lines that do not contain the specified pattern.

8. How to print the squares of numbers from 1 to 10 using awk command
awk 'BEGIN { for(i=1;i<=10;i++) {print "square of",i,"is",i*i;}}'

9. Write a command to display the files in the directory by file size?
ls -l | grep '^-' |sort -nr -k 5

10. How to find out the usage of the CPU by the processes?

The top utility can be used to display the CPU usage by the processes.

TOP UNIX INTERVIEW QUESTIONS - PART 6

1. Write a command to remove the prefix of the string ending with '/'.

The basename utility deletes any prefix ending in /. The usage is mentioned below:
basename /usr/local/bin/file 

This will display only file

2. How to display zero byte size files?
ls -l | grep '^-' | awk '/^-/ {if ($5 !=0 ) print $9 }'

3. How to replace the second occurrence of the word "bat" with "ball" in a file?
sed 's/bat/ball/2' < filename

4. How to remove all the occurrences of the word "jhon" except the first one in a line with in the entire file?
sed 's/jhon//2g' < filename

5. How to replace the word "lite" with "light" from 100th line to last line in a file?
sed '100,$ s/lite/light/' < filename

6. How to list the files that are accessed 5 days ago in the current directory?
find -atime 5 -type f

7. How to list the files that were modified 5 days ago in the current directory?
find -mtime 5 -type f

8. How to list the files whose status is changed 5 days ago in the current directory?
find -ctime 5 -type f

9. How to replace the character '/' with ',' in a file?
sed 's/\//,/' < filename
sed 's|/|,|' < filename

10. Write a command to find the number of files in a directory.
ls -l|grep '^-'|wc -l

TOP UNIX INTERVIEW QUESTIONS - PART 7

1. Write a command to display your name 100 times.
The Yes utility can be used to repeatedly output a line with the specified string or 'y'.
yes <your_name> | head -100

2. Write a command to display the first 10 characters from each line of a file?
cut -c -10 filename

3. The fields in each line are delimited by comma. Write a command to display third field from each line of a file?
cut -d',' -f2 filename

4. Write a command to print the fields from 10 to 20 from each line of a file?
cut -d',' -f10-20 filename

5. Write a command to print the first 5 fields from each line?
cut -d',' -f-5 filename

6. By default the cut command displays the entire line if there is no delimiter in it. Which cut option is used to suppress these kind of lines?

The -s option is used to suppress the lines that do not contain the delimiter.

7. Write a command to replace the word "bad" with "good" in file?
sed s/bad/good/ < filename

8. Write a command to replace the word "bad" with "good" globally in a file?
sed s/bad/good/g < filename

9. Write a command to replace the word "apple" with "(apple)" in a file?
sed s/apple/(&)/ < filename

10. Write a command to switch the two consecutive words "apple" and "mango" in a file?
sed 's/\(apple\) \(mango\)/\2 \1/' < filename

11. Write a command to display the characters from 10 to 20 from each line of a file?
cut -c 10-20 filename
 

TOP UNIX INTERVIEW QUESTIONS - PART 8

1. Write a command to print the lines that has the the pattern "july" in all the files in a particular directory?
grep july *

This will print all the lines in all files that contain the word “july” along with the file name. If any of the files contain words like "JULY" or "July", the above command would not print those lines.

2. Write a command to print the lines that has the word "july" in all the files in a directory and also suppress the file name in the output.
grep -h july *

3. Write a command to print the lines that has the word "july" while ignoring the case.
grep -i july *

The option i make the grep command to treat the pattern as case insensitive.

4. When you use a single file as input to the grep command to search for a pattern, it won't print the filename in the output. Now write a grep command to print the file name in the output without using the '-H' option.
grep pattern file name /dev/null

The /dev/null or null device is special file that discards the data written to it. So, the /dev/null is always an empty file.
Another way to print the file name is using the '-H' option. The grep command for this is
grep -H pattern filename

5. Write a command to print the file names in a directory that does not contain the word "july"?
grep -L july *

The '-L' option makes the grep command to print the file names that do not contain the specified pattern.

6. Write a command to print the line numbers along with the line that has the word "july"?
grep -n july filename

The '-n' option is used to print the line numbers in a file. The line numbers start from 1

7. Write a command to print the lines that starts with the word "start"?
grep '^start' filename

The '^' symbol specifies the grep command to search for the pattern at the start of the line.

8. In the text file, some lines are delimited by colon and some are delimited by space. Write a command to print the third field of each line.
awk '{ if( $0 ~ /:/ ) { FS=":"; } else { FS =" "; } print $3 }' filename

9. Write a command to print the line number before each line?
awk '{print NR, $0}' filename

10. Write a command to print the second and third line of a file without using NR.
awk 'BEGIN {RS="";FS="\n"} {print $2,$3}' filename

11. How to create an alias for the complex command and remove the alias?

The alias utility is used to create the alias for a command. The below command creates alias for ps -aef command.
alias pg='ps -aef'

If you use pg, it will work the same way as ps -aef.

To remove the alias simply use the unalias command as
unalias pg

12. Write a command to display today's date in the format of 'yyyy-mm-dd'?

The date command can be used to display today's date with time
date '+%Y-%m-%d'

HADOOP FS SHELL COMMANDS

HADOOP FS SHELL COMMANDS EXAMPLES - TUTORIALS

Hadoop file system (fs) shell commands are used to perform various file operations like copying file, changing permissions, viewing the contents of the file, changing ownership of files, creating directories etc. 

The syntax of fs shell command is 
hadoop fs <args>

All the fs shell commands takes the path URI as arguments. The format of URI is sheme://authority/path. The scheme and authority are optional. For hadoop the scheme is hdfs and for local file system the scheme is file. IF you do not specify a scheme, the default scheme is taken from the configuration file. You can also specify the directories in hdfs along with the URI as hdfs://namenodehost/dir1/dir2 or simple /dir1/dir2. 

The hadoop fs commands are almost similar to the unix commands. Let see each of the fs shell commands in detail with examples: 


Hadoop fs Shell Commands


hadoop fs ls: 

The hadoop ls command is used to list out the directories and files. An example is shown below: 
> hadoop fs -ls /user/hadoop/employees
Found 1 items
-rw-r--r--   2 hadoop hadoop 2 2012-06-28 23:37 /user/hadoop/employees/000000_0

The above command lists out the files in the employees directory. 
> hadoop fs -ls /user/hadoop/dir
Found 1 items
drwxr-xr-x   - hadoop hadoop  0 2013-09-10 09:47 /user/hadoop/dir/products

The output of hadoop fs ls command is almost similar to the unix ls command. The only difference is in the second field. For a file, the second field indicates the number of replicas and for a directory, the second field is empty. 

hadoop fs lsr: 

The hadoop lsr command recursively displays the directories, sub directories and files in the specified directory. The usage example is shown below: 
> hadoop fs -lsr /user/hadoop/dir
Found 2 items
drwxr-xr-x   - hadoop hadoop  0 2013-09-10 09:47 /user/hadoop/dir/products
-rw-r--r--   2 hadoop hadoop    1971684 2013-09-10 09:47 /user/hadoop/dir/products/products.dat

The hadoop fs lsr command is similar to the ls -R command in unix. 

hadoop fs cat: 

Hadoop cat command is used to print the contents of the file on the terminal (stdout). The usage example of hadoop cat command is shown below: 
> hadoop fs -cat /user/hadoop/dir/products/products.dat

cloudera book by amazon
cloudera tutorial by ebay

hadoop fs chgrp: 

hadoop chgrp shell command is used to change the group association of files. Optionally you can use the -R option to change recursively through the directory structure. The usage of hadoop fs -chgrp is shown below: 
hadoop fs -chgrp [-R] <NewGroupName> <file or directory name>

hadoop fs chmod: 

The hadoop chmod command is used to change the permissions of files. The -R option can be used to recursively change the permissions of a directory structure. The usage is shown below: 
hadoop fs -chmod [-R] <mode | octal mode> <file or directory name>

hadoop fs chown: 

The hadoop chown command is used to change the ownership of files. The -R option can be used to recursively change the owner of a directory structure. The usage is shown below: 
hadoop fs -chown [-R] <NewOwnerName>[:NewGroupName] <file or directory name>

hadoop fs mkdir: 

The hadoop mkdir command is for creating directories in the hdfs. You can use the -p option for creating parent directories. This is similar to the unix mkdir command. The usage example is shown below: 
> hadoop fs -mkdir /user/hadoop/hadoopdemo

The above command creates the hadoopdemo directory in the /user/hadoop directory. 
> hadoop fs -mkdir -p /user/hadoop/dir1/dir2/demo

The above command creates the dir1/dir2/demo directory in /user/hadoop directory. 

hadoop fs copyFromLocal: 

The hadoop copyFromLocal command is used to copy a file from the local file system to the hadoop hdfs. The syntax and usage example are shown below: 
Syntax:
hadoop fs -copyFromLocal <localsrc> URI

Example:

Check the data in local file
> ls sales
2000,iphone
2001, htc

Now copy this file to hdfs

> hadoop fs -copyFromLocal sales /user/hadoop/hadoopdemo

View the contents of the hdfs file.

> hadoop fs -cat /user/hadoop/hadoopdemo/sales
2000,iphone
2001, htc

hadoop fs copyToLocal: 

The hadoop copyToLocal command is used to copy a file from the hdfs to the local file system. The syntax and usage example is shown below: 
Syntax
hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>

Example:

hadoop fs -copyToLocal /user/hadoop/hadoopdemo/sales salesdemo

The -ignorecrc option is used to copy the files that fail the crc check. The -crc option is for copying the files along with their CRC. 

hadoop fs cp: 

The hadoop cp command is for copying the source into the target. The cp command can also be used to copy multiple files into the target. In this case the target should be a directory. The syntax is shown below: 
hadoop fs -cp /user/hadoop/SrcFile /user/hadoop/TgtFile
hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 hdfs://namenodehost/user/hadoop/TgtDirectory

hadoop fs -put: 

Hadoop put command is used to copy multiple sources to the destination system. The put command can also read the input from the stdin. The different syntaxes for the put command are shown below: 
Syntax1: copy single file to hdfs

hadoop fs -put localfile /user/hadoop/hadoopdemo

Syntax2: copy multiple files to hdfs

hadoop fs -put localfile1 localfile2 /user/hadoop/hadoopdemo

Syntax3: Read input file name from stdin
hadoop fs -put - hdfs://namenodehost/user/hadoop/hadoopdemo

hadoop fs get: 

Hadoop get command copies the files from hdfs to the local file system. The syntax of the get command is shown below: 
hadoop fs -get /user/hadoop/hadoopdemo/hdfsFileName localFileName

hadoop fs getmerge: 

hadoop getmerge command concatenates the files in the source directory into the destination file. The syntax of the getmerge shell command is shown below: 
hadoop fs -getmerge <src> <localdst> [addnl]

The addnl option is for adding new line character at the end of each file. 

hadoop fs moveFromLocal: 

The hadoop moveFromLocal command moves a file from local file system to the hdfs directory. It removes the original source file. The usage example is shown below: 
> hadoop fs -moveFromLocal products /user/hadoop/hadoopdemo

hadoop fs mv: 

It moves the files from source hdfs to destination hdfs. Hadoop mv command can also be used to move multiple source files into the target directory. In this case the target should be a directory. The syntax is shown below: 
hadoop fs -mv /user/hadoop/SrcFile /user/hadoop/TgtFile
hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2 hdfs://namenodehost/user/hadoop/TgtDirectory

hadoop fs du: 

The du command displays aggregate length of files contained in the directory or the length of a file in case its just a file. The syntax and usage is shown below: 
hadoop fs -du hdfs://namenodehost/user/hadoop

hadoop fs dus: 

The hadoop dus command prints the summary of file lengths 
> hadoop fs -dus hdfs://namenodehost/user/hadoop
hdfs://namenodehost/user/hadoop 21792568333

hadoop fs expunge: 

Used to empty the trash. The usage of expunge is shown below: 
hadoop fs -expunge

hadoop fs rm: 

Removes the specified list of files and empty directories. An example is shown below: 
hadoop fs -rm /user/hadoop/file

hadoop fs -rmr: 

Recursively deletes the files and sub directories. The usage of rmr is shown below: 
hadoop fs -rmr /user/hadoop/dir

hadoop fs setrep: 

Hadoop setrep is used to change the replication factor of a file. Use the -R option for recursively changing the replication factor. 
hadoop fs -setrep -w 4 -R /user/hadoop/dir

hadoop fs stat: 

Hadoop stat returns the stats information on a path. The syntax of stat is shown below: 
hadoop fs -stat URI

> hadoop fs -stat /user/hadoop/
2013-09-24 07:53:04

hadoop fs tail: 

Hadoop tail command prints the last kilobytes of the file. The -f option can be used same as in unix. 
> hafoop fs -tail /user/hadoop/sales.dat

12345 abc
2456 xyz

hadoop fs test: 

The hadoop test is used for file test operations. The syntax is shown below: 
hadoop fs -test -[ezd] URI

Here "e" for checking the existence of a file, "z" for checking the file is zero length or not, "d" for checking the path is a directory or no. On success, the test command returns 1 else 0. 

hadoop fs text: 

The hadoop text command displays the source file in text format. The allowed source file formats are zip and TextRecordInputStream. The syntax is shown below: 
hadoop fs -text <src>

hadoop fs touchz: 

The hadoop touchz command creates a zero byte file. This is similar to the touch command in unix. The syntax is shown below: 
hadoop fs -touchz /user/hadoop/filename

Hive Interview Questions

What is Hive?
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
Hive was originally developed at Facebook. It’s now a Hadoop subproject with many contributors. Users need to concentrate only on the top level hive language rather than java map reduce programs. One of the main advantages of Hive is its SQLish nature. Thus it leverages the usability to a higher extend.


A hive program will be automatically compiled into map-reduce jobs executed on Hadoop. In addition, HiveQL supports custom map-reduce scripts to be plugged into queries.

Hive example:
selecting the employee names whose salary more than 100 dollars from a hive table called tbl_employee.
SELECT employee_name FROM tbl_employee WHERE salary > 100;
Users are excited to use Hive since it is very similar to SQL.
 

What are the types of tables in Hive?
There are two types of tables.
1. Managed tables.
2. External tables.

Only the drop table command differentiates managed and external tables. Otherwise, both type of tables are very similar.

Does Hive support record level Insert, delete or update?
Hive does not provide record-level update, insert, or delete. Henceforth, Hive does not provide transactions too. However, users can go with CASE statements and built in functions of Hive to satisfy the above DML operations. Thus, a complex update query in a RDBMS may need many lines of code in Hive.

What kind of datawarehouse application is suitable for Hive?
Hive is not a full database. The design constraints and limitations of Hadoop and HDFS impose limits on what Hive can do.


Hive is most suited for data warehouse applications, where
1) Relatively static data is analyzed,
2) Fast response times are not required, and
3) When the data is not changing rapidly.


Hive doesn’t provide crucial features required for OLTP, Online Transaction Processing. It’s closer to being an OLAP tool, Online Analytic Processing.So, Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.

How can the columns of a table in hive be written to a file?
By using awk command in shell, the output from HiveQL (Describe) can be written to a file.
hive -S -e "describe table_name;" | awk -F" " ’{print 1}’ > ~/output.



CONCAT function in Hive with Example?
CONCAT function will concat the input strings. You can specify any number of strings separated by comma.

Example:
CONCAT ('Hive','-','performs','-','good','-','in','-','Hadoop');

Output:
Hive-performs-good-in-Hadoop

So, every time you delimit the strings by '-'. If it is common for all the strings, then Hive provides another command CONCAT_WS. Here you have to specify the delimit operator first.

CONCAT_WS ('-','Hive','performs','good','in','Hadoop');
Output: Hive-performs-good-in-Hadoop

REPEAT function in Hive with example?

REPEAT function will repeat the input string n times specified in the command.

Example:
REPEAT('Hadoop',3);

Output:
HadoopHadoopHadoop.

Note: You can add a space with the input string also.


TRIM function in Hive with example?

TRIM function will remove the spaces associated with a string.

Example:
TRIM('  Hadoop  ');

Output:
Hadoop.

Note: If you want to remove only leading or trialing spaces then you can specify the below commands respectively.
LTRIM('  Hadoop');
RTRIM('Hadoop  ');

REVERSE function in Hive with example?
REVERSE function will reverse the characters in a string.

Example:
REVERSE('Hadoop');

Output:
poodaH

LOWER or LCASE function in Hive with example?
LOWER or LCASE function will convert the input string to lower case characters.

Example:
LOWER('Hadoop');
LCASE('Hadoop');

Output:
hadoop

Note:
If the characters are already in lower case then they will be preserved.

UPPER or UCASE function in Hive with example?
UPPER or UCASE function will convert the input string to upper case characters.

Example:
UPPER('Hadoop');
UCASE('Hadoop');

Output:
HADOOP

Note:
If the characters are already in upper case then they will be preserved.

Double type in Hive – Important points?
It is important to know about the double type in Hive. Double type in Hive will present the data differently unlike RDBMS.
See the double type data below:
24624.0
32556.0
3.99893E5
4366.0

E5 represents 10^5 here. So, the value 3.99893E5 represents 399893. All the calculations will be accurately performed using double type. The maximum value for a IEEE 754 double is about 2.22E308.


It is crucial while exporting the double type data to any RDBMS since the type may be wrongly interpreted. So, it is advised to cast the double type into appropriate type before exporting.

Rename a table in Hive – How to do it?
Using ALTER command, we can rename a table in Hive.
ALTER TABLE hive_table_name RENAME  TO new_name;

There is another way to rename a table in Hive. Sometimes, ALTER may take more time if the underlying table has more partitions/functions. In that case, Import and export options can be utilized. Here you are saving the hive data into HDFS and importing back to new table like below.
EXPORT TABLE tbl_name TO 'HDFS_location';
IMPORT TABLE new_tbl_name FROM 'HDFS_location';

If you prefer to just preserve the data, you can create a new table from old table like below.
CREATE TABLE new_tbl_name AS SELECT * FROM old_tbl_name;
DROP TABLE old_tbl_name;

How to change a column data type in Hive?
ALTER TABLE table_name CHANGE column_name column_name new_datatype;
Example: If you want to change the data type of ID column from integer to bigint in a table called employee.
ALTER TABLE employee CHANGE id id BIGINT;

Difference between order by and sort by in hive?
SORT BY will sort the data within each reducer. You can use any number of reducers for SORT BY operation.
ORDER BY will sort all of the data together, which has to pass through one reducer. Thus, ORDER BY in hive uses single reducer.
ORDER BY guarantees total order in the output while SORT BY only guarantees ordering of the rows within a reducer. If there is more than one reducer, SORT BY may give partially ordered final results

RLIKE in Hive?
RLIKE (Right-Like) is a special function in Hive where if any substring of A matches with B then it evaluates to true. It also obeys Java regular expression pattern. Users don't need to put % symbol for a simple match in RLIKE.

Examples:
'Express' RLIKE 'Exp' --> True
'Express' RLIKE '^E.*' --> True (Regular expression)

Moreover, RLIKE will come handy when the string has some spaces. Without using TRIM function, RLIKE satisfies the required scenario. Suppose if A has value 'Express ' (2 spaces additionally) and B has value 'Express' RLIKE will work better without using TRIM.
'Express' RLIKE 'Express' --> True

Note:
RLIKE evaluates to NULL if A or B is NULL.









 
Related Posts Plugin for WordPress, Blogger...