Showing posts with label Big Data Interview Questions and Answers. Show all posts
Showing posts with label Big Data Interview Questions and Answers. Show all posts

Thursday 5 December 2013

Hadoop admin interview question and answers

Which operating system(s) are supported for production Hadoop deployment?
The main supported operating system is Linux. However, with some additional software Hadoop can be deployed on Windows.

What is the role of the namenode?
The namenode is the "brain" of the Hadoop cluster and responsible for managing the distribution blocks on the system based on the replication policy. The namenode also supplies the specific addresses for the data based on the client requests.

What happen on the namenode when a client tries to read a data file?
The namenode will look up the information about file in the edit file and then retrieve the remaining information from filesystem memory snapshot. Since the namenode needs to support a large number of the clients, the primary namenode will only send information back for the data location. The datanode itselt is responsible for the retrieval.

What are the hardware requirements for a Hadoop cluster (primary and secondary namenodes and datanodes)?
There are no requirements for datanodes. However, the namenodes require a specified amount of RAM to store filesystem image in memory Based on the design of the primary namenode and secondary namenode, entire filesystem information will be stored in memory. Therefore, both namenodes need to have enough memory to contain the entire filesystem image.


What mode(s) can Hadoop code be run in?
Hadoop can be deployed in stand alone mode, pseudo-distributed mode or fully-distributed mode. Hadoop was specifically designed to be deployed on multi-node cluster. However, it also can be deployed on single machine and as a single process for testing purposes


How would an Hadoop administrator deploy various components of Hadoop in production?
Deploy namenode and jobtracker on the master node, and deploy datanodes and taskstrackers on multiple slave nodes. There is a need for only one namenode and jobtracker on the system. The number of datanodes depends on the available hardware



What is the best practice to deploy the secondary namenode
Deploy secondary namenode on a separate standalone machine. The secondary namenode needs to be deployed on a separate machine. It will not interfere with primary namenode operations in this way. The secondary namenode must have the same memory requirements as the main namenode.



Is there a standard procedure to deploy Hadoop?
No, there are some differences between various distributions. However, they all require that Hadoop jars be installed on the machine. There are some common requirements for all Hadoop distributions but the specific procedures will be different for different vendors since they all have some degree of proprietary software

What is the role of the secondary namenode?
Secondary namenode performs CPU intensive operation of combining edit logs and current filesystem snapshots. The secondary namenode was separated out as a process due to having CPU intensive operations and additional requirements for metadata back-up


What are the side effects of not running a secondary name node?
The cluster performance will degrade over time since edit log will grow bigger and bigger. If the secondary namenode is not running at all, the edit log will grow significantly and it will slow the system down. Also, the system will go into safemode for an extended time since the namenode needs to combine the edit log and the current filesystem checkpoint image.



What happen if a datanode loses network connection for a few minutes?
The namenode will detect that a datanode is not responsive and will start replication of the data from remaining replicas. When datanode comes back online, the extra replicas will be The replication factor is actively maintained by the namenode. The namenode monitors the status of all datanodes and keeps track which blocks are located on that node. The moment the datanode is not avaialble it will trigger replication of the data from the existing replicas. However, if the datanode comes back up, overreplicated data will be deleted. Note: the data might be deleted from the original datanode.


What happen if one of the datanodes has much slower CPU?
The task execution will be as fast as the slowest worker. However, if speculative execution is enabled, the slowest worker will not have such big impact Hadoop was specifically designed to work with commodity hardware. The speculative execution helps to offset the slow workers. The multiple instances of the same task will be created and job tracker will take the first result into consideration and the second instance of the task will be killed.



What is speculative execution?
If speculative execution is enabled, the job tracker will issue multiple instances of the same task on multiple nodes and it will take the result of the task that finished first. The other instances of the task will be killed.
The speculative execution is used to offset the impact of the slow workers in the cluster. The jobtracker creates multiple instances of the same task and takes the result of the first successful task. The rest of the tasks will be discarded.


How many racks do you need to create an Hadoop cluster in order to make sure that the cluster operates reliably?
In order to ensure a reliable operation it is recommended to have at least 2 racks with rack placement configured Hadoop has a built-in rack awareness mechanism that allows data distribution between different racks based on the configuration.

Are there any special requirements for namenode?
Yes, the namenode holds information about all files in the system and needs to be extra reliable. The namenode is a single point of failure. It needs to be extra reliable and metadata need to be replicated in multiple places. Note that the community is working on solving the single point of failure issue with the namenode.

If you have a file 128M size and replication factor is set to 3, how many blocks can you find on the cluster that will correspond to that file (assuming the default apache and cloudera configuration)?
6
Based on the configuration settings the file will be divided into multiple blocks according to the default block size of 64M. 128M / 64M = 2 . Each block will be replicated according to replication factor settings (default 3). 2 * 3 = 6 .

What is distributed copy (distcp)?
Distcp is a Hadoop utility for launching MapReduce jobs to copy data. The primary usage is for copying a large amount of data. One of the major challenges in the Hadoop enviroment is copying data across multiple clusters and distcp will allow multiple datanodes to be leveraged for parallel copying of the data.

What is replication factor?
Replication factor controls how many times each individual block can be replicated –
Data is replicated in the Hadoop cluster based on the replication factor. The high replication factor guarantees data availability in the event of failure.

What daemons run on Master nodes?
NameNode, Secondary NameNode and JobTracker
Hadoop is comprised of five separate daemons and each of these daemon run in its own JVM. NameNode, Secondary NameNode and JobTracker run on Master nodes. DataNode and TaskTracker run on each Slave nodes.

What is rack awareness?
Rack awareness is the way in which the namenode decides how to place blocks based on the rack definitions. Hadoop will try to minimize the network traffic between datanodes within the same rack and will only contact remote racks if it has to. The namenode is able to control this due to rack awareness



What is the role of the jobtracker in an Hadoop cluster? 
The jobtracker is responsible for scheduling tasks on slave nodes, collecting results, retrying failed tasks. The job tracker is the main component of the map-reduce execution. It control the division of the job into smaller tasks, submits tasks to individual tasktracker, tracks the progress of the jobs and reports results back to calling code.

How does the Hadoop cluster tolerate datanode failures?
Since Hadoop is design to run on commodity hardware, the datanode failures are expected. Namenode keeps track of all available datanodes and actively maintains replication factor on all data.
The namenode actively tracks the status of all datanodes and acts immediately if the datanodes become non-responsive. The namenode is the central "brain" of the HDFS and starts replication of the data the moment a disconnect is detected.

What is the procedure for namenode recovery?
A namenode can be recovered in two ways: starting new namenode from backup metadata or promoting secondary namenode to primary namenode.
The namenode recovery procedure is very important to ensure the reliability of the data.It can be accomplished by starting a new namenode using backup data or by promoting the secondary namenode to primary.


Web-UI shows that half of the datanodes are in decommissioning mode. What does that mean? Is it safe to remove those nodes from the network?
This means that namenode is trying retrieve data from those datanodes by moving replicas to remaining datanodes. There is a possibility that data can be lost if administrator removes those datanodes before decomissioning finished .
Due to replication strategy it is possible to lose some data due to datanodes removal en masse prior to completing the decommissioning process. Decommissioning refers to namenode trying to retrieve data from datanodes by moving replicas to remaining datanodes.

What does the Hadoop administrator have to do after adding new datanodes to the Hadoop cluster?
Since the new nodes will not have any data on them, the administrator needs to start the balancer to redistribute data evenly between all nodes.
Hadoop cluster will detect new datanodes automatically. However, in order to optimize the cluster performance it is recommended to start rebalancer to redistribute the data between datanodes evenly.

If the Hadoop administrator needs to make a change, which configuration file does he need to change?
Each node in the Hadoop cluster has its own configuration files and the changes needs to be made in every file. One of the reasons for this is that configuration can be different for every node.



Map Reduce jobs are failing on a cluster that was just restarted. They worked before restart. What could be wrong?
The cluster is in a safe mode. The administrator needs to wait for namenode to exit the safe mode before restarting the jobs again
This is a very common mistake by Hadoop administrators when there is no secondary namenode on the cluster and the cluster has not been restarted in a long time. The namenode will go into safemode and combine the edit log and current file system timestamp

Map Reduce jobs take too long. What can be done to improve the performance of the cluster?
One the most common reasons for performance problems on Hadoop cluster is uneven distribution of the tasks. The number tasks has to match the number of available slots on the cluster
Hadoop is not a hardware aware system. It is the responsibility of the developers and the administrators to make sure that the resource supply and demand match.

How often do you need to reformat the namenode?
Never. The namenode needs to formatted only once in the beginning. Reformatting of the namenode will lead to lost of the data on entire
The namenode is the only system that needs to be formatted only once. It will create the directory structure for file system metadata and create namespaceID for the entire file system.

After increasing the replication level, I still see that data is under replicated. What could be wrong?
Data replication takes time due to large quantities of data. The Hadoop administrator should allow sufficient time for data replication
Depending on the data size the data replication will take some time. Hadoop cluster still needs to copy data around and if data size is big enough it is not uncommon that replication will take from a few minutes to a few hours.

Wednesday 9 October 2013

UNIX INTERVIEW QUESTIONS

UNIX INTERVIEW QUESTIONS ON AWK COMMAND

Awk is powerful tool in Unix. Awk is an excellent tool for processing the files which have data arranged in rows and columns format. It is a good filter and report writer. 
1. How to run awk command specified in a file?
awk -f filename

2. Write a command to print the squares of numbers from 1 to 10 using awk command
awk 'BEGIN { for(i=1;i<=10;i++) {print "square of",i,"is",i*i;}}'

3. Write a command to find the sum of bytes (size of file) of all files in a directory.
ls -l | awk 'BEGIN {sum=0} {sum = sum + $5} END {print sum}'

4. In the text file, some lines are delimited by colon and some are delimited by space. Write a command to print the third field of each line.

awk '{ if( $0 ~ /:/ ) { FS=":"; } else { FS =" "; } print $3 }' filename

5. Write a command to print the line number before each line?
awk '{print NR, $0}' filename

6. Write a command to print the second and third line of a file without using NR.
awk 'BEGIN {RS="";FS="\n"} {print $2,$3}' filename

7. Write a command to print zero byte size files?
ls -l | awk '/^-/ {if ($5 !=0 ) print $9 }'

8. Write a command to rename the files in a directory with "_new" as postfix?
ls -F | awk '{print "mv "$1" "$1".new"}' | sh

9. Write a command to print the fields in a text file in reverse order?
awk 'BEGIN {ORS=""} { for(i=NF;i>0;i--) print $i," "; print "\n"}' filename

10. Write a command to find the total number of lines in a file without using NR
awk 'BEGIN {sum=0} {sum=sum+1} END {print sum}' filename

Another way to print the number of lines is by using the NR. The command is
awk 'END{print NR}' filename


UNIX INTERVIEW QUESTIONS ON GREP COMMAND

The grep is one of the powerful tools in unix. Grep stands for "global search for regular expressions and print". The power of grep lies in using regular expressions mostly.

The general syntax of grep command is
grep [options] pattern [files]

1. Write a command to print the lines that has the the pattern "july" in all the files in a particular directory?

grep july *
This will print all the lines in all files that contain the word “july” along with the file name. If any of the files contain words like "JULY" or "July", the above command would not print those lines.

2. Write a command to print the lines that has the word "july" in all the files in a directory and also suppress the filename in the output.

grep -h july *

3. Write a command to print the lines that has the word "july" while ignoring the case.

grep -i july *
The option i make the grep command to treat the pattern as case insensitive.

4. When you use a single file as input to the grep command to search for a pattern, it won't print the filename in the output. Now write a grep command to print the filename in the output without using the '-H' option.
grep pattern filename /dev/null
The /dev/null or null device is special file that discards the data written to it. So, the /dev/null is always an empty file.

Another way to print the filename is using the '-H' option. The grep command for this is
grep -H pattern filename

5. Write a Unix command to display the lines in a file that do not contain the word "july"?
grep -v july filename
The '-v' option tells the grep to print the lines that do not contain the specified pattern.

6. Write a command to print the file names in a directory that has the word "july"?
grep -l july *
The '-l' option make the grep command to print only the filename without printing the content of the file. As soon as the grep command finds the pattern in a file, it prints the pattern and stops searching other lines in the file.

7. Write a command to print the file names in a directory that does not contain the word "july"?
grep -L july *
The '-L' option makes the grep command to print the filenames that do not contain the specified pattern.

8. Write a command to print the line numbers along with the line that has the word "july"?
grep -n july filename
The '-n' option is used to print the line numbers in a file. The line numbers start from 1

9. Write a command to print the lines that starts with the word "start"?
grep '^start' filename
The '^' symbol specifies the grep command to search for the pattern at the start of the line.

10. Write a command to print the lines which end with the word "end"?
grep 'end$' filename
The '$' symbol specifies the grep command to search for the pattern at the end of the line.

11. Write a command to select only those lines containing "july" as a whole word?
grep -w july filename
The '-w' option makes the grep command to search for exact whole words. If the specified pattern is found in a string, then it is not considered as a whole word. For example: In the string "mikejulymak", the pattern "july" is found. However "july" is not a whole word in that string.


UNIX INTERVIEW QUESTIONS ON SED COMMAND

SED is a special editor used for modifying files automatically.

1. Write a command to replace the word "bad" with "good" in file?
sed s/bad/good/ < filename

2. Write a command to replace the word "bad" with "good" globally in a file?
sed s/bad/good/g < filename

3. Write a command to replace the character '/' with ',' in a file?
sed 's/\//,/' < filename
sed 's|/|,|' < filename

4. Write a command to replace the word "apple" with "(apple)" in a file?
sed s/apple/(&)/ < filename

5. Write a command to switch the two consecutive words "apple" and "mango" in a file?
sed 's/\(apple\) \(mango\)/\2 \1/' < filename

6. Write a command to replace the second occurrence of the word "bat" with "ball" in a file?
sed 's/bat/ball/2' < filename

7. Write a command to remove all the occurrences of the word "jhon" except the first one in a line with in the entire file?
sed 's/jhon//2g' < filename

8. Write a command to remove the first number on line 5 in file?
sed '5 s/[0-9][0-9]*//' < filename

9. Write a command to remove the first number on all lines that start with "@"?
sed '\,^@, s/[0-9][0-9]*//' < filename

10. Write a command to replace the word "gum" with "drum" in the first 100 lines of a file?
sed '1,00 s/gum/drum/' < filename

11. write a command to replace the word "lite" with "light" from 100th line to last line in a file?
sed '100,$ s/lite/light/' < filename

12. Write a command to remove the first 10 lines from a file?
sed '1,10 d' < filename

13. Write a command to duplicate each line in a file?
sed 'p' < filename

14. Write a command to duplicate empty lines in a file?
sed '/^$/ p' < filename

15. Write a sed command to print the lines that do not contain the word "run"?
sed -n '/run/!p' < filename


UNIX INTERVIEW QUESTIONS ON CUT COMMAND

The cut command is used to used to display selected columns or fields from each line of a file. Cut command works in two modes:
  • Delimited selection: The fields in the line are delimited by a single character like blank,comma etc.
  • Range selection: Each field starts with certain fixed offset defined as range.
1. Write a command to display the third and fourth character from each line of a file?
cut -c 3,4 filename

2. Write a command to display the characters from 10 to 20 from each line of a file?
cut -c 10-20 filename

3. Write a command to display the first 10 characters from each line of a file?
cut -c -10 filename

4. Write a comamnd to display from the 10th character to the end of the line?
cut -c 10- filename

5. The fields in each line are delimited by comma. Write a command to display third field from each line of a file?
cut -d',' -f2 filename

6. Write a command to print the fields from 10 to 20 from each line of a file?
cut -d',' -f10-20 filename

7. Write a command to print the first 5 fields from each line?
cut -d',' -f-5 filename

8. Write a command to print the fields from 10th to the end of the line?
cut -d',' -f10- filename

9. By default the cut command displays the entire line if there is no delimiter in it. Which cut option is used to supress these kind of lines?
The -s option is used to supress the lines that do not contain the delimiter.

10. Write a cut command to extract the username from 'who am i' comamnd?
who am i | cut -f1 -d' '


UNIX INTERVIEW QUESTIONS ON FIND COMMAND

Find utility is used for searching files using the directory information.

1. Write a command to search for the file 'test' in the current directory?
find -name test -type f

2. Write a command to search for the file 'temp' in '/usr' directory?
find /usr -name temp -type f

3. Write a command to search for zero byte size files in the current directory?
find -size 0 -type f

4. Write a command to list the files that are accessed 5 days ago in the current directory?
find -atime 5 -type f

5. Write a command to list the files that were modified 5 days ago in the current directory?
find -mtime 5 -type f

6. Write a command to search for the files in the current directory which are not owned by any user in the /etc/passwd file?
find . -nouser -type f

7. Write a command to search for the files in '/usr' directory that start with 'te'?
find /usr -name 'te*' -type f

8. Write a command to search for the files that start with 'te' in the current directory and then display the contents of the file?
find . -name 'te*' -type f -exec cat {} \;

9. Write a command to list the files whose status is changed 5 days ago in the current directory?
find -ctime 5 -type f

10. Write a command to list the files in '/usr' directory that start with 'ch' and then display the number of lines in each file?
find /usr -name 'ch*' -type f -exec wc -l {} \;


TOP UNIX INTERVIEW QUESTIONS - PART 1

1. How to display the 10th line of a file?
head -10 filename | tail -1

2. How to remove the header from a file?
sed -i '1 d' filename

3. How to remove the footer from a file?
sed -i '$ d' filename

4. Write a command to find the length of a line in a file?

The below command can be used to get a line from a file.
sed –n '<n> p' filename

We will see how to find the length of 10th line in a file
sed -n '10 p' filename|wc -c

5. How to get the nth word of a line in Unix?
cut –f<n> -d' '

6. How to reverse a string in unix?
echo "java" | rev

7. How to get the last word from a line in Unix file?
echo "unix is good" | rev | cut -f1 -d' ' | rev

8. How to replace the n-th line in a file with a new line in Unix?
sed -i'' '10 d' filename      # d stands for delete
sed -i'' '10 i new inserted line' filename    # i stands for insert

9. How to check if the last command was successful in Unix?
echo $?

10. Write command to list all the links from a directory?
ls -lrt | grep "^l"

11. How will you find which operating system your system is running on in UNIX?
uname -a

12. Create a read-only file in your home directory?
touch file; chmod 400 file

13. How do you see command line history in UNIX?

The 'history' command can be used to get the list of commands that we are executed.

14. How to display the first 20 lines of a file?

By default, the head command displays the first 10 lines from a file. If we change the option of head, then we can display as many lines as we want.
head -20 filename

An alternative solution is using the sed command
sed '21,$ d' filename

The d option here deletes the lines from 21 to the end of the file

15. Write a command to print the last line of a file?

The tail command can be used to display the last lines from a file.
tail -1 filename

Alternative solutions are:
sed -n '$ p' filename
awk 'END{print $0}' filename


TOP UNIX INTERVIEW QUESTIONS - PART 2

1. How do you rename the files in a directory with _new as suffix?
ls -lrt|grep '^-'| awk '{print "mv "$9" "$9".new"}' | sh

2. Write a command to convert a string from lower case to upper case?
echo "apple" | tr [a-z] [A-Z]

3. Write a command to convert a string to Initcap.
echo apple | awk '{print toupper(substr($1,1,1)) tolower(substr($1,2))}'

4. Write a command to redirect the output of date command to multiple files?

The tee command writes the output to multiple files and also displays the output on the terminal.
date | tee -a file1 file2 file3

5. How do you list the hidden files in current directory?
ls -a | grep '^\.'

6. List out some of the Hot Keys available in bash shell? 
  • Ctrl+l - Clears the Screen.
  • Ctrl+r - Does a search in previously given commands in shell.
  • Ctrl+u - Clears the typing before the hotkey.
  • Ctrl+a - Places cursor at the beginning of the command at shell.
  • Ctrl+e - Places cursor at the end of the command at shell.
  • Ctrl+d - Kills the shell.
  • Ctrl+z - Places the currently running process into background.

7. How do you make an existing file empty?
cat /dev/null >  filename

8. How do you remove the first number on 10th line in file?
sed '10 s/[0-9][0-9]*//' < filename

9. What is the difference between join -v and join -a?
join -v : outputs only matched lines between two files.
join -a : In addition to the matched lines, this will output unmatched lines also.

10. How do you display from the 5th character to the end of the line from a file?
cut -c 5- filename


TOP UNIX INTERVIEW QUESTIONS - PART 3

1. Display all the files in current directory sorted by size?
ls -l | grep '^-' | awk '{print $5,$9}' |sort -n|awk '{print $2}'

2. Write a command to search for the file 'map' in the current directory?
find -name map -type f

3. How to display the first 10 characters from each line of a file?
cut -c -10 filename

4. Write a command to remove the first number on all lines that start with "@"?
sed '\,^@, s/[0-9][0-9]*//' < filename

5. How to print the file names in a directory that has the word "term"?
grep -l term *

The '-l' option make the grep command to print only the filename without printing the content of the file. As soon as the grep command finds the pattern in a file, it prints the pattern and stops searching other lines in the file.

6. How to run awk command specified in a file?
awk -f filename

7. How do you display the calendar for the month march in the year 1985?

The cal command can be used to display the current month calendar. You can pass the month and year as arguments to display the required year, month combination calendar.
cal 03 1985

This will display the calendar for the March month and year 1985.

8. Write a command to find the total number of lines in a file?
wc -l filename

Other ways to print the total number of lines are
awk 'BEGIN {sum=0} {sum=sum+1} END {print sum}' filename
awk 'END{print NR}' filename

9. How to duplicate empty lines in a file?
sed '/^$/ p' < filename

10. Explain iostat, vmstat and netstat?
  • Iostat: reports on terminal, disk and tape I/O activity.
  • Vmstat: reports on virtual memory statistics for processes, disk, tape and CPU activity.
  • Netstat: reports on the contents of network data structures.



TOP UNIX INTERVIEW QUESTIONS - PART 4

1. How do you write the contents of 3 files into a single file?
cat file1 file2 file3 > file

2. How to display the fields in a text file in reverse order?
awk 'BEGIN {ORS=""} { for(i=NF;i>0;i--) print $i," "; print "\n"}' filename

3. Write a command to find the sum of bytes (size of file) of all files in a directory.
ls -l | grep '^-'| awk 'BEGIN {sum=0} {sum = sum + $5} END {print sum}'

4. Write a command to print the lines which end with the word "end"?
grep 'end$' filename

The '$' symbol specifies the grep command to search for the pattern at the end of the line.

5. Write a command to select only those lines containing "july" as a whole word?
grep -w july filename

The '-w' option makes the grep command to search for exact whole words. If the specified pattern is found in a string, then it is not considered as a whole word. For example: In the string "mikejulymak", the pattern "july" is found. However "july" is not a whole word in that string.

6. How to remove the first 10 lines from a file?
sed '1,10 d' < filename

7. Write a command to duplicate each line in a file?
sed 'p' < filename

8. How to extract the username from 'who am i' comamnd?
who am i | cut -f1 -d' '

9. Write a command to list the files in '/usr' directory that start with 'ch' and then display the number of lines in each file?
wc -l /usr/ch*

Another way is 
find /usr -name 'ch*' -type f -exec wc -l {} \;

10. How to remove blank lines in a file ?
grep -v ‘^$’ filename > new_filename

TOP UNIX INTERVIEW QUESTIONS - PART 5

1. How to display the processes that were run by your user name ?
ps -aef | grep <user_name>

2. Write a command to display all the files recursively with path under current directory?
find . -depth -print

3. Display zero byte size files in the current directory?
find -size 0 -type f

4. Write a command to display the third and fifth character from each line of a file?
cut -c 3,5 filename

5. Write a command to print the fields from 10th to the end of the line. The fields in the line are delimited by a comma?
cut -d',' -f10- filename

6. How to replace the word "Gun" with "Pen" in the first 100 lines of a file?
sed '1,00 s/Gun/Pen/' < filename

7. Write a Unix command to display the lines in a file that do not contain the word "RAM"?
grep -v RAM filename

The '-v' option tells the grep to print the lines that do not contain the specified pattern.

8. How to print the squares of numbers from 1 to 10 using awk command
awk 'BEGIN { for(i=1;i<=10;i++) {print "square of",i,"is",i*i;}}'

9. Write a command to display the files in the directory by file size?
ls -l | grep '^-' |sort -nr -k 5

10. How to find out the usage of the CPU by the processes?

The top utility can be used to display the CPU usage by the processes.

TOP UNIX INTERVIEW QUESTIONS - PART 6

1. Write a command to remove the prefix of the string ending with '/'.

The basename utility deletes any prefix ending in /. The usage is mentioned below:
basename /usr/local/bin/file 

This will display only file

2. How to display zero byte size files?
ls -l | grep '^-' | awk '/^-/ {if ($5 !=0 ) print $9 }'

3. How to replace the second occurrence of the word "bat" with "ball" in a file?
sed 's/bat/ball/2' < filename

4. How to remove all the occurrences of the word "jhon" except the first one in a line with in the entire file?
sed 's/jhon//2g' < filename

5. How to replace the word "lite" with "light" from 100th line to last line in a file?
sed '100,$ s/lite/light/' < filename

6. How to list the files that are accessed 5 days ago in the current directory?
find -atime 5 -type f

7. How to list the files that were modified 5 days ago in the current directory?
find -mtime 5 -type f

8. How to list the files whose status is changed 5 days ago in the current directory?
find -ctime 5 -type f

9. How to replace the character '/' with ',' in a file?
sed 's/\//,/' < filename
sed 's|/|,|' < filename

10. Write a command to find the number of files in a directory.
ls -l|grep '^-'|wc -l

TOP UNIX INTERVIEW QUESTIONS - PART 7

1. Write a command to display your name 100 times.
The Yes utility can be used to repeatedly output a line with the specified string or 'y'.
yes <your_name> | head -100

2. Write a command to display the first 10 characters from each line of a file?
cut -c -10 filename

3. The fields in each line are delimited by comma. Write a command to display third field from each line of a file?
cut -d',' -f2 filename

4. Write a command to print the fields from 10 to 20 from each line of a file?
cut -d',' -f10-20 filename

5. Write a command to print the first 5 fields from each line?
cut -d',' -f-5 filename

6. By default the cut command displays the entire line if there is no delimiter in it. Which cut option is used to suppress these kind of lines?

The -s option is used to suppress the lines that do not contain the delimiter.

7. Write a command to replace the word "bad" with "good" in file?
sed s/bad/good/ < filename

8. Write a command to replace the word "bad" with "good" globally in a file?
sed s/bad/good/g < filename

9. Write a command to replace the word "apple" with "(apple)" in a file?
sed s/apple/(&)/ < filename

10. Write a command to switch the two consecutive words "apple" and "mango" in a file?
sed 's/\(apple\) \(mango\)/\2 \1/' < filename

11. Write a command to display the characters from 10 to 20 from each line of a file?
cut -c 10-20 filename
 

TOP UNIX INTERVIEW QUESTIONS - PART 8

1. Write a command to print the lines that has the the pattern "july" in all the files in a particular directory?
grep july *

This will print all the lines in all files that contain the word “july” along with the file name. If any of the files contain words like "JULY" or "July", the above command would not print those lines.

2. Write a command to print the lines that has the word "july" in all the files in a directory and also suppress the file name in the output.
grep -h july *

3. Write a command to print the lines that has the word "july" while ignoring the case.
grep -i july *

The option i make the grep command to treat the pattern as case insensitive.

4. When you use a single file as input to the grep command to search for a pattern, it won't print the filename in the output. Now write a grep command to print the file name in the output without using the '-H' option.
grep pattern file name /dev/null

The /dev/null or null device is special file that discards the data written to it. So, the /dev/null is always an empty file.
Another way to print the file name is using the '-H' option. The grep command for this is
grep -H pattern filename

5. Write a command to print the file names in a directory that does not contain the word "july"?
grep -L july *

The '-L' option makes the grep command to print the file names that do not contain the specified pattern.

6. Write a command to print the line numbers along with the line that has the word "july"?
grep -n july filename

The '-n' option is used to print the line numbers in a file. The line numbers start from 1

7. Write a command to print the lines that starts with the word "start"?
grep '^start' filename

The '^' symbol specifies the grep command to search for the pattern at the start of the line.

8. In the text file, some lines are delimited by colon and some are delimited by space. Write a command to print the third field of each line.
awk '{ if( $0 ~ /:/ ) { FS=":"; } else { FS =" "; } print $3 }' filename

9. Write a command to print the line number before each line?
awk '{print NR, $0}' filename

10. Write a command to print the second and third line of a file without using NR.
awk 'BEGIN {RS="";FS="\n"} {print $2,$3}' filename

11. How to create an alias for the complex command and remove the alias?

The alias utility is used to create the alias for a command. The below command creates alias for ps -aef command.
alias pg='ps -aef'

If you use pg, it will work the same way as ps -aef.

To remove the alias simply use the unalias command as
unalias pg

12. Write a command to display today's date in the format of 'yyyy-mm-dd'?

The date command can be used to display today's date with time
date '+%Y-%m-%d'

Hive Interview Questions

What is Hive?
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
Hive was originally developed at Facebook. It’s now a Hadoop subproject with many contributors. Users need to concentrate only on the top level hive language rather than java map reduce programs. One of the main advantages of Hive is its SQLish nature. Thus it leverages the usability to a higher extend.


A hive program will be automatically compiled into map-reduce jobs executed on Hadoop. In addition, HiveQL supports custom map-reduce scripts to be plugged into queries.

Hive example:
selecting the employee names whose salary more than 100 dollars from a hive table called tbl_employee.
SELECT employee_name FROM tbl_employee WHERE salary > 100;
Users are excited to use Hive since it is very similar to SQL.
 

What are the types of tables in Hive?
There are two types of tables.
1. Managed tables.
2. External tables.

Only the drop table command differentiates managed and external tables. Otherwise, both type of tables are very similar.

Does Hive support record level Insert, delete or update?
Hive does not provide record-level update, insert, or delete. Henceforth, Hive does not provide transactions too. However, users can go with CASE statements and built in functions of Hive to satisfy the above DML operations. Thus, a complex update query in a RDBMS may need many lines of code in Hive.

What kind of datawarehouse application is suitable for Hive?
Hive is not a full database. The design constraints and limitations of Hadoop and HDFS impose limits on what Hive can do.


Hive is most suited for data warehouse applications, where
1) Relatively static data is analyzed,
2) Fast response times are not required, and
3) When the data is not changing rapidly.


Hive doesn’t provide crucial features required for OLTP, Online Transaction Processing. It’s closer to being an OLAP tool, Online Analytic Processing.So, Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.

How can the columns of a table in hive be written to a file?
By using awk command in shell, the output from HiveQL (Describe) can be written to a file.
hive -S -e "describe table_name;" | awk -F" " ’{print 1}’ > ~/output.



CONCAT function in Hive with Example?
CONCAT function will concat the input strings. You can specify any number of strings separated by comma.

Example:
CONCAT ('Hive','-','performs','-','good','-','in','-','Hadoop');

Output:
Hive-performs-good-in-Hadoop

So, every time you delimit the strings by '-'. If it is common for all the strings, then Hive provides another command CONCAT_WS. Here you have to specify the delimit operator first.

CONCAT_WS ('-','Hive','performs','good','in','Hadoop');
Output: Hive-performs-good-in-Hadoop

REPEAT function in Hive with example?

REPEAT function will repeat the input string n times specified in the command.

Example:
REPEAT('Hadoop',3);

Output:
HadoopHadoopHadoop.

Note: You can add a space with the input string also.


TRIM function in Hive with example?

TRIM function will remove the spaces associated with a string.

Example:
TRIM('  Hadoop  ');

Output:
Hadoop.

Note: If you want to remove only leading or trialing spaces then you can specify the below commands respectively.
LTRIM('  Hadoop');
RTRIM('Hadoop  ');

REVERSE function in Hive with example?
REVERSE function will reverse the characters in a string.

Example:
REVERSE('Hadoop');

Output:
poodaH

LOWER or LCASE function in Hive with example?
LOWER or LCASE function will convert the input string to lower case characters.

Example:
LOWER('Hadoop');
LCASE('Hadoop');

Output:
hadoop

Note:
If the characters are already in lower case then they will be preserved.

UPPER or UCASE function in Hive with example?
UPPER or UCASE function will convert the input string to upper case characters.

Example:
UPPER('Hadoop');
UCASE('Hadoop');

Output:
HADOOP

Note:
If the characters are already in upper case then they will be preserved.

Double type in Hive – Important points?
It is important to know about the double type in Hive. Double type in Hive will present the data differently unlike RDBMS.
See the double type data below:
24624.0
32556.0
3.99893E5
4366.0

E5 represents 10^5 here. So, the value 3.99893E5 represents 399893. All the calculations will be accurately performed using double type. The maximum value for a IEEE 754 double is about 2.22E308.


It is crucial while exporting the double type data to any RDBMS since the type may be wrongly interpreted. So, it is advised to cast the double type into appropriate type before exporting.

Rename a table in Hive – How to do it?
Using ALTER command, we can rename a table in Hive.
ALTER TABLE hive_table_name RENAME  TO new_name;

There is another way to rename a table in Hive. Sometimes, ALTER may take more time if the underlying table has more partitions/functions. In that case, Import and export options can be utilized. Here you are saving the hive data into HDFS and importing back to new table like below.
EXPORT TABLE tbl_name TO 'HDFS_location';
IMPORT TABLE new_tbl_name FROM 'HDFS_location';

If you prefer to just preserve the data, you can create a new table from old table like below.
CREATE TABLE new_tbl_name AS SELECT * FROM old_tbl_name;
DROP TABLE old_tbl_name;

How to change a column data type in Hive?
ALTER TABLE table_name CHANGE column_name column_name new_datatype;
Example: If you want to change the data type of ID column from integer to bigint in a table called employee.
ALTER TABLE employee CHANGE id id BIGINT;

Difference between order by and sort by in hive?
SORT BY will sort the data within each reducer. You can use any number of reducers for SORT BY operation.
ORDER BY will sort all of the data together, which has to pass through one reducer. Thus, ORDER BY in hive uses single reducer.
ORDER BY guarantees total order in the output while SORT BY only guarantees ordering of the rows within a reducer. If there is more than one reducer, SORT BY may give partially ordered final results

RLIKE in Hive?
RLIKE (Right-Like) is a special function in Hive where if any substring of A matches with B then it evaluates to true. It also obeys Java regular expression pattern. Users don't need to put % symbol for a simple match in RLIKE.

Examples:
'Express' RLIKE 'Exp' --> True
'Express' RLIKE '^E.*' --> True (Regular expression)

Moreover, RLIKE will come handy when the string has some spaces. Without using TRIM function, RLIKE satisfies the required scenario. Suppose if A has value 'Express ' (2 spaces additionally) and B has value 'Express' RLIKE will work better without using TRIM.
'Express' RLIKE 'Express' --> True

Note:
RLIKE evaluates to NULL if A or B is NULL.









 

Hadoop interview questions

Name the most common InputFormats defined in Hadoop? Which one is default ?
 Following 3 are most common InputFormats defined in Hadoop
- TextInputFormat
- KeyValueInputFormat
- SequenceFileInputFormat

 
TextInputFormat is the hadoop default.


What is the difference between TextInputFormat and KeyValueInputFormat class?
TextInputFormat: It reads lines of text files and provides the offset of the line as key to the Mapper and actual line as Value to the mapper

KeyValueInputFormat: Reads text file and parses lines into key, val pairs. Everything up to the first tab character is sent as key to the Mapper and the remainder of the line is sent as value to the mapper.

What is InputSplit in Hadoop?
When a hadoop job is run, it splits input files into chunks and assign each split to a mapper to process. This is called Input Split

How is the splitting of file invoked in Hadoop Framework ?

It is invoked by the Hadoop framework by running getInputSplit() method of the Input format class (like FileInputFormat) defined by the user

Consider case scenario: In M/R system,
    - HDFS block size is 64 MB
    - Input format is FileInputFormat
    - We have 3 files of size 64K, 65Mb and 127Mb
then how many input splits will be made by Hadoop framework?

Hadoop will make 5 splits as follows
- 1 split for 64K files
- 2  splits for 65Mb files
- 2 splits for 127Mb file

What is the purpose of RecordReader in Hadoop?
The InputSplit has defined a slice of work, but does not describe how to access it. The RecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. The RecordReader instance is defined by the InputFormat

After the Map phase finishes, the hadoop framework does "Partitioning, Shuffle and sort". Explain what happens in this phase?
- Partitioning
Partitioning is the process of determining which reducer instance will receive which intermediate keys and values. Each mapper must determine for all of its output (key, value) pairs which reducer will receive them. It is necessary that for any key, regardless of which mapper instance generated it, the destination partition is the same

- Shuffle
After the first map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. This process of moving map outputs to the reducers is known as shuffling.

- Sort
Each reduce task is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer

If no custom partitioner is defined in the hadoop then how is data partitioned before its sent to the reducer?
The default partitioner computes a hash value for the key and assigns the partition based on this result

What is a Combiner?

The Combiner is a "mini-reduce" process which operates only on data generated by a mapper. The Combiner will receive as input all data emitted by the Mapper instances on a given node. The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers.

What is job tracker?
Job Tracker is the service within Hadoop that runs Map Reduce jobs on the cluster

What are some typical functions of Job Tracker?
The following are some typical tasks of Job Tracker
- Accepts jobs from clients
- It talks to the NameNode to determine the location of the data
- It locates TaskTracker nodes with available slots at or near the data
- It submits the work to the chosen Task Tracker nodes and monitors progress of each task by receiving heartbeat signals from Task tracker

What is task tracker?
Task Tracker is a node in the cluster that accepts tasks like Map, Reduce and Shuffle operations - from a JobTracker

Whats the relationship between Jobs and Tasks in Hadoop?
One job is broken down into one or many tasks in Hadoop.

Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What will hadoop do ?
It will restart the task again on some other task tracker and only if the task fails more than 4 (default setting and can be changed) times will it kill the job

Hadoop achieves parallelism by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program and slow down the program. What mechanism Hadoop provides to combat this ?
Speculative Execution

How does speculative execution works in Hadoop  ?

Job tracker makes different task trackers process same input. When tasks complete, they announce this fact to the Job Tracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the Task Trackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.

Using command line in Linux, how will you
- see all jobs running in the hadoop cluster
- kill a job

- hadoop job -list
- hadoop job -kill jobid

What is Hadoop Streaming  ?

Streaming is a generic API that allows programs written in virtually any language to be used as Hadoop Mapper and Reducer implementations


What is the characteristic of streaming API that makes it flexible run map reduce jobs in languages like perl, ruby, awk etc.  ?
Hadoop Streaming allows to use arbitrary programs for the Mapper and Reducer phases of a Map Reduce job by having both Mappers and Reducers receive their input on stdin and emit output (key, value) pairs on stdout.


Whats is Distributed Cache in Hadoop ?
Distributed Cache is a facility provided by the Map/Reduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node.


What is the benifit of Distributed cache, why can we just have the file in HDFS and have the application read it  ?
This is because distributed cache is much faster. It copies the file to all trackers at the start of the job. Now if the task tracker runs 10 or 100 mappers or reducer, it will use the same copy of distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR job then every mapper will try to access it from HDFS hence if a task tracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this.


What mechanism does Hadoop framework provides to synchronize changes made in Distribution Cache during runtime of the application  ?
This is a trick questions. There is no such mechanism. Distributed Cache by design is read only during the time of Job execution


Have you ever used Counters in Hadoop. Give us an example scenario ?
Anybody who claims to have worked on a Hadoop project is expected to use counters

Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job  ?

Yes, The input format class provides methods to add multiple directories as input to a Hadoop job


Is it possible to have Hadoop job output in multiple directories. If yes then how ?
Yes, by using Multiple Outputs class


What will a hadoop job do if you try to run it with an output directory that is already present? Will it
- overwrite it
- warn you and continue
- throw an exception and exit

The hadoop job will throw an exception and exit.


How can you set an arbitrary number of mappers to be created for a job in Hadoop ?

This is a trick question. You cannot set it


How can you set an arbitary number of reducers to be created for a job in Hadoop ?

You can either do it progamatically by using method setNumReduceTasksin the JobConfclass or set it up as a configuration setting


How will you write a custom partitioner for a Hadoop job ?

To have hadoop use a custom partitioner you will have to do minimum the following three
- Create a new class that extends Partitioner class
- Override method getPartition
- In the wrapper that runs the Map Reducer, either

- add the custom partitioner to the job programtically using method setPartitionerClass or
- add the custom partitioner to the job as a config file (if your wrapper reads from config file or oozie)


How did you debug your Hadoop code ?
There can be several ways of doing this but most common ways are
- By using counters
- The web interface provided by Hadoop framework


Did you ever built a production process in Hadoop ? If yes then what was the process when your hadoop job fails due to any reason?

Its an open ended question but most candidates, if they have written a production job, should talk about some type of alert mechanisn like email is sent or there monitoring system sends an alert. Since Hadoop works on unstructured data, its very important to have a good alerting system for errors since unexpected data can very easily break the job.


Did you ever ran into a lop sided job that resulted in out of memory error, if yes then how did you handled it ?
This is an open ended question but a candidate who claims to be an intermediate developer and has worked on large data set (10-20GB min) should have run into this problem. There can be many ways to handle this problem but most common way is to alter your algorithm and break down the job into more map reduce phase or use a combiner if possible. 



What is HDFS?
HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information. Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications

What does the statement "HDFS is block structured file system" means?

It means that in HDFS individual files are broken into blocks of a fixed size. These blocks are stored across a cluster of one or more machines with data storage capacity

What does the term "Replication factor" mean?

Replication factor is the number of times a file needs to be replicated in HDFS

What is the default replication factor in HDFS? 

3

What is the default block size of an HDFS block?  

64Mb

What is the benefit of having such big block size (when compared to block size of linux file system like ext)?

It allows HDFS to decrease the amount of metadata storage required per file (the list of blocks per file will be smaller as the size of individual blocks increases). Furthermore, it allows for fast streaming reads of data, by keeping large amounts of data sequentially laid out on the disk

Why is it recommended to have few very large files instead of a lot of small files in HDFS?

This is because the Name node contains the meta data of each and every file in HDFS and more files means more metadata and since namenode loads all the metadata in memory for speed hence having a lot of files may make the metadata information big enough to exceed the size of the memory on the Name node

True/false question. What is the lowest granularity at which you can apply replication factor in HDFS
- You can choose replication factor per directory
- You can choose replication factor per file in a directory
- You can choose replication factor per block of a file

- True
- True
- False

What is a datanode in HDFS?  

Individual machines in the HDFS cluster that hold blocks of data are called datanodes

What is a Namenode in HDFS?  

The Namenode stores all the metadata for the file system

What alternate way does HDFS provides to recover data in case a Namenode, without backup, fails and cannot be recovered?

There is no way. If Namenode dies and there is no backup then there is no way to recover data

Describe how a HDFS client will read a file in HDFS, like will it talk to data node or namenode ... how will data flow etc?

To open a file, a client contacts the Name Node and retrieves a list of locations for the blocks that comprise the file. These locations identify the Data Nodes which hold each block. Clients then read file data directly from the Data Node servers, possibly in parallel. The Name Node is not directly involved in this bulk data transfer, keeping its overhead to a minimum.

Using linux command line. how will you
- List the the number of files in a HDFS directory
- Create a directory in HDFS
- Copy file from your local directory to HDFS

hadoop fs -ls
hadoop fs -mkdir
hadoop fs -put localfile hdfsfile
 


Advantages of Hadoop?
• Bringing compute and storage together on commodity hardware: The result is blazing speed at low cost.
• Price performance: The Hadoop big data technology provides significant cost savings (think a factor of approximately 10) with significant performance improvements (again, think factor of 10). Your mileage may vary. If the existing technology can be so dramatically trounced, it is worth examining if Hadoop can complement or replace aspects of your current architecture.
• Linear Scalability: Every parallel technology makes claims about scale up.Hadoop has genuine scalability since the latest release is expanding the limit on the number of nodes to beyond 4,000.
• Full access to unstructured data: A highly scalable data store with a good parallel programming model, MapReduce, has been a challenge for the industry for some time. Hadoop programming model does not solve all problems, but it is a strong solution for many tasks.

Definition of Big data?
According to Gartner, Big data can be defined as high volume, velocity and variety information requiring innovative and cost effective forms of information processing for enhanced decision making.

How Big data differs from database ?
Datasets which are beyond the ability of the database to store, analyze and manage can be defined as Big. The technology extracts required information from large volume whereas the storage area is limited for a database.

Who are all using Hadoop? Give some examples?
• A9.com
• Amazon
• Adobe
• AOL
• Baidu
• Cooliris
• Facebook
• NSF-Google
• IBM
• LinkedIn
• Ning
• PARC
• Rackspace
• StumbleUpon
• Twitter
• Yahoo!

Pig for Hadoop - Give some points?
Pig is Data-flow oriented language for analyzing large data sets.
It is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

At the present time, Pig infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig language layer currently consists of a textual language called Pig Latin, which has the following key properties:

Ease of programming.
It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.

Optimization opportunities.
The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.

Extensibility.
Users can create their own functions to do special-purpose processing.

Features of Pig:
– data transformation functions
– datatypes include sets, associative arrays, tuples
– high-level language for marshalling data
- developed at yahoo!


Hive for Hadoop - Give some points?
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Keypoints:
• SQL-based data warehousing application
– features similar to Pig
– more strictly SQL-type
• Supports SELECT, JOIN, GROUP BY,etc
• Analyzing very large data sets
– log processing, text mining, document indexing
• Developed at Facebook

Map Reduce in Hadoop?
Map reduce :
it is a framework for processing in parallel across huge datasets usning large no. of computers referred to cluster, it involves two processes namely Map and reduce.

Map Process:
In this process input is taken by the master node,which divides it into smaller tasks and distribute them to the workers nodes. The workers nodes process these sub tasks and pass them back to the master node.

Reduce Process :
In this the master node combines all the answers provided by the worker nodes to get the results of the original task. The main advantage of Map reduce is that the map and reduce are performed in distributed mode. Since each operation is independent, so each map can be performed in parallel and hence reducing the net computing time.

What is a heartbeat in HDFS?
A heartbeat is a signal indicating that it is alive. A data node sends heartbeat to Name node and task tracker will send its heart beat to job tracker. If the Name node or job tracker does not receive heart beat then they will decide that there is some problem in data node or task tracker is unable to perform the assigned task.

What is a metadata?
Metadata is the information about the data stored in data nodes such as location of the file, size of the file and so on.

Is Namenode also a commodity?
No. Namenode can never be a commodity hardware because the entire HDFS rely on it.
It is the single point of failure in HDFS. Namenode has to be a high-availability machine.

Can Hadoop be compared to NOSQL database like Cassandra?
Though NOSQL is the closet technology that can be compared to Hadoop, it has its own pros and cons. There is no DFS in NOSQL. Hadoop is not a database. It’s a filesystem (HDFS) and distributed programming framework (MapReduce).

What is Key value pair in HDFS?
Key value pair is the intermediate data generated by maps and sent to reduces for generating the final output.

What is the difference between MapReduce engine and HDFS cluster?
HDFS cluster is the name given to the whole configuration of master and slaves where data is stored. Map Reduce Engine is the programming module which is used to retrieve and analyze data.

What is a rack?
Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks in a single location.

How indexing is done in HDFS?
Hadoop has its own way of indexing. Depending upon the block size, once the data is stored, HDFS will keep on storing the last part of the data which will say where the next part of the data will be. In fact, this is the base of HDFS.

History of Hadoop?
Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project.

The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about:
The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria.

Subprojects and “contrib” modules in Hadoop also tend to have names that are unrelated to their function, often with an elephant or other animal theme (“Pig,” for example). Smaller components are given more descriptive (and therefore more mundane) names. This is a good principle, as it means you can generally work out what something does from its name. For example, the jobtracker keeps track of MapReduce jobs.

What is meant by Volunteer Computing?
Volunteer computing projects work by breaking the problem they are trying to solve into chunks called work units, which are sent to computers around the world to be analyzed.
SETI@home is the most well-known of many volunteer computing projects.

How Hadoop differs from SETI (Volunteer computing)?
Although SETI (Search for Extra-Terrestrial Intelligence) may be superficially similar to MapReduce (breaking a problem into independent pieces to be worked on in parallel), there are some significant differences. The SETI@home problem is very CPU-intensive, which makes it suitable for running on hundreds of thousands of computers across the world. Since the time to transfer the work unit is dwarfed by the time to run the computation on it. Volunteers are donating CPU cycles, not bandwidth.

MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated hardware running in a single data center with very high aggregate bandwidth interconnects. By contrast, SETI@home runs a perpetual computation on untrusted machines on the Internet with highly variable connection speeds and no data locality.

Compare RDBMS and MapReduce?
Data size:
RDBMS - Gigabytes
MapReduce - Petabytes
Access:
RDBMS - Interactive and batch
MapReduce - Batch
Updates:
RDBMS - Read and write many times
MapReduce - Write once, read many times
Structure:
RDBMS - Static schema
MapReduce - Dynamic schema
Integrity:
RDBMS - High
MapReduce - Low
Scaling:
RDBMS - Nonlinear
MapReduce - Linear

What is HBase?
A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).

What is ZooKeeper?
A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications.

What is Chukwa?
A distributed data collection and analysis system. Chukwa runs collectors that store data in HDFS, and it uses MapReduce to produce reports. (At the time of this writing, Chukwa had only recently graduated from a “contrib” module in Core to its own subproject.)

What is Avro?
A data serialization system for efficient, cross-language RPC, and persistent data storage. (At the time of this writing, Avro had been created only as a new subproject, and no other Hadoop subprojects were using it yet.)

core subproject in Hadoop - What is it?
A set of components and interfaces for distributed filesystems and general I/O (serialization, Java RPC, persistent data structures).

What are all Hadoop subprojects?
Pig, Chukwa, Hive, HBase, MapReduce, HDFS, ZooKeeper, Core, Avro

What is a split?
Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just splits. Hadoop creates one map task for each split, which runs the userdefined map function for each record in the split.

Having many splits means the time taken to process each split is small compared to the time to process the whole input. So if we are processing the splits in parallel, the processing is better load-balanced.

On the other hand, if splits are too small, then the overhead of managing the splits and of map task creation begins to dominate the total job execution time. For most jobs, a good split size tends to be the size of a HDFS block, 64 MB by default, although this can be changed for the cluster

Map tasks write their output to local disk, not to HDFS. Why is this?
Map output is intermediate output: it’s processed by reduce tasks to produce the final output, and once the job is complete the map output can be thrown away. So storing it in HDFS, with replication, would be overkill. If the node running the map task fails before the map output has been consumed by the reduce task, then Hadoop will automatically rerun the map task on another node to recreate the map output.

MapReduce data flow with a single reduce task- Explain?
The input to a single reduce task is normally the output from all mappers.
The sorted map outputs have to be transferred across the network to the node where the reduce task is running, where they are merged and then passed to the user-defined reduce function. The output of the reduce is normally stored in HDFS for reliability.
For each HDFS block of the reduce output, the first replica is stored on the local node, with other replicas being stored on off-rack nodes.

MapReduce data flow with multiple reduce tasks- Explain?
When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task. There can be many keys (and their associated values) in each partition, but the records for every key are all in a single partition. The partitioning can be controlled by a user-defined partitioning function, but normally the default partitioner.

MapReduce data flow with no reduce tasks- Explain?
It’s also possible to have zero reduce tasks. This can be appropriate when you don’t need the shuffle since the processing can be carried out entirely in parallel.
In this case, the only off-node data transfer is used when the map tasks write to HDFS

What is a block in HDFS?
Filesystems deal with data in blocks, which are an integral multiple of the disk block size. Filesystem blocks are typically a few kilobytes in size, while disk blocks are normally 512 bytes.

Why is a Block in HDFS So Large?
HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. By making a block large enough, the time to transfer the data from the disk can be made to be significantly larger than the time to seek to the start of the block. Thus the time to transfer a large file made of multiple blocks operates at the disk transfer rate.

File permissions in HDFS?
HDFS has a permissions model for files and directories.
There are three types of permission: the read permission (r), the write permission (w) and the execute permission (x). The read permission is required to read files or list the contents of a directory. The write permission is required to write a file, or for a directory, to create or delete files or directories in it. The execute permission is ignored for a file since you can’t execute a file on HDFS.

What is Thrift in HDFS?
The Thrift API in the “thriftfs” contrib module exposes Hadoop filesystems as an Apache Thrift service, making it easy for any language that has Thrift bindings to interact with a Hadoop filesystem, such as HDFS.
To use the Thrift API, run a Java server that exposes the Thrift service, and acts as a proxy to the Hadoop filesystem. Your application accesses the Thrift service, which is typically running on the same machine as your application.

How Hadoop interacts with C?
Hadoop provides a C library called libhdfs that mirrors the Java FileSystem interface.
It works using the Java Native Interface (JNI) to call a Java filesystem client.
The C API is very similar to the Java one, but it typically lags the Java one, so newer features may not be supported. You can find the generated documentation for the C API in the libhdfs/docs/api directory of the Hadoop distribution.

What is FUSE in HDFS Hadoop?
Filesystem in Userspace (FUSE) allows filesystems that are implemented in user space to be integrated as a Unix filesystem. Hadoop’s Fuse-DFS contrib module allows any Hadoop filesystem (but typically HDFS) to be mounted as a standard filesystem. You can then use Unix utilities (such as ls and cat) to interact with the filesystem.
Fuse-DFS is implemented in C using libhdfs as the interface to HDFS. Documentation for compiling and running Fuse-DFS is located in the src/contrib/fuse-dfs directory of the Hadoop distribution.

Explain WebDAV in Hadoop?
WebDAV is a set of extensions to HTTP to support editing and updating files. WebDAV shares can be mounted as filesystems on most operating systems, so by exposing HDFS (or other Hadoop filesystems) over WebDAV, it’s possible to access HDFS as a standard filesystem.

What is Sqoop in Hadoop?

It is a tool design to transfer the data between Relational database management system(RDBMS) and Hadoop HDFS.
Thus, we can sqoop the data from RDBMS like mySql or Oracle into HDFS of Hadoop as well as exporting data from HDFS file to RDBMS.
Sqoop will read the table row-by-row and the import process is performed in Parallel. Thus, the output may be in multiple files.
Example:
sqoop INTO "directory";
(SELECT * FROM database.table WHERE condition;)










 
Related Posts Plugin for WordPress, Blogger...