Tuesday, 21 October 2014

How to use blank as null in Hive

In Hadoop, Generally null values are represented as blank in HDFS file.
But in databases null value has a special meaning.
When Hive SQL is used to generate reports, then its common to use IS NULL construct.
But Hive does not treat blank and null in the same way.
Hence output of Hive SQL query with IS NULL construct many not be as per your expectation if you come from database background.
Lets look at an example.
1. Create sample input file
We will create an input file with three columns ( name , title,  birth_year )
Column values are seperated by pipe(‘|’) character.
12345
-bash-3.2$ echo -e "mark||2011\ntony|Mr.|2012\nrichard||2013" > input.txt
-bash-3.2$ cat input.txt
mark||2011
tony|Mr.|2012
richard||2013
view rawinput.txt hosted with ❤ by GitHub
If you look at above data,
In First record, title column has blank value
In Second record, title column has Mr. value
In Third record, title column has blank value
2. Upload sample input file to HDFS
12345
-bash-3.2$ hadoop fs -copyFromLocal input.txt /user/abhijit/input.txt
-bash-3.2$ hadoop fs -cat /user/abhijit/input.txt
mark||2011
tony|Mr.|2012
richard||2013
view rawinput.hdfs.txt hosted with ❤ by GitHub
3. Start Hive CLI and create Hive table
We will use following DDL statement to create Hive table.
12345678
CREATE EXTERNAL TABLE person
(
name string,
title string,
birth_year string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LOCATION '/person';
view rawperson.sql hosted with ❤ by GitHub
4. Find out rows where title column has null values
Fire following SELECT SQL using HIVE CLI.
1
select * from person where title is null;
If you see the output of this SQL, you will find that it returns zero  results which is unexpected.
We were expecting that the result should contain 2 rows.
How to handle this scenario?
Hive developers have already thought about it and Hive provides you a table propertyserialization.null.format which can be used to treat a character of your choice as null in Hive SQL. In this case we want to use blank as null.
Drop above Hive table and use following DDL statement instead of the one specified above
123456789
CREATE EXTERNAL TABLE person
(
name string,
title string,
birth_year string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LOCATION '/person'
TBLPROPERTIES('serialization.null.format'='');
view rawperson_null.sql hosted with ❤ by GitHub
Now if you fire above SELECT SQL then you will get the expected result.
Hope it helps!!

How to Fix Hadoop for non-Unix platform (Windows OS) ?


Overview

Users of org.expasy.jpl.io.util.JPLHMapSerializer on windows may experiment the following exception as Hadoop makes system call to unix tool chmod :
Exception in thread "main" java.io.IOException: Cannot run program "chmod": CreateProcess error=2,
The system cannot find the file specified :
        at java.lang.ProcessBuilder.start(Unknown Source)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:201)
        at org.apache.hadoop.util.Shell.run(Shell.java:183)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:376)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:462)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:445)
        at org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:543)
        at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:535)
        at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:336)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:400)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:610)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:591)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:498)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:490)
        at org.apache.hadoop.hbase.io.hfile.HFile$Writer.<init>(HFile.java:306)
        at org.expasy.jpl.io.util.JPLHMapSerializer.init(JPLHMapSerializer.java:125)
      

Fixing Dependence Bug


The solution is to install cygwin in your windows system or install a subset of it as only chmod and its dlls are needed. We will provide, in the following, a solution for the second alternative:

First Step: Get "chmod" Resources

Here are the archives for different Windows architectures:
  • Windows 32-bit - contains chmod.exe, cygwin1.dll, cygiconv-2.dll, cygintl-8.dll and cyggcc_s-1.dll
  • Windows 64-bit - not yet available

Second Step: Set your Path in Windows

Do not forget to set your PATH variable for chmod in Windows else chmod will not be found!!
  1. First right-click on your My Computer icon on your Desktop and click on Properties. Alternatively you could just press the Windows Key+Pause Break Key
  2. Then in the new window that opened, click on the Advanced tab
  3. Click on Environment Variables
  4. In System Variables, edit or create the PATH variable and enter the path name to cygwin-chmod directory

setting path variable
Related Posts Plugin for WordPress, Blogger...