Tuesday 21 October 2014

How to use blank as null in Hive

In Hadoop, Generally null values are represented as blank in HDFS file.
But in databases null value has a special meaning.
When Hive SQL is used to generate reports, then its common to use IS NULL construct.
But Hive does not treat blank and null in the same way.
Hence output of Hive SQL query with IS NULL construct many not be as per your expectation if you come from database background.
Lets look at an example.
1. Create sample input file
We will create an input file with three columns ( name , title,  birth_year )
Column values are seperated by pipe(‘|’) character.
12345
-bash-3.2$ echo -e "mark||2011\ntony|Mr.|2012\nrichard||2013" > input.txt
-bash-3.2$ cat input.txt
mark||2011
tony|Mr.|2012
richard||2013
view rawinput.txt hosted with ❤ by GitHub
If you look at above data,
In First record, title column has blank value
In Second record, title column has Mr. value
In Third record, title column has blank value
2. Upload sample input file to HDFS
12345
-bash-3.2$ hadoop fs -copyFromLocal input.txt /user/abhijit/input.txt
-bash-3.2$ hadoop fs -cat /user/abhijit/input.txt
mark||2011
tony|Mr.|2012
richard||2013
view rawinput.hdfs.txt hosted with ❤ by GitHub
3. Start Hive CLI and create Hive table
We will use following DDL statement to create Hive table.
12345678
CREATE EXTERNAL TABLE person
(
name string,
title string,
birth_year string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LOCATION '/person';
view rawperson.sql hosted with ❤ by GitHub
4. Find out rows where title column has null values
Fire following SELECT SQL using HIVE CLI.
1
select * from person where title is null;
If you see the output of this SQL, you will find that it returns zero  results which is unexpected.
We were expecting that the result should contain 2 rows.
How to handle this scenario?
Hive developers have already thought about it and Hive provides you a table propertyserialization.null.format which can be used to treat a character of your choice as null in Hive SQL. In this case we want to use blank as null.
Drop above Hive table and use following DDL statement instead of the one specified above
123456789
CREATE EXTERNAL TABLE person
(
name string,
title string,
birth_year string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LOCATION '/person'
TBLPROPERTIES('serialization.null.format'='');
view rawperson_null.sql hosted with ❤ by GitHub
Now if you fire above SELECT SQL then you will get the expected result.
Hope it helps!!

No comments:

Post a Comment

Related Posts Plugin for WordPress, Blogger...