This are the steps for parsing your XML files by PIG
---------------------------------------------------------------------------------------------------------------
Step 1: Set the classpath for pig bin
export PATH=/home/hadoop/work/pig-0.11.1/bin:$PATH
Step 2: Register the jar file
REGISTER '/home/hadoop/work/pig-0.11.1/contrib/piggybank/java/piggybank.jar'
Step 3: Load the data
xml = load '/home/hadoop/work/hadoop-1.1.2/conf/mapred-site.xml' USING
org.apache.pig.piggybank.storage.XMLLoader('name') as(doc:chararray);
@ data looks like
<property>
<name>fs.default.name</name>
<value>hdfs://localhsot:8020</value>
</property>
Step 4: Parse the file and retrieve the value
value = foreach xml GENERATE FLATTEN(REGEX_EXTRACT_ALL(doc,'<name>(.*)</name>')) AS name:chararray;
Step 5: show the value
dump value;
Output will be:
fs.default.name
Parse the multiple attribute file
@ data looks like
<property>
<fname>kalyan</fname>
<lname>hadoop</lname>
<landmark>annapurna block</landmark>
<city>hyderabad</city>
<state>Telengana</state>
<contact>1234567890</contact>
<email>kalyan@gmail.com</email>
<PAN_Card>0011542</PAN_Card>
<URL>kalyanhadooptraining.blogspot.com</URL>
</property>
Load the data:
pigdata = load '/home/hadoop/work/input/file.txt' USING
org.apache.pig.piggybank.storage.XMLLoader('property') as (doc:chararray);
Parse the values:
values = foreach pigdata GENERATE FLATTEN(REGEX_EXTRACT_ALL(doc,'<property>\\s*<fname>(.*)</fname>\\s*<lname>(.*)</lname>\\s*<landmark>(.*)</landmark>\\s*<city>(.*)</city>\\s*<state>(.*)</state>\\s*<contact>(.*)</contact>\\s*<email>(.*)</email>\\s*<PAN_Card>(.*)</PAN_Card>\\s*<URL>(.*)</URL>\\s*</property>')) AS (fname:chararray, lname:chararray, landmark:chararray, city:chararray, state:chararray, contact:int, email:chararray, PAN_Card:long, URL:chararray);
dump values;
Output will be:
(kalyan,hadoop,annapurna block,hyderabad,Telengana,1234567890,kalyan@gmail.com,0011542,kalyanhadooptraining.blogspot.com)
No comments:
Post a Comment