In one of the projects I am working there was a requirement to parse a very large XML file (around 1.2 GB) in Ruby. Using the the traditional method of parsing wherein the XML file is loaded in memory and parsed was not a feasible approach for this.
So, I started exploring different methods for XML parsing and came across the libxml library.
Parsing using libxml is event based, that is, the parser reads the file line by line and looks for XML elements. When and element is encountered a event is fired. To parse the contents of the file, these events need to be handled.
To get started we need to install the following:
1
2
3
| gem install libxml-ruby sudo apt-get install libxml2 sudo apt-get install libxml2-dev libxslt1-dev |
The structure of a typical program to parse using libxml is as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
| require ‘libxml’ include LibXML class Parser include XML ::SaxParser::Callbacks def initialize # Constructor end def on_start_element(element, attributes) # This event is fired when an start of an element is found. end def on_cdata_block(cdata) # This event is fired when a CDATA block is found. end def on_characters(chars) # This event is fired when characters are encountered between the start and end of an element. end def on_end_element(element) # This event is fired when an end of an element is found. end end parser = XML ::SaxParser.file(“large_file.xml”) parser.callbacks = Parser. new parser.parse |
Let us try this out with an example. For this example I am using the XML file from the following location:
I have saved the XML file as large_file.xml. As this is just an example I am using a small file, however, the above mentioned code will work for large files too without any change.
Sample from the XML file:
1
2
3
4
5
6
7
8
9
10
11
| < catalog > < book id=”bk101”> < author >Gambardella, Matthew</ author > < title >XML Developer’s Guide</ title > < genre >Computer</ genre > < price >44.95</ price > < publish_date >2000-10-01</ publish_date > < description >An in-depth look at creating applications with XML.</ description > </ book > </ catalog > |
So the code to parse the XML containing a book elements as shown above is as follows :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
| require ‘libxml’ include LibXML class Parser include XML ::SaxParser::Callbacks def initialize #The Constructor end def on_start_element(element, attributes) if element.to_s == “catalog” puts “Catalog Started” end if element.to_s == “book” puts “ ID : ” + attributes[“id”].to_s end if element.to_s == “author” @read_string = “” end if element.to_s == “title” @read_string = “” end if element.to_s == “genre” @read_string = “” end if element.to_s == “price” @read_string = “” end if element.to_s == “publish_date” @read_string = “” end if element.to_s == “description” @read_string = “” end end def on_cdata_block(cdata) puts “ CDATA Found: ” + cdata.to_s end def on_characters(chars) if @read_string != nil @read_string = @read_string + chars end end def on_end_element(element) if element.to_s == “catalog” puts “Catalog Ended” end if element.to_s == “book” puts “n” end if element.to_s == “author” puts “Author :” + @read_string @read_string = nil end if element.to_s == “title” puts “Title :” + @read_string @read_string = nil end if element.to_s == “genre” puts “Genre :” + @read_string @read_string = nil end if element.to_s == “price” puts “Price :” + @read_string @read_string = nil end if element.to_s == “publish_date” puts “Publish Date :” + @read_string @read_string = nil end if element.to_s == “description” puts “Description :” + @read_string @read_string = nil end end end parser = XML ::SaxParser.file(“large_file.xml”) parser.callbacks = Parser. new parser.parse |
As you can see above how the event handlers are parsing the XML file element by element.
Sample output of the above code :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
| Catalog Started ID : bk101 Author :Gambardella, Matthew Title :XML Developer’s Guide Genre :Computer Price :44.95 Publish Date :2000-10-01 Description :An in -depth look at creating applications with XML. ID : bk102 Author :Ralls, Kim Title :Midnight Rain Genre :Fantasy Price :5.95 Publish Date :2000-12-16 Description :A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world. ID : bk103 Author :Corets, Eva Title :Maeve Ascendant Genre :Fantasy Price :5.95 Publish Date :2000-11-17 Description :After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society. ..... ..... Catalog Ended |
The code for the above can be found at the following location :
Hope this helps and let me know if you need further information.