In one of the projects I am working there was a requirement to parse a very large XML file (around 1.2 GB) in Ruby. Using the the traditional method of parsing wherein the XML file is loaded in memory and parsed was not a feasible approach for this.
So, I started exploring different methods for XML parsing and came across the libxml library.
Parsing using libxml is event based, that is, the parser reads the file line by line and looks for XML elements. When and element is encountered a event is fired. To parse the contents of the file, these events need to be handled.
To get started we need to install the following:
1
2
3
| gem install libxml-rubysudo apt-get install libxml2sudo apt-get install libxml2-dev libxslt1-dev |
The structure of a typical program to parse using libxml is as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
| require ‘libxml’include LibXMLclass Parser include XML::SaxParser::Callbacks def initialize # Constructor end def on_start_element(element, attributes) # This event is fired when an start of an element is found. end def on_cdata_block(cdata) # This event is fired when a CDATA block is found. end def on_characters(chars) # This event is fired when characters are encountered between the start and end of an element. end def on_end_element(element) # This event is fired when an end of an element is found. endendparser = XML::SaxParser.file(“large_file.xml”)parser.callbacks = Parser.newparser.parse |
Let us try this out with an example. For this example I am using the XML file from the following location:
I have saved the XML file as large_file.xml. As this is just an example I am using a small file, however, the above mentioned code will work for large files too without any change.
Sample from the XML file:
1
2
3
4
5
6
7
8
9
10
11
| <catalog> <book id=”bk101”> <author>Gambardella, Matthew</author> <title>XML Developer’s Guide</title> <genre>Computer</genre> <price>44.95</price> <publish_date>2000-10-01</publish_date> <description>An in-depth look at creating applications with XML.</description> </book></catalog> |
So the code to parse the XML containing a book elements as shown above is as follows :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
| require ‘libxml’include LibXMLclass Parser include XML::SaxParser::Callbacks def initialize #The Constructor end def on_start_element(element, attributes) if element.to_s == “catalog” puts “Catalog Started” end if element.to_s == “book” puts “ID : ” + attributes[“id”].to_s end if element.to_s == “author” @read_string = “” end if element.to_s == “title” @read_string = “” end if element.to_s == “genre” @read_string = “” end if element.to_s == “price” @read_string = “” end if element.to_s == “publish_date” @read_string = “” end if element.to_s == “description” @read_string = “” end end def on_cdata_block(cdata) puts “CDATA Found: ” + cdata.to_s end def on_characters(chars) if @read_string != nil @read_string = @read_string + chars end end def on_end_element(element) if element.to_s == “catalog” puts “Catalog Ended” end if element.to_s == “book” puts “n” end if element.to_s == “author” puts “Author :” + @read_string @read_string = nil end if element.to_s == “title” puts “Title :” + @read_string @read_string = nil end if element.to_s == “genre” puts “Genre :” + @read_string @read_string = nil end if element.to_s == “price” puts “Price :” + @read_string @read_string = nil end if element.to_s == “publish_date” puts “Publish Date :” + @read_string @read_string = nil end if element.to_s == “description” puts “Description :” + @read_string @read_string = nil end endendparser = XML::SaxParser.file(“large_file.xml”)parser.callbacks = Parser.newparser.parse |
As you can see above how the event handlers are parsing the XML file element by element.
Sample output of the above code :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
| Catalog StartedID : bk101Author :Gambardella, MatthewTitle :XML Developer’s GuideGenre :ComputerPrice :44.95Publish Date :2000-10-01Description :An in-depth look at creating applications with XML.ID : bk102Author :Ralls, KimTitle :Midnight RainGenre :FantasyPrice :5.95Publish Date :2000-12-16Description :A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.ID : bk103Author :Corets, EvaTitle :Maeve AscendantGenre :FantasyPrice :5.95Publish Date :2000-11-17Description :After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society...........Catalog Ended |
The code for the above can be found at the following location :
Hope this helps and let me know if you need further information.