XML file to pandas DataFrame object – sax

In this article, I will describe how to load data from an XML file into a DataFrame object using xml.sax module.

Project files are available for download >>here<<

The same XML file as described in the previous post is used.

In addition to xml.sax, I also use the pandas module and the defaultdict class from the collections module.

The first step in reading data from an XML file using SAX is to implement your own class that inherits from the ContentHandler() class. The handler class overrides three methods of the base class: the startElement() method, which is called when reading the next tag starts, characters () method, which reads the values ​​stored for a particular element and the endElement () method, which is called when reading the corresponding tag has finished. In addition, in the init () method, I create an instance of the dictionary in which the data read from the XML file will be saved.

In the startElement () method, I define a class attribute called tag, and if the processed tag is ‘person’ I write the id number for the person.

Then, in the characters() method, I save the values ​​stored in each element as the appropriate class variables.

In the endElement () method, I save the retrieved values ​​to the persons dictionary.

Parsing the XML file with the xml.sax module is done by calling the make_parser() method, which returns an instance of the parser. Then the handler instance is assigned to the created parser as an argument to the setContentHandler function() – in this case the PersonsHandler class.

The parse () method is then called to parse the source XML document.

The persons variable holds the values ​​of the dictionary created by the handler, and that dictionary is passed as a parameter when creating a DataFrame.

main.py souce code:

import xml.sax
from collections import defaultdict
import pandas as pd

class PersonsHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.persons = defaultdict(list)
    def startElement(self, tag, attr):
        self.tag = tag
        if tag == 'person':
            self.persons['id'].append(attr['id'])
            
    def characters(self, content):
        if content.strip():
            if self.tag == 'position': self.position = content
            elif self.tag == 'first_name': self.first_name = content
            elif self.tag == 'last_name': self.last_name = content
            elif self.tag == 'email': self.email = content
            elif self.tag == 'salary': self.salary = content
    
    def endElement(self, tag):
        if tag == 'position': self.persons['position'].append(self.position)
        elif tag == 'first_name': self.persons['first_name'].append(self.first_name)
        elif tag == 'last_name': self.persons['last_name'].append(self.last_name)
        elif tag == 'email': self.persons['email'].append(self.email)
        elif tag == 'salary': self.persons['salary'].append(self.salary)

parser = xml.sax.make_parser()
parser.setContentHandler(PersonsHandler())
parser.parse(open('persons.xml'))
persons = parser.getContentHandler().persons


df = pd.DataFrame(persons, columns=persons.keys()).set_index('id')
df['salary'] = df['salary'].astype(float)
print(df.sort_values(by='salary', ascending=False))

Leave a Reply

Your email address will not be published. Required fields are marked *