XML file to pandas DataFrame object – minidom

In this article, I will describe how to load data from an XML file into a DataFrIn the created document, I search for a list of elements from the XML file named person and, in turn, I retrieve the value of the id attribute for each of them, which I save in the persons dictionary.ame object using minidom module.

Project files are available for download >>here<<

The same XML file as described in the previous post is used.

In addition to xml.dom.minidom, I also use the pandas module and the defaultdict class from the collections module.

Using the context manager, I open the XML file, which is loaded and from which the DOM document is created. Moreover, for each child tag, ie position, first_name, last_name, etc., I save its value.

The persons dictionary is as an argument when creating a DataFrame object.

main.py source code:

import xml.dom.minidom
from collections import defaultdict
import pandas as pd

persons = defaultdict(list)
with xml.dom.minidom.parse(open('persons.xml')) as tree:
    persons_list = tree.getElementsByTagName('person')
    for person in persons_list:
        persons['id'].append(person.getAttribute('id'))
        for tag in ('position', 'first_name', 'last_name', 'email', 'salary'):
            persons[tag].append(person.getElementsByTagName(tag)[0].firstChild.data)


df = pd.DataFrame(persons, columns=persons.keys()).set_index('id')
df['salary'] = df['salary'].astype(float)
print(df.sort_values(by='salary', ascending=False))

XML file to pandas DataFrame object – sax

In this article, I will describe how to load data from an XML file into a DataFrame object using xml.sax module.

Project files are available for download >>here<<

The same XML file as described in the previous post is used.

In addition to xml.sax, I also use the pandas module and the defaultdict class from the collections module.

The first step in reading data from an XML file using SAX is to implement your own class that inherits from the ContentHandler() class. The handler class overrides three methods of the base class: the startElement() method, which is called when reading the next tag starts, characters () method, which reads the values ​​stored for a particular element and the endElement () method, which is called when reading the corresponding tag has finished. In addition, in the init () method, I create an instance of the dictionary in which the data read from the XML file will be saved.

In the startElement () method, I define a class attribute called tag, and if the processed tag is ‘person’ I write the id number for the person.

Then, in the characters() method, I save the values ​​stored in each element as the appropriate class variables.

In the endElement () method, I save the retrieved values ​​to the persons dictionary.

Parsing the XML file with the xml.sax module is done by calling the make_parser() method, which returns an instance of the parser. Then the handler instance is assigned to the created parser as an argument to the setContentHandler function() – in this case the PersonsHandler class.

The parse () method is then called to parse the source XML document.

The persons variable holds the values ​​of the dictionary created by the handler, and that dictionary is passed as a parameter when creating a DataFrame.

main.py souce code:

import xml.sax
from collections import defaultdict
import pandas as pd

class PersonsHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.persons = defaultdict(list)
    def startElement(self, tag, attr):
        self.tag = tag
        if tag == 'person':
            self.persons['id'].append(attr['id'])
            
    def characters(self, content):
        if content.strip():
            if self.tag == 'position': self.position = content
            elif self.tag == 'first_name': self.first_name = content
            elif self.tag == 'last_name': self.last_name = content
            elif self.tag == 'email': self.email = content
            elif self.tag == 'salary': self.salary = content
    
    def endElement(self, tag):
        if tag == 'position': self.persons['position'].append(self.position)
        elif tag == 'first_name': self.persons['first_name'].append(self.first_name)
        elif tag == 'last_name': self.persons['last_name'].append(self.last_name)
        elif tag == 'email': self.persons['email'].append(self.email)
        elif tag == 'salary': self.persons['salary'].append(self.salary)

parser = xml.sax.make_parser()
parser.setContentHandler(PersonsHandler())
parser.parse(open('persons.xml'))
persons = parser.getContentHandler().persons


df = pd.DataFrame(persons, columns=persons.keys()).set_index('id')
df['salary'] = df['salary'].astype(float)
print(df.sort_values(by='salary', ascending=False))

XML file to pandas DataFrame object – ElementTree

In this article, I will describe how to load data from an XML file into a DataFrame object using xml.etree.ElementTree.

Project files are available for download >>here<<

The sample XML file will describe people employed in the company and will have the following form:

<persons>
    <person id="">
        <position></position>
        <first_name></first_name>
        <last_name></last_name>
        <email></email>
        <salary></salary>
    </person>
</persons>

The following data is saved for each person with a unique id attribute: position, name, surname, email address and salary.

First, I import the necessary modules, ie xml.etree.ElementTree to parse the XML document. From the collections module I import defaultdict, which will store lists containing first names, surnames, salaries, etc., which I then pass as an argument of the DataFrame class.

import xml.etree.ElementTree as et
from collections import defaultdict
import pandas as pd

In the next line, I create a dictionary that will store employee data obtained from the XML file:

persons = defaultdict(list)

Then I load the XML file and download the parent element – root – persons, and then for each child element – person I get the value of its id attribute, as well as the values ​​of its elements: position, first_name, last_name, etc. Each of these values ​​is added as the next element of the appropriate dictionary list.

tree = et.parse("persons.xml")
root = tree.getroot()
for child in root:
    id = child.attrib.get('id')
    position = child.find('position').text
    first_name = child.find('first_name').text
    last_name = child.find('last_name').text
    email = child.find('email').text
    salary = child.find('salary').text
    
    persons['id'].append(id)
    persons['position'].append(position)
    persons['first_name'].append(first_name)
    persons['last_name'].append(last_name)
    persons['email'].append(email)
    persons['salary'].append(salary)

I pass the created dictionary as an argument to the DataFrame object being created, with the names of the dictionary keys as the names of the columns and the id column as the DataFrame index.

Then I change the data type of the salary column to float to sort the frame in order of decreasing values ​​of that column. Alternatively, the conversion to float could be done when creating the DataFrame object, additionally providing the dtype argument.