XML file to pandas DataFrame object #1

In this article, I will describe how to load data from an XML file into a DataFrame object using xml.etree.ElementTree.

Project files are available for download >>here<<

The sample XML file will describe people employed in the company and will have the following form:

<persons>
    <person id="">
        <position></position>
        <first_name></first_name>
        <last_name></last_name>
        <email></email>
        <salary></salary>
    </person>
</persons>

The following data is saved for each person with a unique id attribute: position, name, surname, email address and salary.

First, I import the necessary modules, ie xml.etree.ElementTree to parse the XML document. From the collections module I import defaultdict, which will store lists containing first names, surnames, salaries, etc., which I then pass as an argument of the DataFrame class.

import xml.etree.ElementTree as et
from collections import defaultdict
import pandas as pd

In the next line, I create a dictionary that will store employee data obtained from the XML file:

persons = defaultdict(list)

Then I load the XML file and download the parent element – root – persons, and then for each child element – person I get the value of its id attribute, as well as the values ​​of its elements: position, first_name, last_name, etc. Each of these values ​​is added as the next element of the appropriate dictionary list.

tree = et.parse("persons.xml")
root = tree.getroot()
for child in root:
    id = child.attrib.get('id')
    position = child.find('position').text
    first_name = child.find('first_name').text
    last_name = child.find('last_name').text
    email = child.find('email').text
    salary = child.find('salary').text
    
    persons['id'].append(id)
    persons['position'].append(position)
    persons['first_name'].append(first_name)
    persons['last_name'].append(last_name)
    persons['email'].append(email)
    persons['salary'].append(salary)

I pass the created dictionary as an argument to the DataFrame object being created, with the names of the dictionary keys as the names of the columns and the id column as the DataFrame index.

Then I change the data type of the salary column to float to sort the frame in order of decreasing values ​​of that column. Alternatively, the conversion to float could be done when creating the DataFrame object, additionally providing the dtype argument.

Leave a Reply

Your email address will not be published. Required fields are marked *