In this article, I will describe how to load data from an XML file into a DataFrame object using xml.etree.ElementTree.
Project files are available for download >>here<<
The sample XML file will describe people employed in the company and will have the following form:
<persons>
<person id="">
<position></position>
<first_name></first_name>
<last_name></last_name>
<email></email>
<salary></salary>
</person>
</persons>
The following data is saved for each person with a unique id attribute: position, name, surname, email address and salary.
First, I import the necessary modules, ie xml.etree.ElementTree to parse the XML document. From the collections module I import defaultdict, which will store lists containing first names, surnames, salaries, etc., which I then pass as an argument of the DataFrame class.
import xml.etree.ElementTree as et
from collections import defaultdict
import pandas as pd
In the next line, I create a dictionary that will store employee data obtained from the XML file:
persons = defaultdict(list)
Then I load the XML file and download the parent element – root – persons, and then for each child element – person I get the value of its id attribute, as well as the values of its elements: position, first_name, last_name, etc. Each of these values is added as the next element of the appropriate dictionary list.
tree = et.parse("persons.xml")
root = tree.getroot()
for child in root:
id = child.attrib.get('id')
position = child.find('position').text
first_name = child.find('first_name').text
last_name = child.find('last_name').text
email = child.find('email').text
salary = child.find('salary').text
persons['id'].append(id)
persons['position'].append(position)
persons['first_name'].append(first_name)
persons['last_name'].append(last_name)
persons['email'].append(email)
persons['salary'].append(salary)
I pass the created dictionary as an argument to the DataFrame object being created, with the names of the dictionary keys as the names of the columns and the id column as the DataFrame index.
Then I change the data type of the salary column to float to sort the frame in order of decreasing values of that column. Alternatively, the conversion to float could be done when creating the DataFrame object, additionally providing the dtype argument.