Refueling report from xlsx file to Pandas + localized string resources

The source files of the program along with a sample report from a gas station can be found >> here <<

The Reports class has three methods: __init__(), load_resources(), and my_parser().

The __init__() method reads all files in the Data directory that have an xlsx extension. The last added file from this file list is loaded – the pandas frame object is created.

In the load_resources() method, I load localized strings depending on the user’s locale. If there is no appropriate file, the version with messages in English is loaded.

In the my_parser () method, I sum up the collected fuel with the division into vehicles. In case of discrepancies with the source documents, you can view a detailed list of refueling for a specific vehicle giving part of the registration number.

script code: main.py

import os
import glob
import locale
import calendar
import pandas as pd
import importlib_resources


class Reports:
	
    def __init__(self):
        self.load_resources()
        script_dir = os.path.dirname(os.path.abspath(__file__))
        data_dir = os.path.join(script_dir, 'Data')
        list_of_files = glob.glob(os.path.join(data_dir, '*.xlsx'))
        latest_file = max(list_of_files, key=os.path.getctime)
        path_to_file = os.path.join(data_dir, latest_file)
        self.df = pd.read_excel(path_to_file)
        self.my_parser()
    
    def load_resources(self):
        self.locale, encoding = locale.getdefaultlocale()
        r = importlib_resources.files('Resources')
        try:
            r_strings = (r / f'strings_{self.locale}.txt').read_text(encoding='utf-8').splitlines()
        except:
            r_strings = (r / f'strings_en_US.txt').read_text(encoding='utf-8').splitlines()
            self.locale = 'en_US'
        self.r_str = dict(x.split(':') for x in r_strings)


    
    def my_parser(self):
        os.system('cls' if os.name == 'nt' else 'clear')
        print()
        f_date = self.df['Date'].min()
        l_date = self.df['Date'].max()

        with calendar.different_locale(self.locale):
            print(self.r_str['title'], f'({f_date.day} -  {l_date.day} {calendar.month_abbr[int(l_date.month)]} {l_date.year})')
        self.df['Amount'] = self.df['Amount'].astype(str)
        self.df['Amount'] = self.df['Amount'].str.replace(',', '.')
        self.df['Amount'] = self.df['Amount'].astype(float)
        print()
        print(self.r_str['fuel_total'], self.df['Amount'].sum())
        cars_group = self.df.groupby('Registration number')
        print()
        # print(cars_group['Amount'].sum()) #standard output
        for item in zip(cars_group.groups, cars_group['Amount'].sum().values):
            name, value = item
            print(f'{name}    {value}')


        print()
        while True:
            choice = input(self.r_str['choice'] + ': ')
            os.system('cls' if os.name == 'nt' else 'clear')
            if choice == '/q': raise SystemExit
            if choice == '/m': self.my_parser()
            for key, value in cars_group['Amount']:
                if choice.upper() in key:
                    print(self.r_str['refueling'], key)
                    total = 0
                    for i, v in zip(value.index, value.values):
                        print(f"{self.df.loc[i, 'Date']}   {v}")
                        total +=v
                    print(self.r_str['ref_sum'] + f' {key}: {total}')
                    print()


if __name__ == "__main__":
    reports = Reports()

XML file to pandas DataFrame object – minidom

In this article, I will describe how to load data from an XML file into a DataFrIn the created document, I search for a list of elements from the XML file named person and, in turn, I retrieve the value of the id attribute for each of them, which I save in the persons dictionary.ame object using minidom module.

Project files are available for download >>here<<

The same XML file as described in the previous post is used.

In addition to xml.dom.minidom, I also use the pandas module and the defaultdict class from the collections module.

Using the context manager, I open the XML file, which is loaded and from which the DOM document is created. Moreover, for each child tag, ie position, first_name, last_name, etc., I save its value.

The persons dictionary is as an argument when creating a DataFrame object.

main.py source code:

import xml.dom.minidom
from collections import defaultdict
import pandas as pd

persons = defaultdict(list)
with xml.dom.minidom.parse(open('persons.xml')) as tree:
    persons_list = tree.getElementsByTagName('person')
    for person in persons_list:
        persons['id'].append(person.getAttribute('id'))
        for tag in ('position', 'first_name', 'last_name', 'email', 'salary'):
            persons[tag].append(person.getElementsByTagName(tag)[0].firstChild.data)


df = pd.DataFrame(persons, columns=persons.keys()).set_index('id')
df['salary'] = df['salary'].astype(float)
print(df.sort_values(by='salary', ascending=False))

XML file to pandas DataFrame object – sax

In this article, I will describe how to load data from an XML file into a DataFrame object using xml.sax module.

Project files are available for download >>here<<

The same XML file as described in the previous post is used.

In addition to xml.sax, I also use the pandas module and the defaultdict class from the collections module.

The first step in reading data from an XML file using SAX is to implement your own class that inherits from the ContentHandler() class. The handler class overrides three methods of the base class: the startElement() method, which is called when reading the next tag starts, characters () method, which reads the values ​​stored for a particular element and the endElement () method, which is called when reading the corresponding tag has finished. In addition, in the init () method, I create an instance of the dictionary in which the data read from the XML file will be saved.

In the startElement () method, I define a class attribute called tag, and if the processed tag is ‘person’ I write the id number for the person.

Then, in the characters() method, I save the values ​​stored in each element as the appropriate class variables.

In the endElement () method, I save the retrieved values ​​to the persons dictionary.

Parsing the XML file with the xml.sax module is done by calling the make_parser() method, which returns an instance of the parser. Then the handler instance is assigned to the created parser as an argument to the setContentHandler function() – in this case the PersonsHandler class.

The parse () method is then called to parse the source XML document.

The persons variable holds the values ​​of the dictionary created by the handler, and that dictionary is passed as a parameter when creating a DataFrame.

main.py souce code:

import xml.sax
from collections import defaultdict
import pandas as pd

class PersonsHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.persons = defaultdict(list)
    def startElement(self, tag, attr):
        self.tag = tag
        if tag == 'person':
            self.persons['id'].append(attr['id'])
            
    def characters(self, content):
        if content.strip():
            if self.tag == 'position': self.position = content
            elif self.tag == 'first_name': self.first_name = content
            elif self.tag == 'last_name': self.last_name = content
            elif self.tag == 'email': self.email = content
            elif self.tag == 'salary': self.salary = content
    
    def endElement(self, tag):
        if tag == 'position': self.persons['position'].append(self.position)
        elif tag == 'first_name': self.persons['first_name'].append(self.first_name)
        elif tag == 'last_name': self.persons['last_name'].append(self.last_name)
        elif tag == 'email': self.persons['email'].append(self.email)
        elif tag == 'salary': self.persons['salary'].append(self.salary)

parser = xml.sax.make_parser()
parser.setContentHandler(PersonsHandler())
parser.parse(open('persons.xml'))
persons = parser.getContentHandler().persons


df = pd.DataFrame(persons, columns=persons.keys()).set_index('id')
df['salary'] = df['salary'].astype(float)
print(df.sort_values(by='salary', ascending=False))

XML file to pandas DataFrame object – ElementTree

In this article, I will describe how to load data from an XML file into a DataFrame object using xml.etree.ElementTree.

Project files are available for download >>here<<

The sample XML file will describe people employed in the company and will have the following form:

<persons>
    <person id="">
        <position></position>
        <first_name></first_name>
        <last_name></last_name>
        <email></email>
        <salary></salary>
    </person>
</persons>

The following data is saved for each person with a unique id attribute: position, name, surname, email address and salary.

First, I import the necessary modules, ie xml.etree.ElementTree to parse the XML document. From the collections module I import defaultdict, which will store lists containing first names, surnames, salaries, etc., which I then pass as an argument of the DataFrame class.

import xml.etree.ElementTree as et
from collections import defaultdict
import pandas as pd

In the next line, I create a dictionary that will store employee data obtained from the XML file:

persons = defaultdict(list)

Then I load the XML file and download the parent element – root – persons, and then for each child element – person I get the value of its id attribute, as well as the values ​​of its elements: position, first_name, last_name, etc. Each of these values ​​is added as the next element of the appropriate dictionary list.

tree = et.parse("persons.xml")
root = tree.getroot()
for child in root:
    id = child.attrib.get('id')
    position = child.find('position').text
    first_name = child.find('first_name').text
    last_name = child.find('last_name').text
    email = child.find('email').text
    salary = child.find('salary').text
    
    persons['id'].append(id)
    persons['position'].append(position)
    persons['first_name'].append(first_name)
    persons['last_name'].append(last_name)
    persons['email'].append(email)
    persons['salary'].append(salary)

I pass the created dictionary as an argument to the DataFrame object being created, with the names of the dictionary keys as the names of the columns and the id column as the DataFrame index.

Then I change the data type of the salary column to float to sort the frame in order of decreasing values ​​of that column. Alternatively, the conversion to float could be done when creating the DataFrame object, additionally providing the dtype argument.

DataFrame object in Pandas based on data from pdf file

Reading data from a pdf file requires the tabula-py module to be installed. This module also enables saving the read data to a data file in csv or json format.

import tabula
df_list = tabula.read_pdf('file.pdf')

The read_pdf function reads one page from a pdf file by default, if no value is given for the pages parameter (if you want to load all pages: pages = ‘all’).

The above function returns a list object containing successive DataFrame objects, for example:

df = df_list[0]        # first DataFrame object