This script is designed to extract financial and date-related information from PDF files and save the results in Excel and CSV formats. It processes files in a specified directory, normalizing values and dates to a consistent format. The script also supports creating an executable file to ensure easy execution on systems without Python installed.
- Column Consistency: All extracted values, regardless of type, are stored in the same column in the Excel file.
- File Differences:
- One type of file contains "Gross Total incl. VAT" values in euros, using the European number system (comma for decimals).
- The other type contains "Total" values in USD, using the standard number system (period for decimals).
- Normalization:
- Values in euros are converted to the standard number system.
- Currency symbols are stripped from all extracted values.
- Date Handling:
- Single dates are directly extracted.
- For date ranges, only the starting date is considered.
- All dates are normalized to the format
DD-MM-YYYY.
Ensures the correct file path is taken for both Python scripts and compiled executables, enabling the script to process files located in the same directory as the executable or script.
This function extracts the target financial values ("Gross Total incl. VAT" or "Total") from the PDF files. The process involves:
- Searching for a specified target string.
- Extracting the value corresponding to the target string.
Converts extracted financial values to a standard format by:
- Replacing commas with periods (and vice versa) for the European number system.
- Stripping currency symbols.
- Ensuring consistent numerical representation.
- Date Extraction: Extracts either a single date or a range of dates from the PDF.
- Normalization: Converts all dates to the format
DD-MM-YYYY. For date ranges, only the starting date is considered.
Processes all PDF files in the specified directory by:
- Searching for the "Gross Total incl. VAT" string and extracting its value.
- If the above string is not found, searching for "Total" instead and extracting its value.
- Extracting dates and normalizing them.
Saves the extracted and processed data in the following formats:
- Excel:
- Contains two sheets:
- Data Sheet: Stores the extracted values and dates.
- Pivot Table Sheet: Provides summarized data.
- Contains two sheets:
- CSV: Contains all the data, including headers, and uses a semicolon (;) as the separator.
The script can be converted into an executable file using PyInstaller. This allows it to run on systems without Python installed. Ensure PyInstaller is installed by running:
pip install pyinstallerThe command used to create the executable is:
pyinstaller --onefile task.py- Python 3.7 or later (if running the script directly).
- Required Python libraries:
Pdfplumber(for PDF processing)pandas(for data handling)openpyxl(for Excel file creation)
Install the required libraries using:
pip install pdfplumber pandas openpyxl- Place the PDF files in the same directory as the script or executable.
- Run the script:
- If using Python:
python task.py
- If using the executable, just double click on it
- If using Python:
- The output files (
invoice_data.xlsxandinvoice_data.csv) will be generated in the same directory.
- Excel File: Contains two sheets:
Data Sheet: Stores extracted financial values and normalized dates.Pivot Table Sheet: Provides a summary of the data.
- CSV File: A flat file containing the processed data.
project_directory/
|-- task.py # Python script
|-- task.exe # Executable file (if created)
|-- sample_file_1.pdf # Sample PDF file (Gross Total incl. VAT in euros)
|-- sample_file_2.pdf # Sample PDF file (Total in USD)
|-- invoice_data.xlsx # Extracted data in Excel format
|-- invoice_data.csv # Extracted data in CSV format
- The script assumes consistent formatting in the target PDFs.
- Complex PDF layouts may require additional adjustments to the extraction logic.
- Currency conversion is not handled; the script only normalizes and extracts values.
For any issues or inquiries, please contact shahzaib.beyg@gmail.com.