This document describes the technical implementation and workflow for processing tabular invoice data (CSV, XLSX, XLS) within the AI e-Invoicing system.
The system treats tabular files as highly-structured text inputs. Unlike PDF/Image processing which requires OCR, tabular processing leverages the native structure of the files using pandas to ensure 100% text accuracy before passing the data to the LLM-based extraction layer.
| Layer | Component | Technology |
|---|---|---|
| Ingestion | excel_processor.py |
pandas, openpyxl, tabulate |
| Orchestration | orchestrator.py |
asyncio, pathlib |
| Brain | extractor.py |
DeepSeek-V3 / GPT-4o, Direct OpenAI Client |
| Core | models.py |
SQLAlchemy 2.0, PostgreSQL (JSONB) |
| Interface | invoices.py |
FastAPI, Pydantic v2 |
- Trigger: User uploads a file or a batch processing script (e.g.,
process_invoices.py) is run. - Detection:
ingestion.file_discovery.get_file_typeidentifies.csv,.xlsx, or.xlsextensions. - Routing:
ingestion.orchestrator.process_invoice_filedispatches the file to theprocess_excelhandler.
- Reading:
pandas.read_csvorpandas.read_excelloads the data into DataFrames. - Multi-Sheet Support: For Excel files, every sheet is parsed and converted.
- Textualization: The DataFrames are converted to a Markdown representation using
df.to_markdown(). This preserves the column-row relationships in a way that LLMs can easily parse. - Return: A dictionary containing the combined markdown text, sheet-wise record counts, and metadata.
- Contextual Prompting: The system prompt includes specific instructions for "TABULAR DATA":
"Pay close attention to data columns. The raw text might look like a table or delimited text. Map the columns correctly to the schema."
- LLM Processing: The LLM analyzes the markdown table and maps it to the
ExtractedDataSchema(Vendor, Date, Line Items, etc.). - Self-Correction: If the first extraction fails validation (e.g., math mismatch), the
refine_extractionlogic provides the tabular text back to the LLM with error feedback for a second attempt.
- Invoice Record: Metadata (filename, hash, size) is stored in the
invoicestable. - Structured Data: The extracted fields and line items are stored in the
extracted_datatable, with line items persisting in aJSONBcolumn. - Raw Traceability: The full markdown representation generated during ingestion is saved in the
raw_textfield for debugging and human review.
ingestion/: Containsexcel_processor.py(the core parser).brain/: Handles the mapping of tabular text to structured fields.core/: Defines the database schema and data models.interface/: Provides the API endpoints to trigger processing.data/: The directory where input files are stored/scanned.
- Parallelism: Since CSV/XLSX processing is memory-efficient (unlike heavy OCR), many files can be processed in parallel via the
pgqueuerbackground system. - Memory: Large Excel files are handled via
pandasstreaming/parsing, though typically invoice files are small.