An advanced, intelligent spend categorization system designed to automatically classify invoice-level procurement data using a hybrid approach that combines both rule-based logic and AI-driven semantic understanding. The system is architected to be scalable, accurate, and human-in-the-loop friendly, enabling organizations to process large volumes of invoice descriptions while maintaining control, auditability, and transparency over categorization decisions.
This system processes invoice-level data to categorize spend items according to the UNSPSC (United Nations Standard Products and Services Code) taxonomy. It uses a hybrid pipeline that combines:
- AI-Powered Categorization using OpenAI GPT models
- Rule-Based Classification for deterministic tagging
- Manual Review Interface for edge cases and quality assurance
- Automatic Categorization: Categorize thousands of line items at scale
- Hybrid Classification: Leverages both deterministic rules and probabilistic AI
- Confidence Scoring: Track classification certainty with every result
- Manual Review: Built-in UI for validating uncertain predictions.
- Data Visualization: Interactive spend analytics and can be downloaded as well for detailed analysis.
- Export Capability: Download categorized and reviewed data in CSV format
- Taxonomy-Aware Embeddings: Uses semantic similarity against the UNSPSC hierarchy
- Daily Taxonomy Sync: Refreshes UNSPSC data every 3 months automatically
- Real-time Search: Instant search across categorized items
- Learning System: Use feedback to improve future classifications
- Feedback History: Maintain audit trail of all corrections
- Continuous Improvement: Regular prompt updates based on feedback
- Detailed Analytics Dashboard:
- Confidence distribution analysis
- Category-wise performance metrics
- Source distribution (Rule-based vs AI-based)
- Downloadable Reports:
- Export metrics as CSV
- Dynamic Prompt Engineering: Automatically optimize prompts based on performance
- Context-Aware Prompts: Adapt prompts based on item category
- Continuous Improvement: Regular prompt updates based on feedback
- Interactive Charts:
- Category distribution bar graphs
- Supplier amount distribution
- Confidence score pie charts
- Source distribution analysis
- Downloadable Visualizations:
- Export charts as PNG
- Save metrics as CSV
- Generate comprehensive reports
Prepare your invoice CSV like this:
| Invoice ID | SKU | Description | Supplier | Amount |
|---|---|---|---|---|
| 001 | 10001 | Black toner cartridge | OfficeSupplyCo | 89.99 |
| 002 | 20003 | Fiber optic cables, 50ft | NetGear Inc. | 129.50 |
Required columns: description, supplier, sku, invoice_id
-
👉🏻Check Deployed Site: Live Demo
- Sanitization: Cleans and normalizes invoice text
- Rule Matching: Applies analyst-defined keyword rules
- Semantic Retrieval: Finds top UNSPSC candidates via embedding similarity
- AI Selection: GPT-4 picks the most likely UNSPSC code
- Confidence Routing: Items below threshold are queued for manual review
- Multi-core parallel processing via multiprocessing
- Confidence-based classification routing
- Logs available in logs/pipeline.log
- More robust rule engine (regex, entity recognition)
- Spend analytics dashboard
- RESTful API endpoints for integration
- Scheduled batch job manager
- User-friendly UI
- Komal Meena
- Subhav Jain
- Sidhant Budhiraja
- Prayash Pandey
- Python 3.9+
- OpenAI API Key
- Clone the Repository
OPENAI_API_KEY=sk-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
git clone https://github.com/yourusername/acme-spend-categorization.git cd acme-spend-categorization python3 -m venv venv source venv/bin/activate # macOS/Linux # OR for Windows venv\Scripts\activate pip install -r requirements.txt python src/08_pipeline.py
-
Run Full Pipeline - Processes invoices and writes outputs
python src/08_pipeline.py
-
Outputs
- data/categorized.csv: High-confidence auto-tagged items
data/manual_review.csv: Items requiring human validationlogs/pipeline.log: Detailed logging of categorization events
- OpenAI for powerful AI APIs
- UNSPSC.org for the classification taxonomy
- Sentence-Transformers for semantic search










