A dataset for this project is available at the following Google Drive link: Dataset Download
A sophisticated financial document analysis and LLM fine-tuning system that automatically extracts financial insights from PDF documents and creates Question-Answer-Reasoning (QAR) datasets for training specialized financial AI models.
FinQAR - Finance Question Answer Reasoning Triplets is designed to bridge the gap between unstructured financial documents and structured AI training data. The system performs OCR on financial PDFs, generates contextual questions based on financial reasoning, provides comprehensive answers, and creates detailed reasoning chains - all optimized for fine-tuning language models to become domain-specific financial experts.
- Advanced OCR Processing: Uses EasyOCR for page-level text extraction from financial PDFs
- Large Context Analysis: Handles up to 120K tokens for comprehensive document understanding
- Smart Question Generation: Creates financial reasoning-based questions per page using configurable token thresholds
- Professional Answer Generation: Leverages large-context LLMs via NVIDIA API or local GPU inference
- Detailed Reasoning Chains: Generates CFA-level analytical reasoning for each question-answer pair
- Multiple Processing Options: Support for both NVIDIA API and local GPU processing
- Efficient Fine-tuning: Uses LoRA (Low-Rank Adaptation) for memory-efficient training
- 4-bit Quantization: Optimized for resource-constrained environments
- Flexible Model Support: Compatible with various instruction-tuned models
- Performance Comparison: Built-in inference comparison between original and fine-tuned models
FinQAR/
├── data/
│ └── README.md # Dataset documentation and download link
├── QAR-Generation/
│ ├── generate_QAR_local_GPU.py
│ ├── generate_QAR_triplets_from_pdf.py
│ └── qarNvidiaAPI.py
├── scripts/
│ ├── finetune.py
│ └── inference.py
└── README.md # Project documentation
- data/: Contains dataset documentation and download instructions.
- QAR-Generation/: Scripts for generating Question-Answer-Reasoning (QAR) triplets from financial PDFs using local GPU or NVIDIA API.
- scripts/: Scripts for model fine-tuning and inference/comparison.
- README.md: Main project documentation and instructions.
- Python 3.8+
- CUDA-compatible GPU (recommended)
- NVIDIA API key (for API-based processing)
- Hugging Face account and token
pip install torch transformers
pip install easyocr pdf2image
pip install pandas numpy
pip install peft bitsandbytes
pip install tiktoken requests
pip install huggingface_hub datasets
- Set your NVIDIA API key in the configuration:
NVIDIA_API_KEY = "nvapi-your-key-here"
- Configure paths and parameters:
PDF_DIR = "/path/to/your/pdfs"
OUTPUT_ROOT = "qar_dataset"
TOKENS_PER_QUESTION = 100 # Adjust question density
- Set your Hugging Face token:
HF_TOKEN = "hf_your-token-here"
- Configure GPU memory allocation:
max_mem = {
0: "48GiB", # Adjust based on your GPU
}
python generate_QAR_triplets_from_pdf.py
- Uses advanced models like Llama-3.1-70B for superior reasoning quality
- Handles large context windows (up to 128K tokens)
- Automatic context truncation and management
python generate_QAR_local_GPU.py
- Processes documents entirely on local hardware
- Uses quantized models for memory efficiency
- Suitable for sensitive documents requiring on-premise processing
python qarNvidiaAPI.py
- Professional-grade system prompts for institutional analysis
- Enhanced reasoning with CFA-level financial expertise
- Batch question generation for improved efficiency
python finetune.py
Key features:
- LoRA Configuration: Efficient fine-tuning with minimal resource requirements
- 4-bit Quantization: Reduces memory usage while maintaining performance
- Flexible Training: Configurable batch sizes and learning rates
python inference.py

- Side-by-side comparison of original vs fine-tuned models
- Interactive questioning interface
- Performance evaluation for financial domain expertise
The system employs professionally crafted prompts designed for institutional-grade financial analysis:
- Focuses on valuation, risk analysis, market behavior, and business strategy
- Limited to 50 words for clarity and precision
- Ensures financial domain relevance
- Senior financial analyst perspective with 200-word responses
- Explicit information extraction with clear limitation statements
- Professional objectivity with insightful interpretation
- CFA charterholder-level analytical rigor
- Step-by-step reasoning connecting numbers to finance principles
- Comprehensive 400-word analytical reasoning chains
The system generates structured CSV files containing:
- page_number: Source page reference
- page_tokens: Token count for context sizing
- questions_generated: Number of questions per page
- cumulative_context_tokens: Growing context size
- question: Generated financial question
- answer: Comprehensive analytical response
- reasoning: Detailed step-by-step analysis
- Maximum Context: 120K tokens with intelligent truncation
- Token Counting: Uses tiktoken for accurate token management
- Memory Optimization: Automatic garbage collection and CUDA cache clearing
- NVIDIA API Models: Llama-3.1-70B, Nemotron-70B, Llama-4-scout, Llama-4-Maverick
- Local Models: Qwen2.5-32B-Instruct-GPTQ, Llama-3.1-8B
- Quantization: 4-bit and 8-bit support for resource efficiency
- Temperature Settings: Low temperature (0.1) for consistent answers
- Retry Logic: Automatic retry for "[Not Directly Stated]" responses
- Professional Standards: Institutional-grade analytical rigor
- GPU Memory: Requires 24GB+ VRAM for optimal local processing
- Processing Speed: 1 question per 100 tokens (configurable)
- Quality vs Speed: API processing offers higher quality, local processing offers privacy
- Investment Research: Automated analysis of financial reports and documents
- Risk Assessment: Systematic extraction of risk factors and mitigation strategies
- Financial Education: Training datasets for financial AI applications
- Document Digitization: Converting legacy financial documents into structured data
- Compliance Analysis: Automated review of regulatory and compliance documents
This project is designed for financial professionals, AI researchers, and developers working on domain-specific language models. Contributions focusing on enhanced financial reasoning, improved OCR accuracy, or model optimization are welcome.
This project is licensed under MIT License.