Context-Aware RAG System for Indonesian Financial Reports
Making sense of complex financial tables and documents shouldn't require hours of manual reading.
Version: 1.0.0 (MVP)
IDX-Analyst is a context-aware RAG (Retrieval-Augmented Generation) system designed to extract actionable insights from Indonesian corporate financial reports. It solves the challenge of retrieving accurate financial data from complex documents by preserving context during the retrieval process.
Key Problem: Investors and analysts spend hours manually reading through hundreds of pages of annual reports to find critical financial information. Traditional RAG systems fail because they lose context when splitting documents into chunks, making numerical data unretrievable.
Our Solution: We implement contextual retrieval inspired by Anthropic's approach, adding explanatory context to each chunk before embedding, which reduces retrieval failures by up to 67% when combined with reranking.
Indonesian stock market investors face three critical obstacles:
1. Complex Document Structure
- Financial tables span multiple pages with inconsistent formatting
- Dense text mixes qualitative narratives with quantitative metrics
- Technical accounting terminology and regulatory disclosures obscure key data
2. Standard RAG Failures
- Chunking breaks table structure, destroying row/column relationships
- Numbers lose semantic meaning without proper context
- Multi-page tables fragment across chunks, making retrieval inaccurate
3. Lost Context
- Queries can't determine which company or fiscal year is referenced
- Relationships between balance sheet, income statement, and cash flow sections are broken
- Financial context (YoY growth, segment breakdown) disappears
Example of the Problem:
Query: "What is Bank BCA's total debt in 2023?"
Traditional RAG might retrieve:
"Total liabilities: Rp 987,654 million"
(Missing: company name, fiscal year, context that this is a banking sector metric)
IDX-Analyst implements three core innovations:
Before embedding, we generate rich context using specialized LLM prompts that include:
- Company identification and business segment
- Specific financial metrics and reporting periods
- Year-over-year comparisons and trends
- Market position and strategic context
Example Transformation:
Original Chunk:
"Total Assets: 1,234,567 (in millions)"
Generated Context:
"Bank BCA's consolidated balance sheet for FY 2023 shows total assets
of Rp 1,234,567 million, representing a 12% YoY increase driven by
loan portfolio expansion in consumer banking."
→ This contextualized chunk is then embedded and indexed
- Dense Retrieval (Qwen 0.6B): Captures semantic relationships in contextualized chunks
- Sparse Retrieval (SPLADE-PP-V2): Matches specific financial terms and numbers
- Context-Aware Reranking (BGE-M3): Prioritizes results with matching metadata and highest relevance
Using LlamaParse for advanced PDF parsing that:
- Preserves complex table structures (headers, rows, columns)
- Maintains relationships across multi-page tables
- Handles nested tables and irregular layouts
- Extracts data with accurate unit preservation
Figure 1: Performance of BGE Figure 2: Performance of Cohere
We created a evaluation dataset with 25 financial questions:
Dataset Structure:
{
"id": "id",
"question": "Berapa total aset Bank BCA tahun 2023?",
"answer": "Rp 1,234.5 trillion",
"context": "Bank Central Asia...",
}Evaluation metrics include Hit Rate (Did relevant chunk appear in top-k?), MRR (Mean Reciprocal Rank), and NDCG (ranking quality).
| Metric | BGE Rerank | Cohere Rerank | Winner |
|---|---|---|---|
| Hit@3 | 76.0% | 60.0% | BGE +16% |
| Hit@5 | 88.0% | 80.0% | BGE +8% |
| Hit@10 | 96.0% | 96.0% | Equal |
| MRR (Mean Reciprocal Rank) | 69.2% | 64.8% | BGE +4.4% |
| Mean Rank Position | 2.21 | 2.83 | BGE (lower is better) |
Key Finding: BGE reranker consistently places the correct answer in top 2-3 positions, reducing user scrolling through irrelevant results.
Production Latency: Current latency reflects CPU-only inference. GPU deployment (NVIDIA A10+) will achieve P50: 0.5-1 second (23x faster).
We evaluate IDX-Analyst using consolidated financial statements from six major Indonesian public companies across different sectors:
| Company | Ticker | Sector | Report Type |
|---|---|---|---|
| PT Aneka Tambang (Antam) | ANTM | Mining | Annual Report 2024 |
| Bank Jago | ARTO | Digital Banking | Annual Report 2024 |
| Bank Central Asia | BBCA | Banking | Annual Report 2024 |
| Bank Rakyat Indonesia | BBRI | Banking | Annual Report 2024 |
| PT Astra International | ASII | Automotif | Annual Report 2024 |
| PT Alamtri Resources | ADRO | Resources/Mining | Annual Report 2024 |
- Source: Indonesian Stock Exchange (IDX) official portal
- Document Type: Consolidated Financial Statements only
- Coverage: Balance Sheets, Income Statements, Cash Flow Statements, Notes
- Pages: Maximum 20 pages per document (most relevant sections)
- Period: 2024 annual reports
All financial data is extracted from publicly available annual reports with permission from the IDX.
| Component | Technology | Version | Purpose |
|---|---|---|---|
| Backend | FastAPI | 0.109+ | REST API & async operations |
| Orchestration | LangGraph | 0.2+ | RAG workflow management |
| Vector DB | Qdrant | 1.7+ | Hybrid dense + sparse embeddings |
| PDF Parser | LlamaParse (GPT-4 Mini) | Latest | Structure-aware PDF extraction with unit preservation |
| Dense Encoder | Qwen3-Embedding-0.6B | Latest | Semantic embeddings (1024-dim) |
| Sparse Encoder | SPLADE-PP-V2 | Latest | Lexical expansion & term weighting |
| Reranker | BGE-M3-V2 | Latest | Cross-encoder ranking |
| LLM | Gemini 2.5 Flash, Groq (GPT-OSS 20B) | Latest | Context generation & responses |
| Containerization | Docker + Compose | 24+ | Multi-service orchestration |
High API Costs
- PDF parsing via LlamaParse with GPT-4 Mini incurs per-page charges
- Context generation using LLM APIs (Groq, Gemini) adds significant overhead per chunk
- Cost scales with document complexity and page count
Unit Conversion Accuracy Trade-off
- We use GPT-4 Mini for LlamaParse PDF parsing to balance cost and accuracy
- While more powerful models (Claude Sonnet 4.5, GPT-4) produce more reliable unit conversions, they significantly increase costs
- GPT-4 Mini occasionally confuses numerical units (billion vs. trillion) in complex tables.
- Docker 24+ and Docker Compose v2
- Python 3.11+ (for local development)
- Required API Keys:
- Groq (for LLM context generation)
- Gemini API (alternative LLM)
- LlamaParse (for PDF parsing)
- Cohere (optional, for Cohere reranker)
1. Parse a Financial Document
python src/document_processor/cli.py \
--input data/BBCA_annual_2024.pdf \
--ticker BBCA \
--company "Bank Central Asia" \
--output data/processed2. Set Up Local Development
# Create environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Configure environment
cp .env.example .env
# Start API server
make run3. Access the API
- Swagger UI:
http://localhost:7860/docs - ReDoc:
http://localhost:7860/redoc
Figure3: RAG Workflow
The system follows this workflow:
- Document Ingestion → PDF uploaded and parsed by LlamaParse
- Context Generation → Each chunk enriched with LLM-generated context
- Embedding & Indexing → Both dense and sparse embeddings stored in Qdrant
- Query Processing → User query embedded and searched against indices
- Reranking → BGE-M3 reranks results by relevance
- Response Generation → Top-k results passed to LLM for synthesis
- Embedding: Unified Embedding API (flexible model switching)
- Reranker: Cohere API (production-ready)
- Focus: Rapid prototyping and validation
- Infrastructure: CPU-based (cost-effective for POC)
- Embedding: Self-hosted on GPU (NVIDIA A10+)
- Models: Qwen3-Embedding-0.6B, SPLADE-PP-V2, BGE-M3
- Benefits: Lower latency, cost savings at scale, data privacy, custom fine-tuning
- Infrastructure: GPU acceleration for sub-second latency
We welcome community contributions!
How to contribute:
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature - Make changes with clear commit messages
- Update documentation as needed
- Submit a Pull Request with test results
Contribution areas: Bug fixes, documentation improvements, new embedding models, evaluation metrics, dataset expansions.
Research & Inspiration:
- Anthropic's Contextual Retrieval approach (September 2024) - Core methodology for context preservation
- LangChain Documentation - RAG patterns and best practices
- Qdrant Vector Database - Hybrid search implementation
- LlamaParse - Advanced PDF parsing
- FastAPI Best Practices - Production API design
Model Resources:
- Qwen3-Embedding-0.6B - Dense encoder
- SPLADE-PP-V2 - Sparse encoder
- BGE-M3 - Cross-encoder reranker
Need Help?
- 🐛 GitHub Issues - Report bugs
- 💬 GitHub Discussions - Ask questions
Connect with Us:
- Maintainer: Fahmi Aziz Fadhil
- Email: fahmiazizfadhil09@gmail.com
- LinkedIn: Fahmi Aziz Fadhil
- GitHub: @fahmiaziz98
MIT License - see LICENSE file for details.


