FinSense 🚀

Demonstration Video - https://youtu.be/X5pfqrTf0rA

Team.Collateral.Damage.-.mp4

Note : Use headphones/earphones for better experience !!!

Have a look at the sample AI Branch Manager Video below :

English(Recommended):

Personal.loan.mp4

Tamil(Additional Lingual Support):

Tamil.business.loan.mp4

Introduction

Intro.mp4

FinSense Application

Overview

FinSense is a cutting-edge application designed to streamline the Loan application and verification process using Optical Character Recognition (OCR) and AI Agents that follows rule-based approval system. This application leverages advanced technologies to assist users in applying for loans just by allowing them to take a video instead of filling lengthy forms !

Dont feel comfortable answering in English ? Dont worry !! We got you covered

Translation & Transcription: We use OpenAI Whisper and IndicWave2Vec for accurate speech-to-text conversion. Tamil transcripts are first generated and then translated into English for further processing.

Text Translation: We leverage IndicTrans2 to convert Tamil transcripts into English before processing.

LLM for AI Agent & Q&A Mapping: Our system is powered by LLaMA-3.2-70B, which enables intelligent agent interactions and precise question-answer mapping.

Prerequisites

Node.js: Ensure you have Node.js installed on your machine. You can download it from nodejs.org.
Python: Ensure you have Python 3.6 or higher installed. You can download it from python.org.
Git: Make sure Git is installed. You can download it from git-scm.com.

Installation

Clone the Repository:

git clone https://github.com/Abilaashss/FinSense.git
cd FinSense

Install Node.js Dependencies:
```
npm install
```
Install Python Dependencies:

Navigate to the Document-Identifier directory and install the required Python packages:
```
cd Document-Identifier
pip install -r requirements.txt
```

Running the Application

Start the Next.js Development Server:

From the root directory of the project, run:
```
npm run dev
```

The OCR Document Classifier is designed to recognize and extract structured data from various identification documents, including PAN cards, Aadhaar cards, Passports, and Driving Licenses. The model leverages state-of-the-art image recognition, text extraction, and natural language processing techniques to convert document images into a structured JSON format.

The classifier uses deep learning models such as CRAFT for text segmentation, TrOCR and Florence-2-large for text extraction, and LLaMA-3B for parsing and structuring the extracted text into a JSON schema. This document outlines the full architecture, accuracy metrics, datasets used, system requirements, and code snippets for running on various operating systems.

Globalised JSON Schema

The following schema is used to define the output structure for any document processed by the OCR classifier:

{
    "document_type": "Type of the document (PAN/Aadhar/Passport/Driving)",
    "document_id": "Aadhar number/PAN Number/Driving License PIN/other (null otherwise)",
    "name": "Name of the Person (null otherwise)",
    "dob": "Date of Birth of the given person (DD/MM/YY format) (null otherwise)",
    "gender": "Gender of the Person (M/F) (null otherwise)",
    "address": "Address of the person (null otherwise)",
    "mobile": "Mobile Number/Phone Number of the person (null otherwise)",
    "doi": "Date of Issue of the document (DD/MM/YY format) (null otherwise)",
    "doe": "Date of Expiry of the document (DD/MM/YY format) (null otherwise)",
    "place_of_issue": "Place of Issue of the document (null otherwise)"
}

Sample Inference

Input Image:

Intermediate Output:

OCR Output:

{'<OCR>': 'भारती संस्कीर-GOVERNMENT OF INDIA-मिलेस रहिNilesh SinghSAMPLEजनम शिूरी / DOB : 01/08/1985पुकर Male4444 3333 2222आधाय - आप अादमी | का अधि को'}

Structured JSON Output:

{
    "document_type": "Aadhar",
    "document_id": "4444 3333 2222",
    "name": "Nilesh Singh",
    "dob": "01/08/1985",
    "gender": "M",
    "address": null,
    "mobile": null,
    "doi": null,
    "doe": null,
    "place_of_issue": null
}

Model Architecture

Approach-1: CRAFT + TrOCR + LLaMA-3B

Input: Document Image (e.g., Aadhaar card, PAN card)
Process:
1. CRAFT: Splits the image into individual text lines.
2. TrOCR: Converts line segments into words.
3. LLaMA-3B: Maps the extracted words to the JSON schema.
Estimated Accuracy: 95.71%
Memory Usage:
- MacBook M3 Max: 18GB
- Nvidia DGX A100: 20GB

Architecture Flow:

(Document Image) -> CRAFT (Single line splits) -> TrOCR (Splits-2-Words) -> LLaMA-3B (Words-2-JSON)

Approach-2: Florence-2-large + LLaMA-3B

Input: Document Image
Process:
1. Florence-2-large: Extracts words directly from the document image.
2. LLaMA-3B: Converts extracted words into the JSON schema.
Estimated Accuracy: 98.36%
Memory Usage:
- MacBook M3 Max: 18GB (with 10GB swap)
- Nvidia DGX A100: 30GB

Architecture Flow:

(Document Image) -> Florence-2-large (Words) -> LLaMA-3B (JSON)

Accuracy Metrics

We use Embedding Score to measure the similarity between the predicted JSON and the ground truth JSON. Higher similarity between embeddings reflects better accuracy in extracting and structuring document information.

Embedding Score Rationale:
- Text embeddings from the predicted and ground truth JSON are compared using cosine similarity.
- The final metric is computed by averaging similarity scores across the dataset.
Additional Metrics:
- Precision, Recall, and F1-Score indirectly benefit from improved embedding scores, especially for critical fields like document_id, dob, and name.

Accuracy Estimation Method

Sentence Transformers are used to compute embeddings for the predicted JSON and ground truth JSON. The method ensures a fine-grained similarity comparison.

Steps:
1. Convert both JSONs into embedding vectors.
2. Calculate the cosine similarity between the embeddings.
3. Compute the average similarity score across the dataset.

Datasets

Test Dataset

The test dataset is provided by Roboflow, and includes images of Aadhaar cards for evaluating the OCR model's ability to extract key fields such as Aadhaar number, name, and date of birth.

Dataset Link: Aadhaar Card Detection Dataset

Train Dataset

No explicit training dataset is required since hyperparameters have been tuned through AutoML. However, during pre-training, the model was fine-tuned on both synthetic and real-world document data.

Data Sources:
- Synthetic Document Generation: Thousands of images with variations in format, font, and layout.
- Public Government Document Datasets: Aadhaar, PAN, passport, and driving license documents for pre-training.

Hyperparameters and AutoML Tuning

Hyperparameters

Batch Size: 16
Learning Rate: 3e-5
Optimizer: AdamW
Warmup Steps: 500
Number of Epochs: 10
Max Sequence Length: 256

AutoML Tuning

AutoML was used to fine-tune these hyperparameters through techniques like grid search and random search. The pipeline automatically optimized:

Learning Rate Scheduling: Dynamic adjustments based on performance.
Data Augmentation: Simulated variations such as noise and distortion.
Early Stopping: Prevented overfitting.

Technical Requirements

Hardware Requirements

MacBook M3 Max: 18GB RAM (10GB swap for Approach-2)
Nvidia DGX A100: 20GB (Approach-1) or 30GB (Approach-2)

Software Requirements

Python: 3.8+
CUDA: 11+ (for GPU acceleration)

Required Libraries

Install the necessary libraries:

pip install torch transformers sentence-transformers opencv-python
pip install git+https://github.com/clovaai/CRAFT-pytorch.git
pip install florence-transformers

Installation and Setup

Running on Linux

# Clone the repository
git

 clone https://github.com/your-repo/ocr-document-classifier.git
cd ocr-document-classifier

# Install dependencies
pip install -r requirements.txt

# Run the model
python run_model.py --image_path path/to/image --approach 1  # For Approach-1
python run_model.py --image_path path/to/image --approach 2  # For Approach-2

Running on macOS

# Install Homebrew (if not installed)
brew install python3

# Install dependencies
pip3 install -r requirements.txt

# Run the model
python3 run_model.py --image_path path/to/image --approach 1

Running on Windows

# Ensure Python and pip are installed
python --version
pip --version

# Install dependencies
pip install -r requirements.txt

# Run the model
python run_model.py --image_path path/to/image --approach 2

Internal Architecture Changes with MLP Head

Overview of Transformer-Based Architecture

In both Approach-1 and Approach-2, we have incorporated an MLP (Multi-Layer Perceptron) head after the Transformer layers to refine the text embeddings extracted by the OCR models. This adjustment significantly enhances the accuracy and structural representation of the JSON outputs, especially for unstructured and noisy document data.

Transformer Layer Composition

Let’s delve into the architecture changes:

1. Transformer Layers

Both TrOCR and Florence-2-large use Transformer blocks that rely on multi-head self-attention (MHSA) and feed-forward neural networks (FFN). These blocks are critical for processing input sequences in parallel and learning dependencies between words or characters.

Attention Mechanism:

The self-attention mechanism in each Transformer block operates as follows:

$$\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^\top}{\sqrt{d_k}} \right) V$$

Where:

$Q = XW_Q$ (the query matrix),
$K = XW_K$ (the key matrix),
$V = XW_V$ (the value matrix),
$W_Q, W_K, W_V$ are learnable parameter matrices,
$d_k$ is the dimensionality of the key vectors.

This mechanism computes a weighted sum of values $V$, where the weights are determined by the dot products between queries and keys, scaled by $\frac{1}{\sqrt{d_k}}$ to prevent large gradients.

Multi-Head Self-Attention (MHSA):

In multi-head attention, instead of using a single attention function, multiple attention heads are computed in parallel. Each head uses different linear projections of $Q$, $K$, and $V$:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, \dots, \text{head}_h) W_O$$

Where:

${head}i = {Attention}(QW{Q_i}, KW_{K_i}, VW_{V_i})$,
$W_{Q_i}, W_{K_i}, W_{V_i}$ are projection matrices for each head $i$,
$W_O$ is the final output projection matrix.

Each attention head focuses on different parts of the input sequence, allowing the model to capture more nuanced relationships between words.

Feed-Forward Neural Networks (FFN):

Each Transformer block contains a position-wise feed-forward network applied independently to each position:

$$\text{FFN}(x) = \text{GELU}(xW_1 + b_1)W_2 + b_2$$

Where:

$W_1 \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}$,
$W_2 \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{model}}}$,
$d_{\text{model}}$ is the dimensionality of the model,
$d_{\text{ff}}$ is the dimensionality of the intermediate feed-forward layer (usually much larger than $d_{\text{model}}$),
$b_1$ and $b_2$ are biases, and the GELU function introduces non-linearity.

Layer Normalization:

After each sub-layer (self-attention and feed-forward), Layer Normalization is applied:

$$\text{LayerNorm}(x) = \frac{x - \mu}{\sigma + \epsilon}$$

Where:

$\mu$ is the mean of the input,
$\sigma$ is the standard deviation, and
$\epsilon$ is a small constant to avoid division by zero.

This normalization stabilizes the training process and accelerates convergence.

2. MLP Head

After passing through the transformer layers, the output is fine-tuned using a Multi-Layer Perceptron (MLP) head. The MLP head performs task-specific transformations, mapping the output embeddings to the structured JSON schema fields (e.g., document_id, name, dob).

MLP Architecture:

The MLP consists of multiple fully-connected layers with non-linear activations:

$$\text{MLP}(x) = \sigma(W_2 \cdot \text{ReLU}(W_1 \cdot x + b_1) + b_2)$$

Where:

$x \in \mathbb{R}^{d_{\text{model}}}$ is the input embedding from the final transformer layer,
$W_1 \in \mathbb{R}^{d_{\text{model}} \times d_{\text{hidden}}}$ and $W_2 \in \mathbb{R}^{d_{\text{hidden}} \times d_{\text{output}}}$ are weight matrices,
$b_1 \in \mathbb{R}^{d_{\text{hidden}}}$ and $b_2 \in \mathbb{R}^{d_{\text{output}}}$ are biases,
$d_{\text{hidden}}$ is the dimensionality of the hidden layer,
$d_{\text{output}}$ is the dimensionality of the output space (number of JSON fields),
$\sigma$ is an activation function, often ReLU or GELU.

Advanced MLP Modifications:

In this architecture, we introduce Layer Normalization and Dropout for regularization and faster convergence:

$$\text{MLP}(x) = \text{Dropout}(\sigma(\text{LayerNorm}(W_2 \cdot \text{ReLU}(W_1 \cdot x + b_1) + b_2)))$$

Where:

Dropout helps to prevent overfitting by randomly setting a fraction of the activations to zero during training.
LayerNorm normalizes the hidden states to stabilize learning.

Final Mapping:

The final layer of the MLP head transforms the hidden representations into the required JSON fields. These transformations allow the model to map the document information (text embeddings) to specific labels (e.g., document_id, dob, etc.).

3. Transformer with MLP Head Calculation Overview

Self-Attention Complexity:

The complexity of the attention mechanism is $O(n^2 \cdot d_{\text{model}})$, where:

$n$ is the sequence length (number of tokens),
$d_{\text{model}}$ is the embedding dimension.

To improve efficiency, techniques like Sparse Attention or Local Attention could be used, but in this architecture, we rely on the full attention mechanism for maximum accuracy.

MLP Complexity:

The computational complexity of the MLP head is:

$$O(n \cdot d_{\text{model}} \cdot d_{\text{hidden}} + n \cdot d_{\text{hidden}} \cdot d_{\text{output}})$$

Where:

$n$ is the number of tokens,
$d_{\text{model}}$ is the model's dimensionality,
$d_{\text{hidden}}$ is the hidden layer size, typically larger than $d_{\text{model}}$,
$d_{\text{output}}$ is the number of output classes (in this case, the JSON fields).

4. Enhancements and Changes in Internal Architecture

Multi-Layer Perceptron Head with Cross-Attention:

We've incorporated cross-attention layers in the MLP head for tasks requiring interactions between different document fields. The cross-attention mechanism computes dependencies across different embeddings before final classification.

For the cross-attention:

$$\text{Cross-Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^\top}{\sqrt{d_k}} \right) V$$

This ensures fields like document_id and dob can interact, improving the model's understanding of document structure.

Memory Considerations with MLP

Memory Usage: The addition of the MLP head increases memory consumption slightly, but the improved accuracy compensates for this.
- On DGX A100 (Nvidia), we observe a memory increase from 20GB to 22GB for Approach-1 and from 30GB to 33GB for Approach-2.
Performance Gains: The embedding accuracy increases by ~1.5%, notably improving field extraction in noisy images, particularly for long address fields and complex date formats.

Contributors

Hariharan Mudaliar

Umesh G J H

Abilaash S S

Kavya Udhayashankar

Arvind Kumar CM

R Siddharth

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
Backend		Backend
Document-Identifier		Document-Identifier
Sample-DB		Sample-DB
ai_functions		ai_functions
public		public
src		src
text2lipsync		text2lipsync
.gitignore		.gitignore
README.md		README.md
next.config.js		next.config.js
next.config.ts		next.config.ts
package-lock.json		package-lock.json
package.json		package.json
postcss.config.js		postcss.config.js
postcss.config.mjs		postcss.config.mjs
tailwind.config.js		tailwind.config.js
tsconfig.json		tsconfig.json

Abilaashss/FinSense

Folders and files

Latest commit

History

Repository files navigation

FinSense 🚀

Demonstration Video - https://youtu.be/X5pfqrTf0rA

Note : Use headphones/earphones for better experience !!!

Have a look at the sample AI Branch Manager Video below :

English(Recommended):

Tamil(Additional Lingual Support):

Introduction

FinSense Application

Overview

Dont feel comfortable answering in English ? Dont worry !! We got you covered

Prerequisites

Installation

Running the Application

Globalised JSON Schema

Sample Inference

Model Architecture

Approach-1: CRAFT + TrOCR + LLaMA-3B

Architecture Flow:

Approach-2: Florence-2-large + LLaMA-3B

Architecture Flow:

Accuracy Metrics

Accuracy Estimation Method

Datasets

Test Dataset

Train Dataset

Hyperparameters and AutoML Tuning

Hyperparameters

AutoML Tuning

Technical Requirements

Hardware Requirements

Software Requirements

Required Libraries

Installation and Setup

Running on Linux

Running on macOS

Running on Windows

Internal Architecture Changes with MLP Head

Overview of Transformer-Based Architecture

Transformer Layer Composition

1. Transformer Layers

Attention Mechanism:

Multi-Head Self-Attention (MHSA):

Feed-Forward Neural Networks (FFN):

Layer Normalization:

2. MLP Head

MLP Architecture:

Advanced MLP Modifications:

Final Mapping:

3. Transformer with MLP Head Calculation Overview

Self-Attention Complexity:

MLP Complexity:

4. Enhancements and Changes in Internal Architecture

Multi-Layer Perceptron Head with Cross-Attention:

Memory Considerations with MLP

Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages