Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added .DS_Store
Binary file not shown.
Binary file not shown.
90 changes: 90 additions & 0 deletions README copy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
rr# oneAPI-GenAI-Hackathon-2023 - Hack2Skill

#### Team Name - Team BhuMe
#### Problem Statement - AI-Enhanced Legal Practice Platform
#### Team Leader Email - [email protected]

### 📜 Overview
This project is part of the Intel OneAPI Hackathon 2023, under the Generative AI Large Language Models Fine Tuned For Legal Practice Platform theme by Team BhuMe.
We delved into a project to develop a robust LLM finetuned on Indian property registry and documentation data capable of extracting structured information from unstructured and complex property data. We simplify & speed-up the property legal due-diligence, conducted during the short period of deal-making. For the task at hand, we implemented the recent Mistral-7B model, which is an open-source large language model. Data is downloaded during runtime from publicly available government digital land records. Our work focused on creating real estate investor & broker centric product to fetch ownership data automatically for a given Village & Plot number, and produce into a simple to understand UI. Selenium can be used to automate data download from public records, and finetuned Mistral is used to extract structured information which can be neatly shown on a vector map UI. Mistral model is finetuned on downstream tasks (adapter layer) to improve accuracy and make it less verbose.

Further, training was done using Mistral on both Intel's Ipex to perform faster inference & Nvidia T4 GPU for benchmarking purposes. Finally, real time inference will be achieved using Intel_for_pytorch_extension. The inference model is hosted on Intel Developer Cloud using ngrok.


### 📜 A Brief of the Prototype:
App is available on https://app.bhume.in/

Brokers and real estate investors use our services to automate and simplify property ownership documentation.

Lawyers use this tool to automatically download property registry data from government website, and then filter the property of interest based on property schedule containing khasra no., survey no., plot no. and other fields.

Valuers use this tool to extract sale instances of properties near their area of interest.


### Tech Stack:

Technologies used to Build the prototype Intel® AI Analytics Toolkits, and it's libraries
![tech_stack](tech_stack.png)
We use a mix of react, python, django, postgres and libraries like Selenium, scikit-learn and finetuned LLMs from OpenAI to build and run the app. Data is scrapped and stored for each request during runtime. downloaded data is filtered using document type and then fed into LLMs one by one to extract information which can help us simplify the UI.

### Step-by-Step Code Execution Instructions:
This Section must contain a set of instructions required to clone and run the prototype so that it can be tested and deeply analyzed

Run frontend
### `cd frontend`
### `npm install`
### `npm start`

Inference model
### `python Mistral_calling.py`

Go to http://localhost:3000


### Step-by-Step Finetuning
Use the following commands to setup and activate the conda environment
```bash
conda create -n venv python==3.8.10
conda activate venv
install pip install -r requirements.txt
```

set the env variable to select Intel AMX ISA
```bash
export ONEDNN_MAX_CPU_ISA="AVX512_CORE_AMX"
```

Preprocessing
Prepare the dataset using preprocess.py
```python preprocess.py```

Finetuning
run the command
```python falcon-tune.py --bf16 True --use_ipex True --max_seq_length 512```

Inference
```python falcon-tuned-inference.py --checkpoints <PATH-TO-CHECKPOINT> --max_length 200 --top_k 10```

### 🚩 Benchmarking Results:
![benchmark](benchmark.png)
*inference time is measured in seconds

### Future Scope:
Write about the scalability and futuristic aspects of the prototype developed

Property disputes account for 70% of all civil court cases in India. Proper due-diligence before any transaction can help a person avoid legal issues. Barrier to due-diligence currently is data cleaning, processing and analyzing for each property in a short duration of time, while the deal is being negotiated.

Future scope of work is:
1. to integrate other legal documents pertaining to property ownership (depth)
2. to expand to other states
3. download ownership data for each district one-by-one and make it available to users

### Our Learnings:
Write about the scalability and futuristic aspects of the prototype developed

Property disputes account for 70% of all civil court cases in India. Proper due-diligence before any transaction can help a person avoid legal issues. Barrier to due-diligence currently is data cleaning, processing and analyzing for each property in a short duration of time, while the deal is being negotiated.

Future scope of work is:
1. to integrate other legal documents pertaining to property ownership (depth)
2. to expand to other states
3. download ownership data for each district one-by-one and make it available to users
96 changes: 77 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,90 @@
# oneAPI-GenAI-Hackathon-2023 - Hack2Skill
rr# oneAPI-GenAI-Hackathon-2023 - Hack2Skill

Welcome to the official repository for the oneAPI-GenAI-Hackathon-2023 organized by Hack2Skill!
#### Team Name - Team BhuMe
#### Problem Statement - AI-Enhanced Legal Practice Platform
#### Team Leader Email - [email protected]

## Getting Started

To get started with the oneAPI-GenAI-Hackathon-2023 repository, follow these steps:
### 📜 Overview
This project is part of the Intel OneAPI Hackathon 2023, under the Generative AI Large Language Models Fine Tuned For Legal Practice Platform theme by Team BhuMe.
We delved into a project to develop a robust LLM finetuned on Indian property registry and documentation data capable of extracting structured information from unstructured and complex property data. We simplify & speed-up the property legal due-diligence, conducted during the short period of deal-making. For the task at hand, we implemented the recent Mistral-7B model, which is an open-source large language model. Data is downloaded during runtime from publicly available government digital land records. Our work focused on creating real estate investor & broker centric product to fetch ownership data automatically for a given Village & Plot number, and produce into a simple to understand UI. Selenium can be used to automate data download from public records, and finetuned Mistral is used to extract structured information which can be neatly shown on a vector map UI. Mistral model is finetuned on downstream tasks (adapter layer) to improve accuracy and make it less verbose.

Further, training was done using Mistral on both Intel's Ipex to perform faster inference & Nvidia T4 GPU for benchmarking purposes. Finally, real time inference will be achieved using Intel_for_pytorch_extension. The inference model is hosted on Intel Developer Cloud using ngrok.


### Submission Instruction:
1. Fork this repository
2. Create a folder with your Team Name
3. Upload all the code and necessary files in the created folder
4. Upload a **README.md** file in your folder with the below mentioned informations.
5. Generate a Pull Request with your Team Name. (Example: submission-XYZ_team)
### 📜 A Brief of the Prototype:
App is available on https://app.bhume.in/

Brokers and real estate investors use our services to automate and simplify property ownership documentation.

### README.md must consist of the following information:
Lawyers use this tool to automatically download property registry data from government website, and then filter the property of interest based on property schedule containing khasra no., survey no., plot no. and other fields.

Valuers use this tool to extract sale instances of properties near their area of interest.

#### Team Name -
#### Problem Statement -
#### Team Leader Email -

### A Brief of the Prototype:
This section must include UML Diagrams and prototype description

### Tech Stack:
List Down all technologies used to Build the prototype

Technologies used to Build the prototype Intel® AI Analytics Toolkits, and it's libraries
![tech_stack](tech_stack.png)
We use a mix of react, python, django, postgres and libraries like Selenium, scikit-learn and finetuned LLMs from OpenAI to build and run the app. Data is scrapped and stored for each request during runtime. downloaded data is filtered using document type and then fed into LLMs one by one to extract information which can help us simplify the UI.

### Step-by-Step Code Execution Instructions:
This Section must contain a set of instructions required to clone and run the prototype so that it can be tested and deeply analyzed

Run frontend
### `cd frontend`
### `npm install`
### `npm start`

Inference model
### `python Mistral_calling.py`

Go to http://localhost:3000


### Step-by-Step Finetuning
Use the following commands to setup and activate the conda environment
```bash
conda create -n venv python==3.8.10
conda activate venv
install pip install -r requirements.txt
```

set the env variable to select Intel AMX ISA
```bash
export ONEDNN_MAX_CPU_ISA="AVX512_CORE_AMX"
```

Preprocessing
Prepare the dataset using preprocess.py
```python preprocess.py```

Finetuning
run the command
```python falcon-tune.py --bf16 True --use_ipex True --max_seq_length 512```

Inference
```python falcon-tuned-inference.py --checkpoints <PATH-TO-CHECKPOINT> --max_length 200 --top_k 10```

### 🚩 Benchmarking Results:
![benchmark](benchmark.png)
*inference time is measured in seconds

### Future Scope:
Write about the scalability and futuristic aspects of the prototype developed

Property disputes account for 70% of all civil court cases in India. Proper due-diligence before any transaction can help a person avoid legal issues. Barrier to due-diligence currently is data cleaning, processing and analyzing for each property in a short duration of time, while the deal is being negotiated.

Future scope of work is:
1. to integrate other legal documents pertaining to property ownership (depth)
2. to expand to other states
3. download ownership data for each district one-by-one and make it available to users

### Our Learnings:
Write about the scalability and futuristic aspects of the prototype developed

Property disputes account for 70% of all civil court cases in India. Proper due-diligence before any transaction can help a person avoid legal issues. Barrier to due-diligence currently is data cleaning, processing and analyzing for each property in a short duration of time, while the deal is being negotiated.

Future scope of work is:
1. to integrate other legal documents pertaining to property ownership (depth)
2. to expand to other states
3. download ownership data for each district one-by-one and make it available to users
Binary file added Screenshot 2023-12-06 at 11.24.51 PM.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Screenshot 2023-12-06 at 11.26.08 PM.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Team BhuMe/Architecture Diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Team BhuMe/Process Flow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added bechmark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
72 changes: 72 additions & 0 deletions falcon-tune.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# falcon-tune.py
import time
import argparse

from datasets import load_dataset
from trl import SFTTrainer
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments)

def main(FLAGS):

# dataset = load_dataset("timdettmers/openassistant-guanaco", split="train")
dataset = load_and_process_json('path_to_your_file.json')

model_name = "tiiuae/falcon-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

print('setting training arguments')

training_arguments = TrainingArguments(
output_dir="./results",
bf16=FLAGS.bf16, #change for CPU
use_ipex=FLAGS.use_ipex, #change for CPU IPEX
no_cuda=True,
fp16_full_eval=False,
)

print('Creating SFTTrainer')

trainer = SFTTrainer(
model=model,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=FLAGS.max_seq_length,
tokenizer=tokenizer,
args=training_arguments,
packing=True,
)

print('Starting Training')
start = time.time()

trainer.train()

total = time.time() - start
print(f'Time to tune {total}')

if __name__ == "__main__":
parser = argparse.ArgumentParser()

parser.add_argument('-bf16',
'--bf16',
type=bool,
default=True,
help="activate mix precision training with bf16")
parser.add_argument('-ipex',
'--use_ipex',
type=bool,
default=True,
help="used to control the maximum length of the generated text in text generation tasks")
parser.add_argument('-msq',
'--max_seq_length',
type=int,
default=512,
help="specifies the number of highest probability tokens to consider at each step")

FLAGS = parser.parse_args()
main(FLAGS)
68 changes: 68 additions & 0 deletions falcon-tuned-inference.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# falcon-tuned-inference.py

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
import transformers
import torch
import argparse
import time

def main(FLAGS):

model = AutoModelForCausalLM.from_pretrained(FLAGS.checkpoints, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(FLAGS.checkpoints, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

generator = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
)

user_input = "start"

while user_input != "stop":

user_input = input(f"Provide Input to tuned falcon: ")

start = time.time()

if user_input != "stop":
sequences = generator(
f""" {user_input}""",
max_length=FLAGS.max_length,
do_sample=False,
top_k=FLAGS.top_k,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,)

inference_time = time.time() - start

for seq in sequences:
print(f"Result: {seq['generated_text']}")

print(f'Total Inference Time: {inference_time} seconds')

if __name__ == "__main__":
parser = argparse.ArgumentParser()

parser.add_argument('-c',
'--checkpoints',
type=str,
default=None,
help="path to model checkpoint files")
parser.add_argument('-ml',
'--max_length',
type=int,
default="200",
help="used to control the maximum length of the generated text in text generation tasks")
parser.add_argument('-tk',
'--top_k',
type=int,
default="10",
help="specifies the number of highest probability tokens to consider at each step")

FLAGS = parser.parse_args()
main(FLAGS)
7 changes: 7 additions & 0 deletions frontend/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# OSX
.DS_Store
build/
node_modules/
npm-debug.log
yarn-error.log

Loading