Skip to content

Commit 4d3aa23

Browse files
Merge pull request #1718 from oracle-devrel/ocr-llm-demo
Added ocr-llm-demo
2 parents 5b0839e + c8f7ecc commit 4d3aa23

File tree

9 files changed

+666
-0
lines changed

9 files changed

+666
-0
lines changed
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
Copyright (c) 2021, 2023 Oracle and/or its affiliates.
2+
3+
The Universal Permissive License (UPL), Version 1.0
4+
5+
Subject to the condition set forth below, permission is hereby granted to any
6+
person obtaining a copy of this software, associated documentation and/or data
7+
(collectively the "Software"), free of charge and under any and all copyright
8+
rights in the Software, and any and all patent rights owned or freely
9+
licensable by each licensor hereunder covering either (i) the unmodified
10+
Software as contributed to or provided by such licensor, or (ii) the Larger
11+
Works (as defined below), to deal in both
12+
13+
(a) the Software, and
14+
(b) any piece of software and/or hardware listed in the lrgrwrks.txt file if
15+
one is included with the Software (each a "Larger Work" to which the Software
16+
is contributed by such licensors),
17+
18+
without restriction, including without limitation the rights to copy, create
19+
derivative works of, display, perform, and distribute the Software and make,
20+
use, sell, offer for sale, import, export, have made, and have sold the
21+
Software and the Larger Work(s), and to sublicense the foregoing rights on
22+
either these or other terms.
23+
24+
This license is subject to the following condition:
25+
The above copyright notice and either this complete permission notice or at
26+
a minimum a reference to the UPL must be included in all copies or
27+
substantial portions of the Software.
28+
29+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
30+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
31+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
32+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
33+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
34+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
35+
SOFTWARE.
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# ocr-llm-demo
2+
How to quickly setup a demo to demonstrate OCR capabilities of multimodal llm
3+
New generation open multimodal llm are a very good fit for complex OCR workloads.
4+
Many of the funcionalities that would require fine tuning with traditional OCR models can now be achieved with prompt engineering.
5+
The multilingual support, the hability to recognize handwriting are some of the features that can be used to improve OCR workloads.
6+
7+
## Prerequisites of the demo
8+
To download model weights you will need access token to Hugging Face.
9+
The demo was run on Ubuntu 24.04. But it should be possible to run it on other Ubuntu versions.
10+
You need to install
11+
- Cuda toolkit/Nvidia Driver
12+
- anaconda or miniconda
13+
- sudo apt-get install python-poppler
14+
15+
16+
### Install and activate
17+
```
18+
conda env create -f ocr-llm.yaml
19+
conda activate ocr-llm
20+
```
21+
22+
## LLM Models
23+
24+
In this Demo we use VLLM to serve multi modal models with the OpenAI API. We tested Pixtral-12B on a VM with 2 A10, and Qwen2-VL on a single A10.
25+
Llama-3.2-11B-Vision-Instruct is also an option.
26+
27+
You first need to login to Hugging Face to download the weights
28+
```
29+
huggingface-cli login
30+
```
31+
and you serve one of the VLLM supported visual models. According to the number of GPUs in your shape you might be able to execute one model or more concurrenlty. For llama-3.2 access in Europe is currenlty restricted.
32+
33+
```
34+
vllm serve mistralai/Pixtral-12B-2409 --dtype auto --tokenizer-mode mistral -tp 2 --port 8001 --max-model-len 32768
35+
36+
vllm serve Qwen/Qwen2-VL-7B-Instruct --dtype auto --max-model-len 8192 --enforce-eager --port 8000
37+
38+
vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --dtype auto --port 8002 --max-model-len 32768
39+
```
40+
## Sample images
41+
42+
The folder pictures includes some example pictures that can be used in the demo. You can add additional images to improve the demo.
43+
The LLM supported formats are PNG,JPG,WEBP, non animated GIF. I also added automated conversion for PDF images, but for multipage PDF only the firt page will be considered.
44+
45+
46+
## Running the GUI
47+
48+
You have a Gradio based GUI available:
49+
```
50+
python gui.py
51+
```
52+
53+
Gradio is configured to proxy to a public connection, similar to the following one
54+
![Alt text](files/gui.png?raw=true "GUI")
55+
56+
## Executing Qwen-2.5-VL as backend API
57+
58+
59+
60+
61+
Qwen-2.5-VL are now supported by VLLM but you mught still need to install transformers from github repo.
62+
63+
You can execute the 72B model on the 8 A100 GPUs of BM.GPU.4.8 with
64+
65+
```
66+
vllm serve Qwen/Qwen2.5-VL-72B-Instruct --dtype auto -tp 8 --port 9193
67+
```
68+
inference time 40-50 seconds.
69+
70+
It can also be executed on a BM.GPU.L40s.4 by limiting context length
71+
72+
```
73+
vllm serve Qwen/Qwen2.5-VL-72B-Instruct --dtype auto -tp 4 --port 9193 --max-model-len 16000 --enforce-eager
74+
```
75+
inference time about 60 seconds
76+
77+
The 7B model can be executed on 2 GPUs
78+
79+
```
80+
vllm serve Qwen/Qwen2.5-VL-7B-Instruct --dtype auto -tp 2 --port 9192
81+
```
82+
83+
84+
604 KB
Loading
Lines changed: 212 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
import base64
2+
from openai import OpenAI
3+
import gradio as gr
4+
import magic
5+
from pdf2image import convert_from_path
6+
from io import BytesIO
7+
from gradio_pdf import PDF
8+
from PIL import Image as Pil
9+
import shutil
10+
import os
11+
12+
uploaded_files = set()
13+
14+
def upload_file(file):
15+
global uploaded_files
16+
if file.name in uploaded_files:
17+
return
18+
UPLOAD_FOLDER = "./pictures"
19+
shutil.copy(file, UPLOAD_FOLDER)
20+
gr.Info("File uploaded", duration=2)
21+
uploaded_files.add(file.name)
22+
23+
def update_file_explorer_2():
24+
return gr.FileExplorer(root_dir="./pictures")
25+
26+
def upload_file2(file):
27+
UPLOAD_FOLDER = "./pictures"
28+
shutil.copy(file, UPLOAD_FOLDER)
29+
gr.Info("File uploaded", duration=2)
30+
return gr.FileExplorer(root_dir="./")
31+
32+
# Function to encode the image
33+
def encode_image(image_path):
34+
with open(image_path, "rb") as image_file:
35+
return base64.b64encode(image_file.read()).decode('utf-8')
36+
37+
# Path to your image
38+
image_path = "./test.png"
39+
40+
def is_pdf(file_path):
41+
mime = magic.Magic(mime=True)
42+
file_type = mime.from_file(file_path)
43+
return file_type == 'application/pdf'
44+
45+
def contact_llm(model_label,query,image_path):
46+
images=[]
47+
48+
# if upload_button:
49+
# upload_file(upload_button)
50+
51+
if not image_path:
52+
return None, None, None
53+
54+
# Getting the base64 string
55+
if is_pdf(image_path):
56+
images = convert_from_path(image_path)
57+
im_file = BytesIO()
58+
images.save(im_file, format="JPEG")
59+
im_bytes = im_file.getvalue()
60+
base64_image = base64.b64encode(im_bytes).decode('utf-8')
61+
image_path=im_file
62+
image=images
63+
else:
64+
base64_image = encode_image(image_path)
65+
image = Pil.open(image_path)
66+
67+
if model_label=="Pixtral-12B":
68+
model="mistralai/Pixtral-12B-2409"
69+
port=str(8001)
70+
if model_label=="Qwen2-VL":
71+
model="Qwen/Qwen2-VL-7B-Instruct"
72+
port=str(8000)
73+
if model_label=="Qwen2.5-VL":
74+
model="Qwen/Qwen2.5-VL-7B-Instruct"
75+
port=str(9192)
76+
if model_label=="Qwen2.5-VL-72B":
77+
model="Qwen/Qwen2.5-VL-72B-Instruct"
78+
port=str(9193)
79+
if model_label=="Llama-3.2-Vision":
80+
model="meta-llama/Llama-3.2-11B-Vision-Instruct"
81+
port=str(8002)
82+
83+
84+
if query != "":
85+
text_query=query
86+
else:
87+
gr.Info("Using default Query", duration=1)
88+
text_query="Extract text from picture precisely as JSON"
89+
90+
client = OpenAI(
91+
base_url="http://localhost:"+port+"/v1",
92+
api_key="EMPTY" # vLLM doesn't require an API key by default
93+
)
94+
response = client.chat.completions.create(
95+
model=model,
96+
messages=[
97+
{
98+
"role": "user",
99+
"content": [
100+
{
101+
"type": "text",
102+
"text": text_query,
103+
},
104+
{
105+
"type": "image_url",
106+
"image_url": {
107+
"url": f"data:image/jpeg;base64,{base64_image}"
108+
},
109+
},
110+
],
111+
}
112+
],
113+
)
114+
115+
#print(response.choices[0])
116+
return text_query,response.choices[0], image
117+
118+
def get_from_url(url_input,model_label,query):
119+
import validators
120+
images=[]
121+
122+
# if upload_button:
123+
# upload_file(upload_button)
124+
125+
valid=validators.url(url_input)
126+
print (valid,url_input)
127+
128+
if model_label=="Pixtral-12B":
129+
model="mistralai/Pixtral-12B-2409"
130+
port=str(8001)
131+
if model_label=="Qwen2-VL":
132+
model="Qwen/Qwen2-VL-7B-Instruct"
133+
port=str(8000)
134+
if model_label=="Qwen2.5-VL":
135+
model="Qwen/Qwen2.5-VL-7B-Instruct"
136+
port=str(9192)
137+
if model_label=="Qwen2.5-VL-72B":
138+
model="Qwen/Qwen2.5-VL-72B-Instruct"
139+
port=str(9193)
140+
if model_label=="Llama-3.2-Vision":
141+
model="meta-llama/Llama-3.2-11B-Vision-Instruct"
142+
port=str(8002)
143+
144+
145+
if query != "":
146+
text_query=query
147+
else:
148+
gr.Info("Using default Query", duration=1)
149+
text_query="Extract text from picture precisely as JSON"
150+
151+
client = OpenAI(
152+
base_url="http://localhost:"+port+"/v1",
153+
api_key="EMPTY" # vLLM doesn't require an API key by default
154+
)
155+
response = client.chat.completions.create(
156+
model=model,
157+
messages=[
158+
{
159+
"role": "user",
160+
"content": [
161+
{
162+
"type": "text",
163+
"text": text_query,
164+
},
165+
{
166+
"type": "image_url",
167+
"image_url": {
168+
"url": url_input
169+
},
170+
},
171+
],
172+
}
173+
],
174+
)
175+
176+
#print(response.choices[0])
177+
return text_query,response.choices[0], None
178+
179+
if __name__ == "__main__":
180+
with gr.Blocks() as demo:
181+
gr.Markdown("# VLM based OCR")
182+
gr.Markdown("Provide an image and ask questions based on the context generated from it.")
183+
184+
with gr.Row():
185+
with gr.Column(scale=1):
186+
model = gr.Dropdown(
187+
["Qwen2.5-VL", "Qwen2.5-VL-72B","Pixtral-12B", "Qwen2-VL", "Llama3.2-Vision"],
188+
label="Model",
189+
info="Pick the model to use"
190+
)
191+
query_input = gr.Textbox(label="Enter your query", placeholder="Ask a question about the content")
192+
url_input = gr.Textbox(label="Enter image URL", placeholder="Paste image URL")
193+
file_explorer = gr.FileExplorer(glob="**/**", root_dir="./pictures", ignore_glob="**/__init__.py", file_count="single")
194+
file_upload = gr.File(file_count="single")
195+
submit_btn = gr.Button("Submit")
196+
197+
with gr.Column(scale=1):
198+
query_output = gr.Textbox(label="Query")
199+
response_output = gr.Textbox(label="Response")
200+
image_output = gr.Image(type="pil")
201+
202+
submit_btn.click(
203+
fn=contact_llm,
204+
inputs=[model, query_input, file_explorer],
205+
outputs=[query_output, response_output, image_output]
206+
)
207+
file_upload.upload(fn=upload_file2, inputs=file_upload, outputs=file_explorer).then(update_file_explorer_2, outputs=file_explorer)
208+
url_input.input(fn=get_from_url,inputs=[url_input,model, query_input],outputs=[query_output, response_output, image_output]
209+
)
210+
# Launch the interface
211+
url = demo.launch(share=True,auth=("opc", "H789lf4z"))
212+

0 commit comments

Comments
 (0)