Skip to content

Commit 13caf87

Browse files
Merge pull request #120 from seanpedrick-case/dev
Added LLM support for redaction and summarisation. GUI improvements with 'Walkthrough' redaction process. Efficient OCR option added with multithread and split text extraction between visual OCR and simple text extraction. Various bug fixes.
2 parents 3be9b9b + a671c9b commit 13caf87

30 files changed

+22770
-4886
lines changed

README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ short_description: OCR / redact PDF documents and tabular data
1111
---
1212
# Document redaction
1313

14-
version: 1.6.7
14+
version: 1.7.0
1515

1616
Redact personally identifiable information (PII) from documents (pdf, png, jpg), Word files (docx), or tabular data (xlsx/csv/parquet). Please see the [User Guide](#user-guide) for a full walkthrough of all the features in the app.
1717

@@ -1005,7 +1005,7 @@ The following parameters can be configured by your system administrator to fine-
10051005
When VLM options are enabled, the following settings are available:
10061006

10071007
- **SHOW_VLM_MODEL_OPTIONS** (default: False): If enabled, VLM options will be shown in the UI.
1008-
- **SELECTED_MODEL** (default: "Dots.OCR"): The VLM model to use. Options include: "Nanonets-OCR2-3B", "Dots.OCR", "Qwen3-VL-2B-Instruct", "Qwen3-VL-4B-Instruct", "Qwen3-VL-8B-Instruct", "PaddleOCR-VL". Generally, the Qwen3-VL-8B-Instruct model is the most accurate, and vlm/inference server inference is based on using this model, but is also the slowest. Qwen3-VL-4B-Instruct can also work quite well on easier documents.
1008+
- **SELECTED_LOCAL_TRANSFORMERS_VLM_MODEL** (default: "Dots.OCR"): The VLM model to use. Options include: "Nanonets-OCR2-3B", "Dots.OCR", "Qwen3-VL-2B-Instruct", "Qwen3-VL-4B-Instruct", "Qwen3-VL-8B-Instruct", "PaddleOCR-VL". Generally, the Qwen3-VL-8B-Instruct model is the most accurate, and vlm/inference server inference is based on using this model, but is also the slowest. Qwen3-VL-4B-Instruct can also work quite well on easier documents.
10091009
- **MAX_SPACES_GPU_RUN_TIME** (default: 60): Maximum seconds to run GPU operations on Hugging Face Spaces.
10101010
- **MAX_NEW_TOKENS** (default: 30): Maximum number of tokens to generate for VLM responses.
10111011
- **MAX_INPUT_TOKEN_LENGTH** (default: 4096): Maximum number of tokens that can be input to the VLM.
@@ -1021,11 +1021,11 @@ When VLM options are enabled, the following settings are available:
10211021

10221022
### Using an alternative OCR model
10231023

1024-
If the SHOW_LOCAL_OCR_MODEL_OPTIONS, SHOW_PADDLE_MODEL_OPTIONS, and SHOW_INFERENCE_SERVER_OPTIONS are set to 'True' in your app_config.env file, you should see the following options available under 'Change default redaction settings...' on the front tab. The different OCR options can be used in different contexts.
1024+
If the SHOW_LOCAL_OCR_MODEL_OPTIONS, SHOW_PADDLE_MODEL_OPTIONS, and SHOW_INFERENCE_SERVER_VLM_OPTIONS are set to 'True' in your app_config.env file, you should see the following options available under 'Change default redaction settings...' on the front tab. The different OCR options can be used in different contexts.
10251025

10261026
- **Tesseract (option 'tesseract')**: Best for documents with clear, well-formatted text, providing a good balance of speed and accuracy with precise word-level bounding boxes. But struggles a lot with handwriting or 'noisy' documents (e.g. scanned documents).
10271027
- **PaddleOCR (option 'paddle')**: More powerful than Tesseract, but slower. Does a decent job with unclear typed text on scanned documents. Also, bounding boxes may not all be accurate as they will be calculated from the line-level bounding boxes produced by Paddle after analysis.
1028-
- **VLM (option 'vlm')**: Recommended for use with the Qwen-3-VL 8B model (can set this with the SELECTED_MODEL environment variable in config.py). This model is extremely good at identifying difficult to read handwriting and noisy documents. However, it is much slower than the above options.
1028+
- **VLM (option 'vlm')**: Recommended for use with the Qwen-3-VL 8B model (can set this with the SELECTED_LOCAL_TRANSFORMERS_VLM_MODEL environment variable in config.py). This model is extremely good at identifying difficult to read handwriting and noisy documents. However, it is much slower than the above options.
10291029
Other models are available as you can see in the tools/run_vlm.py code file. This will conduct inference with the transformers package, and quantise with bitsandbytes if the QUANTISE_VLM_MODELS environment variable is set to True. Inference with this package is *much* slower than with e.g. llama.cpp or vllm servers, which can be used with the inference-server options described below.
10301030
- **Inference server (option 'inference-server')**: This can be used with OpenAI compatible API endpoints, for example [llama-cpp using llama-server](https://github.com/ggml-org/llama.cpp), or [vllm](https://docs.vllm.ai/en/stable). Both of these options will be much faster for inference than the VLM 'in-app' model calls described above, and produce results of a similar quality, but you will need to be able to set up the server separately.
10311031

@@ -1064,7 +1064,7 @@ llama-server \
10641064
If running llama.cpp on the same computer as the doc redaction app, you can then set the following variable in config/app_config.env to run:
10651065

10661066
```
1067-
SHOW_INFERENCE_SERVER_OPTIONS=True
1067+
SHOW_INFERENCE_SERVER_VLM_OPTIONS=True
10681068
INFERENCE_SERVER_API_URL=http://localhost:7862
10691069
```
10701070

_quarto.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
project:
22
type: website
3-
output-dir: docs # Common for GitHub Pages
3+
output-dir: docs
44
render:
55
- "*.qmd"
66

77
website:
88
title: "Document Redaction App"
9-
page-navigation: true # Often enabled for floating TOC to highlight current section
9+
page-navigation: true
1010
back-to-top-navigation: true
1111
search: true
1212
navbar:

0 commit comments

Comments
 (0)