Skip to content

Commit 83576cb

Browse files
Updated DynamoDB and S3 log load scripts. Updated user guide. Fix to loading review file when loading in _redactions_for_review file. Front page GUI fix
1 parent 6103474 commit 83576cb

File tree

10 files changed

+816
-271
lines changed

10 files changed

+816
-271
lines changed

README.md

Lines changed: 26 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ short_description: OCR / redact PDF documents and tabular data
1111
---
1212
# Document redaction
1313

14-
version: 1.7.1
14+
version: 1.7.2
1515

1616
Redact personally identifiable information (PII) from documents (PDF, PNG, JPG), Word files (DOCX), or tabular data (XLSX/CSV/Parquet). Please see the [User Guide](#user-guide) for a full walkthrough of all the features in the app.
1717

@@ -263,6 +263,7 @@ Now you have the app installed, what follows is a guide on how to use it for bas
263263

264264
### Advanced user guide
265265
- [Fuzzy search and redaction](#fuzzy-search-and-redaction)
266+
- [Document summarisation tab](#document-summarisation-tab)
266267
- [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
267268
- [Using _for_review.pdf files with Adobe Acrobat](#using-_for_reviewpdf-files-with-adobe-acrobat)
268269
- [Exporting to Adobe Acrobat](#exporting-to-adobe-acrobat)
@@ -275,7 +276,6 @@ Now you have the app installed, what follows is a guide on how to use it for bas
275276
### Features for expert users/system administrators
276277
- [Advanced OCR options (Hybrid OCR)](#advanced-ocr-options-hybrid-ocr)
277278
- [PII identification with LLMs](#pii-identification-with-llms)
278-
- [Document summarisation tab](#document-summarisation-tab)
279279
- [Command Line Interface (CLI)](#command-line-interface-cli)
280280

281281
## Built-in example data
@@ -334,7 +334,7 @@ On the **'Redact PDFs/images'** tab, the **'Redaction settings'** accordion at t
334334

335335
### Text extraction
336336

337-
Inside the same **'Redaction settings'** accordion, open the nested accordion **'Change default redaction settings'** (it may already be open). If enabled, under **'Change default text extraction OCR method'** you can choose how text is extracted:
337+
Inside the same **'Redaction settings'** accordion, open the nested accordion **'Change default text extraction settings'** (it may already be open). If enabled, under **'Change default text extraction OCR method'** you can choose how text is extracted:
338338

339339
- **'Local model - selectable text'** - Reads text directly from PDFs that have selectable text (using PikePDF). Best for most PDFs; finds nothing if the PDF has no selectable text and is not suitable for handwriting or signatures. Image files are passed to the next option.
340340
- **'Local OCR model - PDFs without selectable text'** - Uses a local OCR model (Tesseract) to extract text from PDFs/images. Handles most typed text without selectable text but is less reliable for handwriting and signatures; use the AWS option below if you need those.
@@ -350,13 +350,16 @@ If you select **'AWS Textract service - all PDF types'** as the text extraction
350350

351351
### PII redaction method
352352

353-
At the start of the **'Change PII identification method'** accordion (under **'Change default redaction settings'**) you will see **'Choose redaction method'**, a radio with three options. **'Extract text only'** runs text extraction without redaction—useful when you only need OCR output or want to review text before redacting; when selected, the **'Select entity types to redact'** and **'Terms to always include or exclude in redactions...'** accordions are hidden. **'Redact all PII'** (the default) uses the chosen PII detection method to find and redact personal information; the entity-types accordion is shown and the terms (allow/deny/page) accordion is hidden. **'Redact selected terms'** shows both accordions and focuses on custom allow/deny lists and entity types (e.g. CUSTOM) so you can redact only the terms you specify.
353+
At the start of the **'Change PII identification method'** accordion (under **'Change default redaction settings'**) you will see **'Choose redaction method'**, a radio with three options:
354+
- **'Extract text only'** runs text extraction without redaction—useful when you only need OCR output or want to review text before redacting; when selected.
355+
- **'Redact all PII'** (the default) uses the chosen PII detection method to find and redact personal information across a range of types that you can customise below.
356+
- **'Redact selected terms'** shows both accordions and focuses on custom allow/deny lists so you can redact only the terms you specify.
354357

355-
Still under **'Change default redaction settings'**, you may see the **'Change PII identification method'** section, if enabled, which lets you choose how PII is detected:
358+
Still under **'Change default redaction settings'**, you may see the **'Change PII identification model'** section, if enabled, which lets you choose how PII is detected. You may have the choice of the following options:
356359

357-
- **'Only extract text - (no redaction)'** - Use this if you only need extracted text (e.g. for duplicate detection or to review on the Review redactions tab).
358360
- **'Local'** - Uses a local model (e.g. spaCy) to detect PII at no extra cost. Often enough when you mainly care about custom terms (see [Customising redaction options](#customising-redaction-options)).
359361
- **'AWS Comprehend'** - Uses AWS Comprehend for PII detection when the app is configured for AWS; typically more accurate but incurs a cost (around £0.0075 ($0.01) per 10,000 characters).
362+
- Other options may be available depending on the app settings (e.g. AWS Bedrock, local LLM models).
360363

361364
Under **'Select entity types to redact'** you can choose which types of PII to redact (e.g. names, emails, dates). The dropdown label varies by method (Local, AWS Comprehend, or LLM); click in the box or near the dropdown arrow to see the full list.
362365

@@ -506,9 +509,9 @@ On the 'Review redactions' tab you have a visual interface that allows you to in
506509

507510
### Uploading documents for review
508511

509-
The top area has a file upload area where you can upload files for review . In the left box, upload the original PDF file. Click '1. Upload original PDF'. In the right box, you can upload the '..._review_file.csv' that is produced by the redaction process.
512+
The top area has a file upload area where you can upload documents to review redactions. In the left box (1.), upload the original PDF file. If you have a document that you have previously redacted, you can also upload the '...redactions_for_review.pdf' file that is produced by the redaction process, which will load in the previous redactions.
510513

511-
Optionally, you can upload a '..._ocr_result_with_words' file here, that will allow you to search through the text and easily [add new redactions based on word search](#searching-and-adding-custom-redactions). You can also upload one of the '..._ocr_output.csv' file here that comes out of a redaction task, so that you can navigate the extracted text from the document. Click the button '2. Upload Review or OCR csv files' load in these files.
514+
In the second input file box to the right (2.), you can upload a '..._ocr_result_with_words' file, that will allow you to search through the text and easily [add new redactions based on word search](#searching-and-adding-custom-redactions). You can also upload one of the '..._ocr_output.csv' file here that comes out of a redaction task, so that you can navigate the extracted text from the document. Click the button '2. Upload Review or OCR csv files' load in these files.
512515

513516
Now you can review and modify the suggested redactions using the interface described below.
514517

@@ -826,6 +829,21 @@ Using these deny list with spelling mistakes, the app fuzzy match these terms to
826829

827830
![Fuzzy match review outputs](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/fuzzy_search/img/fuzzy_search_review.PNG)
828831

832+
## Document summarisation tab
833+
834+
When summarisation is enabled (e.g. **SHOW_SUMMARISATION** and at least one LLM option available), a **Document summarisation** tab is shown in the app. It lets you generate LLM-based summaries from OCR output CSVs (e.g. from a previous redaction run).
835+
836+
**How to use the Document summarisation tab**
837+
838+
1. **Upload OCR output files**: In the summarisation tab, use "Upload one or multiple 'ocr_output.csv' files to summarise" to attach one or more `*_ocr_output.csv` files (produced by the redaction pipeline when you extract text from PDFs/images).
839+
2. **Summarisation settings** (accordion):
840+
- **Choose LLM inference method for summarisation**: Choose from the LLM options available in the app settings.
841+
- **Max pages per page-group summary**: Limits how many pages are summarised together before recursive summarisation.
842+
- **Summary format**: **Concise** (key themes only) or **Detailed** (as much detail as possible).
843+
- **Additional summary instructions (optional)**: e.g. "Focus on key obligations and termination clauses".
844+
3. **Generate summary**: Click **"Generate summary"** to run the summarisation. The app groups pages, calls the LLM for each group, and creates a combined summary.
845+
4. **Outputs**: When finished, you can download summary files and view the summary in the tab.
846+
829847
## Export to and import from Adobe
830848

831849
Files for this section are stored [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/export_to_adobe/).
@@ -1118,26 +1136,6 @@ On the **'Redact PDFs/images'** tab, under **'Redaction settings'**, choose the
11181136

11191137
Model choice (Bedrock model ID, inference server URL, or local model name) and parameters (temperature, max tokens) are typically set in **Settings** or via environment variables; see the App settings / config documentation for your deployment.
11201138

1121-
## Document summarisation tab
1122-
1123-
When summarisation is enabled (e.g. **SHOW_SUMMARISATION** and at least one LLM option available), a **Document summarisation** tab is shown in the app. It lets you generate LLM-based summaries from OCR output CSVs (e.g. from a previous redaction run).
1124-
1125-
**How to use the Document summarisation tab**
1126-
1127-
1. **Upload OCR output files**: In the summarisation tab, use "Upload one or multiple 'ocr_output.csv' files to summarise" to attach one or more `*_ocr_output.csv` files (produced by the redaction pipeline when you extract text from PDFs/images).
1128-
2. **Summarisation settings** (accordion):
1129-
- **Choose LLM inference method for summarisation**: e.g. "LLM (AWS Bedrock)", "Local transformers LLM", or "Local inference server", depending on what is enabled.
1130-
- **Temperature**: Controls randomness (lower is more deterministic).
1131-
- **Max pages per page-group summary**: Limits how many pages are summarised together before recursive summarisation.
1132-
- **API Key (if required)**: For providers that need an API key.
1133-
- **Additional context (optional)**: Short description of the document type (e.g. "This is a partnership agreement").
1134-
- **Summary format**: **Concise** (key themes only) or **Detailed** (as much detail as possible).
1135-
- **Additional summary instructions (optional)**: e.g. "Focus on key obligations and termination clauses".
1136-
3. **Generate summary**: Click **"Generate summary"** to run the summarisation. The app groups pages, calls the LLM, and optionally recurses if the combined summary is long.
1137-
4. **Outputs**: When finished, you can download summary files and view the summary in the tab.
1138-
1139-
Summarisation uses the same LLM/inference settings as configured for the app (AWS region, inference server URL, etc.). For batch or scripted summarisation, use the CLI `--task summarise` (see Command Line Interface).
1140-
11411139
## Command Line Interface (CLI)
11421140

11431141
The app includes a comprehensive command-line interface (`cli_redact.py`) that allows you to perform redaction, deduplication, AWS Textract batch operations, and document summarisation directly from the terminal. This is particularly useful for batch processing, automation, and integration with other systems.

0 commit comments

Comments
 (0)