PDF Reader Markdown Doc Updates (#85)

adrianoaru-nhs · web-flow · commit 19803439ba49 · 2025-05-28T13:17:27.000+01:00
## Description  Addressed feedback given during reviews of the markdown document. ## Context  ## Type of changes Allows for easier understanding of the util.  - [x] Refactoring (non-breaking change) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would change existing functionality) - [ ] Bug fix (non-breaking change which fixes an issue) ## Checklist  - [x] I am familiar with the [contributing guidelines](https://github.com/nhs-england-tools/playwright-python-blueprint/blob/main/CONTRIBUTING.md) - [x] I have followed the code style of the project - [ ] I have added tests to cover my changes (where appropriate) - [x] I have updated the documentation accordingly - [ ] This PR is a result of pair or mob programming --- ## Sensitive Information Declaration To ensure the utmost confidentiality and protect your and others privacy, we kindly ask you to NOT including [PII (Personal Identifiable Information) / PID (Personal Identifiable Data)](https://digital.nhs.uk/data-and-information/keeping-data-safe-and-benefitting-the-public) or any other sensitive data in this PR (Pull Request) and the codebase changes. We will remove any PR that do contain any sensitive information. We really appreciate your cooperation in this matter. - [x] I confirm that neither PII/PID nor sensitive data are included in this PR and the codebase changes.
diff --git a/docs/utility-guides/PDFReader.md b/docs/utility-guides/PDFReader.md
@@ -1,6 +1,6 @@
 # Utility Guide: PDF Reader
 
-The PDF Reader utility allows for reading of PDF files and performing specific tasks on them.
+The PDF Reader utility allows for reading PDF files and extracting NHS numbers from them.
 
 ## Table of Contents
 
@@ -10,26 +10,55 @@ The PDF Reader utility allows for reading of PDF files and performing specific t
     - [Extract NHS No From PDF](#extract-nhs-no-from-pdf)
       - [Required Arguments](#required-arguments)
       - [How This Function Works](#how-this-function-works)
+      - [Example Usage](#example-usage)
 
 ## Functions Overview
 
-For this utility we have the following functions/methods:
+For this utility, the following function is available:
 
 - `extract_nhs_no_from_pdf`
 
 ### Extract NHS No From PDF
 
-This is called to extract all NHS numbers from a PDF file.
-The way it finds an NHS number is by looking for the string **"NHS No:"**
+This function extracts all NHS numbers from a PDF file by searching for the string **"NHS No:"** on each page.
 
 #### Required Arguments
 
 - `file`:
   - Type: `str`
-  - This is the file path stored as a string.
+  - The file path to the PDF file as a string.
 
 #### How This Function Works
 
-1. It starts off by storing the PDF file as a PdfReader object, this is from the `pypdf` package.
-2. Then it loops through each page.
-3. If it finds the string *"NHS No"* in the page, it extracts it and removes any whitespaces, then adds it to a pandas DataFrame - `nhs_no_df`
+1. Loads the PDF file using the `PdfReader` object from the `pypdf` package.
+2. Loops through each page of the PDF.
+3. Searches for the string *"NHS No"* on each page.
+4. If found, extracts the NHS number, removes any whitespaces, and adds it to a pandas DataFrame (`nhs_no_df`).
+5. If no NHS numbers are found on that page, it goes to the next page.
+6. Returns the DataFrame containing all extracted NHS numbers.
+
+#### Example Usage
+
+You can use this utility to extract NHS numbers from a PDF file as part of the [`Batch Processing`](BatchProcessing.md) utility or by providing the file path as a string.
+
+**Extracting NHS numbers using a file path:**
+
+```python
+from utils.pdf_reader import extract_nhs_no_from_pdf
+file_path = "path/to/your/file.pdf"
+nhs_no_df = extract_nhs_no_from_pdf(file_path)
+```
+
+**Extracting NHS numbers using batch processing:**
+
+```python
+from utils.pdf_reader import extract_nhs_no_from_pdf
+get_subjects_from_pdf = True
+file = download_file.suggested_filename # This is done via playwright when the "Retrieve button" on a batch is clicked.
+
+nhs_no_df = (
+  extract_nhs_no_from_pdf(file)
+  if file.endswith(".pdf") and get_subjects_from_pdf
+  else None
+  )
+```