Altering docstrings/markdowns and adding a markdown document for the PDF Reader util

adrianoaru-nhs · adrianoaru-nhs · commit 135661afcf8e · 2025-05-01T21:28:31.000+01:00
diff --git a/docs/utility-guides/BatchProcessing.md b/docs/utility-guides/BatchProcessing.md
@@ -20,7 +20,7 @@ The Batch Processing utility allows for the processing of batches on the active
 
 ## Functions Overview
 
-For this utility we have the following functions/methods:
+For this utility we have the following functions:
 
 - `batch_processing`
 - `prepare_and_print_batch`
@@ -34,25 +34,25 @@ This will call the other two functions in order to successfully process a batch.
 #### Required Arguments
 
 - `page`:
-  - Type: **Page**
+  - Type: `Page`
   - This is the playwright page object which is used to tell playwright what page the test is currently on.
 - `batch_type`:
-  - Type: **str**
+  - Type: `str`
   - This is the event code for the batch. For example: **S1** or **A323**
 - `batch_description`:
-  - Type: **str**
+  - Type: `str`
   - This is the description of the batch. For example: **Pre-invitation (FIT)** or **Post-investigation Appointment NOT Required**
 - `latest_event_status`:
-  - Type: **str**
+  - Type: `str`
   - This is the status the subject will get updated to after the batch has been processed. It is used to check that the subject has been updated to the correct status after a batch has been printed
 
 #### Optional Arguments
 
 - `run_timed_events`:
-  - Type: **bool**
+  - Type: `bool`
   - If this is set to **True**, then bcss_timed_events will be executed against all the subjects found in the batch
 - `get_subjects_from_pdf`:
-  - Type: **bool**
+  - Type: `bool`
   - If this is set to **True**, then the subjects will be retrieved from the downloaded PDF file instead of from the DB
 
 #### How This Function Works
@@ -66,6 +66,7 @@ This will call the other two functions in order to successfully process a batch.
 6. After the ID is stored, it clicks on the ID to get to the Manage Active Batch page
 7. From Here it calls the `prepare_and_print_batch` function.
    1. If `get_subjects_from_pdf` was set to False it calls `get_nhs_no_from_batch_id`, which is imported from *utils.oracle.oracle_specific_functions*, to get the subjects from the batch and stores them as a pandas DataFrame - **nhs_no_df**
+   2. For more Info on `get_nhs_no_from_batch_id` please look at: [PDFReader](PDFReader.md)
 8. Once this is complete it calls the `check_batch_in_archived_batch_list` function
 9. Finally, once that function is complete it calls `verify_subject_event_status_by_nhs_no` which is imported from *utils/screening_subject_page_searcher*
 
@@ -77,13 +78,13 @@ It is in charge of pressing on the following button: **Prepare Batch**, **Retrie
 #### Arguments
 
 - `page`:
-  - Type: **Page**
+  - Type: `Page`
   - This is the playwright page object which is used to tell playwright what page the test is currently on.
 - `link_text`:
-  - Type: **str**
+  - Type: `str`
   - This is the batch ID of the batch currently being processed
 - `get_subjects_from_pdf`:
-  - Type: **bool**
+  - Type: `bool`
   - This is an optional argument and if this is set to **True**, then the subjects will be retrieved from the downloaded PDF file instead of from the DB
 
 #### How This Function Works
@@ -103,10 +104,10 @@ This function checks that the batch that was just prepared and printed is now vi
 #### Arguments
 
 - `page`:
-  - Type: **Page**
+  - Type: `Page`
   - This is the playwright page object which is used to tell playwright what page the test is currently on.
 - `link_text`:
-  - Type: **str**
+  - Type: `str`
   - This is the batch ID of the batch currently being processed
 
 #### How This Function Works
diff --git a/docs/utility-guides/PDFReader.md b/docs/utility-guides/PDFReader.md
@@ -0,0 +1,35 @@
+# Utility Guide: PDF Reader
+
+The PDF Reader utility allows for reading of PDF files and performing specific tasks on them.
+
+## Table of Contents
+
+- [Utility Guide: PDF Reader](#utility-guide-pdf-reader)
+  - [Table of Contents](#table-of-contents)
+  - [Functions Overview](#functions-overview)
+    - [Ectract NHS No From PDF](#ectract-nhs-no-from-pdf)
+      - [Required Arguments](#required-arguments)
+      - [How This Function Works](#how-this-function-works)
+
+## Functions Overview
+
+For this utility we have the following functions/methods:
+
+- `extract_nhs_no_from_pdf`
+
+### Ectract NHS No From PDF
+
+This is the main function that is called in order to process a batch.
+This will call the other two functions in order to successfully process a batch.
+
+#### Required Arguments
+
+- `file`:
+  - Type: `str`
+  - This is the file path stored as a string.
+
+#### How This Function Works
+
+1. It starts off by storing the PDF file as a PdfReader object, this is from the `pypdf` package.
+2. Then it loops thrpugh each page.
+3. If it finds the string *"NHS No"* in the page, it extracts it and removes any whitespaces, then adds it to a pandas DataFrame - `nhs_no_df`
diff --git a/utils/batch_processing.py b/utils/batch_processing.py
@@ -27,13 +27,15 @@ def batch_processing(
     get_subjects_from_pdf: bool = False,
 ) -> None:
     """
-    This util is used to process batches. It expects the following inputs:
-    - page: This is playwright page variable
-    - batch_type: This is the event code of the batch. E.g. S1 or S9
-    - batch_description: This is the description of the batch. E.g. Pre-invitation (FIT)
-    - latest_event_status: This is the status the subject will get updated to after the batch has been processed.
-    - run_timed_events: This is an optional input that executes bcss_timed_events if set to True
-    - get_subjects_from_pdf: This is an optial input to change the method of retrieving subjects from the batch from the Db to the PDF file.
+    This is used to process batches.
+
+    Args:
+        page (Page): This is the playwright page object
+        batch_type (str): The event code of the batch. E.g. S1 or S9
+        batch_description (str): The description of the batch. E.g. Pre-invitation (FIT)
+        latest_event_status (str): The status the subject will get updated to after the batch has been processed.
+        run_timed_events (bool): An optional input that executes bcss_timed_events if set to True
+        get_subjects_from_pdf (bool): An optial input to change the method of retrieving subjects from the batch from the DB to the PDF file.
     """
     logging.info(f"Processing {batch_type} - {batch_description} batch")
     BasePage(page).click_main_menu_link()
@@ -86,8 +88,16 @@ def prepare_and_print_batch(
     page: Page, link_text: str, get_subjects_from_pdf: bool = False
 ) -> pd.DataFrame | None:
     """
-    This method prepares the batch, retreives the files and confirms them as printed
+    This prepares the batch, retreives the files and confirms them as printed
     Once those buttons have been pressed it waits for the message 'Batch Successfully Archived'
+
+    Args:
+        page (Page): This is the playwright page object
+        link_text (str): The batch ID
+        get_subjects_from_pdf (bool): An optial input to change the method of retrieving subjects from the batch from the DB to the PDF file.
+
+    Returns:
+        nhs_no_df (pd.DataFrame | None): if get_subjects_from_pdf is True, this is a DataFrame with the column 'subject_nhs_number' and each NHS number being a record, otherwise it is None
     """
     ManageActiveBatch(page).click_prepare_button()
     page.wait_for_timeout(
@@ -142,7 +152,11 @@ def prepare_and_print_batch(
 
 def check_batch_in_archived_batch_list(page: Page, link_text) -> None:
     """
-    This method checks the the batch that was just prepared and printed is now visible in the archived batch list
+    Checks the the batch that was just prepared and printed is now visible in the archived batch list.
+
+    Args:
+        page (Page): This is the playwright page object
+        link_text (str): The batch ID
     """
     BasePage(page).click_main_menu_link()
     BasePage(page).go_to_communications_production_page()
diff --git a/utils/pdf_reader.py b/utils/pdf_reader.py
@@ -3,6 +3,15 @@
 
 
 def extract_nhs_no_from_pdf(file: str) -> pd.DataFrame:
+    """
+    Extracts all of the NHS Numbers in a PDF file and stores them in a pandas DataFrame.
+
+    Args:
+        file (str): The file path stored as a string.
+
+    Returns:
+        nhs_no_df (pd.DataFrame): A DataFrame with the column 'subject_nhs_number' and each NHS number being a record
+    """
     reader = PdfReader(file)
     nhs_no_df = pd.DataFrame(columns=["subject_nhs_number"])
     # For loop looping through all pages of the file to find the NHS Number