Skip to content

Commit a995a0b

Browse files
Addressing feedback given in Jira tickets
1 parent bfa4ed7 commit a995a0b

File tree

1 file changed

+36
-8
lines changed

1 file changed

+36
-8
lines changed

docs/utility-guides/PDFReader.md

Lines changed: 36 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Utility Guide: PDF Reader
22

3-
The PDF Reader utility allows for reading of PDF files and performing specific tasks on them.
3+
The PDF Reader utility allows for reading PDF files and extracting NHS numbers from them.
44

55
## Table of Contents
66

@@ -10,26 +10,54 @@ The PDF Reader utility allows for reading of PDF files and performing specific t
1010
- [Extract NHS No From PDF](#extract-nhs-no-from-pdf)
1111
- [Required Arguments](#required-arguments)
1212
- [How This Function Works](#how-this-function-works)
13+
- [Example Usage](#example-usage)
1314

1415
## Functions Overview
1516

16-
For this utility we have the following functions/methods:
17+
For this utility, the following function is available:
1718

1819
- `extract_nhs_no_from_pdf`
1920

2021
### Extract NHS No From PDF
2122

22-
This is called to extract all NHS numbers from a PDF file.
23-
The way it finds an NHS number is by looking for the string **"NHS No:"**
23+
This function extracts all NHS numbers from a PDF file by searching for the string **"NHS No:"** on each page.
2424

2525
#### Required Arguments
2626

2727
- `file`:
2828
- Type: `str`
29-
- This is the file path stored as a string.
29+
- The file path to the PDF file as a string.
3030

3131
#### How This Function Works
3232

33-
1. It starts off by storing the PDF file as a PdfReader object, this is from the `pypdf` package.
34-
2. Then it loops through each page.
35-
3. If it finds the string *"NHS No"* in the page, it extracts it and removes any whitespaces, then adds it to a pandas DataFrame - `nhs_no_df`
33+
1. Loads the PDF file using the `PdfReader` object from the `pypdf` package.
34+
2. Loops through each page of the PDF.
35+
3. Searches for the string *"NHS No"* on each page.
36+
4. If found, extracts the NHS number, removes any whitespaces, and adds it to a pandas DataFrame (`nhs_no_df`).
37+
5. Returns the DataFrame containing all extracted NHS numbers.
38+
39+
#### Example Usage
40+
41+
You can use this utility to extract NHS numbers from a PDF file as part of the [`Batch Processing`](BatchProcessing.md) utility or by providing the file path as a string.
42+
43+
**Extracting NHS numbers using a file path:**
44+
45+
```python
46+
from utils.pdf_reader import extract_nhs_no_from_pdf
47+
file_path = "path/to/your/file.pdf"
48+
nhs_no_df = extract_nhs_no_from_pdf(file_path)
49+
```
50+
51+
**Extracting NHS numbers using batch processing:**
52+
53+
```python
54+
from utils.pdf_reader import extract_nhs_no_from_pdf
55+
get_subjects_from_pdf = True
56+
file = download_file.suggested_filename # This is done via playwright when the "Retrieve button" on a batch is clicked.
57+
58+
nhs_no_df = (
59+
extract_nhs_no_from_pdf(file)
60+
if file.endswith(".pdf") and get_subjects_from_pdf
61+
else None
62+
)
63+
```

0 commit comments

Comments
 (0)