Skip to content

Commit 1980343

Browse files
PDF Reader Markdown Doc Updates (#85)
<!-- markdownlint-disable-next-line first-line-heading --> ## Description <!-- Describe your changes in detail. --> Addressed feedback given during reviews of the markdown document. ## Context <!-- Why is this change required? What problem does it solve? --> ## Type of changes Allows for easier understanding of the util. <!-- What types of changes does your code introduce? Put an `x` in all the boxes that apply. --> - [x] Refactoring (non-breaking change) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would change existing functionality) - [ ] Bug fix (non-breaking change which fixes an issue) ## Checklist <!-- Go over all the following points, and put an `x` in all the boxes that apply. --> - [x] I am familiar with the [contributing guidelines](https://github.com/nhs-england-tools/playwright-python-blueprint/blob/main/CONTRIBUTING.md) - [x] I have followed the code style of the project - [ ] I have added tests to cover my changes (where appropriate) - [x] I have updated the documentation accordingly - [ ] This PR is a result of pair or mob programming --- ## Sensitive Information Declaration To ensure the utmost confidentiality and protect your and others privacy, we kindly ask you to NOT including [PII (Personal Identifiable Information) / PID (Personal Identifiable Data)](https://digital.nhs.uk/data-and-information/keeping-data-safe-and-benefitting-the-public) or any other sensitive data in this PR (Pull Request) and the codebase changes. We will remove any PR that do contain any sensitive information. We really appreciate your cooperation in this matter. - [x] I confirm that neither PII/PID nor sensitive data are included in this PR and the codebase changes.
1 parent bfa4ed7 commit 1980343

File tree

1 file changed

+37
-8
lines changed

1 file changed

+37
-8
lines changed

docs/utility-guides/PDFReader.md

Lines changed: 37 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Utility Guide: PDF Reader
22

3-
The PDF Reader utility allows for reading of PDF files and performing specific tasks on them.
3+
The PDF Reader utility allows for reading PDF files and extracting NHS numbers from them.
44

55
## Table of Contents
66

@@ -10,26 +10,55 @@ The PDF Reader utility allows for reading of PDF files and performing specific t
1010
- [Extract NHS No From PDF](#extract-nhs-no-from-pdf)
1111
- [Required Arguments](#required-arguments)
1212
- [How This Function Works](#how-this-function-works)
13+
- [Example Usage](#example-usage)
1314

1415
## Functions Overview
1516

16-
For this utility we have the following functions/methods:
17+
For this utility, the following function is available:
1718

1819
- `extract_nhs_no_from_pdf`
1920

2021
### Extract NHS No From PDF
2122

22-
This is called to extract all NHS numbers from a PDF file.
23-
The way it finds an NHS number is by looking for the string **"NHS No:"**
23+
This function extracts all NHS numbers from a PDF file by searching for the string **"NHS No:"** on each page.
2424

2525
#### Required Arguments
2626

2727
- `file`:
2828
- Type: `str`
29-
- This is the file path stored as a string.
29+
- The file path to the PDF file as a string.
3030

3131
#### How This Function Works
3232

33-
1. It starts off by storing the PDF file as a PdfReader object, this is from the `pypdf` package.
34-
2. Then it loops through each page.
35-
3. If it finds the string *"NHS No"* in the page, it extracts it and removes any whitespaces, then adds it to a pandas DataFrame - `nhs_no_df`
33+
1. Loads the PDF file using the `PdfReader` object from the `pypdf` package.
34+
2. Loops through each page of the PDF.
35+
3. Searches for the string *"NHS No"* on each page.
36+
4. If found, extracts the NHS number, removes any whitespaces, and adds it to a pandas DataFrame (`nhs_no_df`).
37+
5. If no NHS numbers are found on that page, it goes to the next page.
38+
6. Returns the DataFrame containing all extracted NHS numbers.
39+
40+
#### Example Usage
41+
42+
You can use this utility to extract NHS numbers from a PDF file as part of the [`Batch Processing`](BatchProcessing.md) utility or by providing the file path as a string.
43+
44+
**Extracting NHS numbers using a file path:**
45+
46+
```python
47+
from utils.pdf_reader import extract_nhs_no_from_pdf
48+
file_path = "path/to/your/file.pdf"
49+
nhs_no_df = extract_nhs_no_from_pdf(file_path)
50+
```
51+
52+
**Extracting NHS numbers using batch processing:**
53+
54+
```python
55+
from utils.pdf_reader import extract_nhs_no_from_pdf
56+
get_subjects_from_pdf = True
57+
file = download_file.suggested_filename # This is done via playwright when the "Retrieve button" on a batch is clicked.
58+
59+
nhs_no_df = (
60+
extract_nhs_no_from_pdf(file)
61+
if file.endswith(".pdf") and get_subjects_from_pdf
62+
else None
63+
)
64+
```

0 commit comments

Comments
 (0)