You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The PDF Reader utility allows for reading of PDF files and performing specific tasks on them.
3
+
The PDF Reader utility allows for reading PDF files and extracting NHS numbers from them.
4
4
5
5
## Table of Contents
6
6
@@ -10,26 +10,54 @@ The PDF Reader utility allows for reading of PDF files and performing specific t
10
10
-[Extract NHS No From PDF](#extract-nhs-no-from-pdf)
11
11
-[Required Arguments](#required-arguments)
12
12
-[How This Function Works](#how-this-function-works)
13
+
-[Example Usage](#example-usage)
13
14
14
15
## Functions Overview
15
16
16
-
For this utility we have the following functions/methods:
17
+
For this utility, the following function is available:
17
18
18
19
-`extract_nhs_no_from_pdf`
19
20
20
21
### Extract NHS No From PDF
21
22
22
-
This is called to extract all NHS numbers from a PDF file.
23
-
The way it finds an NHS number is by looking for the string **"NHS No:"**
23
+
This function extracts all NHS numbers from a PDF file by searching for the string **"NHS No:"** on each page.
24
24
25
25
#### Required Arguments
26
26
27
27
-`file`:
28
28
- Type: `str`
29
-
-This is the file path stored as a string.
29
+
-The file path to the PDF file as a string.
30
30
31
31
#### How This Function Works
32
32
33
-
1. It starts off by storing the PDF file as a PdfReader object, this is from the `pypdf` package.
34
-
2. Then it loops through each page.
35
-
3. If it finds the string *"NHS No"* in the page, it extracts it and removes any whitespaces, then adds it to a pandas DataFrame - `nhs_no_df`
33
+
1. Loads the PDF file using the `PdfReader` object from the `pypdf` package.
34
+
2. Loops through each page of the PDF.
35
+
3. Searches for the string *"NHS No"* on each page.
36
+
4. If found, extracts the NHS number, removes any whitespaces, and adds it to a pandas DataFrame (`nhs_no_df`).
37
+
5. Returns the DataFrame containing all extracted NHS numbers.
38
+
39
+
#### Example Usage
40
+
41
+
You can use this utility to extract NHS numbers from a PDF file as part of the [`Batch Processing`](BatchProcessing.md) utility or by providing the file path as a string.
42
+
43
+
**Extracting NHS numbers using a file path:**
44
+
45
+
```python
46
+
from utils.pdf_reader import extract_nhs_no_from_pdf
47
+
file_path ="path/to/your/file.pdf"
48
+
nhs_no_df = extract_nhs_no_from_pdf(file_path)
49
+
```
50
+
51
+
**Extracting NHS numbers using batch processing:**
52
+
53
+
```python
54
+
from utils.pdf_reader import extract_nhs_no_from_pdf
55
+
get_subjects_from_pdf =True
56
+
file= download_file.suggested_filename # This is done via playwright when the "Retrieve button" on a batch is clicked.
0 commit comments