Skip to content

Commit 3db84cf

Browse files
Update README.md
1 parent fd40268 commit 3db84cf

File tree

1 file changed

+29
-9
lines changed

1 file changed

+29
-9
lines changed

README.md

Lines changed: 29 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,21 +2,41 @@
22
Python Multiple PDF Documents Text Extraction - Python 3.7
33
![Logo](XPDF.jpg)
44

5-
## Resources
6-
- [Overview about PDF Processing with Python](https://towardsdatascience.com/pdf-preprocessing-with-python-19829752af9f)
7-
- **pdf2txt** tool forked from [pdfminer.six](https://github.com/pdfminer/pdfminer.six) project.
8-
- **merger** and **splitter** tools forked from [PyPDF2](https://github.com/mstamy2/PyPDF2) project.
5+
## Introduction
6+
**As a Data Scientist , You may not stick to data format.**
7+
8+
PDFs is good source of data, most of the organization release their data in PDFs only. **As AI is growing, we need more data for prediction and classification**; hence, ignoring PDFs as data source for you could be a blunder.
9+
10+
*As you know PDF Processing comes under text analytics.*
11+
12+
13+
Most of the Text Analytics Library or frameworks are designed in Python only, this gives a leverage on text analytics. One more thing you can never process a pdf directly in exising frameworks of Machine Learning or Natural Language Processing. Unless they are proving explicit interface for this, **we have to convert pdf to text first.**
14+
## Problematic
15+
Most Python Liabiries for Pdf Processing such as PyPDF2 and Pdfminer.six perform in text extraction task, but this performance is limited to a sample PDF document.
16+
17+
That's why, **PDFs-TextExtract** project developed to **extract text from multiple and large pdf documents.**
918

1019
## Setup Environment
11-
- **Step 1:** Select Version of Python to Install from Python.org website.
20+
21+
- **Step 1:** Select Version of Python (Python 3.7) to Install from [Python.org](https://www.python.org/) website.
1222
- **Step 2:** Download Python Executable Installer.
1323
- **Step 3:** Run Executable Installer.
1424
- **Step 4:** Verify Python Was Installed On Windows.
1525
- **Step 5:** Verify Pip Was Installed.
1626
- **Step 6:** Add Python Path to Environment Variables (Optional).
17-
- **Step 7:** Install Python extension for your IDE.
18-
- **Step 8:** Now you’ll be able to execute python scripts with your IDE.
19-
- **Step 9:** *Terminal* : pip install pdfminer.six
20-
- **Step 10:** *Terminal* : pip install PyPDF2
27+
- **Step 7:** Install Python extension for your IDE (Visual Studio Code).
28+
- **Step 8:** Now you’ll be able to execute python scripts with your IDE (Visual Studio Code).
29+
- **Step 9:** Execute *Terminal command* inside Python IDE : **pip install pdfminer.six**
30+
- **Step 10:** Execute *Terminal command* inside Python IDE : **pip install PyPDF2**
2131

32+
## Usage
33+
- **Step 1:** Open **..\PDFs-TextExtract\samples** folder and put your PDF Documents inside.
34+
- **Step 2:** Execute **..\PDFs-TextExtract\Scripts\merged.py** script.
35+
- **Step 3:** Execute **..\PDFs-TextExtract\Scripts\spliter.py** script.
36+
- **Step 4:** Execute **..\PDFs-TextExtract\Scripts\extract_text.py** script.
37+
- **Step 5:** Open **..\PDFs-TextExtract\output** and you will find the result there.
2238

39+
## Resources
40+
- [Overview about PDF Processing with Python](https://towardsdatascience.com/pdf-preprocessing-with-python-19829752af9f)
41+
- **pdf2txt** tool forked from [pdfminer.six](https://github.com/pdfminer/pdfminer.six) project.
42+
- **merger** and **spliter** tools forked from [PyPDF2](https://github.com/mstamy2/PyPDF2) project.

0 commit comments

Comments
 (0)