|
2 | 2 | Python Multiple PDF Documents Text Extraction - Python 3.7 |
3 | 3 |  |
4 | 4 |
|
5 | | -## Resources |
6 | | -- [Overview about PDF Processing with Python](https://towardsdatascience.com/pdf-preprocessing-with-python-19829752af9f) |
7 | | -- **pdf2txt** tool forked from [pdfminer.six](https://github.com/pdfminer/pdfminer.six) project. |
8 | | -- **merger** and **splitter** tools forked from [PyPDF2](https://github.com/mstamy2/PyPDF2) project. |
| 5 | +## Introduction |
| 6 | +**As a Data Scientist , You may not stick to data format.** |
| 7 | + |
| 8 | +PDFs is good source of data, most of the organization release their data in PDFs only. **As AI is growing, we need more data for prediction and classification**; hence, ignoring PDFs as data source for you could be a blunder. |
| 9 | + |
| 10 | +*As you know PDF Processing comes under text analytics.* |
| 11 | + |
| 12 | + |
| 13 | +Most of the Text Analytics Library or frameworks are designed in Python only, this gives a leverage on text analytics. One more thing you can never process a pdf directly in exising frameworks of Machine Learning or Natural Language Processing. Unless they are proving explicit interface for this, **we have to convert pdf to text first.** |
| 14 | +## Problematic |
| 15 | +Most Python Liabiries for Pdf Processing such as PyPDF2 and Pdfminer.six perform in text extraction task, but this performance is limited to a sample PDF document. |
| 16 | + |
| 17 | +That's why, **PDFs-TextExtract** project developed to **extract text from multiple and large pdf documents.** |
9 | 18 |
|
10 | 19 | ## Setup Environment |
11 | | -- **Step 1:** Select Version of Python to Install from Python.org website. |
| 20 | + |
| 21 | +- **Step 1:** Select Version of Python (Python 3.7) to Install from [Python.org](https://www.python.org/) website. |
12 | 22 | - **Step 2:** Download Python Executable Installer. |
13 | 23 | - **Step 3:** Run Executable Installer. |
14 | 24 | - **Step 4:** Verify Python Was Installed On Windows. |
15 | 25 | - **Step 5:** Verify Pip Was Installed. |
16 | 26 | - **Step 6:** Add Python Path to Environment Variables (Optional). |
17 | | -- **Step 7:** Install Python extension for your IDE. |
18 | | -- **Step 8:** Now you’ll be able to execute python scripts with your IDE. |
19 | | -- **Step 9:** *Terminal* : pip install pdfminer.six |
20 | | -- **Step 10:** *Terminal* : pip install PyPDF2 |
| 27 | +- **Step 7:** Install Python extension for your IDE (Visual Studio Code). |
| 28 | +- **Step 8:** Now you’ll be able to execute python scripts with your IDE (Visual Studio Code). |
| 29 | +- **Step 9:** Execute *Terminal command* inside Python IDE : **pip install pdfminer.six** |
| 30 | +- **Step 10:** Execute *Terminal command* inside Python IDE : **pip install PyPDF2** |
21 | 31 |
|
| 32 | +## Usage |
| 33 | +- **Step 1:** Open **..\PDFs-TextExtract\samples** folder and put your PDF Documents inside. |
| 34 | +- **Step 2:** Execute **..\PDFs-TextExtract\Scripts\merged.py** script. |
| 35 | +- **Step 3:** Execute **..\PDFs-TextExtract\Scripts\spliter.py** script. |
| 36 | +- **Step 4:** Execute **..\PDFs-TextExtract\Scripts\extract_text.py** script. |
| 37 | +- **Step 5:** Open **..\PDFs-TextExtract\output** and you will find the result there. |
22 | 38 |
|
| 39 | +## Resources |
| 40 | +- [Overview about PDF Processing with Python](https://towardsdatascience.com/pdf-preprocessing-with-python-19829752af9f) |
| 41 | +- **pdf2txt** tool forked from [pdfminer.six](https://github.com/pdfminer/pdfminer.six) project. |
| 42 | +- **merger** and **spliter** tools forked from [PyPDF2](https://github.com/mstamy2/PyPDF2) project. |
0 commit comments