|
1 | | -# PDFs-TextExtract |
2 | | -Python Multiple and Large PDF Documents Text Extraction - Python 3.7 |
3 | | - |
4 | | - |
5 | | - |
6 | | - |
7 | | -## Introduction |
8 | | -**As a Data Scientist , You may not stick to data format.** |
9 | | - |
10 | | -PDFs is good source of data, most of the organization release their data in PDFs only. **As AI is growing, we need more data for prediction and classification**; hence, ignoring PDFs as data source for you could be a blunder. |
11 | | - |
12 | | -*As you know PDF Processing comes under text analytics.* |
13 | | - |
14 | | - |
15 | | -Most of the Text Analytics Library or frameworks are designed in Python only, this gives a leverage on text analytics. You can never process a pdf directly in exising frameworks of Machine Learning or Natural Language Processing. Unless they are proving explicit interface for this, **we have to convert pdf to text first.** |
16 | | -## Problematic |
17 | | -Most Python Liabiries for Pdf Processing such as PyPDF2 and Pdfminer.six perform in text extraction task, but this performance is limited to a small and simple PDF document. |
18 | | - |
19 | | -That's why, **PDFs-TextExtract** project developed to **extract text from multiple and large pdf documents.** |
20 | | - |
21 | | -## Setup Environment |
22 | | - |
23 | | -#### For use with MacOS X, the scripts will need to be modified to remove "/PDFs-TextExtract" from the path. |
24 | | - |
25 | | -- **Step 1:** Select Version of Python (Python 3.7) to Install from [Python.org](https://www.python.org/) website. |
26 | | -- **Step 2:** Download Python Executable Installer. |
27 | | -- **Step 3:** Run Executable Installer. |
28 | | -- **Step 4:** Verify Python Was Installed On Windows. |
29 | | -- **Step 5:** Verify Pip Was Installed. |
30 | | -- **Step 6:** Add Python Path to Environment Variables (Optional). |
31 | | -- **Step 7:** Install Python extension for your IDE (Visual Studio Code). |
32 | | -- **Step 8:** Now you’ll be able to execute python scripts with your IDE (Visual Studio Code). |
33 | | -- **Step 9:** |
34 | | - |
35 | | -## Install dependencies |
36 | | - |
37 | | - pip install -r requirements.txt |
38 | | - |
39 | | -## Usage |
40 | | -- **Step 1:** Open **..\PDFs-TextExtract-master\samples** folder and put your PDF Documents inside. |
41 | | -- **Step 2:** Execute **..\PDFs-TextExtract-master\Scripts\merged.py** script. |
42 | | -- **Step 3:** Execute **..\PDFs-TextExtract-master\Scripts\spliter.py** script. |
43 | | -- **Step 4:** Execute **..\PDFs-TextExtract-master\Scripts\extract_text.py** script. |
44 | | -- **Step 5:** Open **..\PDFs-TextExtract-master\output** and you will find the result there. |
45 | | - |
46 | | -## With bash script |
47 | | -Execute |
48 | | -sh main.sh |
49 | | - |
50 | | -## Resources |
51 | | -- [Overview about PDF Processing with Python](https://towardsdatascience.com/pdf-preprocessing-with-python-19829752af9f) |
52 | | -- **pdf2txt** tool forked from [pdfminer.six](https://github.com/pdfminer/pdfminer.six) project. |
53 | | -- **merger** and **spliter** tools forked from [PyPDF2](https://github.com/mstamy2/PyPDF2) project. |
0 commit comments