Skip to content

Commit 72f787f

Browse files
new optimized version for data extraction API
1 parent 2074a4b commit 72f787f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+335
-285
lines changed

.gitignore

Lines changed: 0 additions & 3 deletions
This file was deleted.

CODE_OF_CONDUCT.md

Lines changed: 0 additions & 76 deletions
This file was deleted.

Dockerfile

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
FROM python:3.7
2+
COPY . /app/
3+
WORKDIR /app
4+
RUN pip install -r requirements.txt
5+
ENTRYPOINT ["python3"]
6+
CMD ["app.py"]

LICENSE

Lines changed: 0 additions & 21 deletions
This file was deleted.

README.md

Lines changed: 0 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -1,53 +0,0 @@
1-
# PDFs-TextExtract
2-
Python Multiple and Large PDF Documents Text Extraction - Python 3.7
3-
![Logo](XPDF.jpg)
4-
5-
6-
7-
## Introduction
8-
**As a Data Scientist , You may not stick to data format.**
9-
10-
PDFs is good source of data, most of the organization release their data in PDFs only. **As AI is growing, we need more data for prediction and classification**; hence, ignoring PDFs as data source for you could be a blunder.
11-
12-
*As you know PDF Processing comes under text analytics.*
13-
14-
15-
Most of the Text Analytics Library or frameworks are designed in Python only, this gives a leverage on text analytics. You can never process a pdf directly in exising frameworks of Machine Learning or Natural Language Processing. Unless they are proving explicit interface for this, **we have to convert pdf to text first.**
16-
## Problematic
17-
Most Python Liabiries for Pdf Processing such as PyPDF2 and Pdfminer.six perform in text extraction task, but this performance is limited to a small and simple PDF document.
18-
19-
That's why, **PDFs-TextExtract** project developed to **extract text from multiple and large pdf documents.**
20-
21-
## Setup Environment
22-
23-
#### For use with MacOS X, the scripts will need to be modified to remove "/PDFs-TextExtract" from the path.
24-
25-
- **Step 1:** Select Version of Python (Python 3.7) to Install from [Python.org](https://www.python.org/) website.
26-
- **Step 2:** Download Python Executable Installer.
27-
- **Step 3:** Run Executable Installer.
28-
- **Step 4:** Verify Python Was Installed On Windows.
29-
- **Step 5:** Verify Pip Was Installed.
30-
- **Step 6:** Add Python Path to Environment Variables (Optional).
31-
- **Step 7:** Install Python extension for your IDE (Visual Studio Code).
32-
- **Step 8:** Now you’ll be able to execute python scripts with your IDE (Visual Studio Code).
33-
- **Step 9:**
34-
35-
## Install dependencies
36-
37-
pip install -r requirements.txt
38-
39-
## Usage
40-
- **Step 1:** Open **..\PDFs-TextExtract-master\samples** folder and put your PDF Documents inside.
41-
- **Step 2:** Execute **..\PDFs-TextExtract-master\Scripts\merged.py** script.
42-
- **Step 3:** Execute **..\PDFs-TextExtract-master\Scripts\spliter.py** script.
43-
- **Step 4:** Execute **..\PDFs-TextExtract-master\Scripts\extract_text.py** script.
44-
- **Step 5:** Open **..\PDFs-TextExtract-master\output** and you will find the result there.
45-
46-
## With bash script
47-
Execute
48-
sh main.sh
49-
50-
## Resources
51-
- [Overview about PDF Processing with Python](https://towardsdatascience.com/pdf-preprocessing-with-python-19829752af9f)
52-
- **pdf2txt** tool forked from [pdfminer.six](https://github.com/pdfminer/pdfminer.six) project.
53-
- **merger** and **spliter** tools forked from [PyPDF2](https://github.com/mstamy2/PyPDF2) project.

Scripts/extract_text.py

Lines changed: 0 additions & 63 deletions
This file was deleted.

Scripts/merged.py

Lines changed: 0 additions & 15 deletions
This file was deleted.

Scripts/spliter.py

Lines changed: 0 additions & 39 deletions
This file was deleted.
3.14 KB
Binary file not shown.
1.19 KB
Binary file not shown.

0 commit comments

Comments
 (0)