A modular FastAPI backend that extracts text, tables, and images from uploaded PDF files.
The extracted content is returned as a ZIP file containing:
- ✅
text.txt
– extracted text - ✅
table_X.csv
– tables saved as CSV (without using pandas, lightweight CSV writer) - ✅ extracted image files (
image_1.png
,image_2.png
, …)
This project is designed with clean modular structure, Dockerized deployment, and can be consumed easily by any frontend (e.g., Streamlit).
- Upload a PDF via REST API
- Extract:
- Text (saved in
.txt
) - Tables (saved in
.csv
without pandas, using Python’s built-incsv
) - Images (saved as
.png
)
- Text (saved in
- Get everything in a single downloadable ZIP file
- Modular project structure (services, utils, routes)
- Dockerized for easy deployment
pdf-extractor-backend/
│── app/
│ ├── main.py # FastAPI entrypoint
│ ├── routes/
│ │ └── extract.py # API endpoint
│ ├── services/
│ │ └── extractor.py # PDF extraction logic
│ ├── utils/
│ │ └── file_ops.py # File saving helpers
│── requirements.txt # Python dependencies
│── Dockerfile # Container build file
│── README.md # Documentation
git clone https://github.com/Dipesh-Ydv/pdf-extractor-backend-api.git
cd pdf-extractor-backend
pip install -r requirements.txt
uvicorn app.main:app --reload
Go to: http://127.0.0.1:8000/docs
POST /extract/pdf
Upload a PDF file with the key file
.
Example using curl
:
curl -X POST "http://127.0.0.1:8000/extract/pdf" \
-F "[email protected]" \
-o output.zip
- Returns a ZIP file containing:
text.txt
table_1.csv
,table_2.csv
, …image_1.png
,image_2.png
, …
docker build -t pdf-extractor-backend .
docker run -d -p 8000:8000 pdf-extractor-backend
Now API is available at: 👉 http://localhost:8000/docs
docker tag pdf-extractor-backend:latest dipeshydv/pdf-extractor-backend:latest
docker push dipeshydv/pdf-extractor-backend:latest
docker pull dipeshydv/pdf-extractor-backend:latest
docker run -d -p 8000:8000 dipeshydv/pdf-extractor-backend:latest
See requirements.txt
:
fastapi
uvicorn[standard]
python-multipart
pdfplumber
pillow
pandas
zipfile36
pyMuPdf
- Fork the project
- Create a feature branch (
git checkout -b feature/xyz
) - Commit changes (
git commit -m 'Add xyz'
) - Push to branch (
git push origin feature/xyz
) - Create a Pull Request
MIT License – free to use & modify.