Brief description of your Python project module. Mention its purpose, main features, and any other relevant information.
pip install document_text_extractor==1.0.0
├── document_text_extractor
│ ├── document_text_extractor
│ │ ├── data
│ │ ├── __init__.py
│ │ ├── CommonOperations.py
│ │ ├── Configuration.py
│ │ ├── GetTextFromImage.py
│ │ ├── DocumentTextExtractor.py
│ │ ├── PDFFileDplitter.py
│ ├── MANIFEST.in
│ ├── README.md
│ ├── setup.py
sudo apt-get update
sudo apt-get install tesseract-ocr
sudo apt-get install tesseract-ocr-engCurrent Version: v1.0.0
Brief description of your project.
Follow these steps to set up and run the Document Text Extractor on your local machine.
- Python 3.6 or higher
- Clone the repository:
git clone https://github.com/harshad208/document_text_extractor.git
- Navigate to the project directory
- cd document_text_extractor/document_text_extractor
- create a virtual environment -> python -m venv venv
- source venv/bin/activate
- pip install -r requirements.txt
- For Linux (Ubuntu/Debian)
sudo apt-get update sudo apt-get install tesseract-ocr sudo apt-get install tesseract-ocr-eng # Optional: English language pack - For Windows Download and install Tesseract OCR from https://github.com/tesseract-ocr/tesseract. Add Tesseract to your system's PATH. Verify the installation by running tesseract --version in the command prompt.
- For macOS
brew install tesseract - Feel free to customize this template further based on the specifics of your project and the setup instructions. Providing clear and concise instructions will make it easier for others to use and contribute to your project.
The following Python packages are required to run this project. You can install them using the
provided requirements.txt file:
uvicorn==0.25.0
fastapi==0.109.0
python-multipart==0.0.6
pytesseract==0.3.10
Pillow==10.2.0
PyMuPDF==1.23.12# v1.0.0
1. initial phase