Image to CSV Converter

A Python tool that converts PDF and image files (JPG, PNG) to CSV format using OCR (Optical Character Recognition). This tool is particularly useful for extracting tabular data from documents and images.

Features

Convert PDF files to CSV
Convert image files (JPG, PNG) to CSV
Automatic table detection and extraction
Image enhancement for better OCR results
Support for multi-page PDFs
Clean and organized output in CSV format

Prerequisites

Python 3.8 or higher
Tesseract OCR installed on your system

Installing Tesseract OCR

macOS

brew install tesseract

Ubuntu/Debian

sudo apt-get update
sudo apt-get install tesseract-ocr

Windows

Download the installer from UB Mannheim
Run the installer
Add Tesseract to your system PATH

Installation

Clone the repository:

git clone https://github.com/yourusername/image-to-csv.git
cd image-to-csv

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Unix/macOS
# or
.venv\Scripts\activate  # On Windows

Install dependencies:

pip install -e .

Usage

Place your PDF or image files in the input directory
Run the converter:

python -m image_to_csv

Find the converted CSV files in the output directory

File Naming Convention

For PDFs: {original_name}_page{page_number}.csv
For images: {original_name}.csv

How It Works

PDF Processing:
- Converts PDF pages to images
- Enhances image quality for better OCR
- Performs OCR on each page
- Detects and extracts tabular data
- Saves each page as a separate CSV file
Image Processing:
- Enhances image quality
- Performs OCR
- Detects and extracts tabular data
- Saves as CSV file

Project Structure

image-to-csv/
├── input/          # Input directory for PDF and image files
├── output/         # Output directory for CSV files
├── image_to_csv/   # Source code
│   ├── __init__.py
│   ├── __main__.py
│   └── converter.py
├── pyproject.toml  # Project configuration
└── README.md       # This file

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Image to CSV Converter

Features

Prerequisites

Installing Tesseract OCR

macOS

Ubuntu/Debian

Windows

Installation

Usage

File Naming Convention

How It Works

Project Structure

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
image_to_csv		image_to_csv
input		input
output		output
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

License

marcioaltoe/image_to_csv

Folders and files

Latest commit

History

Repository files navigation

Image to CSV Converter

Features

Prerequisites

Installing Tesseract OCR

macOS

Ubuntu/Debian

Windows

Installation

Usage

File Naming Convention

How It Works

Project Structure

Contributing

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages