Skip to content

Commit 25c38ab

Browse files
Update README.md
1 parent 2fea064 commit 25c38ab

File tree

1 file changed

+34
-23
lines changed

1 file changed

+34
-23
lines changed

README.md

Lines changed: 34 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,11 @@
11
# Table Extraction
22

3+
![Docker Cloud Build Status](https://img.shields.io/docker/cloud/build/abdullahibneat/table-extraction)
4+
35
![Extracting tabular data as JSON data](https://i.imgur.com/vUUQ4g1.png)
46

7+
![The web interface](https://i.imgur.com/on76ccg.png)
8+
59
## Overview
610

711
This framework was developed as part of my undergraduate final year project at University and allows for the extraction of tabular data from raster images. It uses **line information** to locate cells, and an algorithm arranges the cells in memory to reconstruct the tabular structure. It then uses the Tesseract OCR engine to extract the text and returns the entire table as JSON data. It achieved 89% cell detection accuracy when extracting prayer times from timetables (see `data` folder for some examples).
@@ -17,41 +21,48 @@ Below is a summary of how the framework works. This structure is reflected in `T
1721

1822
![Overview of processes involved](https://i.imgur.com/oz6YSGK.jpg)
1923

20-
## Tesseract setup
24+
## Docker
2125

22-
Follow the instruction from [https://github.com/sirfz/tesserocr](https://github.com/sirfz/tesserocr).
26+
This is the recommended way to run this project as the environment is all set up and ready to use. For convenience, Docker images are automatically built and released on [Docker Hub](https://hub.docker.com/repository/docker/abdullahibneat/table-extraction).
2327

24-
## Get started
28+
To run the Docker container locally:
2529

26-
1. Make sure Python 3.7.x is installed. `❗❗❗THIS IS IMPORTANT❗❗❗`
27-
2. Set up a Python 3.7 [virtual environment](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/)
28-
3. Install the requirements (tesserocr might require extra steps, see below): `pip install -r requirements.txt`
29-
4. Run the `main.py` file
30+
```
31+
docker pull abdullahibneat/table-extraction
32+
docker run -d -p 5000:5000 abdullahibneat/table-extraction
33+
```
3034

31-
## Flask API server
35+
Then visit http://localhost:5000 and you're ready to go!
3236

33-
A simple Flask API was written to interact with the table extractor. Run the `server.py` file with Flask:
37+
When using a cloud provider, you can change the port by setting the `PORT` environment variable. In Heroku, the port is set automatically so this repository can simply be pushed to the Heroku remote.
3438

35-
```
36-
FLASK_APP=server
37-
flask run
38-
```
39+
## Manual setup
3940

40-
and visit the address (default: `127.0.0.1:5000`). Alternatively, store the image as form data (it can have any name) and send a `POST` request to the root endpoint.
41+
### OCR setup
4142

42-
In production, use Gunicorn:
43+
An OCR engine is NOT required to run the project, though without one the returned table object will return cell numbers instead of the cell contents.
4344

44-
```
45-
gunicorn server:app
46-
```
45+
This project uses [tesserocr](https://github.com/sirfz/tesserocr) as the Tesseract wrapper out-of-the-box. Follow the instructions there to set up tesserocr.
4746

48-
## Docker
47+
Alternatively, use your own OCR implementation by removing the tesserocr requirement from `requirements.txt` and updating the code in `main.py` and/or `server.py` with your own implementation.
48+
49+
### Get started
50+
51+
1. Make sure Python 3.7.x is installed. `❗❗❗THIS IS IMPORTANT❗❗❗`
52+
2. `Recommended:` Set up a Python 3.7 [virtual environment](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/)
53+
3. Install the requirements (tesserocr might require extra steps, see below): `pip install -r requirements.txt`
54+
4. Run the `main.py` file
55+
56+
### Flask API server
4957

50-
To run as Docker container locally:
58+
A simple Flask API was written to interact with the table extractor. Run the `app` module with Flask:
5159

5260
```
53-
docker build -t table-extraction .
54-
docker run -p 5000:5000 -e PORT=5000 table-extraction
61+
FLASK_APP=app flask run
5562
```
5663

57-
When using a cloud provider, you can change the port by setting the `PORT` environment variable. In Heroku, the port is set automatically so this repository can simply be pushed to the Heroku remote.
64+
and visit the address (default: `http://localhost:5000`). Alternatively, send the image as form data (it can have any name) in a `POST` request to the root endpoint:
65+
66+
```
67+
curl -F [email protected] http://localhost:5000
68+
```

0 commit comments

Comments
 (0)