You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert

4
6
7
+

8
+
5
9
## Overview
6
10
7
11
This framework was developed as part of my undergraduate final year project at University and allows for the extraction of tabular data from raster images. It uses **line information** to locate cells, and an algorithm arranges the cells in memory to reconstruct the tabular structure. It then uses the Tesseract OCR engine to extract the text and returns the entire table as JSON data. It achieved 89% cell detection accuracy when extracting prayer times from timetables (see `data` folder for some examples).
@@ -17,41 +21,48 @@ Below is a summary of how the framework works. This structure is reflected in `T
17
21
18
22

19
23
20
-
## Tesseract setup
24
+
## Docker
21
25
22
-
Follow the instruction from [https://github.com/sirfz/tesserocr](https://github.com/sirfz/tesserocr).
26
+
This is the recommended way to run this project as the environment is all set up and ready to use. For convenience, Docker images are automatically built and released on [Docker Hub](https://hub.docker.com/repository/docker/abdullahibneat/table-extraction).
23
27
24
-
## Get started
28
+
To run the Docker container locally:
25
29
26
-
1. Make sure Python 3.7.x is installed. `❗❗❗THIS IS IMPORTANT❗❗❗`
27
-
2. Set up a Python 3.7 [virtual environment](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/)
28
-
3. Install the requirements (tesserocr might require extra steps, see below): `pip install -r requirements.txt`
29
-
4. Run the `main.py` file
30
+
```
31
+
docker pull abdullahibneat/table-extraction
32
+
docker run -d -p 5000:5000 abdullahibneat/table-extraction
33
+
```
30
34
31
-
## Flask API server
35
+
Then visit http://localhost:5000 and you're ready to go!
32
36
33
-
A simple Flask API was written to interact with the table extractor. Run the `server.py` file with Flask:
37
+
When using a cloud provider, you can change the port by setting the `PORT` environment variable. In Heroku, the port is set automatically so this repository can simply be pushed to the Heroku remote.
34
38
35
-
```
36
-
FLASK_APP=server
37
-
flask run
38
-
```
39
+
## Manual setup
39
40
40
-
and visit the address (default: `127.0.0.1:5000`). Alternatively, store the image as form data (it can have any name) and send a `POST` request to the root endpoint.
41
+
### OCR setup
41
42
42
-
In production, use Gunicorn:
43
+
An OCR engine is NOT required to run the project, though without one the returned table object will return cell numbers instead of the cell contents.
43
44
44
-
```
45
-
gunicorn server:app
46
-
```
45
+
This project uses [tesserocr](https://github.com/sirfz/tesserocr) as the Tesseract wrapper out-of-the-box. Follow the instructions there to set up tesserocr.
47
46
48
-
## Docker
47
+
Alternatively, use your own OCR implementation by removing the tesserocr requirement from `requirements.txt` and updating the code in `main.py` and/or `server.py` with your own implementation.
48
+
49
+
### Get started
50
+
51
+
1. Make sure Python 3.7.x is installed. `❗❗❗THIS IS IMPORTANT❗❗❗`
52
+
2.`Recommended:` Set up a Python 3.7 [virtual environment](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/)
53
+
3. Install the requirements (tesserocr might require extra steps, see below): `pip install -r requirements.txt`
54
+
4. Run the `main.py` file
55
+
56
+
### Flask API server
49
57
50
-
To run as Docker container locally:
58
+
A simple Flask API was written to interact with the table extractor. Run the `app` module with Flask:
51
59
52
60
```
53
-
docker build -t table-extraction .
54
-
docker run -p 5000:5000 -e PORT=5000 table-extraction
61
+
FLASK_APP=app flask run
55
62
```
56
63
57
-
When using a cloud provider, you can change the port by setting the `PORT` environment variable. In Heroku, the port is set automatically so this repository can simply be pushed to the Heroku remote.
64
+
and visit the address (default: `http://localhost:5000`). Alternatively, send the image as form data (it can have any name) in a `POST` request to the root endpoint:
0 commit comments