You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Remove NiceGUI references (removed in v1.0.0)
- Remove duplicate Installation section
- Update encoding table: all 8 encodings now Built-in
- EasyOCR and Tesseract both listed as core (not optional)
- Add bilingual output, source language auto-detection to features
- Add system dependencies section (tesseract, translate-shell)
- Update architecture diagram with OCR engines and bilingual output
- Simplify contributing section
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
**Legacy Font PDF Translator** - Translate PDF documents with legacy Indian font encodings to English.
4
4
5
+
## Problem
6
+
7
+
Millions of government documents, legal papers, and archival materials in Indian regional languages (Marathi, Hindi, Tamil, etc.) were created using legacy font encoding systems (Shree-Lipi, Kruti Dev, APS, Chanakya, etc.). These fonts map Devanagari/regional script glyphs to ASCII/Latin code points, making them unreadable by standard translation tools.
8
+
9
+
**Example:**
10
+
- What the PDF displays: महाराष्ट्र राजभाषा अधिनियम
11
+
- What text extraction produces: `´ÖÆüÖ¸üÖ™Òü ¸üÖ•Ö³ÖÖÂÖÖ †×¬Ö×®ÖμÖ´Ö`
12
+
- What Google Translate sees: Gibberish
13
+
14
+
## Solution
15
+
16
+
LegacyLipi:
17
+
1.**Detects** the font encoding scheme used in a PDF (legacy or Unicode)
18
+
2.**Converts** legacy-encoded text to proper Unicode
19
+
3.**Alternatively**, uses **OCR** (Tesseract or EasyOCR) to extract text from scanned PDFs
20
+
4.**Translates** the Unicode text to the target language
21
+
5.**Outputs** translated text in various formats (text, markdown, PDF) with optional bilingual side-by-side output
22
+
5
23
## Installation
6
24
7
25
### From PyPI (Recommended)
@@ -10,7 +28,13 @@
10
28
pip install legacylipi
11
29
```
12
30
13
-
Or with uv:
31
+
Or with uv (one command, no install):
32
+
33
+
```bash
34
+
uvx legacylipi api
35
+
```
36
+
37
+
Or install as a tool:
14
38
15
39
```bash
16
40
uv tool install legacylipi
@@ -24,17 +48,8 @@ cd legacylipi
24
48
uv sync
25
49
```
26
50
27
-
### Frontend (for development)
28
-
29
-
```bash
30
-
cd frontend
31
-
npm install
32
-
```
33
-
34
51
### Docker
35
52
36
-
Build and run with Docker:
37
-
38
53
```bash
39
54
# Build the image
40
55
docker build -t legacylipi .
@@ -53,60 +68,9 @@ To process local files, mount volumes:
53
68
docker run -p 8000:8000 -v ./input:/app/input -v ./output:/app/output legacylipi
54
69
```
55
70
56
-
### Usage
57
-
58
-
```bash
59
-
# CLI translation
60
-
legacylipi translate input.pdf -o output.txt
61
-
62
-
# Launch React web UI (production build served by FastAPI)
63
-
legacylipi api
64
-
65
-
# Launch legacy NiceGUI web UI (deprecated)
66
-
legacylipi ui
67
-
```
68
-
69
-
## Problem
70
-
71
-
Millions of government documents, legal papers, and archival materials in Indian regional languages (Marathi, Hindi, Tamil, etc.) were created using legacy font encoding systems (Shree-Lipi, Kruti Dev, APS, Chanakya, etc.). These fonts map Devanagari/regional script glyphs to ASCII/Latin code points, making them unreadable by standard translation tools.
72
-
73
-
**Example:**
74
-
- What the PDF displays: महाराष्ट्र राजभाषा अधिनियम
75
-
- What text extraction produces: `´ÖÆüÖ¸üÖ™Òü ¸üÖ•Ö³ÖÖÂÖÖ †×¬Ö×®ÖμÖ´Ö`
76
-
- What Google Translate sees: Gibberish
77
-
78
-
## Solution
79
-
80
-
LegacyLipi:
81
-
1.**Detects** the font encoding scheme used in a PDF (legacy or Unicode)
82
-
2.**Converts** legacy-encoded text to proper Unicode
83
-
3.**Alternatively**, uses **OCR** to extract text from scanned PDFs
84
-
4.**Translates** the Unicode text to the target language
85
-
5.**Outputs** translated text in various formats (text, markdown, PDF)
0 commit comments