Skip to content

Commit cfb4afd

Browse files
biswasbiplobclaude
andcommitted
docs: Update README to reflect v1.0 changes
- Remove NiceGUI references (removed in v1.0.0) - Remove duplicate Installation section - Update encoding table: all 8 encodings now Built-in - EasyOCR and Tesseract both listed as core (not optional) - Add bilingual output, source language auto-detection to features - Add system dependencies section (tesseract, translate-shell) - Update architecture diagram with OCR engines and bilingual output - Simplify contributing section Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 01aa51d commit cfb4afd

File tree

1 file changed

+80
-121
lines changed

1 file changed

+80
-121
lines changed

README.md

Lines changed: 80 additions & 121 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,24 @@
22

33
**Legacy Font PDF Translator** - Translate PDF documents with legacy Indian font encodings to English.
44

5+
## Problem
6+
7+
Millions of government documents, legal papers, and archival materials in Indian regional languages (Marathi, Hindi, Tamil, etc.) were created using legacy font encoding systems (Shree-Lipi, Kruti Dev, APS, Chanakya, etc.). These fonts map Devanagari/regional script glyphs to ASCII/Latin code points, making them unreadable by standard translation tools.
8+
9+
**Example:**
10+
- What the PDF displays: महाराष्ट्र राजभाषा अधिनियम
11+
- What text extraction produces: `´ÖÆüÖ¸üÖ™Òü ¸üÖ•Ö³ÖÖÂÖÖ †×¬Ö×®ÖμÖ´Ö`
12+
- What Google Translate sees: Gibberish
13+
14+
## Solution
15+
16+
LegacyLipi:
17+
1. **Detects** the font encoding scheme used in a PDF (legacy or Unicode)
18+
2. **Converts** legacy-encoded text to proper Unicode
19+
3. **Alternatively**, uses **OCR** (Tesseract or EasyOCR) to extract text from scanned PDFs
20+
4. **Translates** the Unicode text to the target language
21+
5. **Outputs** translated text in various formats (text, markdown, PDF) with optional bilingual side-by-side output
22+
523
## Installation
624

725
### From PyPI (Recommended)
@@ -10,7 +28,13 @@
1028
pip install legacylipi
1129
```
1230

13-
Or with uv:
31+
Or with uv (one command, no install):
32+
33+
```bash
34+
uvx legacylipi api
35+
```
36+
37+
Or install as a tool:
1438

1539
```bash
1640
uv tool install legacylipi
@@ -24,17 +48,8 @@ cd legacylipi
2448
uv sync
2549
```
2650

27-
### Frontend (for development)
28-
29-
```bash
30-
cd frontend
31-
npm install
32-
```
33-
3451
### Docker
3552

36-
Build and run with Docker:
37-
3853
```bash
3954
# Build the image
4055
docker build -t legacylipi .
@@ -53,60 +68,9 @@ To process local files, mount volumes:
5368
docker run -p 8000:8000 -v ./input:/app/input -v ./output:/app/output legacylipi
5469
```
5570

56-
### Usage
57-
58-
```bash
59-
# CLI translation
60-
legacylipi translate input.pdf -o output.txt
61-
62-
# Launch React web UI (production build served by FastAPI)
63-
legacylipi api
64-
65-
# Launch legacy NiceGUI web UI (deprecated)
66-
legacylipi ui
67-
```
68-
69-
## Problem
70-
71-
Millions of government documents, legal papers, and archival materials in Indian regional languages (Marathi, Hindi, Tamil, etc.) were created using legacy font encoding systems (Shree-Lipi, Kruti Dev, APS, Chanakya, etc.). These fonts map Devanagari/regional script glyphs to ASCII/Latin code points, making them unreadable by standard translation tools.
72-
73-
**Example:**
74-
- What the PDF displays: महाराष्ट्र राजभाषा अधिनियम
75-
- What text extraction produces: `´ÖÆüÖ¸üÖ™Òü ¸üÖ•Ö³ÖÖÂÖÖ †×¬Ö×®ÖμÖ´Ö`
76-
- What Google Translate sees: Gibberish
77-
78-
## Solution
79-
80-
LegacyLipi:
81-
1. **Detects** the font encoding scheme used in a PDF (legacy or Unicode)
82-
2. **Converts** legacy-encoded text to proper Unicode
83-
3. **Alternatively**, uses **OCR** to extract text from scanned PDFs
84-
4. **Translates** the Unicode text to the target language
85-
5. **Outputs** translated text in various formats (text, markdown, PDF)
86-
87-
## Installation
88-
89-
```bash
90-
# Clone and install
91-
git clone https://github.com/biswasbiplob/legacylipi.git
92-
cd legacylipi
93-
uv sync
94-
95-
# With all optional backends
96-
uv sync --all-extras
97-
```
98-
99-
### OCR Support (Optional)
100-
101-
LegacyLipi supports multiple OCR backends:
102-
103-
| Backend | Description | GPU Support |
104-
|---------|-------------|-------------|
105-
| Tesseract | Local, free, most language packs | CPU only |
106-
| Google Vision | Cloud, paid, best accuracy | N/A |
107-
| EasyOCR | Local, free, good for Indian languages | CUDA, MPS (Apple Silicon) |
71+
### System Dependencies
10872

109-
**Tesseract (default):**
73+
**Tesseract** (for OCR - recommended):
11074
```bash
11175
# Ubuntu/Debian
11276
sudo apt-get install tesseract-ocr tesseract-ocr-mar tesseract-ocr-hin
@@ -115,39 +79,35 @@ sudo apt-get install tesseract-ocr tesseract-ocr-mar tesseract-ocr-hin
11579
brew install tesseract tesseract-lang
11680
```
11781

118-
**EasyOCR with GPU (optional):**
82+
**Translate-Shell** (recommended translation backend):
11983
```bash
120-
# Install with EasyOCR support
121-
uv sync --extra easyocr
122-
123-
# For GPU acceleration, install PyTorch with CUDA or MPS support
124-
```
84+
# Ubuntu/Debian
85+
sudo apt-get install translate-shell
12586

126-
**Google Vision (optional):**
127-
```bash
128-
uv sync --extra vision
129-
# Requires GCP credentials (GOOGLE_APPLICATION_CREDENTIALS)
87+
# macOS
88+
brew install translate-shell
13089
```
13190

132-
See [docs/cli-reference.md](docs/cli-reference.md) for detailed OCR options and language codes.
133-
13491
## Quick Start
13592

13693
```bash
13794
# Basic translation
138-
uv run legacylipi translate input.pdf -o output.txt
95+
legacylipi translate input.pdf -o output.txt
13996

14097
# Output as PDF (preserves layout)
141-
uv run legacylipi translate input.pdf -o output.pdf --format pdf
98+
legacylipi translate input.pdf -o output.pdf --format pdf
99+
100+
# Bilingual side-by-side output
101+
legacylipi translate input.pdf -o output.pdf --bilingual
142102

143103
# OCR for scanned documents
144-
uv run legacylipi translate input.pdf --use-ocr -o output.txt
104+
legacylipi translate input.pdf --use-ocr -o output.txt
145105

146106
# Use local LLM (requires Ollama)
147-
uv run legacylipi translate input.pdf --translator ollama --model llama3.2
107+
legacylipi translate input.pdf --translator ollama --model llama3.2
148108

149109
# Detect encoding only
150-
uv run legacylipi detect input.pdf
110+
legacylipi detect input.pdf
151111
```
152112

153113
See [docs/cli-reference.md](docs/cli-reference.md) for complete CLI documentation.
@@ -160,9 +120,9 @@ LegacyLipi includes a modern React-based web interface backed by a FastAPI REST
160120

161121
```bash
162122
# Serves the built React frontend + API on one port
163-
uv run legacylipi api
123+
legacylipi api
164124
# or
165-
uv run legacylipi-web
125+
uvx legacylipi api
166126
```
167127

168128
Open **http://localhost:8000** in your browser.
@@ -178,15 +138,6 @@ This runs:
178138
- **Backend** at http://localhost:8000 (FastAPI with auto-reload)
179139
- **Frontend** at http://localhost:5173 (Vite dev server with HMR, proxies `/api` to backend)
180140

181-
### Legacy NiceGUI UI (deprecated)
182-
183-
The original NiceGUI-based UI is still available but deprecated:
184-
185-
```bash
186-
uv run legacylipi ui
187-
# Open http://localhost:8080
188-
```
189-
190141
**Workflow Modes:**
191142
- **Scanned Copy** - Create image-based PDF copy (adjust DPI, color, quality)
192143
- **Convert to Unicode** - OCR + Unicode conversion without translation
@@ -196,12 +147,27 @@ uv run legacylipi ui
196147
- Drag-and-drop PDF upload
197148
- Workflow-based UI with mode selection
198149
- Multiple translation backends (Translate-Shell, Google, Ollama, OpenAI, etc.)
199-
- OCR support with engine and language selection
150+
- OCR support with EasyOCR and Tesseract engine selection
200151
- Structure-preserving or flowing text modes
152+
- Bilingual side-by-side output
153+
- Source language auto-detection from encoding
201154
- Real-time SSE progress streaming
202155
- Direct download of translated files
203156
- Responsive dark-theme design
204157

158+
## Supported Encodings
159+
160+
| Encoding | Font Family | Language | Status |
161+
|----------|-------------|----------|--------|
162+
| shree-dev | SHREE-DEV-0708, 0714, 0715, 0721 | Marathi | Built-in |
163+
| shree-lipi | Shree-Lipi, SDL-DEV | Marathi | Built-in |
164+
| dvb-tt | DVBWTTSurekh, DVBTTSurekh | Marathi | Built-in |
165+
| kruti-dev | KrutiDev010, KrutiDev040 | Hindi | Built-in |
166+
| chanakya | Chanakya | Hindi/Sanskrit | Built-in |
167+
| aps-dv | APS-DV-TT | Hindi | Built-in |
168+
| walkman-chanakya | Walkman Chanakya | Hindi | Built-in |
169+
| shusha | Shusha | Marathi/Hindi | Built-in |
170+
205171
## Translation Backends
206172

207173
| Backend | Description | Setup |
@@ -215,39 +181,36 @@ uv run legacylipi ui
215181

216182
See [docs/translation-backends.md](docs/translation-backends.md) for detailed setup guides.
217183

218-
## Supported Encodings
184+
## OCR Backends
219185

220-
| Encoding | Font Family | Language | Status |
221-
|----------|-------------|----------|--------|
222-
| shree-lipi | Shree-Lipi, Shree-Dev-0714 | Marathi | ✅ Built-in |
223-
| kruti-dev | Kruti Dev | Hindi | ✅ Built-in |
224-
| aps-dv | APS-DV | Hindi | 🔄 Detection only |
225-
| chanakya | Chanakya | Hindi | 🔄 Detection only |
226-
| dvb-tt | DVB-TT, DV-TTYogesh | Hindi | 🔄 Detection only |
227-
| walkman-chanakya | Walkman Chanakya | Hindi | 🔄 Detection only |
228-
| shusha | Shusha | Hindi | 🔄 Detection only |
186+
Both OCR engines are included as core dependencies:
187+
188+
| Backend | Description | GPU Support |
189+
|---------|-------------|-------------|
190+
| EasyOCR | Local, free, good for Indian languages (default) | CUDA, MPS (Apple Silicon) |
191+
| Tesseract | Local, free, most language packs | CPU only |
192+
| Google Vision | Cloud, paid, best accuracy | N/A |
193+
194+
Google Vision requires an additional install: `pip install legacylipi[vision]`
195+
196+
See [docs/cli-reference.md](docs/cli-reference.md) for detailed OCR options and language codes.
229197

230198
## CLI Commands
231199

232200
| Command | Description |
233201
|---------|-------------|
234202
| `api` | Launch the React web UI + FastAPI REST API |
235-
| `translate` | Full pipeline: parsedetectconverttranslate output |
203+
| `translate` | Full pipeline: parse, detect, convert, translate, output |
236204
| `convert` | Convert legacy encoding to Unicode (no translation) |
237205
| `extract` | Extract text from PDF (OCR or font-based) |
238206
| `detect` | Analyze PDF and report detected encoding |
239207
| `scan-copy` | Create an image-based scanned copy of a PDF |
240208
| `encodings` | List supported font encodings |
241209
| `usage` | Show API usage statistics |
242-
| `ui` | Launch legacy NiceGUI web interface (deprecated) |
243210

244211
See [docs/cli-reference.md](docs/cli-reference.md) for full command reference.
245212

246-
## Development
247-
248-
See [docs/development.md](docs/development.md) for setup instructions, running tests, project structure, and adding new encodings.
249-
250-
### Architecture
213+
## Architecture
251214

252215
```
253216
┌─────────────────────────────────────────────────────────────────────────┐
@@ -268,23 +231,20 @@ See [docs/development.md](docs/development.md) for setup instructions, running t
268231
│ ┌──────────────────────────────────────────────────────────────────┐ │
269232
│ │ Core Pipeline │ │
270233
│ │ │ │
271-
│ │ PDF Parser / OCR Parser │ │
234+
│ │ PDF Parser / OCR Parser (Tesseract + EasyOCR) │ │
272235
│ │ │ │ │
273236
│ │ Encoding Detector → Unicode Converter │ │
274237
│ │ │ │ │
275238
│ │ Translation Engine (trans, Google, Ollama, OpenAI, GCP, ...) │ │
276239
│ │ │ │ │
277-
│ │ Output Generator (.txt, .md, .pdf) │ │
240+
│ │ Output Generator (.txt, .md, .pdf, bilingual) │ │
278241
│ └──────────────────────────────────────────────────────────────────┘ │
279242
└─────────────────────────────────────────────────────────────────────────┘
280243
```
281244

282-
**Pipeline Flow:**
283-
1. **Parse PDF** → Extract text with PDF parser or OCR
284-
2. **Detect Encoding** → Identify legacy encoding scheme
285-
3. **Convert to Unicode** → Transform legacy text to Unicode
286-
4. **Translate** → Use translation backend
287-
5. **Generate Output** → Create PDF/text/markdown
245+
## Development
246+
247+
See [docs/development.md](docs/development.md) for setup instructions, running tests, project structure, and adding new encodings.
288248

289249
## License
290250

@@ -296,7 +256,6 @@ Contributions are welcome! Please:
296256

297257
1. Fork the repository
298258
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
299-
3. Make your changes
300-
4. Run tests (`uv run pytest`)
301-
5. Commit and push
302-
6. Open a Pull Request
259+
3. Run checks (`./scripts/check.sh`)
260+
4. Commit and push
261+
5. Open a Pull Request

0 commit comments

Comments
 (0)