Precise technical-document translation powered by glossary-aware prompts and robust Markdown segmentation. This project provides both CLI and WebUI with a unified module layer.
We’ve shipped an exciting refactor focused on Dynamic Terminology Management, dramatically improving stability and consistency for long-form document translation:
- Terminology union matching: newly detected terms are collected during translation and unioned with the CSV glossary for matching, maximizing term hits.
- Controlled merge switch: at the end, choose whether to merge new terms back into the glossary; if not, a separate “new-terms CSV” is exported for review.
- Paragraph-level retries: transient API hiccups no longer break the flow; failures reset on the next success.
- Proper-noun matching optimization: better case handling for names like HOPKINS/Hopkins without harmful lemmatization.
- Structured output and prompts: translation and notes are separated; notes are list-form and include reasons for new terminology choices, closing the review loop.
Plan change: the “empty glossary generation” feature is temporarily deprecated due to insufficient effectiveness; it may be rebuilt or removed.
- Overview
- Key Features
- Project Structure (Unified Root)
- Installation
- Quick Start (uv)
- Usage
- CLI
- WebUI
- Environment & Defaults
- Testing
- FAQ
- Changelog & Migration Notes
- Roadmap
- Contact
AIwork4translator focuses on translating technical documents while preserving terminology accuracy. It detects and protects domain-specific terms and provides an editor workflow to review translations efficiently.
- Glossary-first prompts: Aho-Corasick automaton and regex matching identify domain terms per segment; prompts instruct the model to use exact translations consistently.
- High Efficiency: Compared to typical RAG uploads for small glossaries, early tests show ~99% less extra token overhead and ~35% more term hits. Overall quality improves due to stable terminology.
- Adaptive: Supports plural normalization (e.g., "outlaws" -> "outlaw") and smart article handling ("the Outlaw" matches "an outlaw").
- Continuous Improvement: Since no glossary can be complete, newly detected terms are automatically highlighted/summarized in the output to aid review and glossary updates.
- Terminology detection & protection: Efficient Aho-Corasick matching with plural/article normalization.
- Multi-format input support: Native
.md/.txt, and conversion for PDF/Word/Excel/HTML via MarkItDown. - Multiple LLM providers: Kimi, OpenAI (GPT), DeepSeek, and optional local models via Ollama.
- CLI and WebUI modes: Unified core logic with interactive CLI and feature-rich WebUI.
- CLI Preference Memory: Automatically remembers last used provider and file paths for faster startup.
- Two-column Markdown editor: WebUI split-view (original vs translation) with live progress.
- Real-time progress: Live polling and safe autosave after completion.
- Interactive confirmation: For non-Markdown conversion steps.
- Optional empty glossary generation: Via NER (CLI) (Note: planned temporary deprecation; see Roadmap).
- Structured translation: Smart segmentation based on Markdown headers.
- Dynamic terminology management (refactor core): Real-time union of new terms with the CSV glossary for matching; end-of-run merge switch; paragraph-level retries for stability.
project_root/
├── data/
│ ├── .env
│ └── README_en.md
├── models/
│ └── dbmdz/bert-large-cased-finetuned-conll03-english-for-ner/
├── modules/
│ ├── api_tool.py # LLMService, returns (content, tokens), no RAG cache
│ ├── config.py # GlobalConfig (env-first)
│ ├── read_tool.py # Structured Markdown segmentation
│ ├── count_tool.py # Paragraph counting (mirrors segmentation logic)
│ ├── csv_process_tool.py # Aho-Corasick term matching & CSV validation
│ ├── markitdown_tool.py # Non-Markdown → Markdown conversion
│ ├── write_out_tool.py # Structured/flat writing (no "# end" by default)
│ └── ner_list_tool.py # Robust model path detection (prefers root models/)
├── templates/
│ ├── index.html
│ └── editor.html
├── static/
│ ├── style.css
│ ├── script.js
│ └── editor.js
├── uploads/
├── app.py # WebUI entry (provides main())
├── main.py # CLI entry (interactive, with preference persistence)
├── baseline.py # Baseline CLI (unified API call, no RAG cache)
├── ner_list_check.py # NER-based glossary generator
└── pyproject.toml # Scripts: CLI/WebUI
- Requirements: Python 3.11+
- Dependency manager:
uv - Virtual environment:
.venv(default)
uv sync
# Sync project dependencies (creates .venv)
uv sync
# Install / remove dependencies
uv add <package>
uv remove <package>
# Run CLI (recommended)
uv run main.py
# Or
python main.py
# Run WebUI
python app.py
# Or script entry
program-translator-webui
# Run tests
python -m pytest
- Configuration & secrets: Set API keys in
data/.envand never commit secrets. On ≥3 consecutive API failures, run the CLI API test and check your network/proxy. Kimi defaults:BASE_URL=https://api.moonshot.cn/v1,MODEL=kimi-k2-turbo-preview. - Glossary quality & performance: CSV must have 2 columns, UTF-8 encoding, no empty rows, and unique source terms. Large glossaries are supported efficiently via Aho-Corasick algorithm. Plural forms are automatically handled.
- CLI Preferences: The CLI remembers your last choices (provider, file paths) in
data/.prefs.json. Press Enter to accept the default value shown in[...]. - Input & structure: Prefer native Markdown. Non-Markdown files are auto-converted; please review the converted file before continuing. Structured mode relies on heading levels—use
#/##reasonably. - Paragraphs & large files: Very long paragraphs are split intelligently. Large files take longer; WebUI recommends ≤10MB. CLI shows progress as
[current/total]. - Consistency: Lower
temperatureimproves term consistency. The prompt explicitly injects term mappings to stabilize outputs. - Local models: For Ollama, set
OLLAMA_BASE_URLandOLLAMA_MODELin.env. Ensure the service is running before use. - Logging & outputs: CLI prints
Total time: xh xm xs(Chinese format enabled).counting_table.csvrecords raw seconds and tokens. WebUI logs include timestamps.
- Run
uv run main.py(orpython main.py) - Select the provider (kimi/gpt/deepseek/ollama) - defaults to last used
- Enter the input file path - defaults to last used
- If the file is not Markdown, it will be converted via MarkItDown and you will be prompted to continue (y/n)
- If you don’t have a glossary, choose
nto generate an empty glossary via NER, then exit and fill the translation column - If you have a glossary, provide the CSV path and proceed with translation
- Supports
.xlsxinput (auto-converted to.csv)
- Supports
- Progress prints:
Translating segment [current]/[total]
- Run
uv run app.py - Open
http://localhost:8008/ - On the home page:
- Upload original file (
.mdor convertible format) and validate - Upload glossary CSV and validate
- Choose provider, click “Use Two-Column Editor”, then in the editor page click “Start Translation”
- Upload original file (
- Endpoints tested:
/validate-file,/start-translation,/translation-progress,/save-content,/download,/open-file - After completion, autosave checks every ~20 seconds
- Environment file:
data/.env - Default provider config for Kimi:
KIMI_BASE_URL=https://api.moonshot.cn/v1model=kimi-k2-turbo-preview
- Suggested parameters:
max_tokens=2048,temperature=0.7,top_p=0.95 - Keep API keys secure and do not commit to the repository
- Run all tests:
python -m pytest - Tests are aligned to the unified root structure
- The editor shows progress as segments are processed; content updates incrementally
- Non-Markdown files are converted and require a user confirmation before translation
- Default writing mode does not append
# endin flat mode (to match CLI/WebUI behavior)
- 2025-12-04: Added CLI preference persistence, Aho-Corasick term matching optimization, and plural/article normalization.
- 2025-11-29: Unified root structure (merged CLI/WebUI), removed RAG cache, aligned modules, updated script entries, cleaned legacy directories
- v1.2.0: Core features completed (CLI, WebUI, provider support, unified modules).
- Future versions:
- Enhance Markdown parsing performance with MinerU.
- Add a CAT-like interactive table view (source/target aligned per sentence) for efficient editing.
Support real-time glossary updates: auto-adding unfamiliar terms during translation and allowing quick edits.Completed: Dynamic Terminology Management (union matching, merge switch, new-term collection/export).Improve empty glossary generation: include type plus translation suggestions for faster review.New plan: temporarily deprecate blank glossary generation due to poor effectiveness; may be rebuilt or removed.
Email: chasen0315@gmail.com (reply within 24 hours by 2025-12-29)