This is a full tool sets for building a open dictionary, based on wikitionary data.
- Core logic lives in
src/open_dictionary. The CLI entry point defined inpyproject.tomlresolves toopen_dictionary:main, which dispatches intosrc/open_dictionary/cli.py; keep any new commands registered there while delegating business logic to feature modules. - Data access helpers sit under
src/open_dictionary/db(for exampleaccess.py) and should remain focused on PostgreSQL streaming semantics. - Wiktionary ingestion utilities are split by concern under
src/open_dictionary/wikitionary/:downloader.py,extract.py,transform.py(streaming COPY + table helpers),pipeline.py(orchestration),filter.py(language table materialization), andprogress.py(shared progress reporters). - LLM-facing enrichments live in
src/open_dictionary/llm, while cross-cutting utilities (environment loading, helpers) belong insrc/open_dictionary/utils. - Runtime artifacts such as dumps or extracted JSONL files are expected in a local
data/directory (not tracked); scripts should accept paths rather than hard-code locations.
uv syncinstalls all dependencies declared inpyproject.toml.uv run open-dictionary download --output data/raw-wiktextract-data.jsonl.gzstreams the upstream Wiktextract snapshot.uv run open-dictionary pipeline --workdir data --table dictionary --column data --truncateexecutes download → extract → load → partition in one shot; add--skip-*flags for partial runs.uv run open-dictionary filter en zh --table dictionary_all --column datacopies only selected languages intodictionary_lang_*tables; passallas the first positional argument to materialize every language code.uv run open-dictionary db-clean --table dictionary_enremoves rows that fail quality heuristics (numeric tokens, zero scores, legacy tags, etc.).uv run open-dictionary db-commonness --table dictionary_enstreams wordfreq-derivedcommon_scorevalues into the target table (add--recompute-existingto refresh populated rows).uv run python -m pytestis the expected test runner once suites are added; for now, rely on targeted CLI runs against a disposable PostgreSQL database.
- Target Python 3.12+, four-space indentation, and
snake_casefor functions, modules, and CLI subcommand names. - Prefer type hints and
pydanticmodels for structured payloads (seellm/define.py), and keep side effects behind small helpers for easier testing. - Environment keys (
DATABASE_URL,LLM_KEY,LLM_API,LLM_MODEL) are loaded throughutils.env_loader; never fetch them ad hoc inside command bodies.
- Focus on integration tests that exercise the CLI contract end-to-end with a seeded PostgreSQL container; isolate I/O with temp directories under
tmp_path. - Name test modules
test_<feature>.pyand colocate fixtures undertests/conftest.pyonce the suite exists. - Validate large operations by asserting row counts, emitted table names, and LLM scaffolding errors rather than snapshotting full JSON.
- Follow the existing history: concise imperative subject lines (e.g. “Add DB iterator”), optional body wrapped at ~72 chars.
- Reference issue IDs in the body when available and note required migrations or manual steps.
- PRs should describe the dataset used for validation, include command transcripts (
uv run …) for any pipelines executed, and, when UI/CLI behavior changes, attach representative logs or screenshots.
- Keep
.envfiles local; share example variables via documentation rather than version control. - Never commit API keys or database URLs. If sensitive configuration is required in CI, use repository secrets and reference them through environment loader helpers.