Open English Dictionary

Rebuilding process WIP

Currently, this project is being rebuilt.

New features are:

Streamlined process + pipeline integration
Wiktionary grounding + LLM explain
- Enormous words data across multiple languages
- Extremely detailed definitions
New distribution format will be: jsonl, sqlite and more are to be determined
Options are available to select specific category of words

Behold and stay tuned!

Prerequisites

Install project dependencies: uv sync
Configure a .env file with DATABASE_URL
Ensure a PostgreSQL database is reachable via that URL

Run The Wiktionary Workflow

Download the compressed dump:

uv run open-dictionary download --output data/raw-wiktextract-data.jsonl.gz

Extract the JSONL file:

uv run open-dictionary extract \
  --input data/raw-wiktextract-data.jsonl.gz \
  --output data/raw-wiktextract-data.jsonl

Stream the JSONL into PostgreSQL (dictionary_all.data is JSONB):

uv run open-dictionary load data/raw-wiktextract-data.jsonl \
  --table dictionary_all \
  --column data \
  --truncate

Run everything end-to-end with optional partitioning:

uv run open-dictionary pipeline \
  --workdir data \
  --table dictionary_all \
  --column data \
  --truncate

Split rows by language code into per-language tables when needed:

uv run open-dictionary partition \
  --table dictionary_all \
  --column data \
  --lang-field lang_code

Materialize a smaller set of languages into dedicated tables with a custom prefix:

uv run open-dictionary filter en zh \
  --table dictionary_all \
  --column data \
  --table-prefix dictionary_filtered

Pass all to emit every language into its own table:

uv run open-dictionary filter all --table dictionary_all --column data

Populate the common_score column with word frequency data (re-run with --recompute-existing to refresh scores):

uv run open-dictionary db-commonness --table dictionary_filtered_en

Normalize raw Wiktionary payloads into a slimmer JSONB column without invoking LLMs (writes to process by default):

Optionally convert to TOON format (reduces token usage by 30-60% for LLM workflows, stores as TEXT instead of JSONB):

uv run open-dictionary pre-process \
  --table dictionary_filtered_en \
  --source-column data \
  --target-column processed \
  --toon

Remove low-quality rows (zero common score, numeric tokens, legacy tags) directly in PostgreSQL:

uv run open-dictionary db-clean --table dictionary_filtered_en

Generate structured Chinese learner-friendly entries with the LLM define workflow (writes JSONB into new_speak by default). This streams rows in batches, dispatches up to 50 concurrent LLM calls with exponential-backoff retries, and resumes automatically on restart:

uv run open-dictionary llm-define \
  --table dictionary_filtered_en \
  --source-column processed \
  --target-column new_speak

Provide LLM_MODEL, LLM_KEY, and LLM_API in your environment (e.g., .env) before running LLM commands.

Each command streams data in chunks to handle the 10M+ line dataset efficiently.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.serena		.serena
data		data
src/open_dictionary		src/open_dictionary
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open English Dictionary

Rebuilding process WIP

Currently, this project is being rebuilt.

Prerequisites

Run The Wiktionary Workflow

About

Uh oh!

Releases 2

Contributors

Uh oh!

Languages

ahpxex/open-dictionary

Folders and files

Latest commit

History

Repository files navigation

Open English Dictionary

Rebuilding process WIP

Currently, this project is being rebuilt.

Prerequisites

Run The Wiktionary Workflow

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Contributors

Uh oh!

Languages