Pali Text Word Mapping & Context-Aware Translation #179

Devamitta · 2025-11-07T02:37:31Z

Devamitta
Nov 7, 2025
Collaborator

It is a continuation of the sasanarakkha#25 and this discussion

Create a tool that analyzes Pali text, maps each word to the Digital Pali Dictionary (DPD) database entries, and generates contextualized translations.

Input: Pali text (sentences/paragraphs)

Process:
1. Tokenize input word by word
2. Match words against DPD database using inflections column for declensions/conjugations
3. Handle compound words which not yet in db, using lookup table’s deconstructor column
4. Use AI (Gemini/DeepSeek/OpenRouter) to disambiguate word senses when multiple meanings exist (e.g., buddha_1 vs buddha_2)
5. Select appropriate id from dpd_headword table based on context

Database Structure:
• SQLite database
• dpd_headword table: id, lemma_1, pos, grammar, meaning_1, inflections etc.
• lookup table: lookup_key, deconstructor

Output:

Detailed Table:

• Original Pali sentence
• Table with columns: word | id | pos | grammar | meaning
• English translation of the sentence using meaning_1 with context
CSV Export:

• All unique words from analyzed text
• Columns: id, lemma_1, pos, meaning_1

Technical Notes
• AI should be able to read full database for context-aware disambiguation
• All Tipitaka words have corresponding constructions in lookup table no

khemarato · 2025-11-07T03:25:45Z

khemarato
Nov 7, 2025

Thanks for writing this up, would be a great project for automatically generating word-by-word glosses suitable for translation help, chanting, memorization, etc. 😊

I'm a little confused by your technical note about full DB access for the AI. Shouldn't just giving them the relevant records in the context window be good enough? Why give the model the ability to populate its own context window with extra information? From the research I've seen, irrelevant context can significantly degrade performance.

1 reply

Devamitta Nov 7, 2025
Collaborator Author

What I ment is :

"AI should be able to read full database for context-aware disambiguation"

So it can distinguish between various meanings of the words.

But maybe including only those various meanings in the prompt will be enough. For example for word buddhassa it provides only buddha 1 and buddha 2.

However I still believe if AI can read whole db, it can provide better quality analysis.

Devamitta · 2025-12-25T06:35:55Z

Devamitta
Dec 25, 2025
Collaborator Author

Bhante @bdhrs , could you please describe in a few sentences how you would approach this task and what you would start with?

1 reply

bdhrs Dec 25, 2025
Maintainer

@Devamitta I would make a fastmcp server, that provides a few useful tools to an agent. Even just one to start with, something like fetch_headwords_and_grammar.

Example workflow

Take a Pāḷi sentence, remove punctuation and split it into single inflected words
Look up each word in the dpd.db lookup table.
Grab the headwords and grammar columns from the lookup table
Look up those headword ids in the dpd_headwords table.
Return a simplified, structured set of data in JSON or TSV or such like.
Package all the db data into a carefully worded prompt, basically saying decide the best meanings from the given data, and give a response packaged as a markdown table with word, pos, meaning, grammar, or whatever you want the response to look like.
Send the request to a API, and see what the results are like.

Once you've got the process working, iterate and refine it to get the exact output you want again and again, including more complicated edge cases.

Most of the tooling already exists in the webapp, and just needs to be modified for the above process.

bdhrs · 2025-12-25T13:00:49Z

bdhrs
Dec 25, 2025
Maintainer

Just as an experiment, I fed the link for this conversation into Gemini CLI conductor and it produced a pretty useful working version in minutes.

Please feel free to refine the database calls and prompts for the LLM to exactly suit your needs. At the moment it's just using a free Openrouter model xiaomi/mimo-v2-flash:free, nothing fancy, but quite fast, under 5 seconds.

You can test it out by running

uv run python exporter/mcp/ai_pali_translate.py

The output will be appear in the terminal, and get saved in markdown format to the exporter/mcp/output folder.

This is a typical example of the output:

Analysis of: tatra, bhikkhave, ye te makkaṭā abālajātikā alolajātikā, te taṃ lepaṃ disvā ārakā parivajjanti.

Of course. Here is the translation and analysis of the provided Pāḷi sentence.

English Translation

Fluent Translation:
"Here, monks, those monkeys who are not foolish or restless, upon seeing that sticky paste, avoid it from a distance."

Literal Translation:
"In that (place), monks, whoever those monkeys (are) not-foolish-type not-restless-type, they, that sticky-paste having-seen, from-far they-avoid."

Word-by-Word Analysis

ID	Word in Sentence	Grammar	Meaning	Construction	Root
29605	tatra	ind, adv	there; in that place	ta + tra
49868	bhikkhave	masc, voc pl	monks!	bhikkhu + ave	√bhikkh (beg)
54158	ye	pron, masc nom pl	those who	ya + e
31134	te	pron, masc nom pl	they; those	ta + e
50445	makkaṭā	masc, nom pl	monkeys
7192	abālajātikā	adj, nom pl	not foolish; not childish	na + bāla + jātika
9262	alolajātikā	adj, nom pl	not restless; not agitated	na + lola + jātika
31134	te	pron, masc nom pl	they; those	ta + e
30216	taṃ	pron, nt acc sg	that	ta + aṃ
55949	lepaṃ	masc, acc sg	sticky paste; tar	√lip + a	√lip (smear, stick)
32674	disvā	abs	having seen	√dis + tvā	√dis (see)
12339	ārakā	ind, prep	far away (from)	āraka + ā	ā √ar (go, move)
44204	parivajjanti	pr, 3rd pl	they avoid; they shun	pari + vajja + ti	√vajj (avoid)

Grammatical Commentary

Complex Subject: The subject of the main verb parivajjanti is a compound phrase: ye te makkaṭā abālajātikā alolajātikā, te. This translates to "those monkeys who are not foolish or restless, they...". The ye... te construction ("whoever... those") is a common way to define a specific group that will be the subject of the sentence.
Compound Adjectives: The words abālajātikā and alolajātikā are dvandva or coordinative compounds. The negation prefix a- applies to each member: a-bāla (not foolish) and a-lola (not restless), both modifying jātika (of the type/kind).
Absolutive Case: disvā is the absolutive (or indeclinable participle) of the root √dis (to see). It indicates an action that precedes the main verb. Here, the monkeys first see the paste, and then they avoid it. A literal translation is "having seen."
Sandhi: The word ārakā is an adverb in the ablative sense, meaning "from far away." The final ā is the remnant of the ablative case ending. Parivajjanti is formed from the prefix pari- (around, away from) and the verb vajjati (to avoid), which comes from the root √vajj.

0 replies

Devamitta · 2025-12-26T06:27:37Z

Devamitta
Dec 26, 2025
Collaborator Author

It already looks quite good!

Next steps may include:

Pali analysis and formatting-related:

Integrate the deconstructor data to ensure correct sandhi breakdown. Using lookup table columns "lookup_key" and corresponding data in column "deconstructor"
Also, add sandhi itself to the table, especially in cases if it’s already in the database (dpd_headwords table). Clearly mark parts of sandhi as such. (see example 4 from exporter/mcp/examples.md)
The same applies to compounds, marking it as such, so they can be broken down and parts can be added to the table as well. (see example 3 from exporter/mcp/examples.md)

GUI-related:

Create a GUI so we can select another meaning of the word from the database and correct the sentence translation if necessary. Make it a new tab in the gui2/main.py
Add an option to export CSV for Anki. (having settings which data export from db, example scripts/export/anki_csv.py)
Add an option to export MD or PDF.
Add an option to select a whole sutta to analyse, simply by choosing the book and the name of the sutta.
(similar to def make_words_to_add_list_sutta from gui/functions_db_dps.py)

For the online format:

I believe it could be part of an exporter/web app, just another tab with all the functionality of the above GUI. We only need to find a free model with generous limits so that many people can use it. Alternatively, we could even make it possible to use personal tokens and switch between models and providers.

Further development

AI powered tool which go through text and try to attribute each word to DPD entries, pointing out those without examples or even missing meanings.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pali Text Word Mapping & Context-Aware Translation #179

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Pali Text Word Mapping & Context-Aware Translation #179

Uh oh!

Uh oh!

Devamitta Nov 7, 2025 Collaborator

Replies: 4 comments · 2 replies

Uh oh!

khemarato Nov 7, 2025

Uh oh!

Devamitta Nov 7, 2025 Collaborator Author

Uh oh!

Devamitta Dec 25, 2025 Collaborator Author

Uh oh!

bdhrs Dec 25, 2025 Maintainer

Example workflow

Uh oh!

bdhrs Dec 25, 2025 Maintainer

Analysis of: tatra, bhikkhave, ye te makkaṭā abālajātikā alolajātikā, te taṃ lepaṃ disvā ārakā parivajjanti.

English Translation

Word-by-Word Analysis

Grammatical Commentary

Uh oh!

Uh oh!

Devamitta Dec 26, 2025 Collaborator Author

Pali analysis and formatting-related:

GUI-related:

For the online format:

Further development

Devamitta
Nov 7, 2025
Collaborator

Replies: 4 comments 2 replies

khemarato
Nov 7, 2025

Devamitta Nov 7, 2025
Collaborator Author

Devamitta
Dec 25, 2025
Collaborator Author

bdhrs Dec 25, 2025
Maintainer

bdhrs
Dec 25, 2025
Maintainer

Devamitta
Dec 26, 2025
Collaborator Author