English-Arabic Dataset and NLP Processing by MostafaOsmanFathi · Pull Request #85 · adelpro/quran-search-engine

MostafaOsmanFathi · 2026-03-11T03:42:53Z

English-Arabic Dataset and NLP Processing

Overview:

Generated the English-Arabic dictionary dataset from the Colored English Word by Word Translation.
Carefully tested the dataset to ensure no misalignment in English-Arabic mapping using various techniques.

NLP-based Grouping:

Implemented grouping of near-meaning words using a simple NLP library.
Extracted the most valuable token or word from sentences. Examples:
```
"you alone": "alone",
"they are alone": "alone"
```
so those 2 records become one
Sentences that have no meaningful word are saved to debug_no_meaning_sentence.json for later processing. Example entries:
```
"those who",
"and out of what",
"and those who",
"in what",
"to you",
"and what"
```
These sentences were ignored in processing for now but saved for potential future handling.

Synonyms Extraction:

Used WordNet to extract the top 5 nearest synonyms for each word extracted from NLP processing.
Populated the synonyms field in the dataset with these top-ranked synonyms.

Next Steps / Feedback:

Please review the dataset and the NLP/synonym processing.
Let me know if you need further changes or additional handling for the sentences without meaningful tokens.

Here is the dataset if you want to check it directly: english-arabic-dictionary.json

coderabbitai · 2026-03-11T03:43:02Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8f7330ae-5ffc-4303-bfc9-7c5e3da3b08b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can suggest fixes for GitHub Check annotations.

Configure the reviews.tools.github-checks setting to adjust the time to wait for GitHub Checks to complete.

adelpro · 2026-03-11T14:10:40Z

Salam @MostafaOsmanFathi ,
Thank you for your effort on this PR.
I have prepared a documentation file explaining the process of implementing the English-to-Arabic search feature in the Quran search engine. Please take a look when you have time:
https://github.com/adelpro/quran-search-engine/blob/develop/docs/english-arabic-search.md

I would appreciate your feedback and thoughts on it.

Also, please consider removing the pnpm-lock.yaml file, as we migrated to Yarn in the latest version.

Baraka Allahu fik.

MostafaOsmanFathi · 2026-03-11T15:20:51Z

Thank you for the documentation and for explaining the implementation process. I’ve synced my branch with the develop branch and resolved the merge conflict by removing the pnpm-lock.yaml file, since the project has migrated to Yarn.

I’ve reviewed the docs, and everything looks good from my side.

Wa fika Baraka Allahu.

MostafaOsmanFathi · 2026-03-11T15:33:03Z

Before merging the pull request, could you please check the following and let me know how you would like me to handle them?

Some words in the dataset do not have meanings. Should I keep them exactly as they appear in the reference, remove them, or keep them with empty quotes as they are now?

You can see an example at line 519 in english-arabic-dictionary.json.

I’d like to know the preferred approach since I’m not fully familiar with all the intended use cases for this dataset.

I have also extracted all entries that have no meaning and collected them here:
debug_no_meaning_sentence.json

adelpro · 2026-03-11T17:38:07Z

Hey @MostafaOsmanFathi,

A couple of cleanup tasks for the entries file:

1. Delete the entry with an empty english field.

2. For all remaining entries, apply two normalizations:

english: consolidate all English translations into a single field (already the current structure, just ensure consistency).
arabic: reduce word forms down to their roots only, using the root data already available in the package.

Example:

Before:

{
  "english": "straight",
  "arabic": ["المستقيم", "مستقيم", "مستقيما", "قيما", "سواء", "صراط", "أقوم", "فاستقيموا", "يستقيم"],
  "synonyms": ["directly", "neat", "full-strength", "true", "unbent"]
}

After:

{
  "english": ["straight", "directly", "neat", "full-strength", "true", "unbent"],
  "arabic": ["قوم", "صرط", "سوي"]
}

Can you handle this @MostafaOsmanFathi ?

baraka allaho fik akhi

MostafaOsmanFathi · 2026-03-12T04:18:55Z

I’ve implemented the changes you requested. However, while working with the dataset I noticed some issues that might affect the accuracy of the results.

First, the intended meaning of some English words can differ from the roots extracted from the Arabic text. For example:

"english": ["allah"],
"arabic": [
  "أله",
  "ضلل",
  "وله",
  "كون",
  "فلل",
  "عند",
  "من",
  "عذب",
  "توب",
  "علم",
  "أتي"
]

As you can see, roots like ضلل don’t correspond to the meaning of “Allah.” This happens because when extracting the most relevant word from a sentence, the algorithm may pick الله instead of another intended word such as عذاب.

Another issue is that word-map.json contains a number of inconsistencies, so it wasn’t reliable enough for extracting word roots. Because of that, I used an external API to perform the root extraction:

fetch('https://rootna.net/api/process-word', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ word }),
  signal: controller.signal,
});

I’m not sure if relying on this API fits your project’s requirements. If you prefer using word-map.json, I also have a version that uses it, but as mentioned earlier it produced less reliable results.

In addition, I refactored the processing to use asynchronous concurrent tasks (with p-limit). This significantly speeds up the generation of Arabic roots since the script makes a large number of API calls. I applied the same idea to the WordNet synonym extraction as well, since that step also takes a considerable amount of time.

If you’d like me to adjust anything or change the approach, please let me know.

adelpro · 2026-03-12T10:42:26Z

I’ve implemented the changes you requested. However, while working with the dataset I noticed some issues that might affect the accuracy of the results.

First, the intended meaning of some English words can differ from the roots extracted from the Arabic text. For example:
"english": ["allah"],
"arabic": [
  "أله",
  "ضلل",
  "وله",
  "كون",
  "فلل",
  "عند",
  "من",
  "عذب",
  "توب",
  "علم",
  "أتي"
]
As you can see, roots like ضلل don’t correspond to the meaning of “Allah.” This happens because when extracting the most relevant word from a sentence, the algorithm may pick الله instead of another intended word such as عذاب.

Another issue is that word-map.json contains a number of inconsistencies, so it wasn’t reliable enough for extracting word roots. Because of that, I used an external API to perform the root extraction:
fetch('https://rootna.net/api/process-word', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ word }),
  signal: controller.signal,
});
I’m not sure if relying on this API fits your project’s requirements. If you prefer using word-map.json, I also have a version that uses it, but as mentioned earlier it produced less reliable results.

In addition, I refactored the processing to use asynchronous concurrent tasks (with p-limit). This significantly speeds up the generation of Arabic roots since the script makes a large number of API calls. I applied the same idea to the WordNet synonym extraction as well, since that step also takes a considerable amount of time.

If you’d like me to adjust anything or change the approach, please let me know.

Thanks a lot for your efforts and especially for your patience, the word-map.json is issued from The Quranic Arabic Corpus, with a GPL license: https://corpus.quran.com/license.jsp

For the new suggested https://rootna.net/api/process-word', I have concerns about the license,

For the data flow, I suggest updating the word-map.json, then using it to update the english-to-arabic so that we can keep both

MostafaOsmanFathi · 2026-03-12T12:40:19Z

Ok, I think that would be better. I will update the word map using that API.
Could you please clarify how exactly I should update the word-map.json file? Should I add a new field for the data, or overwrite the existing one like lemma or rootfield?

After updating it, I will proceed with extracting the roots from it.

…f rootna api

MostafaOsmanFathi · 2026-03-12T22:17:30Z

I have updated word-map.json to include all the results from the API. I also added some missing records that were needed by english-to-arabic-builder, along with their roots.

Additionally, I refactored the english-to-arabic-builder script so that instead of importing the roots from the API, it now reads them directly from word-map.json.

Please check the changes and let me know if any further modifications are needed. I’d be happy to help.

adelpro · 2026-03-13T08:01:49Z

This is it, exceptional work @MostafaOsmanFathi baraka allaho fik

Before merging this PR, please:

Can we use the quran.json data to replace ayahs.csv
Group the related data for english-to-arabic-builder in one folder and name it english-to-arabic-builder-data
Update the documentation to reference your work
I suggest naming the json: quran-english-arabic-roots.json
- Purpose : Quran-related (your search engine)
- Source language : English
- Target content : Arabic roots (not full words)

…tionality

…nal Quran database

….json

MostafaOsmanFathi · 2026-03-13T16:28:53Z

Thank you

I’ve implemented all the requested changes:

Replaced ayahs.csv with quran.json for word-level data.
Grouped related data for english-to-arabic-builder in a new folder named english-to-arabic-builder-data.
Updated the documentation to reference the changes and added phonetic-inverted-index-generation.md details.
Renamed the JSON to quran-english-arabic-roots.json with English as the source language and Arabic roots as the target content.

All commits are ready for review. Please let me know if any further adjustments are needed.

adelpro · 2026-03-15T05:18:23Z

Thank you

I’ve implemented all the requested changes:

* Replaced ayahs.csv with quran.json for word-level data.

* Grouped related data for english-to-arabic-builder in a new folder named english-to-arabic-builder-data.

* Updated the documentation to reference the changes and added `phonetic-inverted-index-generation.md` details.

* Renamed the JSON to quran-english-arabic-roots.json with English as the source language and Arabic roots as the target content.

All commits are ready for review. Please let me know if any further adjustments are needed.

Baraka allaho fik @MostafaOsmanFathi , the data is ready, thanks to you, i will continue the implementation inchalah

…arabic-mapping-dataset-nlp English-Arabic Dataset and NLP Processing

MostafaOsmanFathi added 10 commits March 6, 2026 08:06

feat: add English-Arabic dictionary builder

09ff677

feat: add ChatGPT chrome automation for semantically normalizing English

59e986e

feat: implement normalize pipeline

471c61c

feat: add caching logic

d000f28

refactor: add retry logic

509568a

refactor: safe update cache, keyboard enter message

8e3f81b

refactor: to use nlp instead of llm

b99dec9

feat: populate synonyms field using WordNet with ranked relevance

1580ad0

refactor: remove multi-word synonyms and clean parenthesized synonyms

dcffa17

feat: add english-arabic-dictionary dataset and update package.json

f4fe72d

fix: rollback pnpm-lock.yaml to resolve merge conflict

a528a37

MostafaOsmanFathi mentioned this pull request Mar 11, 2026

[FEATURE] Add English Search Support via English → Arabic Semantic Mapping Dataset #61

Closed

MostafaOsmanFathi added 2 commits March 11, 2026 16:47

fix: resolve merge conflict by removing pnpm-lock.yaml

3901656

update: sync with develop branch

531de16

feat: add Arabic root normalization and p-limit concurrency

c089e4b

MostafaOsmanFathi added 2 commits March 12, 2026 18:54

fix: word-map root words using rootna api

b07f455

refactor: english-to-arabic-builder.ts to use word-map.json instead o…

afc11a3

…f rootna api

MostafaOsmanFathi added 3 commits March 13, 2026 15:21

refactor: restructure english-to-arabic-builder to group related func…

75101bb

…tionality

refactor(word-level-transliteration): use quran.json instead of exter…

224bc08

…nal Quran database

refactor: use standard instead of uthmani and add record for phonetic…

b5e90d2

….json

docs: add phonetic-search-indexing documentation

c91f1da

adelpro merged commit 47c8baa into adelpro:develop Mar 15, 2026
5 of 7 checks passed

Vexxo-Dev pushed a commit to Vexxo-Dev/quran-search-engine that referenced this pull request Mar 15, 2026

Merge pull request adelpro#85 from MostafaOsmanFathi/feat/english-to-…

f401e82

…arabic-mapping-dataset-nlp English-Arabic Dataset and NLP Processing

Conversation

MostafaOsmanFathi commented Mar 11, 2026