Skip to content

English-Arabic Dataset and NLP Processing#85

Merged
adelpro merged 20 commits intoadelpro:developfrom
MostafaOsmanFathi:feat/english-to-arabic-mapping-dataset-nlp
Mar 15, 2026
Merged

English-Arabic Dataset and NLP Processing#85
adelpro merged 20 commits intoadelpro:developfrom
MostafaOsmanFathi:feat/english-to-arabic-mapping-dataset-nlp

Conversation

@MostafaOsmanFathi
Copy link
Copy Markdown
Contributor

English-Arabic Dataset and NLP Processing

Overview:

  • Generated the English-Arabic dictionary dataset from the Colored English Word by Word Translation.
  • Carefully tested the dataset to ensure no misalignment in English-Arabic mapping using various techniques.

NLP-based Grouping:

  • Implemented grouping of near-meaning words using a simple NLP library.

  • Extracted the most valuable token or word from sentences. Examples:

    "you alone": "alone",
    "they are alone": "alone"

    so those 2 records become one

  • Sentences that have no meaningful word are saved to debug_no_meaning_sentence.json for later processing. Example entries:

    "those who",
    "and out of what",
    "and those who",
    "in what",
    "to you",
    "and what"
  • These sentences were ignored in processing for now but saved for potential future handling.

Synonyms Extraction:

  • Used WordNet to extract the top 5 nearest synonyms for each word extracted from NLP processing.
  • Populated the synonyms field in the dataset with these top-ranked synonyms.

Next Steps / Feedback:

  • Please review the dataset and the NLP/synonym processing.
  • Let me know if you need further changes or additional handling for the sentences without meaningful tokens.

Here is the dataset if you want to check it directly: english-arabic-dictionary.json

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 11, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8f7330ae-5ffc-4303-bfc9-7c5e3da3b08b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can suggest fixes for GitHub Check annotations.

Configure the reviews.tools.github-checks setting to adjust the time to wait for GitHub Checks to complete.

@adelpro
Copy link
Copy Markdown
Owner

adelpro commented Mar 11, 2026

Salam @MostafaOsmanFathi ,
Thank you for your effort on this PR.
I have prepared a documentation file explaining the process of implementing the English-to-Arabic search feature in the Quran search engine. Please take a look when you have time:
https://github.com/adelpro/quran-search-engine/blob/develop/docs/english-arabic-search.md

I would appreciate your feedback and thoughts on it.

Also, please consider removing the pnpm-lock.yaml file, as we migrated to Yarn in the latest version.

Baraka Allahu fik.

@MostafaOsmanFathi
Copy link
Copy Markdown
Contributor Author

Thank you for the documentation and for explaining the implementation process. I’ve synced my branch with the develop branch and resolved the merge conflict by removing the pnpm-lock.yaml file, since the project has migrated to Yarn.

I’ve reviewed the docs, and everything looks good from my side.

Wa fika Baraka Allahu.

@MostafaOsmanFathi
Copy link
Copy Markdown
Contributor Author

Before merging the pull request, could you please check the following and let me know how you would like me to handle them?

Some words in the dataset do not have meanings. Should I keep them exactly as they appear in the reference, remove them, or keep them with empty quotes as they are now?

You can see an example at line 519 in english-arabic-dictionary.json.

I’d like to know the preferred approach since I’m not fully familiar with all the intended use cases for this dataset.

I have also extracted all entries that have no meaning and collected them here:
debug_no_meaning_sentence.json

@adelpro
Copy link
Copy Markdown
Owner

adelpro commented Mar 11, 2026

Hey @MostafaOsmanFathi,

A couple of cleanup tasks for the entries file:

1. Delete the entry with an empty english field.

2. For all remaining entries, apply two normalizations:

  • english: consolidate all English translations into a single field (already the current structure, just ensure consistency).
  • arabic: reduce word forms down to their roots only, using the root data already available in the package.

Example:

Before:

{
  "english": "straight",
  "arabic": ["المستقيم", "مستقيم", "مستقيما", "قيما", "سواء", "صراط", "أقوم", "فاستقيموا", "يستقيم"],
  "synonyms": ["directly", "neat", "full-strength", "true", "unbent"]
}

After:

{
  "english": ["straight", "directly", "neat", "full-strength", "true", "unbent"],
  "arabic": ["قوم", "صرط", "سوي"]
}

Can you handle this @MostafaOsmanFathi ?

baraka allaho fik akhi

@MostafaOsmanFathi
Copy link
Copy Markdown
Contributor Author

I’ve implemented the changes you requested. However, while working with the dataset I noticed some issues that might affect the accuracy of the results.

First, the intended meaning of some English words can differ from the roots extracted from the Arabic text. For example:

"english": ["allah"],
"arabic": [
  "أله",
  "ضلل",
  "وله",
  "كون",
  "فلل",
  "عند",
  "من",
  "عذب",
  "توب",
  "علم",
  "أتي"
]

As you can see, roots like ضلل don’t correspond to the meaning of “Allah.” This happens because when extracting the most relevant word from a sentence, the algorithm may pick الله instead of another intended word such as عذاب.

Another issue is that word-map.json contains a number of inconsistencies, so it wasn’t reliable enough for extracting word roots. Because of that, I used an external API to perform the root extraction:

fetch('https://rootna.net/api/process-word', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ word }),
  signal: controller.signal,
});

I’m not sure if relying on this API fits your project’s requirements. If you prefer using word-map.json, I also have a version that uses it, but as mentioned earlier it produced less reliable results.

In addition, I refactored the processing to use asynchronous concurrent tasks (with p-limit). This significantly speeds up the generation of Arabic roots since the script makes a large number of API calls. I applied the same idea to the WordNet synonym extraction as well, since that step also takes a considerable amount of time.

If you’d like me to adjust anything or change the approach, please let me know.

@adelpro
Copy link
Copy Markdown
Owner

adelpro commented Mar 12, 2026

I’ve implemented the changes you requested. However, while working with the dataset I noticed some issues that might affect the accuracy of the results.

First, the intended meaning of some English words can differ from the roots extracted from the Arabic text. For example:

"english": ["allah"],
"arabic": [
  "أله",
  "ضلل",
  "وله",
  "كون",
  "فلل",
  "عند",
  "من",
  "عذب",
  "توب",
  "علم",
  "أتي"
]

As you can see, roots like ضلل don’t correspond to the meaning of “Allah.” This happens because when extracting the most relevant word from a sentence, the algorithm may pick الله instead of another intended word such as عذاب.

Another issue is that word-map.json contains a number of inconsistencies, so it wasn’t reliable enough for extracting word roots. Because of that, I used an external API to perform the root extraction:

fetch('https://rootna.net/api/process-word', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ word }),
  signal: controller.signal,
});

I’m not sure if relying on this API fits your project’s requirements. If you prefer using word-map.json, I also have a version that uses it, but as mentioned earlier it produced less reliable results.

In addition, I refactored the processing to use asynchronous concurrent tasks (with p-limit). This significantly speeds up the generation of Arabic roots since the script makes a large number of API calls. I applied the same idea to the WordNet synonym extraction as well, since that step also takes a considerable amount of time.

If you’d like me to adjust anything or change the approach, please let me know.

Thanks a lot for your efforts and especially for your patience, the word-map.json is issued from The Quranic Arabic Corpus, with a GPL license: https://corpus.quran.com/license.jsp

For the new suggested https://rootna.net/api/process-word', I have concerns about the license,

For the data flow, I suggest updating the word-map.json, then using it to update the english-to-arabic so that we can keep both

@MostafaOsmanFathi
Copy link
Copy Markdown
Contributor Author

Ok, I think that would be better. I will update the word map using that API.
Could you please clarify how exactly I should update the word-map.json file? Should I add a new field for the data, or overwrite the existing one like lemma or rootfield?

After updating it, I will proceed with extracting the roots from it.

@MostafaOsmanFathi
Copy link
Copy Markdown
Contributor Author

I have updated word-map.json to include all the results from the API. I also added some missing records that were needed by english-to-arabic-builder, along with their roots.

Additionally, I refactored the english-to-arabic-builder script so that instead of importing the roots from the API, it now reads them directly from word-map.json.

Please check the changes and let me know if any further modifications are needed. I’d be happy to help.

@adelpro
Copy link
Copy Markdown
Owner

adelpro commented Mar 13, 2026

This is it, exceptional work @MostafaOsmanFathi baraka allaho fik

Before merging this PR, please:

  • Can we use the quran.json data to replace ayahs.csv
  • Group the related data for english-to-arabic-builder in one folder and name it english-to-arabic-builder-data
  • Update the documentation to reference your work
  • I suggest naming the json: quran-english-arabic-roots.json
    • Purpose : Quran-related (your search engine)
    • Source language : English
    • Target content : Arabic roots (not full words)

@MostafaOsmanFathi
Copy link
Copy Markdown
Contributor Author

Thank you

I’ve implemented all the requested changes:

  • Replaced ayahs.csv with quran.json for word-level data.
  • Grouped related data for english-to-arabic-builder in a new folder named english-to-arabic-builder-data.
  • Updated the documentation to reference the changes and added phonetic-inverted-index-generation.md details.
  • Renamed the JSON to quran-english-arabic-roots.json with English as the source language and Arabic roots as the target content.

All commits are ready for review. Please let me know if any further adjustments are needed.

@adelpro adelpro merged commit 47c8baa into adelpro:develop Mar 15, 2026
5 of 7 checks passed
@adelpro
Copy link
Copy Markdown
Owner

adelpro commented Mar 15, 2026

Thank you

I’ve implemented all the requested changes:

* Replaced ayahs.csv with quran.json for word-level data.

* Grouped related data for english-to-arabic-builder in a new folder named english-to-arabic-builder-data.

* Updated the documentation to reference the changes and added `phonetic-inverted-index-generation.md` details.

* Renamed the JSON to quran-english-arabic-roots.json with English as the source language and Arabic roots as the target content.

All commits are ready for review. Please let me know if any further adjustments are needed.

Baraka allaho fik @MostafaOsmanFathi , the data is ready, thanks to you, i will continue the implementation inchalah

Vexxo-Dev pushed a commit to Vexxo-Dev/quran-search-engine that referenced this pull request Mar 15, 2026
…arabic-mapping-dataset-nlp

English-Arabic Dataset and NLP Processing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants