English-Arabic Dataset and NLP Processing#85
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can suggest fixes for GitHub Check annotations.Configure the |
|
Salam @MostafaOsmanFathi , I would appreciate your feedback and thoughts on it. Also, please consider removing the pnpm-lock.yaml file, as we migrated to Yarn in the latest version. Baraka Allahu fik. |
|
Thank you for the documentation and for explaining the implementation process. I’ve synced my branch with the I’ve reviewed the docs, and everything looks good from my side. Wa fika Baraka Allahu. |
|
Before merging the pull request, could you please check the following and let me know how you would like me to handle them? Some words in the dataset do not have meanings. Should I keep them exactly as they appear in the reference, remove them, or keep them with empty quotes as they are now? You can see an example at line 519 in I’d like to know the preferred approach since I’m not fully familiar with all the intended use cases for this dataset. I have also extracted all entries that have no meaning and collected them here: |
|
Hey @MostafaOsmanFathi, A couple of cleanup tasks for the entries file: 1. Delete the entry with an empty 2. For all remaining entries, apply two normalizations:
Example: Before: {
"english": "straight",
"arabic": ["المستقيم", "مستقيم", "مستقيما", "قيما", "سواء", "صراط", "أقوم", "فاستقيموا", "يستقيم"],
"synonyms": ["directly", "neat", "full-strength", "true", "unbent"]
}After: {
"english": ["straight", "directly", "neat", "full-strength", "true", "unbent"],
"arabic": ["قوم", "صرط", "سوي"]
}Can you handle this @MostafaOsmanFathi ? baraka allaho fik akhi |
|
I’ve implemented the changes you requested. However, while working with the dataset I noticed some issues that might affect the accuracy of the results. First, the intended meaning of some English words can differ from the roots extracted from the Arabic text. For example: "english": ["allah"],
"arabic": [
"أله",
"ضلل",
"وله",
"كون",
"فلل",
"عند",
"من",
"عذب",
"توب",
"علم",
"أتي"
]As you can see, roots like Another issue is that fetch('https://rootna.net/api/process-word', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ word }),
signal: controller.signal,
});I’m not sure if relying on this API fits your project’s requirements. If you prefer using In addition, I refactored the processing to use asynchronous concurrent tasks (with If you’d like me to adjust anything or change the approach, please let me know. |
Thanks a lot for your efforts and especially for your patience, the For the new suggested https://rootna.net/api/process-word', I have concerns about the license, For the data flow, I suggest updating the |
|
Ok, I think that would be better. I will update the word map using that API. After updating it, I will proceed with extracting the roots from it. |
|
I have updated word-map.json to include all the results from the API. I also added some missing records that were needed by english-to-arabic-builder, along with their roots. Additionally, I refactored the english-to-arabic-builder script so that instead of importing the roots from the API, it now reads them directly from word-map.json. Please check the changes and let me know if any further modifications are needed. I’d be happy to help. |
|
This is it, exceptional work @MostafaOsmanFathi baraka allaho fik Before merging this PR, please:
|
|
Thank you I’ve implemented all the requested changes:
All commits are ready for review. Please let me know if any further adjustments are needed. |
Baraka allaho fik @MostafaOsmanFathi , the data is ready, thanks to you, i will continue the implementation inchalah |
…arabic-mapping-dataset-nlp English-Arabic Dataset and NLP Processing
English-Arabic Dataset and NLP Processing
Overview:
NLP-based Grouping:
Implemented grouping of near-meaning words using a simple NLP library.
Extracted the most valuable token or word from sentences. Examples:
so those 2 records become one
Sentences that have no meaningful word are saved to
debug_no_meaning_sentence.jsonfor later processing. Example entries:These sentences were ignored in processing for now but saved for potential future handling.
Synonyms Extraction:
synonymsfield in the dataset with these top-ranked synonyms.Next Steps / Feedback:
Here is the dataset if you want to check it directly: english-arabic-dictionary.json