Skip to content

yenaing/myanmar-language-dataset-collection

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Myanmar Language Dataset Collection

This repository serves as a collection of Myanmar language datasets, focusing on both speech and text resources. Given the scarcity and difficulty in finding Myanmar language datasets, our goal is to create a centralized reference point for researchers, developers, and language enthusiasts. As Myanmar language resources are often challenging to locate, we encourage contributions from the community.

If you know of or have access to additional Myanmar language datasets not listed here, please consider contributing by submitting a pull request or opening an issue. Let's collaborate to build a comprehensive inventory of Myanmar language resources.

Myanmar Langauge Speech Dataset

  1. Myanmar Speech Dataset for ASR

    • This is a collection of available Myanmar speech datasets for training ASR models.
    • Datasets in this collection:
      • OpenSLR (See No.2)
      • Google Fleurs (See No.4)
    • HuggingFace Dataset
  2. Crowdsourced high-quality Burmese speech dataset (SLR80)

  3. BloomSpeech

    • HuggingFace Dataset
    • Notebook (Loading Myanmar Language)
    • Notes: Although it's showing burmese, the actual language='mya' is Palaung (De'ang / Ta'ang / Riang) language.
  4. Google Fleurs

Myanmar Langauge Text Dataset

  1. Asian Language Treebank (ALT)
    • Download Page
    • HuggingFace Dataset
    • It supports translation between following languages:
      • Myanmar (Burmese) To Bengali
      • Myanmar (Burmese) To English
      • Myanmar (Burmese) To Filipino
      • Myanmar (Burmese) To Hindi
      • Myanmar (Burmese) To Bahasa Indonesia
      • Myanmar (Burmese) To Japanese
      • Myanmar (Burmese) To Khmer
      • Myanmar (Burmese) To Lao
      • Myanmar (Burmese) To Malay
      • Myanmar (Burmese) To Thai
      • Myanmar (Burmese) To Vietnamese
      • Myanmar (Burmese) To Chinese (Simplified Chinese).
  2. A Corpus of Modern Burmese
  3. Myanmar Spoken and Written Language Dataset
  4. Myanmar NRC Format Dataset
  5. Myanmar Wikipedia Dataset
  6. Myanmar Book Corpus Dataset (MM-Lib)
  7. Myanmar C4 Dataset (Converted Zawgyi to Unicode)
  8. Myanmar CulturaX Dataset (Converted Zawgyi to Unicode)
  9. Myanmar CC100 Dataset (Converted Zawgyi to Unicode)
  10. ChannelMyanmar Movie Summary Dataset
  11. Myanmar Fineweb2 Dataset (Converted Zawgyi to Unicode)
  12. Myanmar Dhamma Article Dataset (Converted Zawgyi to Unicode)
  13. Myanmar Dhamma Question and Answer Dataset
  14. Myanmar Aya Dataset
  15. Burmese Microbiology 1K
  16. Mpox Myanmar
  17. Myanmar Agriculture 1K
  18. Myanmar Instruction Tuning Dataset
    • This is a collection of available Myanmar Question and Answer datasets for instruction fine-tuning LLM models.
    • Datasets in this collection:
      • Burmese Microbiology 1K (See No.15)
      • Mpox Myanmar (See No.16)
      • Myanmar Agriculture 1K (See No.17)
      • Myanmar Aya Dataset (See No.14)
      • Myanmar Dhamma Question and Answer Dataset (See No.13)
      • Myanmar Football Dataset (See No.21)
    • HuggingFace Dataset
    • Dataset Generting Notebook
  19. Myanmar Social Media Sentiment Analysis Dataset
  20. myXNLI - Myanmar Natural Language Inference Corpus
  21. Myanmar Football Dataset
  22. Myanmar Facebook Flores Dataset
  23. Myanmar Text Segmentation Dataset

About

Myanmar Language Dataset Collection

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • HTML 51.4%
  • Jupyter Notebook 48.6%