Skip to content

22022658NguyenTienKhoi/ViEdu-pretraining-data-processing

Repository files navigation

Pretraining dataset

Data sources come from the following categories:

  1. Web crawler dataset:
  • Website UET (ĐH Công nghệ): tuyensinh.uet.vnu.edu.vn; new.uet.vnu.edu.vn
  • Website HUS (ĐH KHTN): hus.vnu.edu.vn
  • Website EUB (ĐH Kinh tế): ueb.vnu.edu.vn
  • Website IS (ĐH Quốc tế): is.vnu.edu.vn
  • Website Eduacation (ĐH Giáo dục): education.vnu.edu.vn
  • Website NXB ĐHQG: press.vnu.edu.vn
    List domain web crawler
  1. CC100:
    link to CC100 vi
  2. C4_vi:
    link to C4_vi

Tokenizer

We use tokenizer from meta-llama/Llama-3.1-8B

Training models

We apply continual pretraining to meta-llama/Llama-3.1-8B on our processed dataset. The training process last 10 days on 2 Nvidia A100 GPUs and we achieve the average training loss of 1.9

Filtering models

  1. Quality classification model:
  • Model type: deberta-v2
  • Params: 278 M
  • Size: 1.11 GB
  1. Domain classification model :
  • Model type: fasttext
  • Size: 2.02 GB
  1. Toxic detection model
  • Model type: RoBERTa with classification layers
  • Params: 136 M
  • Size: 544 MB

Deduplication

Locality Sensitive Hashing: Minhash

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors