Data sources come from the following categories:
- Web crawler dataset:
- Website UET (ĐH Công nghệ): tuyensinh.uet.vnu.edu.vn; new.uet.vnu.edu.vn
- Website HUS (ĐH KHTN): hus.vnu.edu.vn
- Website EUB (ĐH Kinh tế): ueb.vnu.edu.vn
- Website IS (ĐH Quốc tế): is.vnu.edu.vn
- Website Eduacation (ĐH Giáo dục): education.vnu.edu.vn
- Website NXB ĐHQG: press.vnu.edu.vn
List domain web crawler
- CC100:
link to CC100 vi - C4_vi:
link to C4_vi
We use tokenizer from meta-llama/Llama-3.1-8B
We apply continual pretraining to meta-llama/Llama-3.1-8B on our processed dataset. The training process last 10 days on 2 Nvidia A100 GPUs and we achieve the average training loss of 1.9
- Model type: deberta-v2
- Params: 278 M
- Size: 1.11 GB
- Model type: fasttext
- Size: 2.02 GB
- Model type: RoBERTa with classification layers
- Params: 136 M
- Size: 544 MB
Locality Sensitive Hashing: Minhash