OpenHathi approach for fine-tuning a llama2 for Hindi is like first pre-train the model for translation and then for next word prediction. So we need to collect English to Malayalam dataset(s).
- Find existing datasets for translation and update in this issue thread.
- Translate wikipedia dataset to Malayalam using open-source models or google translate.