-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Closed
Labels
Description
RedPajama is an open dataset containing more than 1.2 trillion tokens - https://www.together.xyz/blog/redpajama.
It has a permissive license and lots of data, so it would invest a lot of knowledge into the project.
Also, it would permit to switch from llama-based model to a custom one or, for example, a Dolly-based one.
Github: https://github.com/togethercomputer/RedPajama-Data
Huggingface: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T
0dminnimda, davidak and elijah-kulpinski