Skip to content

Dataset: RedPajama #2782

@iAdanos

Description

@iAdanos

RedPajama is an open dataset containing more than 1.2 trillion tokens - https://www.together.xyz/blog/redpajama.
It has a permissive license and lots of data, so it would invest a lot of knowledge into the project.
Also, it would permit to switch from llama-based model to a custom one or, for example, a Dolly-based one.

Github: https://github.com/togethercomputer/RedPajama-Data
Huggingface: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions