Skip to content

Continue pretraining: use books before supervised fine tuning #4

@AmitMY

Description

@AmitMY

There are many books and articles on Sign Language Processing, but they are not easily/widely accessible. Collecting a large dataset of text about sign language can be beneficial for large scale pretraining.

Probably worthwhile to start from https://research.sign.mt/ and ideally automatically explore all references - if they are sign language references (llm classifier) we can download them, etc. Perhaps a cleaner way would be adding a s2orc corpus, that is open access, that downloads from https://github.com/allenai/s2orc/, then goes over the 10 million papers and classifies abstracts: is sign language or not? if yes, saves the paper, if not, does not.

Should also do the same to books, problem being that there is copyright involved.

Then, maybe save each file as the DOI number for that paper/book and .tex or .md or .txt depending on the format we can extract.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions