Continue pretraining: use books before supervised fine tuning

There are many books and articles on Sign Language Processing, but they are not easily/widely accessible. Collecting a large dataset of text about sign language can be beneficial for large scale pretraining.

Probably worthwhile to start from https://research.sign.mt/ and ideally automatically explore all references - if they are sign language references (llm classifier) we can download them, etc. Perhaps a cleaner way would be adding a `s2orc` corpus, that is open access, that downloads from https://github.com/allenai/s2orc/, then goes over the 10 million papers and classifies abstracts: is sign language or not? if yes, saves the paper, if not, does not.

Should also do the same to books, problem being that there is copyright involved. 

Then, maybe save each file as the DOI number for that paper/book and .tex or .md or .txt depending on the format we can extract.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continue pretraining: use books before supervised fine tuning #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Continue pretraining: use books before supervised fine tuning #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions