-
Notifications
You must be signed in to change notification settings - Fork 1
Description
There are many books and articles on Sign Language Processing, but they are not easily/widely accessible. Collecting a large dataset of text about sign language can be beneficial for large scale pretraining.
Probably worthwhile to start from https://research.sign.mt/ and ideally automatically explore all references - if they are sign language references (llm classifier) we can download them, etc. Perhaps a cleaner way would be adding a s2orc corpus, that is open access, that downloads from https://github.com/allenai/s2orc/, then goes over the 10 million papers and classifies abstracts: is sign language or not? if yes, saves the paper, if not, does not.
Should also do the same to books, problem being that there is copyright involved.
Then, maybe save each file as the DOI number for that paper/book and .tex or .md or .txt depending on the format we can extract.