-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Currently we have four datasets containing S2ORC, Arxiv, and PMC data:
lm_en_s2orc_ai2_pdf_parseslm_en_s2orc_ai2_abstractslm_en_arxivlm_en_pmc
There are a few concerns:
- Overlap between abstracts and pdf parses of S2ORC. Since there are many more abstracts than full pdf parses we probably don't want to discard all abstracts. Currently investigating if we can match on
paper_idto discard abstracts of papers that have pdf parses. - There is probably significant overlap between Arxiv, PMC <-> S2ORC pdf parses but the former are probably larger. So it would make sense to exclude the Arxiv/PMC sources from S2ORC. The source info exists in principal in the S2ORC dataset but seems not to be present in the datasets above. Asked Kyle if there is a way to get that info.
- The Arxiv/PMC sources are less preprocessed and e.g. references should be removed. This is requires a custom filter/map.
SaulLu
Metadata
Metadata
Assignees
Labels
No labels