S2ORC vs Arxiv vs PMC

Currently we have four datasets containing S2ORC, Arxiv, and PMC data:

- `lm_en_s2orc_ai2_pdf_parses`
- `lm_en_s2orc_ai2_abstracts`
- `lm_en_arxiv`
- `lm_en_pmc`

There are a few concerns:
1. Overlap between abstracts and pdf parses of S2ORC. Since there are many more abstracts than full pdf parses we probably don't want to discard all abstracts. Currently investigating if we can match on `paper_id` to discard abstracts of papers that have pdf parses.
2. There is probably significant overlap between Arxiv, PMC <-> S2ORC pdf parses but the former are probably larger. So it would make sense to exclude the Arxiv/PMC sources from S2ORC. The source info exists in principal in the S2ORC dataset but seems not to be present in the datasets above. Asked Kyle if there is a way to get that info. 
3. The Arxiv/PMC sources are less preprocessed and e.g. references should be removed. This is requires a custom filter/map.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

S2ORC vs Arxiv vs PMC #11

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

S2ORC vs Arxiv vs PMC #11

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions