Skip to content

S2ORC vs Arxiv vs PMCΒ #11

@lvwerra

Description

@lvwerra

Currently we have four datasets containing S2ORC, Arxiv, and PMC data:

  • lm_en_s2orc_ai2_pdf_parses
  • lm_en_s2orc_ai2_abstracts
  • lm_en_arxiv
  • lm_en_pmc

There are a few concerns:

  1. Overlap between abstracts and pdf parses of S2ORC. Since there are many more abstracts than full pdf parses we probably don't want to discard all abstracts. Currently investigating if we can match on paper_id to discard abstracts of papers that have pdf parses.
  2. There is probably significant overlap between Arxiv, PMC <-> S2ORC pdf parses but the former are probably larger. So it would make sense to exclude the Arxiv/PMC sources from S2ORC. The source info exists in principal in the S2ORC dataset but seems not to be present in the datasets above. Asked Kyle if there is a way to get that info.
  3. The Arxiv/PMC sources are less preprocessed and e.g. references should be removed. This is requires a custom filter/map.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions