Skip to content

Tokenization is not consistent across datasets #18

@JohnGiorgi

Description

@JohnGiorgi

Tokenization is not the same across datasets. I don't know how big the issue is, but JNLPBA seems to have coarser tokenization than the other datasets.

For example, in JNLPBA "interleukin-n" is kept together while in the other datasets it appears as "interleukin", "-", "n".

Replace the JNLPBA corpus here with the one from: https://github.com/spyysalo/jnlpba

This will involve

  • Removing duplicates
  • Creating the valid split
  • Outputting in both IOB and IOBES, and for each entity type

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions