Tokenization is not consistent across datasets

Tokenization is not the same across datasets. I don't know how big the issue is, but JNLPBA seems to have coarser tokenization than the other datasets.

For example, in JNLPBA "interleukin-`n`" is kept together while in the other datasets it appears as "interleukin", "-", "`n`".

Replace the JNLPBA corpus here with the one from: https://github.com/spyysalo/jnlpba

This will involve

- [ ] Removing duplicates
- [ ] Creating the valid split
- [ ] Outputting in both IOB and IOBES, and for each entity type

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenization is not consistent across datasets #18

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tokenization is not consistent across datasets #18

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions