String interning

Hey @arunchaganty ,

@jekbradbury and @bmccann recently discovered a huge performance oversight in another tokenization library by @jekbradbury. Namely, [string interning](https://en.wikipedia.org/wiki/String_interning) improved DecaNLP performance by something like 100x. It dawned on me that we don't seem to do this for this python client? So the output annotations are storing a bazillion copies of words, gloss, tags, whitespaces etc? Can you confirm/deny this?

For reference the issue in question is here: https://github.com/jekbradbury/revtok/pull/4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

String interning #21

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

String interning #21

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions