Skip to content

String interning #21

@vzhong

Description

@vzhong

Hey @arunchaganty ,

@jekbradbury and @bmccann recently discovered a huge performance oversight in another tokenization library by @jekbradbury. Namely, string interning improved DecaNLP performance by something like 100x. It dawned on me that we don't seem to do this for this python client? So the output annotations are storing a bazillion copies of words, gloss, tags, whitespaces etc? Can you confirm/deny this?

For reference the issue in question is here: jekbradbury/revtok#4

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions