Skip to content

scripts/preprocess.py should sort tokens lexicographically #43

@AlekzNet

Description

@AlekzNet

Currently the indexes are assigned to tokens on the first occurrence basis. If the text is changed (think of fixing a typo or training a pre-trained model on a different corpus) the indexes might be reassigned what will break subsequent training initialized from a checkpoint.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions