Skip to content

TokensLoader with parallel meta fieldΒ #753

@Aceticia

Description

@Aceticia

πŸš€ Feature

Add a TokensLoaderWithMeta class that stores some additional parallel meta data with the tokens. It can be used to store data with a bit more structure than flat sequence, like image tokens. Here's an example:

{
    "token":     [1,2,3,4,5],
    "token_x": [0,0,0,1,1],
    "token_y": [0,1,2,0,1]
}

Notice they all have the same length.

Motivation

I've been using TokensLoader to train models and find it to be really handy. But it's unfortunately a bit difficult to use when I want to experiment with different positional encoding schemes.

Alternatives

The alternative is to create a normal LitDataset with these. But it is less efficient to store, load, and harder to pack samples.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions