Skip to content

Incorrect handling of tokens lengths #17

@Waguy02

Description

@Waguy02

In Long term forecast experiments, you fixed the max lenght of text to 1024 , see https://github.com/AdityaLab/MM-TSFlib/blob/e789ce78c9bafd8e3ba0d8850f9ad2becbe83548/exp/exp_long_term_forecasting.py#L649 (i guess in accordance with GPT2 context, right) but this implies an arbitrary truncation of the input text.

  • I calculated the following statitics per dataset:
Dataset Average number of Text Tokens per TimeStamp
Agriculture 425
Climate 560
Economy 416
Energy 57
Environment 546
PublicHealth 65
Security 656
SocialGood 350
Traffic 64

Regarding, those stats, in a setting with input window lenght of 3 or more and using any of the datasets Agriculture, Climate, Economy, Environment, Security or SocialGood, it is clear that the total number of text tokens within the input window will overflow the context of 1024 fixed in hard in the code. (e.g. 425*3 >1024 )
A truncation methodology should be carefully defined to correctly incompass fairly the text information within the input windows.

My solution is to simply use a LLM with a larger context, I did it in my own reimplementation.
It would be fine if you could adapt your lib code to handle this correctly.

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions