Skip to content

Using non-ASCII characters in start_text crashes sample.lua #47

@ostrosablin

Description

@ostrosablin

Having non-ASCII characters in arguments, particularly UTF-8 which is my terminal encoding breaks sample.lua start_text functionality. Here's sample output, where i try to initialize network with russian word for "test":

th sample.lua -checkpoint models/test/checkpoint_27350.t7 -length 1000 -sample 1 -gpu -1 -temperature 1 -start_text тест
/home/vostrosa/torch/install/bin/luajit: ./LanguageModel.lua:129: Got invalid idx
stack traceback:
        [C]: in function 'assert'
        ./LanguageModel.lua:129: in function 'encode_string'
        ./LanguageModel.lua:174: in function 'sample'
        sample.lua:41: in main chunk
        [C]: in function 'dofile'
        ...ator/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00405ec0

This is very unfortunate, since most of datasets I train network on consist mostly of Russian UTF-8 encoded text and I'm unable to preseed the network. My guess is that it treats UTF-8 as a single-byte encoding, which would explain why it yields invalid indices.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions