-
Notifications
You must be signed in to change notification settings - Fork 511
Closed
Description
Having non-ASCII characters in arguments, particularly UTF-8 which is my terminal encoding breaks sample.lua start_text functionality. Here's sample output, where i try to initialize network with russian word for "test":
th sample.lua -checkpoint models/test/checkpoint_27350.t7 -length 1000 -sample 1 -gpu -1 -temperature 1 -start_text тест
/home/vostrosa/torch/install/bin/luajit: ./LanguageModel.lua:129: Got invalid idx
stack traceback:
[C]: in function 'assert'
./LanguageModel.lua:129: in function 'encode_string'
./LanguageModel.lua:174: in function 'sample'
sample.lua:41: in main chunk
[C]: in function 'dofile'
...ator/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00405ec0
This is very unfortunate, since most of datasets I train network on consist mostly of Russian UTF-8 encoded text and I'm unable to preseed the network. My guess is that it treats UTF-8 as a single-byte encoding, which would explain why it yields invalid indices.
Metadata
Metadata
Assignees
Labels
No labels