Skip to content

kenlm bin LM #2

@abnerLing

Description

@abnerLing

Is the arpa-based KenLM the only type supported? Arpa models work fine for me but when I try using a bin type model I get the below error.

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-30-c9c4a045dd64> in <module>
      5 alpha = 2.5 # LM Weight
      6 beta = 0.0 # LM Usage Reward
----> 7 word_lm_scorer = ctcdecode.WordKenLMScorer('../lm_common_crawl_small_4gram_prun0-6-15_200kvocab.bin', alpha, beta) # use your own kenlm model
      8 decoder = ctcdecode.BeamSearchDecoder(
      9     vocabulary,

~/work/wav2vec/py-ctc-decode/ctcdecode/scorer.py in __init__(self, path, alpha, beta)
     45         self.lm = kenlm.Model(path)
     46 
---> 47         self.words = self._get_words(path)
     48         self.word_prefixes = self._get_word_prefixes(self.words)
     49 

~/work/wav2vec/py-ctc-decode/ctcdecode/scorer.py in _get_words(self, path)
    107 
    108             while not end_1_gram:
--> 109                 line = f.readline().strip()
    110 
    111                 if line == '\\1-grams:':

~/anaconda3/lib/python3.8/codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 62: invalid start byte```

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions