Skip to content

Parsing a large TextDict fails on recent python versions #539

@NeoLegends

Description

@NeoLegends

E.g. for LS960 you get:

OverflowError                             
Traceback (most recent call last)
File /usr/local/lib/python3.10/dist-packages/sisyphus/task.py:188, in Task.run(self=<Task 'run' job=Job<alias/datasets/LibriSpeech/t...ext/convert/TextDictToTextLinesJob.xMTMuHiJ4xBa>>, task_id=1, resume_job=False, logging_thread=<LoggingThread(Thread-2, started daemon 140001375073856)>)
    186             logging.info("Starting subtask for arg id: %d args: %s" % (arg_id, str(args)))
    187             logging.info("-" * 60)
--> 188             f(*args)
        f = <bound method TextDictToTextLinesJob.run of Job<alias/datasets/LibriSpeech/train_other_960_corpus_text_lines work/i6_core/text/convert/TextDictToTextLinesJob.xMTMuHiJ4xBa>>
        args = []
    189 except sp.CalledProcessError as e:
    190     if e.returncode == 137:
    191         # TODO move this into engine class

File recipe/i6_core/text/convert.py:33, in TextDictToTextLinesJob.run(self=Job<alias/datasets/LibriSpeech/train_other_960_c...text/convert/TextDictToTextLinesJob.xMTMuHiJ4xBa>)
     30 def run(self):
     31     # nan/inf should not be needed, but avoids errors at this point and will print an error below,
     32     # that we don't expect an N-best list here.
---> 33     d = eval(uopen(self.text_dict, "rt").read(), {"nan": float("nan"), "inf": float("inf")})
        {"nan": float("nan"), "inf": float("inf")} = {'nan': nan, 'inf': inf}
        float("nan") = nan
        float("inf") = inf
        self = Job<alias/datasets/LibriSpeech/train_other_960_corpus_text_lines work/i6_core/text/convert/TextDictToTextLinesJob.xMTMuHiJ4xBa>
        self.text_dict = <Path work/i6_core/corpus/convert/CorpusToTextDictJob.JIQTGMdLEmbz/output/text_dictionary.py.gz>
     34     assert isinstance(d, dict)  # seq_tag -> text
     36     with uopen(self.out_text_lines, "wt") as out:

OverflowError: line number table is too long

Working on a fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions