Skip to content
Discussion options

You must be logged in to vote

I solved my own problem! My custom reader had incorrect output. Once I copied the implementation of JsonlCorpus, it started to work. Still not sure what exactly was going wrong in my code (so I would appreciate some input), but I did get it working. You can see the changes:

def stream_pretrain_data(path: str, limit: int, train_test_split_seed: int, shuffle_seed: int, shuffle_buf_size: int) -> Callable[[Language], Iterator[Example]]:
    def doc_generator(nlp: "Language"):
        with open(path, "r") as f:
            for line in f:
                text_arr = json.loads(line)["t"]
                yield nlp.make_doc(" ".join(text_arr))

    def generate_stream(nlp):
        count = 0
     …

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by bennmcgregor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tok2vec Feature: Token-to-vector layer and pretraining feat / training Feature: Training utils, Example, Corpus and converters
1 participant