Error running NEL on large data #10255

12dmj · 2022-02-10T11:35:26Z

12dmj
Feb 10, 2022

I'm trying to train a NEL (based on https://github.com/explosion/projects/tree/v3/tutorials/nel_emerson) with 1.2 million test/train documents, but the process fails

ValueError: [E870] Could not serialize the DocBin because it is too
large. Consider splitting up your documents into several doc bins and
serializing each separately. spacy.Corpus.v1 will search recursively for
all *.spacy files if you provide a directory instead of a filename as
the 'path'.

I've split the data into multiple train_{num}.spacy and dev_{num}.spacy files and placed them in

corpus/train/train_{num}.spacy
corpus/dev/dev_{num}.spacy

I've also updated the project.yml file to point to the directory rather than a specific .spacy file

"python -m spacy train configs/${vars.config} --output training --paths.train corpus/train --paths.dev corpus/dev --paths.kb temp/${vars.kb} --paths.base_nlp temp/${vars.nlp} -c scripts/custom_functions.py"

This part now fails as there is a permissions problem (with Windows)

PermissionError: [Errno 13] Permission denied: 'corpus\\train'

I've tried changing the permissions on Windows but it keeps failing. I've ran the terminal as administrator which has the same issues.

Any idea why the permissions keep changing?

Also, possibly more important; is this the correct approach?

Answered by 12dmj

Feb 10, 2022

There was a bug in the above code. I've tried the following on my small data set and it works as normal

def read_files(filepath: Path, nlp: "Language") -> Iterable[Example]:
    # print(f"Opening {filepath}")
    diretory = os.fsencode(filepath)
    # we run the full pipeline and not just nlp.make_doc to ensure we have entities and sentences
    # which are needed during training of the entity linker
    with nlp.select_pipes(disable="entity_linker"):

        doc_bin = DocBin()
        # print("docbin len")
        # print(len(doc_bin))

        for files in os.listdir(diretory):
            filename = os.fsdecode(files)
            # print(f"file: {filename}")
            doc_bin_loop = D…

View full answer

12dmj · 2022-02-10T13:51:17Z

12dmj
Feb 10, 2022
Author

I see that the files actually get loaded to read_files method in scripts/custom_functions.py

I've modified to code to loop through all the .spacy files in a given directory

def read_files(filepath: Path, nlp: "Language") -> Iterable[Example]:
    # print(f"Opening {filepath}")
    diretory = os.fsencode(filepath)
    # we run the full pipeline and not just nlp.make_doc to ensure we have entities and sentences
    # which are needed during training of the entity linker
    with nlp.select_pipes(disable="entity_linker"):

        for files in os.listdir(diretory):
            filename = os.fsdecode(files)
            #print(f"file: {filename}")
            doc_bin = DocBin().from_disk(str(filepath) + '\\' + filename)
            docs = doc_bin.get_docs(nlp.vocab)
            for doc in docs:
                yield Example(nlp(doc.text), doc)

I'm running this on a smaller dataset to see if it works. I'm not sure if the code is adding all the .spacy files or just the last one?

0 replies

12dmj · 2022-02-10T20:12:56Z

12dmj
Feb 10, 2022
Author

There was a bug in the above code. I've tried the following on my small data set and it works as normal

def read_files(filepath: Path, nlp: "Language") -> Iterable[Example]:
    # print(f"Opening {filepath}")
    diretory = os.fsencode(filepath)
    # we run the full pipeline and not just nlp.make_doc to ensure we have entities and sentences
    # which are needed during training of the entity linker
    with nlp.select_pipes(disable="entity_linker"):

        doc_bin = DocBin()
        # print("docbin len")
        # print(len(doc_bin))

        for files in os.listdir(diretory):
            filename = os.fsdecode(files)
            # print(f"file: {filename}")
            doc_bin_loop = DocBin().from_disk(str(filepath) + '\\' + filename)
            doc_bin.merge(doc_bin_loop)
            # print(len(doc_bin))


        docs = doc_bin.get_docs(nlp.vocab)
        for doc in docs:
            yield Example(nlp(doc.text), doc)

I've also updated the evaluate file with similar code. I haven't tested this out with my large dataset, that'll take hours.

I would be happy to hear of alternative solutions. Thanks

5 replies

svlandeg Feb 11, 2022

Hi!

As you've found, the NEL project represents a simplified example to work as a tutorial, but it'll require a bit more work to make this into something production ready. Yes, you're correct that the reader needs to be adjusted to something that can process a directory of .spacy files rather than just one, so you can split up your DocBin objects into more reasonable sizes. Thanks for sharing the code for doing so!

One crucial bit to look out for here is the call to nlp(doc.text). As it stands, this is run on each doc and if the nlp object contains time/memory consuming pipeline components, this will slow your reader (and training) down immensely. In the original example, this step is a shortcut to make sure that the gold data has sentences and entities annotated. But maybe you've already got that data available, and you don't actually need to run those components of the nlp pipeline?

Also note that we typically don't advice using nlp(text) in a loop. If possible, try to use nlp.pipe() instead (cf here for more details). If you only need tokenization, you can run nlp.make_doc instead.

Hope these pointers help you to make your implementation more performant!

12dmj Feb 11, 2022
Author

Thank you for the feedback. I'm struggling to see how to implement the nlp.pipe?

Would it go at the end of the read_files method? similar to (which doesn't work):

docs = doc_bin.get_docs(nlp.vocab)
docs_pipe = list(nlp.pipe(docs))
for doc in docs_pipe:
    yield Example(doc.text, doc)

12dmj Feb 15, 2022
Author

The performance is very slow on the large dataset (as predicted). I have 1,178,741 documents and am using a 70/30 test/train split.

The process hasn't completed a single epoch all day (CPU). Hopefully it just hasn't crashed.

... the call to nlp(doc.text). ... the nlp object contains time/memory consuming pipeline components, this will slow your reader (and training) down immensely.

I've removed the nlp(doc.text) in the loop:

def read_files(filepath: Path, nlp: "Language") -> Iterable[Example]:
    
    diretory = os.fsencode(filepath)
    # we run the full pipeline and not just nlp.make_doc to ensure we have entities and sentences
    # which are needed during training of the entity linker
    with nlp.select_pipes(disable="entity_linker"):

        doc_bin = DocBin()

        for files in os.listdir(diretory):
            filename = os.fsdecode(files)
            doc_bin_loop = DocBin().from_disk(str(filepath) + '\\' + filename)
            doc_bin.merge(doc_bin_loop)

        docs = doc_bin.get_docs(nlp.vocab)
        for doc in docs:
            yield Example(doc, doc)

And have run it on a smaller dataset, although there is no noticeable performance increase.

In the original example, this step is a shortcut to make sure that the gold data has sentences and entities annotated. But maybe you've already got that data available, and you don't actually need to run those components of the nlp pipeline?

Each of the documents should have gold data and sentences along with the annotations.

svlandeg Mar 10, 2022

Hi! 1 million documents is quite a large batch, and entity linking is always going to be computationally expensive, depending on the size of the knowledge base and how you're searching through it for candidates.

My main suggestion would be to start with a small batch (maybe 1K docs) and see if the code works end-to-end. Then scale up gradually and perhaps keep a log of the timings, to better get an idea of the correlation between input size and computation time.

12dmj Mar 12, 2022
Author

Thank you @svlandeg - this is what I have done. I reduced my training data to 250,000 items with fewer training documents for each person. Which took ~48 hours and resulted in an 75% accuracy in the evaluation, which is down 5% from the smaller dataset. This is acceptable for my needs at this time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Error running NEL on large data #10255

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Error running NEL on large data #10255

Uh oh!

12dmj Feb 10, 2022

Replies: 2 comments · 5 replies

Uh oh!

12dmj Feb 10, 2022 Author

Uh oh!

12dmj Feb 10, 2022 Author

Uh oh!

svlandeg Feb 11, 2022

Uh oh!

12dmj Feb 11, 2022 Author

Uh oh!

12dmj Feb 15, 2022 Author

Uh oh!

svlandeg Mar 10, 2022

Uh oh!

12dmj Mar 12, 2022 Author

12dmj
Feb 10, 2022

Replies: 2 comments 5 replies

12dmj
Feb 10, 2022
Author

12dmj
Feb 10, 2022
Author

12dmj Feb 11, 2022
Author

12dmj Feb 15, 2022
Author

12dmj Mar 12, 2022
Author