Error running NEL on large data #10255
-
I'm trying to train a NEL (based on https://github.com/explosion/projects/tree/v3/tutorials/nel_emerson) with 1.2 million test/train documents, but the process fails
I've split the data into multiple train_{num}.spacy and dev_{num}.spacy files and placed them in corpus/train/train_{num}.spacy I've also updated the project.yml file to point to the directory rather than a specific .spacy file
This part now fails as there is a permissions problem (with Windows)
I've tried changing the permissions on Windows but it keeps failing. I've ran the terminal as administrator which has the same issues. Any idea why the permissions keep changing? Also, possibly more important; is this the correct approach? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 5 replies
-
I see that the files actually get loaded to I've modified to code to loop through all the .spacy files in a given directory def read_files(filepath: Path, nlp: "Language") -> Iterable[Example]:
# print(f"Opening {filepath}")
diretory = os.fsencode(filepath)
# we run the full pipeline and not just nlp.make_doc to ensure we have entities and sentences
# which are needed during training of the entity linker
with nlp.select_pipes(disable="entity_linker"):
for files in os.listdir(diretory):
filename = os.fsdecode(files)
#print(f"file: {filename}")
doc_bin = DocBin().from_disk(str(filepath) + '\\' + filename)
docs = doc_bin.get_docs(nlp.vocab)
for doc in docs:
yield Example(nlp(doc.text), doc) I'm running this on a smaller dataset to see if it works. I'm not sure if the code is adding all the .spacy files or just the last one? |
Beta Was this translation helpful? Give feedback.
-
There was a bug in the above code. I've tried the following on my small data set and it works as normal def read_files(filepath: Path, nlp: "Language") -> Iterable[Example]:
# print(f"Opening {filepath}")
diretory = os.fsencode(filepath)
# we run the full pipeline and not just nlp.make_doc to ensure we have entities and sentences
# which are needed during training of the entity linker
with nlp.select_pipes(disable="entity_linker"):
doc_bin = DocBin()
# print("docbin len")
# print(len(doc_bin))
for files in os.listdir(diretory):
filename = os.fsdecode(files)
# print(f"file: {filename}")
doc_bin_loop = DocBin().from_disk(str(filepath) + '\\' + filename)
doc_bin.merge(doc_bin_loop)
# print(len(doc_bin))
docs = doc_bin.get_docs(nlp.vocab)
for doc in docs:
yield Example(nlp(doc.text), doc) I've also updated the evaluate file with similar code. I haven't tested this out with my large dataset, that'll take hours. I would be happy to hear of alternative solutions. Thanks |
Beta Was this translation helpful? Give feedback.
There was a bug in the above code. I've tried the following on my small data set and it works as normal