Error when using nltk with spacy #10238

12dmj · 2022-02-08T12:36:56Z

12dmj
Feb 8, 2022

I'm trying to use nltk to remove stopwords from a string before it gets sent to the training process. But I'm getting errors.

Python code:

with json_loc.open("r", encoding="utf8") as jsonfile:
        for line in jsonfile:
            example = json.loads(line)
            sentence = example["text"]
            print(sentence)
            sentence_stop_words = ' '.join([word for word in sentence.split() if word not in cachedStopWords])
            print(sentence_stop_words)
            print("")
            if example["answer"] == "accept":
                QID = example["accept"][0]
        
                doc = nlp.make_doc(sentence_stop_words)

                print(doc)            

                gold_ids.append(QID)
                offset = (example["spans"][0]["start"], example["spans"][0]["end"])
                links_dict = {QID: 1.0}
                entity_label = example["spans"][0]["label"]
                entities = [(offset[0], offset[1], entity_label)]
                # we assume only 1 annotated span per sentence, and only 1 KB ID per span
                entity = doc.char_span(
                    example["spans"][0]["start"],
                    example["spans"][0]["end"],
                    label=example["spans"][0]["label"],
                    kb_id=QID,
                )
                doc.ents = [entity]
                for i, t in enumerate(doc):
                    doc[i].is_sent_start = i == 0
                docs.append(doc)

                dataset.append((sentence, {"links": {offset: links_dict}, "entities": entities}))

And the error output is

Running command: 'c:\users\testuser\anaconda3\python.exe' ./scripts/create_corpus.py ./assets/wales.jsonl ./temp/my_nlp/ corpus/train.spacy corpus/dev.spacy
The squadron was founded in April 1918 at Salonika, Greece with elements from both No. 17 Squadron RAF and No. 47 Squadron RAF. It was equipped with Bristol M.1c, Royal Aircraft Factory SE.5a, and Sopwith Camel aircraft during its World War I service. Ten aces served with the unit, including such notables as Gerald Gordon Bell, Charles D. B. Green, Douglas Arthur Davies, Acheson Goulding, Frederick Dudley Travers, Franklin Saunders , Arthur Jarvis, George Gardiner, and Leslie Hamilton.
The squadron founded April 1918 Salonika, Greece elements No. 17 Squadron RAF No. 47 Squadron RAF. It equipped Bristol M.1c, Royal Aircraft Factory SE.5a, Sopwith Camel aircraft World War I service. Ten aces served unit, including notables Gerald Gordon Bell, Charles D. B. Green, Douglas Arthur Davies, Acheson Goulding, Frederick Dudley Travers, Franklin Saunders , Arthur Jarvis, George Gardiner, Leslie Hamilton.

The squadron founded April 1918 Salonika, Greece elements No. 17 Squadron RAF No. 47 Squadron RAF. It equipped Bristol M.1c, Royal Aircraft Factory SE.5a, Sopwith Camel aircraft World War I service. Ten aces served unit, including notables Gerald Gordon Bell, Charles D. B. Green, Douglas Arthur Davies, Acheson Goulding, Frederick Dudley Travers, Franklin Saunders , Arthur Jarvis, George Gardiner, Leslie Hamilton.
Traceback (most recent call last):
  File "./scripts/create_corpus.py", line 131, in <module>
    typer.run(main)
  File "c:\users\testuser\anaconda3\lib\site-packages\typer\main.py", line 864, in run
    app()
  File "c:\users\testuser\anaconda3\lib\site-packages\typer\main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "c:\users\testuser\anaconda3\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "c:\users\testuser\anaconda3\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "c:\users\testuser\anaconda3\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\users\testuser\anaconda3\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "c:\users\testuser\anaconda3\lib\site-packages\typer\main.py", line 500, in wrapper
    return callback(**use_params)  # type: ignore
  File "./scripts/create_corpus.py", line 72, in main
    doc.ents = [entity]
  File "spacy\tokens\doc.pyx", line 733, in spacy.tokens.doc.Doc.ents.__set__
  File "spacy\tokens\doc.pyx", line 1742, in spacy.tokens.doc.get_entity_info
TypeError: object of type 'NoneType' has no len()

I don't quite understand what the issue is, both sentence_stop_words and sentence are strings. But when I pass it to nlp.make_doc the script fails with the above error and I'm not sure what is wrong?

The two stings output fine, so nltk seems to running just fine.

Answered by 12dmj

Feb 8, 2022

I think I may have worked out the issue. In my data the start and end characters are no longer where they are with a shorter string after the stopwords have been removed. The data had the start char at 418. There are only 417 in the string now stopwords are removed!

I'm guessing that's the issue. I'll have to create a new set of training data.

View full answer

12dmj · 2022-02-08T13:24:54Z

12dmj
Feb 8, 2022
Author

I think I may have worked out the issue. In my data the start and end characters are no longer where they are with a shorter string after the stopwords have been removed. The data had the start char at 418. There are only 417 in the string now stopwords are removed!

I'm guessing that's the issue. I'll have to create a new set of training data.

1 reply

polm Feb 9, 2022

Glad you figured out how to get this working!

That said though, you almost certainly shouldn't remove stop words before passing data to spaCy. You can read some more about this in an FAQ I just posted: #10243.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Error when using nltk with spacy #10238

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Error when using nltk with spacy #10238

Uh oh!

12dmj Feb 8, 2022

Replies: 1 comment · 1 reply

Uh oh!

12dmj Feb 8, 2022 Author

Uh oh!

polm Feb 9, 2022

12dmj
Feb 8, 2022

Replies: 1 comment 1 reply

12dmj
Feb 8, 2022
Author