Error when using nltk with spacy #10238
-
I'm trying to use nltk to remove stopwords from a string before it gets sent to the training process. But I'm getting errors. Python code: with json_loc.open("r", encoding="utf8") as jsonfile:
for line in jsonfile:
example = json.loads(line)
sentence = example["text"]
print(sentence)
sentence_stop_words = ' '.join([word for word in sentence.split() if word not in cachedStopWords])
print(sentence_stop_words)
print("")
if example["answer"] == "accept":
QID = example["accept"][0]
doc = nlp.make_doc(sentence_stop_words)
print(doc)
gold_ids.append(QID)
offset = (example["spans"][0]["start"], example["spans"][0]["end"])
links_dict = {QID: 1.0}
entity_label = example["spans"][0]["label"]
entities = [(offset[0], offset[1], entity_label)]
# we assume only 1 annotated span per sentence, and only 1 KB ID per span
entity = doc.char_span(
example["spans"][0]["start"],
example["spans"][0]["end"],
label=example["spans"][0]["label"],
kb_id=QID,
)
doc.ents = [entity]
for i, t in enumerate(doc):
doc[i].is_sent_start = i == 0
docs.append(doc)
dataset.append((sentence, {"links": {offset: links_dict}, "entities": entities})) And the error output is
I don't quite understand what the issue is, both sentence_stop_words and sentence are strings. But when I pass it to The two stings output fine, so nltk seems to running just fine. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
I think I may have worked out the issue. In my data the start and end characters are no longer where they are with a shorter string after the stopwords have been removed. The data had the start char at 418. There are only 417 in the string now stopwords are removed! I'm guessing that's the issue. I'll have to create a new set of training data. |
Beta Was this translation helpful? Give feedback.
I think I may have worked out the issue. In my data the start and end characters are no longer where they are with a shorter string after the stopwords have been removed. The data had the start char at 418. There are only 417 in the string now stopwords are removed!
I'm guessing that's the issue. I'll have to create a new set of training data.