Problem in retraining spacy for sentence splitting #12573

nikhilajoshy · 2023-04-25T06:27:42Z

nikhilajoshy
Apr 25, 2023

I created training data using

for _, rw in tqdm(data.iterrows()): 
    doc = nlp.make_doc(rw.text)
    for token_id, token in enumerate(nlp(rw.text)):
        if token.idx in rw['starts']:
            doc[token_id].is_sent_start = True

        else:
            doc[token_id].is_sent_start = False
    for sent in doc.sents:
        print("--", sent, "---")
    db.add(doc)

db.to_disk("./train.spacy")

But when I debug config file, I am getting this

why is training data split as one word per sentence??

Answered by adrianeboyd

Apr 25, 2023

For data where you have set sentence boundaries with token.is_sent_start, you want to train a senter component instead of a parser component.

(The parser uses dependency trees to identify sentence boundaries, so in data without dependency annotation, each word looks like its own separate tree and own separate sentence.)

View full answer

adrianeboyd · 2023-04-25T11:03:27Z

adrianeboyd
Apr 25, 2023

For data where you have set sentence boundaries with token.is_sent_start, you want to train a senter component instead of a parser component.

(The parser uses dependency trees to identify sentence boundaries, so in data without dependency annotation, each word looks like its own separate tree and own separate sentence.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Problem in retraining spacy for sentence splitting #12573

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Problem in retraining spacy for sentence splitting #12573

Uh oh!

Uh oh!

nikhilajoshy Apr 25, 2023

Replies: 1 comment

Uh oh!

adrianeboyd Apr 25, 2023

nikhilajoshy
Apr 25, 2023

adrianeboyd
Apr 25, 2023