morph reading in token is not merged properly when using merge_entities pipeline #12856

lawctan · 2023-07-24T19:06:54Z

lawctan
Jul 24, 2023

How to reproduce the behaviour

import spacy
import json
import fileinput
from pprint import pprint


# returns start and end index, end not inclusive

def process(nlp, texts):
    docs = list(nlp.pipe(texts, n_process=1, batch_size=2000))
    for doc in docs:

        for sent in doc.sents:
            for token in sent:
                tokenInfo = {
                    "idx": token.i,
                    "orth": token.orth_,
                    "pos": token.pos_,
                    "lemma": token.lemma_,
                    "norm": token.norm_,
                    "dep": token.dep_,
                    "morph": token.morph.to_json(),
                }
                print(json.dumps(tokenInfo, ensure_ascii=False))


nlp = spacy.load('ja_core_news_lg')


nlp.add_pipe("merge_subtokens")
nlp.add_pipe("merge_entities")


texts = []

for line in fileinput.input():
    texts.append(line.strip())

process(nlp, texts)

Command to test

echo "４月１日に試験があるので" | python parse-jap.py

returns

{"idx": 0, "orth": "４月１日", "pos": "NOUN", "lemma": "4月1日", "norm": "４月１日", "dep": "obl", "morph": "Reading=ツイタチ"}
{"idx": 1, "orth": "に", "pos": "ADP", "lemma": "に", "norm": "に", "dep": "case", "morph": "Reading=ニ"}
{"idx": 2, "orth": "試験", "pos": "NOUN", "lemma": "試験", "norm": "試験", "dep": "nsubj", "morph": "Reading=シケン"}
{"idx": 3, "orth": "が", "pos": "ADP", "lemma": "が", "norm": "が", "dep": "case", "morph": "Reading=ガ"}
{"idx": 4, "orth": "ある", "pos": "VERB", "lemma": "ある", "norm": "有る", "dep": "ROOT", "morph": "Inflection=五段-ラ行;連体形-一般|Reading=アル"}
{"idx": 5, "orth": "の", "pos": "SCONJ", "lemma": "の", "norm": "の", "dep": "mark", "morph": "Reading=ノ"}
{"idx": 6, "orth": "で", "pos": "AUX", "lemma": "だ", "norm": "だ", "dep": "fixed", "morph": "Inflection=助動詞-ダ;連用形-一般|Reading=デ"}

Note how for ４月１日, it shows morph": "Reading=ツイタチ". It removed the reading from ４月

Your Environment

spaCy version: 3.5.3
Platform: macOS-12.5-arm64-arm-64bit
Python version: 3.10.10
Pipelines: ja_core_news_sm (3.2.0), ja_ginza (5.1.2), ja_core_news_trf (3.2.0), ja_ginza_electra (5.1.2), ja_core_news_lg (3.2.0)

Answered by adrianeboyd

Jul 25, 2023

The retokenizer can merge different morph features like A=1 + B=2 -> A=1|B=2, but it doesn't know how to automatically merge multiple values for the same feature like A=1 + A=2 -> A=???, so it uses the value from one token instead of trying to merge them. I'd have to double-check to be sure, but I think the default is to take the value from the head token in the phrase, and if there's no parse then it's taken from the first token.

As a workaround, you can merge the values using your own custom method before retokenizing. Set the same value on all tokens in the entity/span to be sure that this value gets used for the new token:

from spacy.tokens import MorphAnalysis
span = doc[0:2]
reading…

View full answer

adrianeboyd · 2023-07-25T06:23:17Z

adrianeboyd
Jul 25, 2023

The retokenizer can merge different morph features like A=1 + B=2 -> A=1|B=2, but it doesn't know how to automatically merge multiple values for the same feature like A=1 + A=2 -> A=???, so it uses the value from one token instead of trying to merge them. I'd have to double-check to be sure, but I think the default is to take the value from the head token in the phrase, and if there's no parse then it's taken from the first token.

As a workaround, you can merge the values using your own custom method before retokenizing. Set the same value on all tokens in the entity/span to be sure that this value gets used for the new token:

from spacy.tokens import MorphAnalysis
span = doc[0:2]
reading = "".join([token.morph.get("Reading")[0] for token in span])
for token in span:
    morph_dict = token.morph.to_dict()
    morph_dict["Reading"] = reading
    token.morph = MorphAnalysis(nlp.vocab, morph_dict)

1 reply

lawctan Jul 29, 2023
Author

thank you, that approach works!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

morph reading in token is not merged properly when using merge_entities pipeline #12856

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

morph reading in token is not merged properly when using merge_entities pipeline #12856

Uh oh!

lawctan Jul 24, 2023

How to reproduce the behaviour

Command to test

Your Environment

Replies: 1 comment · 1 reply

Uh oh!

adrianeboyd Jul 25, 2023

Uh oh!

Uh oh!

lawctan Jul 29, 2023 Author

lawctan
Jul 24, 2023

Replies: 1 comment 1 reply

adrianeboyd
Jul 25, 2023

lawctan Jul 29, 2023
Author