Inquiry on how to go about improving Japanese parsing quality #6749

nlovell1 · 2021-01-17T22:32:05Z

nlovell1
Jan 17, 2021

Hi! I'm a newbie to NLP and the whole world of language as it relates to technology, so there are probably a lot of gaps in my knowledge. But, here is my question.

Essentially, I am an avid user of Morphman , which has recently been experimenting with using Spacy's models to parse sentences so that language sentence flashcards can be reordered such that there is only one new vocabulary or grammar structure used.

In the example of 気になった、 for example, it seems that it is parsed [('気', 'NOUN'), ('に', 'ADP'), ('なっ', 'VERB'), ('た', 'AUX'), ('。', 'PUNCT')]. While the model does extremely well catching exactly how the morphemes relate to one another, this type of breakdown is not helpful for the learner. With consideration to JMDict (primary dictionary for english-Japanese), I imagine it would be parsed as something like 気になっ (VERB) た (AUX). I know I can make exceptions in the parsing, but I am wondering if there is a way to brute force the entries of JMDict to combine with the great parsing going on here to output more semantically descriptive results.

Additionally, has there been any interest to train a model that works well for spoken language yet?

Thanks a ton

polm · 2021-01-18T06:59:23Z

polm
Jan 18, 2021

Given what you want to do, I would recommend running JMDict entries through spaCy and using them to generate Matcher rules. Then you could have a "JMDict entry" entity and extract that.

I imagine it would be parsed as something like 気になっ (VERB) た (AUX).

You seem to be suggesting you want to change the output of the tokenizer and POS tagger, but that doesn't really make sense. What the Entity Matcher will do is give you Spans, which are lists of tokens in the sentence, which you can turn into single strings if you need to.

40 replies

lawctan Jul 24, 2023

Hi @nlovell1, I'm curious if you have you been able to solve this issue for japanese parsing quality? Did you end up just doing the retokinization approach? Thanks!

nlovell1 Jul 24, 2023
Author

Hi @lawctan , it's my impression that detecting multi word expressions in general is not something that's of great interest in the field. Some attempts have limited success (Babieno 2022, for metaphors for example) but accuracy never breaks 70 percent. I would suggest rethinking your use case (why do you need to detect MWEs)?

This problem is likely outside the scope of Spacy.

Let me know if you have any questions.

lawctan Jul 25, 2023

it's basically to achieve something similar to what jisho.org can do. For example, in https://jisho.org/search/%E6%98%A8%E6%97%A5%E3%81%99%E3%81%8D%E7%84%BC%E3%81%8D%E3%82%92%E9%A3%9F%E3%81%B9%E3%81%BE%E3%81%97%E3%81%9F, it's able to know that 食べました belongs together.

lawctan Jul 25, 2023

Eventually, i went with an approach where I looked at the dep and pos fields of the neighboring tokens of a verb to retokenize them

lawctan Jul 25, 2023

Actually, just realized that the original post is about a word that was treated as a noun even though it should have been treated as part of a verb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Inquiry on how to go about improving Japanese parsing quality #6749

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 40 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Inquiry on how to go about improving Japanese parsing quality #6749

Uh oh!

nlovell1 Jan 17, 2021

Replies: 1 comment · 40 replies

Uh oh!

polm Jan 18, 2021

Uh oh!

lawctan Jul 24, 2023

Uh oh!

nlovell1 Jul 24, 2023 Author

Uh oh!

lawctan Jul 25, 2023

Uh oh!

lawctan Jul 25, 2023

Uh oh!

lawctan Jul 25, 2023

nlovell1
Jan 17, 2021

Replies: 1 comment 40 replies

polm
Jan 18, 2021

nlovell1 Jul 24, 2023
Author