Incorrect tokenization of dash punctuation in Spanish when not preceded or followed by a space #13055
Unanswered
jonathanknebel
asked this question in
Help: Other Questions
Replies: 2 comments
-
Thank you for reporting this issue! We will evaluate whether it makes sense to change the Spanish default (since it would break the opposite case where text mistakingly uses e/em dashes in place of hyphens). In the meanwhile, you could add en/em dash to the the tokenizer's infix expression to handle this. For example:
|
Beta Was this translation helpful? Give feedback.
0 replies
-
Thank you. I ended up implementing basically this very workaround. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
This is related to this (now closed) issue: #3277.
How to reproduce the behaviour
Per the fixes related to the above issue (https://github.com/explosion/spaCy/pull/3281/files), the en/em dash now tokenizes into a separate token whenever it is preceded or followed by a space, but whenever this dash is connected to another word or punctuation mark on one side without a space between and also connected to another word or punctuation mark on the other side without a space between, then it seems to be treated as a hyphen and is not tokenized. In fact, it will even cause the dash and it's preceding punctuation mark to be combined into a single token together with whatever word precedes the preceding punctuation mark and whatever word follows the dash, as in the example below.
["—", "Pues", "bien,—dijo", "el", "extranjero,—el", "año", "que", "viene", "debe", "Vd.", "hacer", "el", "tiempo", "para", "sus", "viñas", "."]
There are many instances in which this Spanish dash is connected on both sides in dialogue. Almost every Spanish book at Gutenberg.org that has dialogue has examples of this: https://www.gutenberg.org/ebooks/search/?query=l.spanish
The issue is not unique to 3.5--it's an issue with previous versions of Spacy as well.
Your Environment
Beta Was this translation helpful? Give feedback.
All reactions