NER for alphanumeric tokens #11374
-
My text mainly contains alphanumeric tokens, I want to detect units within sentences,using spacy NER.
1.Consider an example how does '5centimeter'using MultiHashEmbed get a vector ,Suppose its tokenised as ['5centimeter'] given its not present in vocabulary?For the representation of ORTH , wouldnt both 'sm' and 'No model' initialise vector to[0,0] and proceed while training (assuming not present in vocab )?I also was doubting whether I should be using any en_core_web_sm, since my objective of measuremnts with numbers is not at all related to the original en_core_web_sm objective. I have masked numericals with a particular token , like 2.Text 3.For MultiHashEmbed,I saw an example in https://explosion.ai/blog/bloom-embeddings, showing |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 11 replies
-
Just before answering your specific questions - while it's good to understand what the code is doing, and general principles of what works and doesn't, NLP is often empirical. For any specific project, one of the most important things is to get an end-to-end pipeline working as a baseline and to work on improving performance iteratively. This helps you make sure that your task is feasible, and lets you avoid overthinking details that may end up not having significant impact on your task. In particular, with CPU models it's very fast to train them and try out different configurations.
MultiHashEmbed, as the name suggests, uses multiple sources of features, not just word embeddings. So for example you would get features from word shape, prefix, and suffix, depending on your configuration, even without a word vector for the given word.
It's true the pretrained pipelines are unlikely to help you in this task, since they're trained on very different kinds of text.
There's no concept of "special characters" in spaCy input, so this can work. Regarding masking in general, ideally the word shape feature would take care of this for you, but for numbers like this it's not unheard of for it to help. Do note that your # will be confused with pre-existing hashes. That may not be a problem, depending on your data.
MultiHashEmbed typically, and by default, does not just use word vectors, but also uses token attributes, as mentioned above. Regarding your questions about fasttext and windows. Context surrounding a token is captured in the CNN layer of tok2vec (or similarly in Transformers), you don't have to do anything extra for it. For subtoken information, you should look at floret.
Tokeniser prefixes and the prefix used by MultiHashEmbed are different things and don't interact, see here for the source that the MultiHashEmbed is using. |
Beta Was this translation helpful? Give feedback.
Just before answering your specific questions - while it's good to understand what the code is doing, and general principles of what works and doesn't, NLP is often empirical. For any specific project, one of the most important things is to get an end-to-end pipeline working as a baseline and to work on improving performance iteratively. This helps you make sure that your task is feasible, and lets you avoid overthinking details that may end up not having significant impact on your task. In particular, with CPU models it's very fast to train them and try out different configurations.