Custom NER doesn't predict new entities #11654
-
Hello all, To do the anotation part, I used a list URLS (news text), a list of known malware names and PhraseMatcher. Then with the spacy config system, I created a NER + transformer model cfg file and then I trained it. It works, I mean, it detects entities that I used to train it, however It doesn't detect new malware names. I tried tok2vec, transformers, etc but without success. Could you help me to improve it? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 6 replies
-
What context do you want to predict malware names in? spaCy's NER is designed to detect things like the names or people, organizations, or places in newspaper articles. So the main features are token features (parts of the actual tokens to label, like being upper case, or using words like "John" or "LLC") or context features ("Today [XXX] said...", "Recently [XXX] was acquried by [YYY] ..."). This general model of entities is effective in many contexts, but not for the samples you've given here. If you have just a list of names and want to classify malware, or if you have just a URL, those are going to be single tokens, and spaCy doesn't really have enough information to make a reasonable decision about items it hasn't seen before. Those documents will have no context features, and token features, even allowing for prefix/suffix information, will be severely limited. As a workaround, you could use a custom tokenizer to split URLs into meaningful segments, or even to split the input into individual characters as tokens, but I'm not sure how well that would work. Unfortunately I'm not very familiar with how malware classification works, but it might be worth looking at what features are used in recent research - spaCy's features aren't that unusual, so I would expect most other standard NER frameworks to have similar issues. |
Beta Was this translation helpful? Give feedback.
What context do you want to predict malware names in?
spaCy's NER is designed to detect things like the names or people, organizations, or places in newspaper articles. So the main features are token features (parts of the actual tokens to label, like being upper case, or using words like "John" or "LLC") or context features ("Today [XXX] said...", "Recently [XXX] was acquried by [YYY] ..."). This general model of entities is effective in many contexts, but not for the samples you've given here.
If you have just a list of names and want to classify malware, or if you have just a URL, those are going to be single tokens, and spaCy doesn't really have enough information to make a reasonable…