Custom NER doesn't predict new entities #11654

MagiCsito · 2022-10-15T13:03:28Z

MagiCsito
Oct 15, 2022

Hello all,
the idea of the project is use NER to identify malware names, this include new malware names in the future. I tried some pre-trained models but it doesn't work correctly, so i decided to create a custom NER.

To do the anotation part, I used a list URLS (news text), a list of known malware names and PhraseMatcher. Then with the spacy config system, I created a NER + transformer model cfg file and then I trained it.

It works, I mean, it detects entities that I used to train it, however It doesn't detect new malware names. I tried tok2vec, transformers, etc but without success. Could you help me to improve it?

Answered by polm

Oct 17, 2022

What context do you want to predict malware names in?

spaCy's NER is designed to detect things like the names or people, organizations, or places in newspaper articles. So the main features are token features (parts of the actual tokens to label, like being upper case, or using words like "John" or "LLC") or context features ("Today [XXX] said...", "Recently [XXX] was acquried by [YYY] ..."). This general model of entities is effective in many contexts, but not for the samples you've given here.

If you have just a list of names and want to classify malware, or if you have just a URL, those are going to be single tokens, and spaCy doesn't really have enough information to make a reasonable…

View full answer

polm · 2022-10-17T04:49:09Z

polm
Oct 17, 2022

What context do you want to predict malware names in?

spaCy's NER is designed to detect things like the names or people, organizations, or places in newspaper articles. So the main features are token features (parts of the actual tokens to label, like being upper case, or using words like "John" or "LLC") or context features ("Today [XXX] said...", "Recently [XXX] was acquried by [YYY] ..."). This general model of entities is effective in many contexts, but not for the samples you've given here.

If you have just a list of names and want to classify malware, or if you have just a URL, those are going to be single tokens, and spaCy doesn't really have enough information to make a reasonable decision about items it hasn't seen before. Those documents will have no context features, and token features, even allowing for prefix/suffix information, will be severely limited.

As a workaround, you could use a custom tokenizer to split URLs into meaningful segments, or even to split the input into individual characters as tokens, but I'm not sure how well that would work.

Unfortunately I'm not very familiar with how malware classification works, but it might be worth looking at what features are used in recent research - spaCy's features aren't that unusual, so I would expect most other standard NER frameworks to have similar issues.

6 replies

polm Oct 19, 2022

Ah OK, I misunderstood what you are trying to do. Sometimes people try to classify URLs (not their contents) to determine if they have malware.

If you are trying to predict ransomware names in articles, then your training data should look like the article text, with ransomware names annotated. If you use just the names of ransomware it won't work, as the model can't learn the context around the words (like I described above).

polm Oct 19, 2022

Also, this flowchart might be helpful.

https://github.com/explosion/assets/blob/main/Prodigy/Prodigy_NER_flowchart_v2_0_0_light.pdf

MagiCsito Oct 19, 2022
Author

Ok this is exactly what I am doing so i guess that I need to use a bigger dataset.
Thank you polm

polm Oct 20, 2022

You might also want to look into "weak supervision", which it seems is what you're doing by automatically adding entities. skweak is one library that works with spaCy, and we have a sample project using it here.

MagiCsito Oct 20, 2022
Author

Oh thank you so much sir! I will give it a try and give you the report :P

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Custom NER doesn't predict new entities #11654

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Custom NER doesn't predict new entities #11654

Uh oh!

Uh oh!

MagiCsito Oct 15, 2022

Replies: 1 comment · 6 replies

Uh oh!

polm Oct 17, 2022

Uh oh!

polm Oct 19, 2022

Uh oh!

polm Oct 19, 2022

Uh oh!

MagiCsito Oct 19, 2022 Author

Uh oh!

polm Oct 20, 2022

Uh oh!

MagiCsito Oct 20, 2022 Author

MagiCsito
Oct 15, 2022

Replies: 1 comment 6 replies

polm
Oct 17, 2022

MagiCsito Oct 19, 2022
Author

MagiCsito Oct 20, 2022
Author