Detect bots from user agent strings

User-agent-ml detects whether a user agent string refers to a bot or to a legitimate browser. The method is hybrid: it is based on rules and machine learning. The rule based part allows for high efficiency since most bot hits come from well known sources (e.g., search engines), and the machine learning part allows detection of less well known user agent strings or new ones, which it would have been hard to capture with a solely rule based system.

Usage

import user_agent_ml
uaml = user_agent_ml.user_agent_ml("data/user-agent.model")
uaml.predict("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.12 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.12")
>>> False
uaml.predict("Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com)")
>>> True

The .predict("user-agent-string") function returns True if the user agent string is identified as bot, and False if it is identified as a regular browser.

Performance

System effectiveness, measured in weighted F1-score, is at 99.11% (± 0.0303). Reported performance is the average over 10-fold cross validation on 53,829 records taking into account class imbalance; 50,918 examples of (mobile) browsers and 2,911 examples of bots).

Machine learning

User-agent-ml at its core uses a decision tree classifier (Random Forest). Features are mostly of textual nature, e.g., a vocabulary was created from all tokens in the user agent strings. More elaborate features, and elaborate feature selection methods are likely to further increase classification effectiveness–however, this is left for future versions.

Acknowledgements

User-agent-ml is developed by 904Labs B.V. User-agent-ml uses training data from project MAUL. The code skeleton for machine learning was taken from project NERD. The development of user-agent-ml was partially supported by Europeana, the European Cultural Heritage Search Engine.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
user_agent_ml		user_agent_ml
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Detect bots from user agent strings

Usage

Performance

Machine learning

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Detect bots from user agent strings

Usage

Performance

Machine learning

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages