-
Notifications
You must be signed in to change notification settings - Fork 77
Description
I'm concerned that we have a significant amount of bot traffic creating sessions on mybinder.org. We should confirm that this isn't the case, or fix this issue, if others are worried about this too.
Context
The 2i2c team (thanks @jmunroe!) recently discovered that we may have caught a significant amount of bot and scraper activity leading to launches on a public BinderHub that we run. Binder sessions were being spun up, followed by a period of no activity, before they were shut down. We think it's something like an LLM scraper or a bot that is hitting URLs and causing sessions to spin up.
Why I think we might have a lot of bots spawning Binder sessions
We use plausible for web analytics, and I am pretty sure plausible filters out any known bot activity1. So we can compare the logs of plausible against the logs of mybinder.org's analytics archive to get an idea of "launches that plausible filtered out".
For example for December 9th:
- Visits to website logged by Plausible: 3,100
- Mybinder launch events: 5,184
December 5th:
- Visits to mybinder.org: 2,500
- Mybinder launch events: 4,398
In both cases it seems like something around 40% of mybinder.org launch events are not being logged by plausible.
Some % of them might be "back-end" launches where nobody touches the browser, but it's plausible that a high percentage of those are bot accounts that are triggering Binder events, but not being logged in Plausible.
Footnotes
-
Plausible is a privacy-first service so we don't have access to the actual user agent names, HTTP requests, etc. That's why we have to kinda infer an outcome here. ↩