add request Origin to launch events#2053
Conversation
choldgraf
left a comment
There was a problem hiding this comment.
This isn't a review of the implementation, but I think the functionality sounds reasonable to me. We are running a free, semi-anonymous public service. I think it's reasonable for us to ask for information about where requests are coming from. Especially given the more anonymous nature of script-based launches, and the potential for abuse, that feels like a useful signal to have.
|
This could reveal internal private domains so a 4th option is to record this for internal analysis, but not make it public on https://archive.analytics.mybinder.org/ @rgaiacs what do you think? |
|
Personally, I think it's OK if it reveals private domains. We're running a massive, public, free service. I think that it's reasonable to ask that the things that use that service are also public. If somebody wanted to be able to run BinderHub functionality but without doing so in public, it feels like a good target for "run your own BinderHub instead of relying on mybinder.org" The exception I could see here is if somebody would accidentally reveal some personally-identifiable information (e.g. if they were testing something out locally, we don't want to accidentally identify them in public data). If that's a risk here, I think it's reasonable to try and avoid that! |
|
If this had been the policy from the beginning I'd 100% support it. The problem is we're changing what information is made available and we don't have a way to communicate that to everyone. I'm happy for this information to be public if @rgaiacs is since the Gesis node has to comply with Gesis policies. |
|
For reference here is our user privacy policy https://mybinder.readthedocs.io/en/latest/about/user-guidelines.html#how-we-ensure-user-privacy I think it does imply people aren't identifiable in their session use. Would this PR violate that? If so I suggest we
If not, then I think we could merge this change without violating expectations we have intentionally set |
|
fwiw I think this is fine to merge for our privacy policy. It's also mitigated by the fact that we are only tracking origins and not full URLs. To me this is in the same line as the fact that the repo you are using is also publicly visible here. |
|
I would be okay with just recording the value of If we do want to semi-privately track the Origin, we can use server-side events with plausible. Build URLs are not tracked at all in plausible (we get a plausible event for But (at least to some degree) our plausible analytics are public, too. I'm not sure how public events would be if we logged them to plausible in this way. Our public Plausible analytics already track Referer hosts: https://plausible.io/mybinder.org?period=7d , though not on an individual event level, only in aggregate. It will reveal private hosts if they link to mybinder.org. |
|
I'm curious what @manics thinks - Simon I think you're the most concerned about this, and I want to make sure you're comfortable with whatever we decide on. What do you think? |
rgaiacs
left a comment
There was a problem hiding this comment.
The code looks good to me.
|
Regarding store the URL or domain that invoke a new session on mybinder.org, my view is that it does not violates user privacy but violates publisher privacy. I believe that publishers that are interested in their privacy / anonymity must run their own instance of BinderHub. Publishers that want to use mybinder.org must agree to have their contribution to mybinder.org load public record the same way that we record the Git repositories.
My morning brain cannot think of a fair case for user thebe + mybinder.org from a internal private domain.
I can image the scenario where someone is doing "local" development using traefik of a web site that uses thebe + mybinder.org. The "local" development is accessible using http://project-id.staff-id.company-name.com and we will end up with the domain that reveals personally identifiable information. I see this as the price to pay for use community lead free infrastructure. My recommendation is for us to hold this PR until the end of February. We write a blog post at https://blog.jupyter.org/ informing of the upcoming changes, add a banner to mybinder.org and we mention this change in the upcoming roadmap workshop. On 1st March, we merge this. |
|
I'm happy for this to go in, but thought it was important to be transparent about the risks. |
|
This makes me realize that we don't have a blog to communicate things like this, other than the jupyter blog which feels like a bit much for a user comms post... Hmmm, how could we communicate this change and what it might mean for a user that was doing private launches in this way? Make a Discourse post? Either way, I don't think we should block merging this on that! |
|
I think the Jupyter Blog is OK as long as it's pitched in the right way, e.g. describe the principles and justifications we've used to make this decision, and that we will follow in future when deciding what other usage information to make available. |
|
I think the Jupyter blog is a fine place for it |
|
btw, I re-read our privacy policy, and it doesn't seem to mention the events archive. It probably should. |
|
a couple updates to our user doc while re-reading it: https://github.com/jupyterhub/mybinder.org-user-guide/pulls |
|
The draft announcement looks good to me! |
|
I left a few brief notes in the blog post, I think it's good to go either way IMO. Thanks so much for writing that up! |
|
submitted to Jupyter blog, not sure who approves them at this point. |
|
https://jupyter.org/media_submissions#response-time-volunteer-working-group says blog submissions are reviewed weekly. |
|
End of February is here! |
jupyterhub/binderhub#2053 Merge pull request #2053 from minrk/track-origin
This will allow us to identify thebe-like launches from interactive documentation sites, which will always have an Origin set. When Origin is not set (typical for same-site requests), the Sec-Fetch-Site header is used, which will generally have the string
same-origin. "script" launches (e.g. via curl/requests/etc.) will generally have null values here, though scripts can't be rigorously identified, since they can spoof any browser behavior they want.So this new field will have values like:
https://course.spacy.io(thebe-like launch)same-origin(typical browser launches)null(scripts that don't send browser info)This adds the Origin websites (just
proto://host[:port], not pages) using mybinder.org via e.g. Thebe to the public analytics archive. I think this is okay and appropriate, given the public nature of mybinder.org, but worth bringing up. If we don't want that info in the events archive, we could just record the Sec-Fetch-Site header, which would identify all thebe launches ascross-site, but not identify the site itself. Then we'd generally only have 3 values:cross-site(thebe launches)same-origin(browser launches)null(script launches)(though
same-siteis technically possible as well, but not usefully distinguished from same-origin for our deployments, I think).Note that Origin is not Referrer, so we're not tracking anything about how people get to binder here. Just the sites that embed binder kernels, which I think is fair game.