add request Origin to launch events by minrk · Pull Request #2053 · jupyterhub/binderhub

minrk · 2026-01-14T17:49:48Z

This will allow us to identify thebe-like launches from interactive documentation sites, which will always have an Origin set. When Origin is not set (typical for same-site requests), the Sec-Fetch-Site header is used, which will generally have the string same-origin. "script" launches (e.g. via curl/requests/etc.) will generally have null values here, though scripts can't be rigorously identified, since they can spoof any browser behavior they want.

So this new field will have values like:

https://course.spacy.io (thebe-like launch)
same-origin (typical browser launches)
null (scripts that don't send browser info)

This adds the Origin websites (just proto://host[:port], not pages) using mybinder.org via e.g. Thebe to the public analytics archive. I think this is okay and appropriate, given the public nature of mybinder.org, but worth bringing up. If we don't want that info in the events archive, we could just record the Sec-Fetch-Site header, which would identify all thebe launches as cross-site, but not identify the site itself. Then we'd generally only have 3 values:

cross-site (thebe launches)
same-origin (browser launches)
null (script launches)

(though same-site is technically possible as well, but not usefully distinguished from same-origin for our deployments, I think).

Note that Origin is not Referrer, so we're not tracking anything about how people get to binder here. Just the sites that embed binder kernels, which I think is fair game.

choldgraf

This isn't a review of the implementation, but I think the functionality sounds reasonable to me. We are running a free, semi-anonymous public service. I think it's reasonable for us to ask for information about where requests are coming from. Especially given the more anonymous nature of script-based launches, and the potential for abuse, that feels like a useful signal to have.

manics · 2026-01-14T19:13:44Z

This could reveal internal private domains so a 4th option is to record this for internal analysis, but not make it public on https://archive.analytics.mybinder.org/

@rgaiacs what do you think?

choldgraf · 2026-01-14T19:59:00Z

Personally, I think it's OK if it reveals private domains. We're running a massive, public, free service. I think that it's reasonable to ask that the things that use that service are also public. If somebody wanted to be able to run BinderHub functionality but without doing so in public, it feels like a good target for "run your own BinderHub instead of relying on mybinder.org"

The exception I could see here is if somebody would accidentally reveal some personally-identifiable information (e.g. if they were testing something out locally, we don't want to accidentally identify them in public data). If that's a risk here, I think it's reasonable to try and avoid that!

manics · 2026-01-14T20:20:44Z

If this had been the policy from the beginning I'd 100% support it. The problem is we're changing what information is made available and we don't have a way to communicate that to everyone.

I'm happy for this information to be public if @rgaiacs is since the Gesis node has to comply with Gesis policies.

choldgraf · 2026-01-14T21:10:14Z

For reference here is our user privacy policy

https://mybinder.readthedocs.io/en/latest/about/user-guidelines.html#how-we-ensure-user-privacy

I think it does imply people aren't identifiable in their session use. Would this PR violate that?

If so I suggest we

make the simplest possible fix to track thebe launches so that it doesn't violate it.
note that we are changing our policy in N months to start tracking the origin of a request to programmatically spawn binder sessions

If not, then I think we could merge this change without violating expectations we have intentionally set

yuvipanda · 2026-01-14T21:35:38Z

fwiw I think this is fine to merge for our privacy policy. It's also mitigated by the fact that we are only tracking origins and not full URLs. To me this is in the same line as the fact that the repo you are using is also publicly visible here.

minrk · 2026-01-15T00:06:41Z

I would be okay with just recording the value of Sec-Fetch-Site. This will separate thebe-like launches, regular browser visits, and typical (non-spoofy) script launches. We can add this request_origin later, if we need it.

If we do want to semi-privately track the Origin, we can use server-side events with plausible. Build URLs are not tracked at all in plausible (we get a plausible event for /v2/gh/..., but not /build/gh/...), so adding them would technically increase the accuracy of our plausible analytics, and we wouldn't double-count anything. It would mean plausible always tracks builds, not just page visits.

But (at least to some degree) our plausible analytics are public, too. I'm not sure how public events would be if we logged them to plausible in this way. Our public Plausible analytics already track Referer hosts: https://plausible.io/mybinder.org?period=7d , though not on an individual event level, only in aggregate. It will reveal private hosts if they link to mybinder.org.

choldgraf · 2026-01-15T01:01:11Z

I'm curious what @manics thinks - Simon I think you're the most concerned about this, and I want to make sure you're comfortable with whatever we decide on. What do you think?

rgaiacs

The code looks good to me.

rgaiacs · 2026-01-15T08:32:02Z

Regarding store the URL or domain that invoke a new session on mybinder.org, my view is that it does not violates user privacy but violates publisher privacy. I believe that publishers that are interested in their privacy / anonymity must run their own instance of BinderHub. Publishers that want to use mybinder.org must agree to have their contribution to mybinder.org load public record the same way that we record the Git repositories.

This could reveal internal private domains

My morning brain cannot think of a fair case for user thebe + mybinder.org from a internal private domain.

The exception I could see here is if somebody would accidentally reveal some personally-identifiable information (e.g. if they were testing something out locally, we don't want to accidentally identify them in public data). If that's a risk here, I think it's reasonable to try and avoid that!

I can image the scenario where someone is doing "local" development using traefik of a web site that uses thebe + mybinder.org. The "local" development is accessible using http://project-id.staff-id.company-name.com and we will end up with the domain that reveals personally identifiable information. I see this as the price to pay for use community lead free infrastructure.

My recommendation is for us to hold this PR until the end of February. We write a blog post at https://blog.jupyter.org/ informing of the upcoming changes, add a banner to mybinder.org and we mention this change in the upcoming roadmap workshop. On 1st March, we merge this.

manics · 2026-01-15T09:53:20Z

I'm happy for this to go in, but thought it was important to be transparent about the risks.

choldgraf · 2026-01-15T16:53:32Z

This makes me realize that we don't have a blog to communicate things like this, other than the jupyter blog which feels like a bit much for a user comms post...

Hmmm, how could we communicate this change and what it might mean for a user that was doing private launches in this way? Make a Discourse post?

Either way, I don't think we should block merging this on that!

manics · 2026-01-15T18:06:46Z

I think the Jupyter Blog is OK as long as it's pitched in the right way, e.g. describe the principles and justifications we've used to make this decision, and that we will follow in future when deciding what other usage information to make available.

minrk · 2026-01-15T21:32:25Z

I think the Jupyter blog is a fine place for it

minrk · 2026-01-15T22:01:57Z

btw, I re-read our privacy policy, and it doesn't seem to mention the events archive. It probably should.

minrk · 2026-01-15T23:13:56Z

draft announcement: https://medium.com/@minrk/mybinder-org-adding-request-origin-to-events-archive-92fe0f954eab

minrk · 2026-01-15T23:20:43Z

a couple updates to our user doc while re-reading it: https://github.com/jupyterhub/mybinder.org-user-guide/pulls

manics · 2026-01-16T09:46:33Z

The draft announcement looks good to me!

choldgraf · 2026-01-16T17:17:19Z

I left a few brief notes in the blog post, I think it's good to go either way IMO. Thanks so much for writing that up!

minrk · 2026-01-24T23:07:23Z

submitted to Jupyter blog, not sure who approves them at this point.

manics · 2026-01-28T11:42:17Z

https://jupyter.org/media_submissions#response-time-volunteer-working-group says blog submissions are reviewed weekly.

manics · 2026-02-02T16:31:21Z

It's live https://blog.jupyter.org/mybinder-org-adding-request-origin-to-events-archive-92fe0f954eab

yuvipanda · 2026-02-25T19:42:26Z

End of February is here!

jupyterhub/binderhub#2053 Merge pull request #2053 from minrk/track-origin

add request Origin to launch events

3a1bb06

minrk requested a review from choldgraf January 14, 2026 17:49

choldgraf approved these changes Jan 14, 2026

View reviewed changes

manics requested a review from rgaiacs January 14, 2026 19:13

jupyterhub-pr-triage-board-bot bot added this to PR triage (experimental) Jan 14, 2026

rgaiacs approved these changes Jan 15, 2026

View reviewed changes

Sync with main

ce1f827

minrk merged commit 9dd0c2c into jupyterhub:main Mar 2, 2026
20 of 21 checks passed

minrk deleted the track-origin branch March 2, 2026 22:40

github-project-automation bot moved this to Done in PR triage (experimental) Mar 2, 2026

consideRatio pushed a commit to jupyterhub/helm-chart that referenced this pull request Mar 2, 2026

[binderhub] Automatic update for commit 1.0.0-0.dev.git.3910.h9dd0c2cc

0d39a4c

jupyterhub/binderhub#2053 Merge pull request #2053 from minrk/track-origin

jupyterhub-bot mentioned this pull request Mar 2, 2026

Updates binderhub chart to 1.0.0-0.dev.git.3913.h4ca7d5ee jupyterhub/mybinder.org-deploy#3675

Merged

Conversation

minrk commented Jan 14, 2026

Uh oh!

choldgraf left a comment

Choose a reason for hiding this comment

Uh oh!

manics commented Jan 14, 2026

Uh oh!

choldgraf commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

manics commented Jan 14, 2026

Uh oh!

choldgraf commented Jan 14, 2026

Uh oh!

yuvipanda commented Jan 14, 2026

Uh oh!

minrk commented Jan 15, 2026

Uh oh!

choldgraf commented Jan 15, 2026

Uh oh!

rgaiacs left a comment

Choose a reason for hiding this comment

Uh oh!

rgaiacs commented Jan 15, 2026

Uh oh!

manics commented Jan 15, 2026

Uh oh!

choldgraf commented Jan 15, 2026

Uh oh!

manics commented Jan 15, 2026

Uh oh!

minrk commented Jan 15, 2026

Uh oh!

minrk commented Jan 15, 2026

Uh oh!

minrk commented Jan 15, 2026

Uh oh!

minrk commented Jan 15, 2026

Uh oh!

manics commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

choldgraf commented Jan 16, 2026

Uh oh!

minrk commented Jan 24, 2026

Uh oh!

manics commented Jan 28, 2026

Uh oh!

manics commented Feb 2, 2026

Uh oh!

yuvipanda commented Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

choldgraf commented Jan 14, 2026 •

edited

Loading

manics commented Jan 16, 2026 •

edited

Loading