Skip to content

add request Origin to launch events#2053

Merged
minrk merged 2 commits intojupyterhub:mainfrom
minrk:track-origin
Mar 2, 2026
Merged

add request Origin to launch events#2053
minrk merged 2 commits intojupyterhub:mainfrom
minrk:track-origin

Conversation

@minrk
Copy link
Member

@minrk minrk commented Jan 14, 2026

This will allow us to identify thebe-like launches from interactive documentation sites, which will always have an Origin set. When Origin is not set (typical for same-site requests), the Sec-Fetch-Site header is used, which will generally have the string same-origin. "script" launches (e.g. via curl/requests/etc.) will generally have null values here, though scripts can't be rigorously identified, since they can spoof any browser behavior they want.

So this new field will have values like:

  • https://course.spacy.io (thebe-like launch)
  • same-origin (typical browser launches)
  • null (scripts that don't send browser info)

This adds the Origin websites (just proto://host[:port], not pages) using mybinder.org via e.g. Thebe to the public analytics archive. I think this is okay and appropriate, given the public nature of mybinder.org, but worth bringing up. If we don't want that info in the events archive, we could just record the Sec-Fetch-Site header, which would identify all thebe launches as cross-site, but not identify the site itself. Then we'd generally only have 3 values:

  • cross-site (thebe launches)
  • same-origin (browser launches)
  • null (script launches)

(though same-site is technically possible as well, but not usefully distinguished from same-origin for our deployments, I think).

Note that Origin is not Referrer, so we're not tracking anything about how people get to binder here. Just the sites that embed binder kernels, which I think is fair game.

@minrk minrk requested a review from choldgraf January 14, 2026 17:49
Copy link
Member

@choldgraf choldgraf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a review of the implementation, but I think the functionality sounds reasonable to me. We are running a free, semi-anonymous public service. I think it's reasonable for us to ask for information about where requests are coming from. Especially given the more anonymous nature of script-based launches, and the potential for abuse, that feels like a useful signal to have.

@manics
Copy link
Member

manics commented Jan 14, 2026

This could reveal internal private domains so a 4th option is to record this for internal analysis, but not make it public on https://archive.analytics.mybinder.org/

@rgaiacs what do you think?

@choldgraf
Copy link
Member

choldgraf commented Jan 14, 2026

Personally, I think it's OK if it reveals private domains. We're running a massive, public, free service. I think that it's reasonable to ask that the things that use that service are also public. If somebody wanted to be able to run BinderHub functionality but without doing so in public, it feels like a good target for "run your own BinderHub instead of relying on mybinder.org"

The exception I could see here is if somebody would accidentally reveal some personally-identifiable information (e.g. if they were testing something out locally, we don't want to accidentally identify them in public data). If that's a risk here, I think it's reasonable to try and avoid that!

@manics
Copy link
Member

manics commented Jan 14, 2026

If this had been the policy from the beginning I'd 100% support it. The problem is we're changing what information is made available and we don't have a way to communicate that to everyone.

I'm happy for this information to be public if @rgaiacs is since the Gesis node has to comply with Gesis policies.

@choldgraf
Copy link
Member

For reference here is our user privacy policy

https://mybinder.readthedocs.io/en/latest/about/user-guidelines.html#how-we-ensure-user-privacy

I think it does imply people aren't identifiable in their session use. Would this PR violate that?

If so I suggest we

  • make the simplest possible fix to track thebe launches so that it doesn't violate it.
  • note that we are changing our policy in N months to start tracking the origin of a request to programmatically spawn binder sessions

If not, then I think we could merge this change without violating expectations we have intentionally set

@yuvipanda
Copy link
Collaborator

fwiw I think this is fine to merge for our privacy policy. It's also mitigated by the fact that we are only tracking origins and not full URLs. To me this is in the same line as the fact that the repo you are using is also publicly visible here.

@minrk
Copy link
Member Author

minrk commented Jan 15, 2026

I would be okay with just recording the value of Sec-Fetch-Site. This will separate thebe-like launches, regular browser visits, and typical (non-spoofy) script launches. We can add this request_origin later, if we need it.

If we do want to semi-privately track the Origin, we can use server-side events with plausible. Build URLs are not tracked at all in plausible (we get a plausible event for /v2/gh/..., but not /build/gh/...), so adding them would technically increase the accuracy of our plausible analytics, and we wouldn't double-count anything. It would mean plausible always tracks builds, not just page visits.

But (at least to some degree) our plausible analytics are public, too. I'm not sure how public events would be if we logged them to plausible in this way. Our public Plausible analytics already track Referer hosts: https://plausible.io/mybinder.org?period=7d , though not on an individual event level, only in aggregate. It will reveal private hosts if they link to mybinder.org.

@choldgraf
Copy link
Member

I'm curious what @manics thinks - Simon I think you're the most concerned about this, and I want to make sure you're comfortable with whatever we decide on. What do you think?

Copy link
Contributor

@rgaiacs rgaiacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me.

@rgaiacs
Copy link
Contributor

rgaiacs commented Jan 15, 2026

Regarding store the URL or domain that invoke a new session on mybinder.org, my view is that it does not violates user privacy but violates publisher privacy. I believe that publishers that are interested in their privacy / anonymity must run their own instance of BinderHub. Publishers that want to use mybinder.org must agree to have their contribution to mybinder.org load public record the same way that we record the Git repositories.

This could reveal internal private domains

My morning brain cannot think of a fair case for user thebe + mybinder.org from a internal private domain.

The exception I could see here is if somebody would accidentally reveal some personally-identifiable information (e.g. if they were testing something out locally, we don't want to accidentally identify them in public data). If that's a risk here, I think it's reasonable to try and avoid that!

I can image the scenario where someone is doing "local" development using traefik of a web site that uses thebe + mybinder.org. The "local" development is accessible using http://project-id.staff-id.company-name.com and we will end up with the domain that reveals personally identifiable information. I see this as the price to pay for use community lead free infrastructure.

My recommendation is for us to hold this PR until the end of February. We write a blog post at https://blog.jupyter.org/ informing of the upcoming changes, add a banner to mybinder.org and we mention this change in the upcoming roadmap workshop. On 1st March, we merge this.

@manics
Copy link
Member

manics commented Jan 15, 2026

I'm happy for this to go in, but thought it was important to be transparent about the risks.

@choldgraf
Copy link
Member

This makes me realize that we don't have a blog to communicate things like this, other than the jupyter blog which feels like a bit much for a user comms post...

Hmmm, how could we communicate this change and what it might mean for a user that was doing private launches in this way? Make a Discourse post?

Either way, I don't think we should block merging this on that!

@manics
Copy link
Member

manics commented Jan 15, 2026

I think the Jupyter Blog is OK as long as it's pitched in the right way, e.g. describe the principles and justifications we've used to make this decision, and that we will follow in future when deciding what other usage information to make available.

@minrk
Copy link
Member Author

minrk commented Jan 15, 2026

I think the Jupyter blog is a fine place for it

@minrk
Copy link
Member Author

minrk commented Jan 15, 2026

btw, I re-read our privacy policy, and it doesn't seem to mention the events archive. It probably should.

@minrk
Copy link
Member Author

minrk commented Jan 15, 2026

@minrk
Copy link
Member Author

minrk commented Jan 15, 2026

a couple updates to our user doc while re-reading it: https://github.com/jupyterhub/mybinder.org-user-guide/pulls

@manics
Copy link
Member

manics commented Jan 16, 2026

The draft announcement looks good to me!

@choldgraf
Copy link
Member

I left a few brief notes in the blog post, I think it's good to go either way IMO. Thanks so much for writing that up!

@minrk
Copy link
Member Author

minrk commented Jan 24, 2026

submitted to Jupyter blog, not sure who approves them at this point.

@manics
Copy link
Member

manics commented Jan 28, 2026

https://jupyter.org/media_submissions#response-time-volunteer-working-group says blog submissions are reviewed weekly.

@manics
Copy link
Member

manics commented Feb 2, 2026

It's live https://blog.jupyter.org/mybinder-org-adding-request-origin-to-events-archive-92fe0f954eab

@yuvipanda
Copy link
Collaborator

End of February is here!

@minrk minrk merged commit 9dd0c2c into jupyterhub:main Mar 2, 2026
20 of 21 checks passed
@minrk minrk deleted the track-origin branch March 2, 2026 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants