fix(relay): Create alternate bearer token auth for FxA (MPP-3505) #6049

jwhitlock · 2025-11-14T22:05:58Z

This is a recreation of PR #5272. It includes the removal of debug logging by PR #5883. I don't quite understand that code, but I also am far away enough from this code that I don't fully understand it anymore. If this proves to be unreviewable, I should start over.

The original PR statement:

This PR adds a new implementation of Bearer Token Authentication, for creating a new user or authorizing with a Mozilla account user logged into Firefox. By default, the existing authentication implementation is used. Setting the environment variable FXA_TOKEN_AUTH_VERSION=2025 picks the new implementation.

While working on MPP-3505 (An IntegrityError while setting up a Relay account for a Mozilla account user through Firefox), I found one important bug in the existing implementation. The cache for the Accounts introspection response uses hash(token) as a cache key. This give a consistent number for a given Python instance, but a different number on a different Python instance, like a different pod or possibly a different gunicorn thread on the same pod. This means these responses were almost never cached.

I'm reluctant to fix just this issue, because we don't know what will happen if we start using the cache reliably.

The new implementation uses a cache key that will be consistent across pods. It also has some potential improvements.:

Accounts introspection API responses or error are cached. In the existing implementation, 'write' requests like a POST call the API and do not cache the results.
/api/v1/terms-accepted-user more consistently returns a 401 or 403 response. The Firefox integration is looking for a 401 response or a 403 response. The existing implementation returns 404 in some instances.
The new code returns 502 Service Unavailable when the upstream Accounts API is unavailable. This could be tuned if the Firefox integration handles this poorly.
The new code tracks the time it takes to call the Account introspection API and profile API. Tracking these timers will allow us to determine when Accounts is having an issue that affects Relay.
The new FxaTokenAuthentication is based on Django REST Framework's TokenAuthentication, uses permissions classes to add permission checks beyond the token check, and returns token details in request.auth. This makes the authentication look more like other DRF authentication, and allows better code sharing between /api/v1/terms-accepted-user and other endpoints like /api/v1/relayaddresses used by the Firefox integration.

api/authentication.py

jwhitlock · 2025-11-17T17:06:10Z

@groovecoder After a weekend, I'm still not sure if this is reviewable or if I should start over. The main benefit is the new code is behind an environment flag, so in theory it could be tested on dev and stage before going to prod.

I'm willing to start over with smaller PRs. I think the first would be to remove or fix token introspection caching. If we go that route, it would probably take several weeks to get each change on stage, but will be reviewable bits.

groovecoder

I didn't look at the test code yet, but I want to give these first review comments ASAP to un-block you.

overall question (non-blocking): (If this is already addressed elsewhere and I just haven't gotten to that part of the review, let me know and I'll just keep reading the rest of the PR.) The 2025 code adds quite a bit more type checking and validation, which is overall good. But I worry it might be too brittle and crash out when there are FXA API bugs or changes. (Obviously it's ideal for the API contract between Relay and FXA to be consistent and strict, but we also know that bugs and changes will happen.) How much more time do you think it would take to make an extra pass thru the code looking for places where the type checking and validation of FXA-sourced data could cause the Relay code to break? Is it worth taking that extra time?

privaterelay/settings.py

api/authentication.py

groovecoder · 2025-11-17T17:47:22Z

api/authentication.py

+class FxaIntrospectData(TypedDict, total=False):
+    """Keys seen in the JSON returned from a Mozilla Accounts introspection request"""
+
+    active: bool
+    sub: str
+    exp: int
+    error: str


question (blocking): Will this handle a case where Mozilla Accounts introspection request returns JSON that doesn't match this structure? (We were recently stung by a situation where Twilio changed an API call without notice.)

In introspect_token later in this file, there is code to check the structure of the returned data. Current version is at:

fx-private-relay/api/authentication.py

Line 434 in a5dc5cb

def introspect_token(token: str) -> IntrospectionResponse | IntrospectionError:

The checks include:

request timed out

other error making request

response was not JSON

response was JSON but not an object

response had a 401 Not Authorized status code

response had a different status code that wasn't 200 OK

active is not a bool or was False

sub is not a string or is an empty string

It then passes it to IntrospectResponse, which has further checks that raise ValueError. The current version is at:

fx-private-relay/api/authentication.py

Line 103 in a5dc5cb

class IntrospectionResponse:

The checks include:

active is not a bool or was False (repeats)

sub is not a string or is an empty string (repeats)

exp is present but not integer

Looking at that code, there may be issues if FxA changes exp to a float or a string representation of a float. An unhandled ValueError would be raised. I could add code to handle that case.

On slack, you clarified that if the API changes:

Hrm, I prefer the opposite - let people log in if possible. And log a “handled” exception into Sentry to review.

In that case, I think only these would block authorization:

401 Not Authorized

active is exactly False

sub is not included, so we don't have an FxA ID

That will require some changes.

question (blocking): So if the most likely failure state seems to be FxA changes exp to a string, then an unhandled ValueError is raised. Is it as simple as updating the type declaration to exp: int | str | float for that case?

And then if I remember python type enforcement - these types are enforced in our code by mypy, but not at runtime? So if FxA changes other values in a way that doesn't match this type declaration, it won't break the code at runtime?

I thought I understood these requested changes, but I'm more confused now than I was. Maybe we should talk about this over video or in person?

I think it is really unlikely that the FxA response to /v1/introspect, even from an int to a float. Why would it change to a float? Is it now in seconds instead of microseconds? I have no idea how to handle an expiration that changes to "it's fine bro". I think it would make the code worse to do anything but fail authentication if the authentication token changes formats. If you think failing authentication means breaking the code, then we're far from agreement.

Yes, static type enforcement does not check at runtime. That's why introspect_token carefully examines the response from FxA before returning an FxaIntrospectData. mypy gives some assurances we correctly went from an unknown response from FxA to this firmer type definition. It offers a sharper line between "I think I know what this is" and "I'm sure what this is, for the parts that matter to me".

api/authentication.py

api/views/privaterelay.py

api/authentication.py

jwhitlock · 2025-11-18T23:07:08Z

@groovecoder I've gotten through the feedback so far, ready for the next round:

Wrapping JSON data as base64-encoded to try to make CodeQL happy
Change __repr__ methods to list all the __init__ arguments even if default
Use shlex.quote(str(thing_that_should_already_be_a_string))
If FxA token introspection request is not 200, log but keep trying
If FxA token introspection omits the expiration parameter or changes the format, replace with 60 seconds in the future
Allow expiration to be as much as 60 seconds in the past

api/authentication_2025.py

Reproduce the IntegrityError by creating a matching user and SocialAccount after checking for a matching user by email. This may not be the exact mechanism in production, but it does produce the same traceback.

Lots of changes that could have been in multiple commits. In authentication tests: * Split AuthenticationMiscellaneous TestCase into IntrospectTokenTests and GetFxaUidFromOauthTokenTests, remove name prefixes, simplify setup. * Convert _setup_fxa_response to setup_fxa_introspect. It now constructs the payload as well as mocking the response, and returns the mocked response and expected cached data. * Use self.assertRaisesMessage for consistent exception checking. * Assert on the mocked response call_count, not the URL-matched count. In terms_accepted_user tests: * Add _mock_fxa_profile_response * Use new setup_fxa_introspect * Use _setup_client everywhere * Add mocked response checks * Add cache value checks, rename incorrect test titles

Create the SocialAccount in a new function outside of a try block, to avoid nested exceptions.

Because the Mozilla Accounts profile fetch takes a while, this is the likely time for a parallel SocialAccount to be created. Check again before proceeding to create a new one.

The Django logout() command, since at least 1.10, also checks if the user was logged in or not, so our check is redundant (and has missing branch coverage)

Reimplement FxaTokenAuthentication on TokenAuthentication, to get the DRF-provided parsing of token authentication headers. This changes the status code for a header of 'Authorization: Bearer ' (token value ommited) from a 400 (Bad Request) to a 401 (Unauthorized).

The results of hash(str) changes between Python instances, so the previous version would lead to many cache misses.

Return the introspection results instead of the expected cache contents. This may make the cache value change more obvious.

Setting FXA_TOKEN_AUTH_VERSION to 2024 (the default) will use the existing authentication for FxA bearer tokens. Setting it to 2025 will use the new authentication method. When the new authentication is proven, the 2024 version can be deleted.

Co-authored-by: luke crouch <[email protected]>

Bump the year numbers 2024->2025, 2025->2026, to reflect that the existing implementation changed in 2025 and the new implementation probably won't be tested until 2026. Add the existing implementations as new files, matching the test filenames: * api/authentication_2025.py * api/views/terms_accepted_user_2025.py (extracted from ./privaterelay.py)

jwhitlock · 2025-11-26T19:16:00Z

I've rebased and adapted to bring in the changes from PR #6049. Changes include:

Year bump. FXA_TOKEN_AUTH_OLD_AND_PROVEN is 2025, and FXA_TOKEN_AUTH_NEW_AND_BUSTED is 2026. That will teach me to use years as identifiers.
On the new code side, a new MissingScope error when the relay scope is not present
On the existing code side, I added two files api/authentication_2025.py and api/terms_accepted_user_2025.py to hold the existing implementations. This makes it a little easier to keep these in sync with changes to main.
Some changes to tests to make them clearer or to make rebasing easier.

groovecoder

I went thru this again. I went thru more quickly this time - assuming that weeks-ago @groovecoder was relatively sane at the time of first review.

Just 1 last block question about whether FxA response changes will break the Relay app at runtime. If not, I'm good to merge this and use the new environment variable to switch to the new auth mechanism on the dev and/or stage server for first testing.

groovecoder · 2025-12-09T21:03:02Z

api/views/terms_accepted_user_2025.py

question (non-blocking): I assume this code is the unmodified current code?

groovecoder · 2025-12-09T21:09:31Z

api/authentication.py

+class FxaIntrospectData(TypedDict, total=False):
+    """Keys seen in the JSON returned from a Mozilla Accounts introspection request"""
+
+    active: bool
+    sub: str
+    exp: int
+    error: str


question (blocking): So if the most likely failure state seems to be FxA changes exp to a string, then an unhandled ValueError is raised. Is it as simple as updating the type declaration to exp: int | str | float for that case?

And then if I remember python type enforcement - these types are enforced in our code by mypy, but not at runtime? So if FxA changes other values in a way that doesn't match this type declaration, it won't break the code at runtime?

jwhitlock · 2026-01-07T16:24:16Z

It seems we're stuck again on this change, and should start over with smaller changes. The key change addressing MPP-3505 (IntegrityError: duplicate key value violates unique constraint) appears to be checking if another process created the user and swallowing the error. That could be a code change of 10 or less lines, plus tests.

Do you agree @groovecoder, or want to continue with this PR?

jwhitlock requested a review from groovecoder November 14, 2025 22:06

github-advanced-security bot found potential problems Nov 14, 2025

View reviewed changes

api/authentication.py Fixed Show fixed Hide fixed

jwhitlock assigned groovecoder and jwhitlock Nov 17, 2025

groovecoder reviewed Nov 17, 2025

View reviewed changes

github-advanced-security bot found potential problems Nov 18, 2025

View reviewed changes

api/authentication.py Fixed Show fixed Hide fixed

jwhitlock requested a review from groovecoder November 18, 2025 23:07

This comment was marked as spam.

Sign in to view

jwhitlock marked this pull request as draft November 24, 2025 18:38

jwhitlock force-pushed the MPP-3505/account-bearer-token-auth branch 3 times, most recently from f8f8140 to e7c27d4 Compare November 26, 2025 18:41

github-advanced-security bot found potential problems Nov 26, 2025

View reviewed changes

api/authentication_2025.py Dismissed Show dismissed Hide dismissed

jwhitlock added 15 commits November 26, 2025 13:05

Reproduce IntegrityError on terms_accepted_user

e8d18d4

Reproduce the IntegrityError by creating a matching user and SocialAccount after checking for a matching user by email. This may not be the exact mechanism in production, but it does produce the same traceback.

Extract to _create_socialaccount_from_bearer_token

1b24762

Create the SocialAccount in a new function outside of a try block, to avoid nested exceptions.

Extract _get_fxa_profile_from_bearer_token

3fcf8e8

Look for matching SA after FxA profile fetch

7c72c1e

Because the Mozilla Accounts profile fetch takes a while, this is the likely time for a parallel SocialAccount to be created. Check again before proceeding to create a new one.

Skip coverage for belt-and-suspender code

5d6194d

Skip auth check before logging out user

2eaecf1

The Django logout() command, since at least 1.10, also checks if the user was logged in or not, so our check is redundant (and has missing branch coverage)

Add request timeout tests

c7f1711

Handle profile timeout

4db8fa4

Re-raise introspect timeout

6d2d63f

Fix cache key function

9f68e9b

The results of hash(str) changes between Python instances, so the previous version would lead to many cache misses.

Change setup_fxa_introspect to return FxA data

ec6d537

Return the introspection results instead of the expected cache contents. This may make the cache value change more obvious.

Change cached key from "json" to "data"

3192e43

Move types, data is always a dict

e08a8b1

jwhitlock and others added 24 commits November 26, 2025 13:05

Emit timing metric for introspect success

3983cdd

Update docstring

e5b2e11

Update docstring

66fdab9

Rearrange _get_fxa_profile_from_bearer_token

03e36f1

Change _get_fxa_profile_from_bearer_tokeni return

4fce545

Emit timing metric for profile fetch

f441c79

Add IntrospectionResponse.is_expired

1bd6146

Check token expiration

b191ebc

Add env FXA_TOKEN_AUTH_VERSION

a07e19a

Setting FXA_TOKEN_AUTH_VERSION to 2024 (the default) will use the existing authentication for FxA bearer tokens. Setting it to 2025 will use the new authentication method. When the new authentication is proven, the 2024 version can be deleted.

Add return code for TokenExpired

f6099b6

Be less specific about message

cfdbeba

Fix spelling

fec4a1d

Co-authored-by: luke crouch <[email protected]>

Include all params in repr, even if default

da6f636

Ensure shlex.quote only gets strings

79aecc8

Co-authored-by: luke crouch <[email protected]>

Encode FxA data as base64

624f00f

Encode FxA non-JSON data as base64

ef5bfe7

Encode FxA JSON non-dict data as base64

c6d496f

Move FxA token grace period to settings

12b9be5

If FxA omits or changes exp, log and use default

3bad23d

Attempt to continue on non-200 from introspect

671c4db

Update TermsAcceptedUserViewTest for new errors

3b8ec41

Update existing code (2024 version) from main

c52d32a

Rename to x_2025

a566b7a

jwhitlock force-pushed the MPP-3505/account-bearer-token-auth branch from e7c27d4 to ce6ae11 Compare November 26, 2025 19:05

jwhitlock requested a review from joeherm November 26, 2025 19:23

jwhitlock marked this pull request as ready for review November 26, 2025 19:23

groovecoder requested changes Dec 9, 2025

View reviewed changes

fix(relay): Create alternate bearer token auth for FxA (MPP-3505) #6049

Are you sure you want to change the base?

fix(relay): Create alternate bearer token auth for FxA (MPP-3505) #6049

Conversation

jwhitlock commented Nov 14, 2025

Uh oh!

Uh oh!

jwhitlock commented Nov 17, 2025

Uh oh!

groovecoder left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

groovecoder Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

jwhitlock Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

groovecoder Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

jwhitlock Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jwhitlock commented Nov 18, 2025

Uh oh!

This comment was marked as spam.

Uh oh!

Uh oh!

jwhitlock commented Nov 26, 2025

Uh oh!

groovecoder left a comment

Choose a reason for hiding this comment

Uh oh!

groovecoder Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

jwhitlock Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

groovecoder Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

jwhitlock commented Jan 7, 2026 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jwhitlock commented Jan 7, 2026 •

edited by atlassian bot

Loading