Skip to content

Conversation

@jwhitlock
Copy link
Member

This is a recreation of PR #5272. It includes the removal of debug logging by PR #5883. I don't quite understand that code, but I also am far away enough from this code that I don't fully understand it anymore. If this proves to be unreviewable, I should start over.

The original PR statement:

This PR adds a new implementation of Bearer Token Authentication, for creating a new user or authorizing with a Mozilla account user logged into Firefox. By default, the existing authentication implementation is used. Setting the environment variable FXA_TOKEN_AUTH_VERSION=2025 picks the new implementation.

While working on MPP-3505 (An IntegrityError while setting up a Relay account for a Mozilla account user through Firefox), I found one important bug in the existing implementation. The cache for the Accounts introspection response uses hash(token) as a cache key. This give a consistent number for a given Python instance, but a different number on a different Python instance, like a different pod or possibly a different gunicorn thread on the same pod. This means these responses were almost never cached.

I'm reluctant to fix just this issue, because we don't know what will happen if we start using the cache reliably.

The new implementation uses a cache key that will be consistent across pods. It also has some potential improvements.:

  • Accounts introspection API responses or error are cached. In the existing implementation, 'write' requests like a POST call the API and do not cache the results.
  • /api/v1/terms-accepted-user more consistently returns a 401 or 403 response. The Firefox integration is looking for a 401 response or a 403 response. The existing implementation returns 404 in some instances.
  • The new code returns 502 Service Unavailable when the upstream Accounts API is unavailable. This could be tuned if the Firefox integration handles this poorly.
  • The new code tracks the time it takes to call the Account introspection API and profile API. Tracking these timers will allow us to determine when Accounts is having an issue that affects Relay.
  • The new FxaTokenAuthentication is based on Django REST Framework's TokenAuthentication, uses permissions classes to add permission checks beyond the token check, and returns token details in request.auth. This makes the authentication look more like other DRF authentication, and allows better code sharing between /api/v1/terms-accepted-user and other endpoints like /api/v1/relayaddresses used by the Firefox integration.

@jwhitlock
Copy link
Member Author

@groovecoder After a weekend, I'm still not sure if this is reviewable or if I should start over. The main benefit is the new code is behind an environment flag, so in theory it could be tested on dev and stage before going to prod.

I'm willing to start over with smaller PRs. I think the first would be to remove or fix token introspection caching. If we go that route, it would probably take several weeks to get each change on stage, but will be reviewable bits.

Copy link
Member

@groovecoder groovecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look at the test code yet, but I want to give these first review comments ASAP to un-block you.

overall question (non-blocking): (If this is already addressed elsewhere and I just haven't gotten to that part of the review, let me know and I'll just keep reading the rest of the PR.) The 2025 code adds quite a bit more type checking and validation, which is overall good. But I worry it might be too brittle and crash out when there are FXA API bugs or changes. (Obviously it's ideal for the API contract between Relay and FXA to be consistent and strict, but we also know that bugs and changes will happen.) How much more time do you think it would take to make an extra pass thru the code looking for places where the type checking and validation of FXA-sourced data could cause the Relay code to break? Is it worth taking that extra time?

Comment on lines 70 to 71
class FxaIntrospectData(TypedDict, total=False):
"""Keys seen in the JSON returned from a Mozilla Accounts introspection request"""

active: bool
sub: str
exp: int
error: str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (blocking): Will this handle a case where Mozilla Accounts introspection request returns JSON that doesn't match this structure? (We were recently stung by a situation where Twilio changed an API call without notice.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In introspect_token later in this file, there is code to check the structure of the returned data. Current version is at:

def introspect_token(token: str) -> IntrospectionResponse | IntrospectionError:

The checks include:

  • request timed out
  • other error making request
  • response was not JSON
  • response was JSON but not an object
  • response had a 401 Not Authorized status code
  • response had a different status code that wasn't 200 OK
  • active is not a bool or was False
  • sub is not a string or is an empty string

It then passes it to IntrospectResponse, which has further checks that raise ValueError. The current version is at:

class IntrospectionResponse:

The checks include:

  • active is not a bool or was False (repeats)
  • sub is not a string or is an empty string (repeats)
  • exp is present but not integer

Looking at that code, there may be issues if FxA changes exp to a float or a string representation of a float. An unhandled ValueError would be raised. I could add code to handle that case.

On slack, you clarified that if the API changes:

Hrm, I prefer the opposite - let people log in if possible. And log a “handled” exception into Sentry to review.

In that case, I think only these would block authorization:

  • 401 Not Authorized
  • active is exactly False
  • sub is not included, so we don't have an FxA ID

That will require some changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (blocking): So if the most likely failure state seems to be FxA changes exp to a string, then an unhandled ValueError is raised. Is it as simple as updating the type declaration to exp: int | str | float for that case?

And then if I remember python type enforcement - these types are enforced in our code by mypy, but not at runtime? So if FxA changes other values in a way that doesn't match this type declaration, it won't break the code at runtime?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I understood these requested changes, but I'm more confused now than I was. Maybe we should talk about this over video or in person?

I think it is really unlikely that the FxA response to /v1/introspect, even from an int to a float. Why would it change to a float? Is it now in seconds instead of microseconds? I have no idea how to handle an expiration that changes to "it's fine bro". I think it would make the code worse to do anything but fail authentication if the authentication token changes formats. If you think failing authentication means breaking the code, then we're far from agreement.

Yes, static type enforcement does not check at runtime. That's why introspect_token carefully examines the response from FxA before returning an FxaIntrospectData. mypy gives some assurances we correctly went from an unknown response from FxA to this firmer type definition. It offers a sharper line between "I think I know what this is" and "I'm sure what this is, for the parts that matter to me".

@jwhitlock
Copy link
Member Author

@groovecoder I've gotten through the feedback so far, ready for the next round:

  • Wrapping JSON data as base64-encoded to try to make CodeQL happy
  • Change __repr__ methods to list all the __init__ arguments even if default
  • Use shlex.quote(str(thing_that_should_already_be_a_string))
  • If FxA token introspection request is not 200, log but keep trying
  • If FxA token introspection omits the expiration parameter or changes the format, replace with 60 seconds in the future
  • Allow expiration to be as much as 60 seconds in the past

thetwai292662-dotcom

This comment was marked as spam.

@jwhitlock jwhitlock marked this pull request as draft November 24, 2025 18:38
@jwhitlock jwhitlock force-pushed the MPP-3505/account-bearer-token-auth branch 3 times, most recently from f8f8140 to e7c27d4 Compare November 26, 2025 18:41
Reproduce the IntegrityError by creating a matching user and
SocialAccount after checking for a matching user by email. This may not
be the exact mechanism in production, but it does produce the same
traceback.
Lots of changes that could have been in multiple commits.

In authentication tests:

* Split AuthenticationMiscellaneous TestCase into IntrospectTokenTests
  and GetFxaUidFromOauthTokenTests, remove name prefixes, simplify
  setup.
* Convert _setup_fxa_response to setup_fxa_introspect. It now constructs
  the payload as well as mocking the response, and returns the mocked
  response and expected cached data.
* Use self.assertRaisesMessage for consistent exception checking.
* Assert on the mocked response call_count, not the URL-matched count.

In terms_accepted_user tests:

* Add _mock_fxa_profile_response
* Use new setup_fxa_introspect
* Use _setup_client everywhere
* Add mocked response checks
* Add cache value checks, rename incorrect test titles
Create the SocialAccount in a new function outside of a try block, to
avoid nested exceptions.
Because the Mozilla Accounts profile fetch takes a while, this is the
likely time for a parallel SocialAccount to be created. Check again
before proceeding to create a new one.
The Django logout() command, since at least 1.10, also checks if the
user was logged in or not, so our check is redundant (and has missing
branch coverage)
Reimplement FxaTokenAuthentication on TokenAuthentication, to get the
DRF-provided parsing of token authentication headers. This changes the
status code for a header of 'Authorization: Bearer ' (token value
ommited) from a 400 (Bad Request) to a 401 (Unauthorized).
The results of hash(str) changes between Python instances, so the
previous version would lead to many cache misses.
Return the introspection results instead of the expected cache contents.
This may make the cache value change more obvious.
jwhitlock and others added 24 commits November 26, 2025 13:05
Setting FXA_TOKEN_AUTH_VERSION to 2024 (the default) will use the
existing authentication for FxA bearer tokens. Setting it to 2025 will
use the new authentication method. When the new authentication is
proven, the 2024 version can be deleted.
Co-authored-by: luke crouch <[email protected]>
Bump the year numbers 2024->2025, 2025->2026, to reflect that the
existing implementation changed in 2025 and the new implementation
probably won't be tested until 2026.

Add the existing implementations as new files, matching the test
filenames:

* api/authentication_2025.py
* api/views/terms_accepted_user_2025.py (extracted from ./privaterelay.py)
@jwhitlock jwhitlock force-pushed the MPP-3505/account-bearer-token-auth branch from e7c27d4 to ce6ae11 Compare November 26, 2025 19:05
@jwhitlock
Copy link
Member Author

I've rebased and adapted to bring in the changes from PR #6049. Changes include:

  • Year bump. FXA_TOKEN_AUTH_OLD_AND_PROVEN is 2025, and FXA_TOKEN_AUTH_NEW_AND_BUSTED is 2026. That will teach me to use years as identifiers.
  • On the new code side, a new MissingScope error when the relay scope is not present
  • On the existing code side, I added two files api/authentication_2025.py and api/terms_accepted_user_2025.py to hold the existing implementations. This makes it a little easier to keep these in sync with changes to main.
  • Some changes to tests to make them clearer or to make rebasing easier.

@jwhitlock jwhitlock requested a review from joeherm November 26, 2025 19:23
@jwhitlock jwhitlock marked this pull request as ready for review November 26, 2025 19:23
Copy link
Member

@groovecoder groovecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went thru this again. I went thru more quickly this time - assuming that weeks-ago @groovecoder was relatively sane at the time of first review.

Just 1 last block question about whether FxA response changes will break the Relay app at runtime. If not, I'm good to merge this and use the new environment variable to switch to the new auth mechanism on the dev and/or stage server for first testing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (non-blocking): I assume this code is the unmodified current code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Comment on lines 70 to 71
class FxaIntrospectData(TypedDict, total=False):
"""Keys seen in the JSON returned from a Mozilla Accounts introspection request"""

active: bool
sub: str
exp: int
error: str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (blocking): So if the most likely failure state seems to be FxA changes exp to a string, then an unhandled ValueError is raised. Is it as simple as updating the type declaration to exp: int | str | float for that case?

And then if I remember python type enforcement - these types are enforced in our code by mypy, but not at runtime? So if FxA changes other values in a way that doesn't match this type declaration, it won't break the code at runtime?

@jwhitlock
Copy link
Member Author

jwhitlock commented Jan 7, 2026

It seems we're stuck again on this change, and should start over with smaller changes. The key change addressing MPP-3505 (IntegrityError: duplicate key value violates unique constraint) appears to be checking if another process created the user and swallowing the error. That could be a code change of 10 or less lines, plus tests.

Do you agree @groovecoder, or want to continue with this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants