-
Notifications
You must be signed in to change notification settings - Fork 242
fix(relay): Create alternate bearer token auth for FxA (MPP-3505) #6049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@groovecoder After a weekend, I'm still not sure if this is reviewable or if I should start over. The main benefit is the new code is behind an environment flag, so in theory it could be tested on dev and stage before going to prod. I'm willing to start over with smaller PRs. I think the first would be to remove or fix token introspection caching. If we go that route, it would probably take several weeks to get each change on stage, but will be reviewable bits. |
groovecoder
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't look at the test code yet, but I want to give these first review comments ASAP to un-block you.
overall question (non-blocking): (If this is already addressed elsewhere and I just haven't gotten to that part of the review, let me know and I'll just keep reading the rest of the PR.) The 2025 code adds quite a bit more type checking and validation, which is overall good. But I worry it might be too brittle and crash out when there are FXA API bugs or changes. (Obviously it's ideal for the API contract between Relay and FXA to be consistent and strict, but we also know that bugs and changes will happen.) How much more time do you think it would take to make an extra pass thru the code looking for places where the type checking and validation of FXA-sourced data could cause the Relay code to break? Is it worth taking that extra time?
| class FxaIntrospectData(TypedDict, total=False): | ||
| """Keys seen in the JSON returned from a Mozilla Accounts introspection request""" | ||
|
|
||
| active: bool | ||
| sub: str | ||
| exp: int | ||
| error: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question (blocking): Will this handle a case where Mozilla Accounts introspection request returns JSON that doesn't match this structure? (We were recently stung by a situation where Twilio changed an API call without notice.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In introspect_token later in this file, there is code to check the structure of the returned data. Current version is at:
fx-private-relay/api/authentication.py
Line 434 in a5dc5cb
| def introspect_token(token: str) -> IntrospectionResponse | IntrospectionError: |
The checks include:
- request timed out
- other error making request
- response was not JSON
- response was JSON but not an object
- response had a 401 Not Authorized status code
- response had a different status code that wasn't 200 OK
activeis not a bool or was Falsesubis not a string or is an empty string
It then passes it to IntrospectResponse, which has further checks that raise ValueError. The current version is at:
fx-private-relay/api/authentication.py
Line 103 in a5dc5cb
| class IntrospectionResponse: |
The checks include:
activeis not a bool or was False (repeats)subis not a string or is an empty string (repeats)expis present but not integer
Looking at that code, there may be issues if FxA changes exp to a float or a string representation of a float. An unhandled ValueError would be raised. I could add code to handle that case.
On slack, you clarified that if the API changes:
Hrm, I prefer the opposite - let people log in if possible. And log a “handled” exception into Sentry to review.
In that case, I think only these would block authorization:
- 401 Not Authorized
activeis exactlyFalsesubis not included, so we don't have an FxA ID
That will require some changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question (blocking): So if the most likely failure state seems to be FxA changes exp to a string, then an unhandled ValueError is raised. Is it as simple as updating the type declaration to exp: int | str | float for that case?
And then if I remember python type enforcement - these types are enforced in our code by mypy, but not at runtime? So if FxA changes other values in a way that doesn't match this type declaration, it won't break the code at runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I understood these requested changes, but I'm more confused now than I was. Maybe we should talk about this over video or in person?
I think it is really unlikely that the FxA response to /v1/introspect, even from an int to a float. Why would it change to a float? Is it now in seconds instead of microseconds? I have no idea how to handle an expiration that changes to "it's fine bro". I think it would make the code worse to do anything but fail authentication if the authentication token changes formats. If you think failing authentication means breaking the code, then we're far from agreement.
Yes, static type enforcement does not check at runtime. That's why introspect_token carefully examines the response from FxA before returning an FxaIntrospectData. mypy gives some assurances we correctly went from an unknown response from FxA to this firmer type definition. It offers a sharper line between "I think I know what this is" and "I'm sure what this is, for the parts that matter to me".
|
@groovecoder I've gotten through the feedback so far, ready for the next round:
|
f8f8140 to
e7c27d4
Compare
Reproduce the IntegrityError by creating a matching user and SocialAccount after checking for a matching user by email. This may not be the exact mechanism in production, but it does produce the same traceback.
Lots of changes that could have been in multiple commits. In authentication tests: * Split AuthenticationMiscellaneous TestCase into IntrospectTokenTests and GetFxaUidFromOauthTokenTests, remove name prefixes, simplify setup. * Convert _setup_fxa_response to setup_fxa_introspect. It now constructs the payload as well as mocking the response, and returns the mocked response and expected cached data. * Use self.assertRaisesMessage for consistent exception checking. * Assert on the mocked response call_count, not the URL-matched count. In terms_accepted_user tests: * Add _mock_fxa_profile_response * Use new setup_fxa_introspect * Use _setup_client everywhere * Add mocked response checks * Add cache value checks, rename incorrect test titles
Create the SocialAccount in a new function outside of a try block, to avoid nested exceptions.
Because the Mozilla Accounts profile fetch takes a while, this is the likely time for a parallel SocialAccount to be created. Check again before proceeding to create a new one.
The Django logout() command, since at least 1.10, also checks if the user was logged in or not, so our check is redundant (and has missing branch coverage)
Reimplement FxaTokenAuthentication on TokenAuthentication, to get the DRF-provided parsing of token authentication headers. This changes the status code for a header of 'Authorization: Bearer ' (token value ommited) from a 400 (Bad Request) to a 401 (Unauthorized).
The results of hash(str) changes between Python instances, so the previous version would lead to many cache misses.
Return the introspection results instead of the expected cache contents. This may make the cache value change more obvious.
Setting FXA_TOKEN_AUTH_VERSION to 2024 (the default) will use the existing authentication for FxA bearer tokens. Setting it to 2025 will use the new authentication method. When the new authentication is proven, the 2024 version can be deleted.
Co-authored-by: luke crouch <[email protected]>
Co-authored-by: luke crouch <[email protected]>
Bump the year numbers 2024->2025, 2025->2026, to reflect that the existing implementation changed in 2025 and the new implementation probably won't be tested until 2026. Add the existing implementations as new files, matching the test filenames: * api/authentication_2025.py * api/views/terms_accepted_user_2025.py (extracted from ./privaterelay.py)
e7c27d4 to
ce6ae11
Compare
|
I've rebased and adapted to bring in the changes from PR #6049. Changes include:
|
groovecoder
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went thru this again. I went thru more quickly this time - assuming that weeks-ago @groovecoder was relatively sane at the time of first review.
Just 1 last block question about whether FxA response changes will break the Relay app at runtime. If not, I'm good to merge this and use the new environment variable to switch to the new auth mechanism on the dev and/or stage server for first testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question (non-blocking): I assume this code is the unmodified current code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
| class FxaIntrospectData(TypedDict, total=False): | ||
| """Keys seen in the JSON returned from a Mozilla Accounts introspection request""" | ||
|
|
||
| active: bool | ||
| sub: str | ||
| exp: int | ||
| error: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question (blocking): So if the most likely failure state seems to be FxA changes exp to a string, then an unhandled ValueError is raised. Is it as simple as updating the type declaration to exp: int | str | float for that case?
And then if I remember python type enforcement - these types are enforced in our code by mypy, but not at runtime? So if FxA changes other values in a way that doesn't match this type declaration, it won't break the code at runtime?
|
It seems we're stuck again on this change, and should start over with smaller changes. The key change addressing MPP-3505 ( Do you agree @groovecoder, or want to continue with this PR? |
This is a recreation of PR #5272. It includes the removal of debug logging by PR #5883. I don't quite understand that code, but I also am far away enough from this code that I don't fully understand it anymore. If this proves to be unreviewable, I should start over.
The original PR statement:
This PR adds a new implementation of Bearer Token Authentication, for creating a new user or authorizing with a Mozilla account user logged into Firefox. By default, the existing authentication implementation is used. Setting the environment variable
FXA_TOKEN_AUTH_VERSION=2025picks the new implementation.While working on MPP-3505 (An
IntegrityErrorwhile setting up a Relay account for a Mozilla account user through Firefox), I found one important bug in the existing implementation. The cache for the Accounts introspection response useshash(token)as a cache key. This give a consistent number for a given Python instance, but a different number on a different Python instance, like a different pod or possibly a different gunicorn thread on the same pod. This means these responses were almost never cached.I'm reluctant to fix just this issue, because we don't know what will happen if we start using the cache reliably.
The new implementation uses a cache key that will be consistent across pods. It also has some potential improvements.:
POSTcall the API and do not cache the results./api/v1/terms-accepted-usermore consistently returns a401or403response. The Firefox integration is looking for a 401 response or a 403 response. The existing implementation returns404in some instances.502 Service Unavailablewhen the upstream Accounts API is unavailable. This could be tuned if the Firefox integration handles this poorly.FxaTokenAuthenticationis based on Django REST Framework's TokenAuthentication, uses permissions classes to add permission checks beyond the token check, and returns token details inrequest.auth. This makes the authentication look more like other DRF authentication, and allows better code sharing between/api/v1/terms-accepted-userand other endpoints like/api/v1/relayaddressesused by the Firefox integration.