Skip to content

THREESCALE-12408: Access tokens protection#4236

Open
jlledom wants to merge 14 commits intomasterfrom
THREESCALE-12408-acess-tokens-protection
Open

THREESCALE-12408: Access tokens protection#4236
jlledom wants to merge 14 commits intomasterfrom
THREESCALE-12408-acess-tokens-protection

Conversation

@jlledom
Copy link
Contributor

@jlledom jlledom commented Feb 26, 2026

What this PR does / why we need it:

This adds some protection to access tokens in our DB, so tokens are not visible in the case the DB is leaked.

This implements two measures to increase protection:

  1. Increase token length to 48 bytes, 96 chars.
  2. Hash the tokens with SHA-384

Since we want to make this backwards compatible, the current code still allows plain text tokens to exist and be used to authenticate.

The code is designed to be super simple. The day we want to remove the migration code and the support for plain text tokens, we'll only have to remove a few lines in the find_from_value method.

A brief explanation of the new design:

  • The value attribute will always unconditionally hold the value in the DB column. That is: plain text if unmigrated, hash if migrated
  • A new in-memory attribute plaintext_value holds the plain text value of a hashed token. It's not persisted so it's lost when the instance is removed. This attribute is designed to provide a way to return the original value once in the request response, be it UI or API. The user will have that only opportunity to store it somewhere else. After that, we won't hold the original value anywhere.
  • No DB migration, we are reusing the same column to hold the new value.
  • Authentication flow:
    1. Hash the received token
    2. Find by value = hash
    3. If nothing found, find by value = received value
    4. If found migrate old value to hash
  • If a DB is leaked, the hashes could be used directly as tokens and they would work because we support plain text tokens. In order to avoid that, we use the length to distinguish between original and hashed values. In our DB in SaaS, all tokens are exactly 64 chars long, while from now on, all migrated hashes will be 96 chars long.
  • If the received token is 96 chars long, we only find by hash. Otherwise we perform both lookups, hash and original.
  • As a trade-off, if some client on premises happened to have an access token exactly 96 characters long, that token will stop working after merging this. But that's extremely unlikely since they would have needed to edit the token manually.
  • The hashing algorithm is SHA-384, I discarded SHA-256 because it generates 64 characters-long hashes, so we couldn't distinguish between orig and hashed values.

Which issue(s) this PR fixes

https://issues.redhat.com/browse/THREESCALE-12408

Verification steps

  1. Use and old token to performs a request
    • It should work
    • The token should be hashed in DB.
  2. Use a new token
    • It should work
  3. You should be able to create a token via UI and API, and see the original value once.

Special notes for your reviewer:

How to review:

There are 90+ files edited in this PR but most of them are just the result of replacing access_token.value by access_token.plaintext_value in tests. This PR is better reviewed commit by commit:

  • dc93ca0: Increase tokens length to 48 to better resist quantum attacks.
  • d166b4e: The actual migration code.
  • a8af378: Edit the UI so the user can see the token. We render instead of redirecting so we keep the plain text value in memory.
  • 901e603: Same for API endpoints.
  • d523dc2: The seed creates hashed tokens, that way we won't have to worry about this when we decide to remove the migration code.
  • d46fd2c: When receiving a token from apicast, store it hashed. For the same reason, to not have to worry after removing the migration.
  • e73ea5f: Add a few tests to test the particular features added in this PR.
  • 9d65d4f: The big bunch of changes to make tests send the plain text token when performing requests.

@codecov
Copy link

codecov bot commented Feb 27, 2026

Codecov Report

❌ Patch coverage is 64.58333% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.30%. Comparing base (230cb62) to head (8df29cf).
⚠️ Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
...rs/provider/admin/user/access_tokens_controller.rb 29.16% 17 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4236      +/-   ##
==========================================
+ Coverage   88.23%   88.30%   +0.07%     
==========================================
  Files        1765     1765              
  Lines       44344    44361      +17     
  Branches      686      686              
==========================================
+ Hits        39126    39175      +49     
+ Misses       5202     5170      -32     
  Partials       16       16              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@jlledom jlledom self-assigned this Feb 27, 2026
@jlledom jlledom force-pushed the THREESCALE-12408-acess-tokens-protection branch from b2037c0 to 9d65d4f Compare February 27, 2026 12:56
@jlledom
Copy link
Contributor Author

jlledom commented Feb 27, 2026

❌ Patch coverage is 82.35294% with 6 lines in your changes missing coverage.

This is not true, those lines are covered by https://github.com/3scale/porta/blob/513293b4e74515fced4aaef41a07e5e442fb6720/features/provider/admin/user/access_tokens.feature. But that's somehow ignored by Codecov. Probably a misconfiguration.

Comment on lines +115 to +135
def self.find_from_value(plaintext_value)
return nil if plaintext_value.blank?

scrubbed = plaintext_value.to_s.scrub
digest = compute_digest(scrubbed)

def self.find_from_value(value)
find_by(value: value.to_s.scrub)
# Fast path: find by digest (new/migrated tokens)
token = find_by(value: digest)
return token if token

# Legacy tokens can't be exactly 96 chars (that's our hash length)
# This prevents using a leaked hash directly as a token
return nil if scrubbed.length == HASHED_TOKEN_LENGTH

# Slow path: find by plaintext (legacy tokens)
token = find_by(value: scrubbed)
return nil unless token

# Migrate on use: replace plaintext with hash
token.migrate_to_hashed!(scrubbed)
token
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make a database migration instead of having another piece of crazy logic to handle legacy and new data nad never to be sure that the data was eventually fully converted or not and have to notify customers when we finally decide to remove the legacy support, wait for them, check they actually stopped using this, wait more, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on adding a DB migration or a migration script, that was my plan. But I don't think this is crazy logic or this can be replaced by a migration. Backwards compatibility is mandatory because otherwise we couldn't deploy without downtime, and the logic is really simple, just a few additional lines in a one and single method.

The safe path to proceed is to deploy code that is compatible with pre and post migration state, then run the migration at any moment, then in the next deploy, remove the code to support legacy tokens.

It's as simple as creating a jira issue to remove the legacy support and schedule it to the release after the one including this PR. I don't see any drama, just regular work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backwards compatibility is mandatory because otherwise we couldn't deploy without downtime

if we start with this, then pure database migration will not be really feasible as far as I can tell.

I'm not talking about ruby code migration but pure query migration that can execute very fast on the DB server side.

A short downtime is always acceptable if it is short. For example a couple of minutes is totally fine.

While a pure ruby migration might be super slow. On the other hand it can be lazily executed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this lazy execution can only reliably be done on clusters we maintain, not sure we can rely customers running this. That's why I think for on-prem much better would be a one-step migration without gimmicks.

Copy link
Contributor Author

@jlledom jlledom Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A short downtime is always acceptable if it is short. For example a couple of minutes is totally fine.

Well, I think no down time is even better.

I'm not talking about ruby code migration but pure query migration that can execute very fast on the DB server side.

Like an SQL query that hashes the values? Is there a way to do it reliably on all the DBs, including Internet Explor... I mean Oracle? Is it really so fast?

Everything is trade-off. If I did it this way is because for me, providing a easy and less risky way to migrate is worth the super smaller extra complexity. Forcing the client to have a down time and execute a migration written in SQL that must work in all DBs and versions we support is more risky and complicated for them I think. And the only advantage would be that it's allegedly less work for us, and maybe not even that.

@mayorova @Madnialihussain Please untie.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our discussion, I suggest this approach:

  1. New tokens are created hashed
  2. Hashed tokens are identified by the prefix with the algorithm
  3. When receiving a request
  4. Find by hashed
  5. If it fails, find by clear
  6. If it fails, reject
  7. Clear tokens are not migrated on the fly

Deploy paths:

  • SaaS:
    1. Deploy code
    2. Leave it running to ensure we don't rollback
    3. Run the migration
    • Two queries will only happen in the lapse between deploying code and running the migration
  • On premises:
    1. Have downtime
    2. Update the code
    3. Run the migration
    4. Start server
    • They will never run two queries per hash

Result:

One time deploy Backwards compatibility No downtime Rollback Leak-safe No DDL Migration No Corner cases

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fact, it's almost one time deploy, bc the second phase is only a cleanup of dead code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with whatever you find yourself comfortable with. I'm only not sure about the 2 queries, but provided on first request the key will be migrated and 2 queries will consistently run only for missing keys, perhaps should be fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my last suggestion, I added:

Clear tokens are not migrated on the fly

This is to provide a way back. But two queries will only happen until they run the migration.

On the other hand, I think we can actually deploy in two phases for SaaS and a single phase for on premises, before we can release the two phases altogether in one release if they are in fact having downtime anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not ultimately down-gradable still, because new tokens may be created while in the new version. Still much higher downgradeablibity.

@jlledom jlledom marked this pull request as ready for review March 4, 2026 08:34
def create
@presenter = AccessTokensNewPresenter.new(current_account)
create! do |success, _failure|
create! do |success, failure|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This syntax belongs to inherited_resources 😅

As we're trying to get rid of it, I'd suggest to use a standard Rails approach to avoid conflicts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, this is how I learnt about that gem.

As we're trying to get rid of it, I'd suggest to use a standard Rails approach to avoid conflicts.

Sure!


def self.random_id
SecureRandom.hex(32)
SecureRandom.hex(48)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm,the existing 64-characters tokens are represented as 96-character long hash, it seems. Why is it necessary to also increase the generated plain value size? 🤔

While we can do it, but I think it's better if we can avoid it - maybe some customers also have some assumptions based on the token size (even though it might not be wise of them).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's more resistant to quantum computers, if those actually work some day.

Copy link
Contributor

@akostadinov akostadinov Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to generate tokens that contain at least as many random bits as the hash used to store them. This is not expensive so should be noticeable in performance.

@jlledom jlledom force-pushed the THREESCALE-12408-acess-tokens-protection branch from 9d65d4f to bae2dcc Compare March 10, 2026 13:09
jlledom and others added 11 commits March 10, 2026 14:10
* New token are generated already hashed
* Old tokens are hashed first time they are used

Co-Authored-By: Claude <noreply@anthropic.com>
* Remove dead code
* Render on create instead of redirecting, so the user can see the token
  while still in memory
* New template to show the generated token. It was embedded in the index
  template before
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
@jlledom jlledom force-pushed the THREESCALE-12408-acess-tokens-protection branch from c3be649 to e58951c Compare March 10, 2026 13:11
@qltysh
Copy link

qltysh bot commented Mar 10, 2026

❌ 55 blocking issues (55 total)

Tool Category Rule Count
reek Lint AccessToken#generate_value calls 'self.class' 2 times 49
rubocop Lint Avoid using update\_columns because it skips validations. 2
rubocop Lint Action edit should appear before create. 1
reek Lint AccessToken#self.find_from_value has approx 8 statements 1
rubocop Lint Block has too many lines. [44/25] 1
rubocop Lint Class has too many lines. [207/200] 1

@jlledom
Copy link
Contributor Author

jlledom commented Mar 11, 2026

I made a few changes, my last proposal:

Changes:

  1. New tokens are created hashed
  2. Hashed tokens are identified by the prefix with the algorithm
  3. When receiving a request
  4. Find by hashed
  5. If it fails, find by clear
  6. If it fails, reject
  7. Clear tokens are NOT migrated on the fly
  8. Hashing is done via Rails DB migration using SQL native hashing.

Deploy paths:

  • SaaS:

    1. Deploy code
    2. Leave it running to ensure we don't rollback
      • Since tokens are not migrated on the fly, we can run the migration or rollback when we want
    3. Run the migration
      • Two queries will only happen in the lapse between deploying code and running the migration
      • Two queries will still happen for invalid tokens, even after the migration
    4. Write a follow up PR to remove the second query, not needed after migration
    5. Merge, deploy again.
  • On premises:

    1. We'll release this PR and the follow-up PR altogether
      • No backwards compatibiity.
    2. Have downtime
    3. Update the code
    4. Run the migration
    5. Start server
      • They will never execute two queries per request

Support table:

Environment One time deploy Backwards compatibility No downtime Rollback Leak-safe No DDL Migration No Corner cases
SaaS
On Premises

Performance tests:
I told claude to write a few scripts to test performance. Take into account I'm running this on a 4 year old laptop.

  • SaaS has about 42K tokens in DB. In my benchmark, I insert 75K tokens in the table.
  • I tried this in all three DB engines multiple times.

Results:

  • About 25-28 request per second before migrating when all 75k tokens are plain text and we need two queries to auth.
  • The migration takes about 10 seconds to migrate the 75k tokens.
  • During migration, most of time there are no requests failing. Two times I saw between 1 and 2 requests failing. Race conditions probably.
  • A slightly greater amount of RPS about 27-29 after the migration, when only 1 query is needed to auth.

@akostadinov @mayorova @Madnialihussain

akostadinov
akostadinov previously approved these changes Mar 12, 2026
Copy link
Contributor

@akostadinov akostadinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!

I have a few nitpick comments but overall very solid, thanks!


def self.find_from_value(value)
find_by(value: value.to_s.scrub)
scrubbed = plaintext_value.to_s.scrub
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need scrub?

Just a nitpick for waste of compute resources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we need it because we re-use it for the legacy comparison. We can remove it in the follow-up PR.

section id="access-tokens"
h2 Access Tokens
p
' Access tokens are personal tokens that let you authenticate against the Account Management API, the Analytics API and the Billing API through HTTP Basic Auth. You can create multiple access tokens with custom scopes and permissions. We suggest you create tokens with the minimal scopes & permissions needed for the task at hand. Use Access Tokens from within the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use i18n for all strings? 👼

But not blocking merge can be done later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


def new
@presenter = AccessTokensNewPresenter.new(current_account)
@access_token = access_tokens.build
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we can't easily avoid instantiating an object here generating a key, but at the same time generating random data should be fast enough nowadays...

"WHERE value NOT LIKE '#{DIGEST_PREFIX}%' LIMIT #{BATCH_SIZE}"
elsif System::Database.postgres?
"UPDATE access_tokens SET value = '#{DIGEST_PREFIX}' || encode(sha384(value::bytea), 'hex') " \
"WHERE id IN (SELECT id FROM access_tokens WHERE value NOT LIKE '#{DIGEST_PREFIX}%' LIMIT #{BATCH_SIZE})"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I'm thinking, postgres may potentially not have the crypto extension enabled. Maybe we should have some failsafe solution for this case 😬
But can be discussed after merge to see if we need one and what exactly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use the extension, that's for old postgres versions, in PG 14 there's a builtin alternative, the sha384() function I'm using here.

https://www.postgresql.org/docs/14/functions-binarystring.html

assert_equal legacy_value, @token.reload.read_attribute(:value)
end

def test_find_from_value_rejects_leaked_hash_as_token
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one seems to be a dup with authentication with leaked database hash fails

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not dupped, one is unit and the other is integration test. They test different layers

assert @token.reload.read_attribute(:value).start_with?(AccessToken::DIGEST_PREFIX)
end

def test_find_from_value_finds_legacy_token
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be a dup with authentication with legacy unmigrated token succeeds

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. Different level testing.

@@ -1,10 +1,14 @@
class AccessToken < ApplicationRecord
DIGEST_PREFIX = 'SHA384|'.freeze
Copy link
Contributor

@akostadinov akostadinov Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overnight I had an insight that this prefix is not ideal. It will be a constant nuisance to use it on command line:

$ echo SHA384|asdasd
bash: asdasd: command not found...

The | character needs to always be quoted or escaped. It will be much more user friendly to have a shorter prefix without special characters, for example s2_

$ echo s2_asdasd
s2_asdasd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _ isn't good neither, because it's a wildcard in mysql:

https://dev.mysql.com/doc/refman/8.4/en/string-comparison-functions.html#operator_like

Let's use $ which is what bcrypt uses and we are already using in our DB for passwords. 65a75ea

I hope you can sleep now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants