Skip to content

feat: store hashed tokens instead of plaintext#8217

Draft
rubenfiszel wants to merge 28 commits intomainfrom
store-hash
Draft

feat: store hashed tokens instead of plaintext#8217
rubenfiszel wants to merge 28 commits intomainfrom
store-hash

Conversation

@rubenfiszel
Copy link
Contributor

@rubenfiszel rubenfiszel commented Mar 4, 2026

Summary

  • Store SHA-256 hashed tokens (token_hash) in the token table instead of plaintext, so a DB compromise doesn't expose usable credentials
  • Add token_prefix (first 10 chars) column for listing/deletion without needing plaintext
  • Auth cache stays keyed by raw token — zero-cost cache hits, only hash on cache miss
  • Two-phase migration: (1) add columns + backfill + build indexes (allows concurrent reads), (2) instant PK swap (~2ms exclusive lock)
  • Plaintext token column is still written alongside the hash until MIN_VERSION_SUPPORTS_TOKEN_HASH (1.650.0) is reached by all workers, then NULL is written instead

TODO for @hugocasademont

Remove get_token_by_prefix (native triggers)

get_token_by_prefix retrieves the plaintext token from the DB so it can be embedded in webhook callback URLs. It's used in two places:

  1. handler.rs:237 — trigger update: already handles None gracefully (deletes old token, creates new one)
  2. google/external.rs:327renew_channel: hard-fails on None ("Webhook token not found")

Once MIN_VERSION_SUPPORTS_TOKEN_HASH is met and plaintext stops being written, all new webhook tokens will have token = NULL, breaking Google trigger watch renewal.

Recommended fix — always delete + create a new token on renewal:

  • Add a low-level create_webhook_token_for_trigger(db, trigger) that inserts a token directly using the trigger's email/owner/workspace_id/script_path (no ApiAuthed needed)
  • Update renew_channel to: delete old token by prefix → create new token → use it in the webhook URL → update webhook_token_prefix on the trigger
  • Then get_token_by_prefix can be removed entirely — every renewal just creates a fresh token

This matches the pattern handler.rs already uses on trigger update.

Test plan

  • cargo check passes (CE + EE + native_trigger features)
  • Migration applies cleanly on fresh and existing DBs
  • Create token via UI → verify token_hash+token_prefix populated, token column has plaintext (pre-1.650.0) or NULL (post-1.650.0)
  • Authenticate with Bearer token → hash-based lookup works
  • List tokens → prefix displays correctly
  • Delete token by prefix → deletion works
  • Login/logout cycle → session tokens work
  • Monitor cleanup of expired tokens uses prefix in logs
  • Native trigger webhook re-creation works

🤖 Generated with Claude Code

Companion EE PR: https://github.com/windmill-labs/windmill-ee-private/pull/437

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude

This comment was marked as outdated.

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Mar 4, 2026

Deploying windmill with  Cloudflare Pages  Cloudflare Pages

Latest commit: 978ee56
Status: ✅  Deploy successful!
Preview URL: https://2240a199.windmill.pages.dev
Branch Preview URL: https://store-hash.windmill.pages.dev

View logs

CREATE UNIQUE INDEX token_hash_unique ON token (token_hash);

-- Index on prefix for deletion/listing
CREATE INDEX idx_token_prefix ON token (token_prefix);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-unique prefix risk: token_prefix is only 10 characters. When creating via rd_string(32), the charset is alphanumeric (62 chars), so the probability of a collision among 10-char prefixes is non-trivial at scale. This is important because the DELETE FROM token WHERE token_prefix = $2 in delete_token could delete multiple tokens with the same prefix.

The delete_token endpoint does scope deletion to email = $1 AND token_prefix = $2, which helps significantly, but there's still a theoretical risk of colliding prefixes for the same user. Consider whether this is acceptable or whether you'd want a UNIQUE(email, token_prefix) constraint. Not blocking, just something to think about at scale.

Comment on lines +15 to +16
-- Make token NOT NULL again
ALTER TABLE token ALTER COLUMN token SET NOT NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential down-migration failure: When rolling back step 2 (token_hash_pk_swap.down.sql), this restores token NOT NULL, but new tokens created after the forward migration will have token = NULL (since the code no longer populates it). This means the rollback will fail with ERROR: column "token" of relation "token" contains null values if any new tokens were created after migration.

You'd need to backfill or drop the new null rows before restoring NOT NULL:

-- Delete tokens that were created after the forward migration (no plaintext token)
DELETE FROM token WHERE token IS NULL;
-- Then restore NOT NULL
ALTER TABLE token ALTER COLUMN token SET NOT NULL;

INSERT INTO token(token, email, label, super_admin) VALUES ('SECRET_TOKEN', 'test@windmill.dev', 'test token', true)
-- Tokens for all users (token_hash = sha256 hex, token_prefix = first 10 chars)
INSERT INTO token(token_hash, token_prefix, token, email, label, super_admin)
VALUES (encode(sha256('SECRET_TOKEN'::bytea), 'hex'), 'SECRET_TOK', 'SECRET_TOKEN', 'test@windmill.dev', 'test token', true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Test fixtures include both token_hash, token_prefix, AND the plaintext token column. This is correct for the transitional period (the column still exists, just nullable), but be aware that if the token column is eventually dropped, these fixtures will need updating. Consider adding a comment here for future reference.

rubenfiszel and others added 5 commits March 4, 2026 08:01
- Update all base.sql fixtures to include token_hash/token_prefix columns
- Keep plaintext token for webhook tokens (needed for URL reconstruction)
- Restore get_token_by_prefix to query DB for webhook tokens
- Fix down migration to delete NULL-token rows before restoring NOT NULL
- Update parser fixture standalone schema
- Update EE dedicated_worker_ee.rs to use token_hash/token_prefix

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Write to token column alongside token_hash until MIN_VERSION_SUPPORTS_TOKEN_HASH
(1.649.0) is reached. This ensures older workers can still authenticate
during rolling upgrades. Remove the separate UPDATE in new_webhook_token
since create_token_internal now writes plaintext directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Check MIN_VERSION_SUPPORTS_TOKEN_HASH at runtime: write plaintext to
token column while old workers exist, switch to NULL once all workers
are >= 1.649.0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rubenfiszel
Copy link
Contributor Author

/ai-fast reviewi this PR

@claude

This comment was marked as outdated.

hugocasa and others added 9 commits March 12, 2026 10:36
Resolve conflicts in monitor.rs and native-triggers/lib.rs,
keeping the token_prefix-based queries from the store-hash branch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…OT NULL

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@hugocasa
Copy link
Collaborator

/ai-fast review this PR

@claude
Copy link
Contributor

claude bot commented Mar 13, 2026

Claude finished @hugocasa's task in 1m 55s —— View job


PR Review: Store hashed tokens instead of plaintext

  • Read key changed files (auth, migrations, native triggers, users)
  • Analyze security and correctness
  • Post review findings

Overall this is a well-designed security improvement. The two-phase migration approach, the MIN_VERSION_SUPPORTS_TOKEN_HASH gating, and the auth cache staying keyed by raw token are all solid choices. Here are the findings:


Bugs / Issues

1. Potential panic on short tokens (auth.rs)

There are 7 instances of token[0..TOKEN_PREFIX_LEN] that will panic if a token shorter than 10 chars is provided. While rd_string(32) always produces 32-char tokens, this could still be triggered by:

  • A malformed Authorization: Bearer xx header
  • Short cookie values from corrupted state

The superadmin path at auth.rs:437 already uses the safe pattern: token.get(..TOKEN_PREFIX_LEN).unwrap_or(token). The other 7 sites should too:

  • auth.rs:284, auth.rs:308, auth.rs:324, auth.rs:373, auth.rs:387, auth.rs:402
  • users.rs:1651

Similarly, mcp/oauth_server.rs:389 and oauth_server.rs:533 use &access_token[..TOKEN_PREFIX_LEN] which could panic, though these are on internally-generated tokens so the risk is very low.

Fix this →

2. Google watch renewal fails when token is orphaned (google/external.rs:328-334)

renew_channel calls rotate_webhook_token and if it returns None (token deleted or missing), it returns a hard error. This is fine for now since there's always a token row during the transition period, but after MIN_VERSION_SUPPORTS_TOKEN_HASH is met and plaintext stops being stored, the rotate_webhook_token function should still work (it looks up by token_hash, not plaintext), so this path should remain functional. However, the error message at line 332 says "no user context to create a fresh one" — consider adding a fallback that creates a fresh token (as the PR description's TODO recommends) rather than failing the renewal entirely.


Migration Concerns

3. Native trigger migration deletes orphaned triggers (20260313000000_native_trigger_token_hash.up.sql:13)

DELETE FROM native_trigger WHERE webhook_token_hash IS NULL;

This silently deletes any native_trigger row whose webhook_token_prefix doesn't match a token.token_prefix. This could happen if:

  • A token was deleted but the trigger wasn't cleaned up
  • Prefix collision caused the JOIN to match a different token

The deletion is the right thing to do for data integrity, but consider logging or counting how many rows are deleted, or adding a DO NOTHING approach (set a sentinel hash value) so the trigger can be detected and re-created rather than silently lost.

4. Native trigger migration backfill uses prefix JOIN (20260313000000_native_trigger_token_hash.up.sql:7-10)

UPDATE native_trigger nt SET webhook_token_hash = t.token_hash
FROM token t WHERE t.token_prefix = nt.webhook_token_prefix;

If two tokens share the same 10-char prefix (unlikely but possible), this JOIN could match the wrong token for a trigger. Since the prefix space is 62^10, the probability is negligible in practice, but it's worth noting. A DISTINCT ON or additional filtering by workspace_id and email would be safer.


Design Observations (Non-blocking)

5. delete_token can delete multiple tokens with same prefix (users.rs:1951-1960)

The DELETE FROM token WHERE email = $1 AND token_prefix = $2 could delete multiple tokens if a user has tokens with colliding 10-char prefixes. The code already returns tokens_deleted.len() in the response message, which is good for visibility. At scale, consider whether a UNIQUE(email, token_prefix) constraint or using token_hash for deletion would be safer.

6. rotate_webhook_token doesn't copy expiration or owner (lib.rs:727-782)

The rotation creates a new token copying label, email, scopes, workspace_id, and super_admin, but doesn't copy expiration or owner. For webhook tokens, having no expiration is probably intentional (they live as long as the trigger), but the missing owner column means the new token won't have an owner set. Verify this is acceptable for token permission checks.

7. token_expiry_notification table stores token_hash as FK (lib.rs:622-631)

INSERT INTO token_expiry_notification (token, expiration) VALUES ($1, $2)

The column is still named token but now stores a hash. This is a naming inconsistency that could confuse future developers. Not blocking, but worth a rename in a follow-up migration (e.g., token_hash).


Summary

The core security goal (removing plaintext tokens from the DB) is well-implemented. The main actionable item is the short token panic (item 1) which could cause a runtime crash on malformed input. Everything else is either already acknowledged in the PR's TODO section or is a low-probability edge case.

Copy link
Collaborator

@hugocasa hugocasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ai-fast

@hugocasa
Copy link
Collaborator

/ai-fast

@claude

This comment was marked as outdated.

@claude

This comment was marked as outdated.

github-actions bot and others added 7 commits March 13, 2026 16:19
Replace all `token[0..TOKEN_PREFIX_LEN]` slicing with
`token.get(..TOKEN_PREFIX_LEN).unwrap_or(token)` to prevent
panics when a token shorter than 10 chars is provided (e.g.
malformed Authorization header, short superadmin secret).

Co-authored-by: hugocasa <hugocasa@users.noreply.github.com>
Replace all `token[0..TOKEN_PREFIX_LEN]` with safe
`token.get(..TOKEN_PREFIX_LEN).unwrap_or(token)` to prevent panics
on malformed tokens shorter than 10 characters.

Co-authored-by: hugocasa <hugocasa@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@hugocasa
Copy link
Collaborator

/ai-fast review this PR

@claude
Copy link
Contributor

claude bot commented Mar 13, 2026

Claude finished @hugocasa's task in 1m 39s —— View job


PR Review: Store hashed tokens instead of plaintext

  • Read key changed files (auth, users, native triggers, migrations, oauth)
  • Analyze security, correctness, and edge cases
  • Check if previous review issues were addressed
  • Post review findings

Overall this is a well-designed security improvement. The two-phase migration, auth cache keyed by raw token, and MIN_VERSION_SUPPORTS_TOKEN_HASH gating are all solid. Here are the findings:


Bugs

1. Panic on short tokens — still present (auth.rs)

The previous review flagged this and it hasn't been fixed. There are 7 instances of token[0..TOKEN_PREFIX_LEN].to_string() in auth.rs that will panic if a token shorter than 10 characters is provided:

  • Lines 284, 308, 324, 373, 387, 402 (regular token auth paths)
  • Line 436 (superadmin secret path — the secret is user-configured and could be shorter than 10 chars)

Additionally users.rs:1651 has the same issue on the login path.

The fix is straightforward — use token.get(..TOKEN_PREFIX_LEN).unwrap_or(token).to_string() everywhere.

Fix this →

2. Google channel renewal hard-fails when token is missing (google/external.rs:328-335)

renew_channel calls rotate_webhook_token and returns a hard error if the old token doesn't exist:

None => {
    return Err(Error::InternalErr(format!(
        "Cannot renew channel {}: webhook token no longer exists and no user context to create a fresh one",
        trigger.external_id
    )));
}

This is acknowledged in the PR description's TODO section, but it's worth noting that this will break Google watch renewals once MIN_VERSION_SUPPORTS_TOKEN_HASH is met and old tokens get cleaned up, since rotate_webhook_token depends on the old token existing. The recommended fix (create a create_webhook_token_for_trigger that doesn't require ApiAuthed) should be implemented before the min version is bumped.


Migration Concerns

3. Native trigger backfill uses prefix JOIN — potential mismatch (20260313000000_native_trigger_token_hash.up.sql:7-10)

UPDATE native_trigger nt SET webhook_token_hash = t.token_hash
FROM token t WHERE t.token_prefix = nt.webhook_token_prefix;

If two tokens share the same 10-char prefix, this JOIN could match the wrong token. Probability is negligible (62^10 space), but adding AND t.workspace_id = nt.workspace_id would make this more robust since both tables have the column.

4. Orphaned trigger deletion is silent (20260313000000_native_trigger_token_hash.up.sql:13)

DELETE FROM native_trigger WHERE webhook_token_hash IS NULL;

This deletes any trigger whose token no longer exists. Consider logging the count or adding a RAISE NOTICE so operators know how many triggers were cleaned up during migration.


Design Notes (Non-blocking)

5. delete_token could delete multiple tokens with colliding prefix (users.rs:1951-1960)

DELETE FROM token WHERE email = $1 AND token_prefix = $2 could affect multiple rows if a user has tokens with the same 10-char prefix. The code already reports tokens_deleted.len() which gives visibility. At scale, consider using token_hash for deletion (but that requires the frontend to know the hash, which is a bigger change).

6. token_expiry_notification.token column naming inconsistency (lib.rs:622-631)

The column is still named token but now stores a hash value. This is a naming inconsistency that could confuse future developers. A follow-up migration renaming it to token_hash would improve clarity.

7. rotate_webhook_token doesn't copy owner or expiration (lib.rs:727-782)

The new token omits the owner column from the old token. For webhook tokens this is likely fine (they're scoped by workspace/email), but worth documenting that owner is intentionally dropped.

8. Down migration handles NULL tokens correctly

The previous review flagged this and it's been addressed — 20260304000001_token_hash_pk_swap.down.sql:16 correctly deletes NULL-token rows before restoring NOT NULL.


Summary

The core design is sound. The main actionable item is the short token panic (item 1) — a malformed Authorization: Bearer x header will crash the server. Everything else is either acknowledged in the TODO or low-probability edge cases.


hugocasa and others added 6 commits March 13, 2026 23:18
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ner/expiration in rotate

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…fication column

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ration

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@hugocasa
Copy link
Collaborator

/ai-fast review this PR

@claude
Copy link
Contributor

claude bot commented Mar 13, 2026

Claude finished @hugocasa's task in 1m 48s —— View job


PR Review: Store hashed tokens instead of plaintext

  • Read key changed files (auth, users, native triggers, migrations)
  • Check if previous review issues were addressed
  • Analyze security, correctness, and edge cases
  • Post review findings

This is a well-designed security improvement. The two-phase migration, auth cache keyed by raw token, MIN_VERSION_SUPPORTS_TOKEN_HASH gating, and rotate_webhook_token approach are all solid. Here are the findings from this review:


Bugs

1. Panic on short tokens — still present (7 sites in auth.rs + 1 in users.rs)

This was flagged in both previous reviews and remains unfixed. token[0..TOKEN_PREFIX_LEN] will panic if a token shorter than 10 characters is provided:

  • auth.rs:284, auth.rs:308, auth.rs:324, auth.rs:373, auth.rs:387, auth.rs:402 — regular token auth paths
  • auth.rs:436 — superadmin secret path (the secret is user-configured and could easily be shorter than 10 chars)
  • users.rs:1651 — login path

While rd_string(32) always produces 32-char tokens, a malformed Authorization: Bearer x header or a short superadmin secret will crash the server.

The fix is straightforward — use token.get(..TOKEN_PREFIX_LEN).unwrap_or(token).to_string() everywhere. This is a one-line change at each site.

Fix this →


Previously Flagged Issues — Now Addressed

2. Native trigger orphan handling (migration) — Fixed

The previous reviews flagged that 20260313000000_native_trigger_token_hash.up.sql deleted orphaned triggers silently. This has been fixed — orphaned triggers now get a placeholder hash and an error message ("Webhook token not found during migration — re-create this trigger to fix") instead of being deleted. Good improvement.

3. Down migration NULL token handling — Fixed

20260304000001_token_hash_pk_swap.down.sql:16 correctly deletes NULL-token rows before restoring NOT NULL.

4. token_expiry_notification column naming — Fixed

Migration 20260313000001_rename_token_expiry_notification_column.up.sql correctly converts existing plaintext values to hashes and renames the column to token_hash.

5. rotate_webhook_token copies owner and expiration — Fixed

lib.rs:763-775 now copies old.owner and old.expiration to the new token, addressing the previous review concern.


Design Observations (Non-blocking)

6. token_expiry_notification migration hashes plaintext values (20260313000001)

UPDATE token_expiry_notification
SET token = encode(sha256(token::bytea), 'hex');

This assumes all existing values in token_expiry_notification.token are plaintext. If any rows were already inserted with a hash (e.g., from the new code running after 20260304000000 but before this migration), they'd be double-hashed. Worth verifying the ordering guarantees — since 20260313000001 runs after 20260304000000, and the new code in register_token_expiry_notification already passes token_hash, any rows created between the two migrations would indeed be double-hashed.

However, looking at the code path: register_token_expiry_notification in lib.rs:622 inserts token_hash into a column named token. So between migrations 20260304000000 (which starts writing hashes) and 20260313000001 (which hashes the column), new rows will contain hashes that get re-hashed. The impact is low — these rows just won't match in check_expiring_tokens's JOIN, so some expiry notifications could be missed during the transition window.

7. Google channel renewal still hard-fails on missing token (google/external.rs:328-335)

As acknowledged in the PR description's TODO, renew_channel returns a hard error when rotate_webhook_token returns None. The handler.rs update path (lines 255-273) already has the correct pattern — it falls back to new_webhook_token. The renew_channel path should get the same treatment before MIN_VERSION_SUPPORTS_TOKEN_HASH is bumped to avoid breaking Google watch renewals. Since this is explicitly called out as a TODO for @hugocasademont, it's tracked.

8. Native trigger migration backfill prefix collision risk (very low)

The prefix JOIN at 20260313000000:7-10 could theoretically match the wrong token if two tokens share a 10-char prefix. Adding AND t.workspace_id = nt.workspace_id would make this more robust, though the 62^10 prefix space makes collisions negligible in practice.


Summary

The core security design is sound and most previous review issues have been addressed. The only remaining actionable bug is the short token panic (item 1) — a malformed Authorization: Bearer x header or a short superadmin secret will crash the server. This should be fixed before merge.

View job

@hugocasa
Copy link
Collaborator

hugocasa commented Mar 13, 2026

Manual Testing Checklist

Token basics

  • Create a token via Settings > Tokens > Add token — verify it works for API calls
  • List tokens — verify the token prefix shows correctly (10 chars)
  • Delete a token by prefix — verify it no longer authenticates
  • Login/logout — verify session token is created on login and invalidated on logout

Backward compatibility

  • Check the token table: newly created tokens should have token_hash and token_prefix populated
  • If MIN_VERSION is below 1.650.0 (rolling upgrade scenario), verify the plaintext token column is also populated
  • If MIN_VERSION is at or above 1.650.0, verify the plaintext token column is NULL

Native triggers (requires Google or Nextcloud integration)

  • Create a native trigger — verify webhook works and receives events
  • Update a native trigger (change the script path) — verify webhook still works with the new token
  • Delete a native trigger — verify the webhook token is also cleaned up from the token table
  • Check native_trigger.webhook_token_hash is populated (not the old webhook_token_prefix)

Migration

  • Verify the migration runs cleanly on a DB with existing tokens
  • Check that token_hash and token_prefix columns are backfilled for all existing tokens
  • Verify the token_expiry_notification.token_hash column was renamed and existing values hashed

Edge cases

  • Verify superadmin secret still works for authentication
  • Verify workspace-scoped tokens still respect workspace boundaries
  • Verify expired tokens are still rejected

@hugocasa hugocasa marked this pull request as draft March 13, 2026 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants