CMR-11116 and CMR-11106: Improve error handling for launchpad tokens#2398
CMR-11116 and CMR-11106: Improve error handling for launchpad tokens#2398daniel-zamora merged 6 commits intomasterfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds cache eviction APIs and implementations, extends HTTP error-to-status mappings, refines Launchpad token validation/caching (SHA-256 keys, transient-error non-caching, eviction on expiration), updates mock Echo token behaviors, and adds/adjusts integration and unit tests. Changes
Sequence DiagramsequenceDiagram
participant Client
participant Cache as "Token Cache"
participant EDL as "EDL / Launchpad"
participant Mock as "Mock Echo"
Client->>Cache: validate(token)
alt Cached & valid
Cache-->>Client: return cached user
else Cached & error (non-transient)
Cache-->>Client: return cached error payload
else Not cached or expired
Cache->>Cache: evict(key)
Client->>EDL: request validation(token)
EDL-->>Mock: upstream response (200 / 401 / 408 / 429 / 500 / 502 / 503 / 504)
alt Success (200)
EDL-->>Cache: user data
Cache->>Cache: store user with expiration
Cache-->>Client: return user
else Transient (429,504)
EDL-->>Client: return transient error (do not cache)
else Non-transient (401,408,500,502,503)
EDL-->>Cache: error payload
Cache->>Cache: store error with expiration
Cache-->>Client: return error
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
system-int-test/test/cmr/system_int_test/misc/launchpad_user_cache.clj (1)
98-109: Clarify test intent: verifies re-fetch after cache expiry, not cached error persistence.The test name suggests "Non-transient errors are cached for 5 minutes," but advancing time by 301 seconds exceeds the 5-minute (300s) cache TTL. At that point, the cache entry would have expired, so the second request is actually re-fetching from the mock (which still returns 401).
If the intent is to verify the error remains cached within the 5-minute window, consider advancing time by less than 300 seconds. If the intent is to verify behavior after cache expiry, the test name and comment should reflect that.
💡 If testing cached error persistence within TTL
- (dev-sys-util/advance-time! 301) + (dev-sys-util/advance-time! 200) ;; Still within 5-minute cache TTL🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@system-int-test/test/cmr/system_int_test/misc/launchpad_user_cache.clj` around lines 98 - 109, The test non-transient-errors-are-cached-test is ambiguous: either change the TTL advance to stay within 5 minutes (use dev-sys-util/advance-time! 299) to verify the 401 stays cached and the second ingest/ingest-concept still hits the cache, or keep the 301s advance but rename the test (and any assertion text) to indicate "after cache expiry" and add an assertion using cache-util/list-cache-keys (or check token-key) that the cache entry is gone before the second ingest/ingest-concept; update the test name/comment accordingly and adjust assertions to match the chosen intent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@common-lib/src/cmr/common/cache.clj`:
- Around line 41-43: Implement the missing evict method for each CmrCache
implementation: add an evict implementation to FallbackCache that calls evict on
both primary and fallback caches (or on the underlying cache instances
represented in the FallbackCache record), add an evict to
SingleThreadLookupCache that removes the key from its local lookup map and
delegates to the wrapped cache's evict, and add an evict to RedisCache that
performs the Redis deletion logic currently used for put/remove (using the
existing redis client/connection code) so it deletes the key from Redis; ensure
each method signature matches the CmrCache protocol (evict [cache cache-key])
and delegates to the appropriate existing functions/fields in the FallbackCache,
SingleThreadLookupCache, and RedisCache implementations.
In `@transmit-lib/src/cmr/transmit/launchpad_user_cache.clj`:
- Around line 58-72: When a cached Launchpad entry expires and
get-launchpad-user-fn returns a non-valid result, the code currently throws the
error without re-caching it; change the expired-path so that on a non-valid
fresh-result you set the cache (using cache/set-value with the fresh-result)
before throwing the service error, mirroring the initial cache-miss behavior for
errors (use the same get-launchpad-user-fn handling and caching behavior), and
keep the existing errors/throw-service-error call to raise the error after
caching.
---
Nitpick comments:
In `@system-int-test/test/cmr/system_int_test/misc/launchpad_user_cache.clj`:
- Around line 98-109: The test non-transient-errors-are-cached-test is
ambiguous: either change the TTL advance to stay within 5 minutes (use
dev-sys-util/advance-time! 299) to verify the 401 stays cached and the second
ingest/ingest-concept still hits the cache, or keep the 301s advance but rename
the test (and any assertion text) to indicate "after cache expiry" and add an
assertion using cache-util/list-cache-keys (or check token-key) that the cache
entry is gone before the second ingest/ingest-concept; update the test
name/comment accordingly and adjust assertions to match the chosen intent.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2f90ba1e-3569-4914-8eaa-2a22732dffa3
📒 Files selected for processing (10)
common-lib/src/cmr/common/api/errors.cljcommon-lib/src/cmr/common/cache.cljcommon-lib/src/cmr/common/cache/in_memory_cache.cljcommon-lib/test/cmr/common/test/cache/in_memory_cache.cljmock-echo-app/src/cmr/mock_echo/api/routes.cljmock-echo-app/src/cmr/mock_echo/api/urs.cljsystem-int-test/test/cmr/system_int_test/misc/launchpad_user_cache.cljtransmit-lib/src/cmr/transmit/launchpad_user_cache.cljtransmit-lib/src/cmr/transmit/tokens.cljtransmit-lib/src/cmr/transmit/urs.clj
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
common-lib/src/cmr/common/cache.clj (1)
41-43:⚠️ Potential issue | 🟠 MajorFinish the
evictrollout before extendingCmrCache.
FallbackCache,SingleThreadLookupCache, andRedisCachestill implementcache/CmrCachewithout anevictmethod. Anycache/evictdispatch to one of those types will fail, so this protocol change is still incomplete.Run this to confirm every
CmrCacheimplementation picked up the new method. Expected result: noMISSING evictlines.#!/bin/bash set -euo pipefail rg -l --type clojure 'cache/CmrCache' common-lib/src redis-utils-lib/src | while read -r file; do if ! rg -q '\(evict\b' "$file"; then echo "MISSING evict: $file" fi done🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@common-lib/src/cmr/common/cache.clj` around lines 41 - 43, The protocol CmrCache was extended with an evict method but FallbackCache, SingleThreadLookupCache, and RedisCache do not implement it, causing dispatch failures; add an evict implementation to each of these types (implement the protocol method named evict / cache/evict) that matches their semantics: for FallbackCache delegate eviction to both/backing caches or the appropriate inner cache, for SingleThreadLookupCache remove the key from its local lookup store and delegate to its backing cache, and for RedisCache perform the appropriate key removal from Redis; after changes run the provided rg script to verify no "MISSING evict" results.
🧹 Nitpick comments (2)
system-int-test/test/cmr/system_int_test/misc/launchpad_user_cache.clj (1)
98-110: This test still doesn't prove the cached401was refreshed after 5 minutes.The second request returns
401whether the cache expired and revalidated or the stale error was served again. Capturing the cached:expiration-timebefore advancing time and asserting it moves forward afterward would make this test cover the timeout behavior it is naming.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@system-int-test/test/cmr/system_int_test/misc/launchpad_user_cache.clj` around lines 98 - 110, The test non-transient-errors-are-cached-test should capture the cache entry's expiration timestamp before advancing time and assert that it is updated after the second ingest; locate the cache entry using the token-key variable and cache-util/list-cache-keys or a cache-read helper to fetch the cached map (which contains :expiration-time), store that :expiration-time in a local binding, call dev-sys-util/advance-time! 301 and perform the second ingest, then read the cache entry again and assert that the new :expiration-time is greater than the previously captured value to prove the entry was refreshed after expiry.mock-echo-app/src/cmr/mock_echo/api/urs.clj (1)
66-76: Consider using the stored:expires-invalue for consistency.The
:expires-inis stored in the atom on line 75 but never read back. Line 70 always uses the local constantexpires-ininstead of(:expires-in validation-info). While this works correctly for the current test token (always 1600 seconds), it would be more consistent to use the stored value:♻️ Proposed fix to use stored expires-in
(let [expires-in 1600 validation-info (get `@launchpad-token-validations` token)] (if validation-info - (let [elapsed-seconds (t/in-seconds (t/interval (:first-validated-at validation-info) (time-keeper/now)))] - (if (>= elapsed-seconds expires-in) + (let [stored-expires-in (:expires-in validation-info) + elapsed-seconds (t/in-seconds (t/interval (:first-validated-at validation-info) (time-keeper/now)))] + (if (>= elapsed-seconds stored-expires-in) {:status 401 :body {:error (format "Launchpad token (partially redacted) [%s] has expired." (common-util/scrub-token token))}} - {:status 200 :body {:uid "user1" :lp_token_expires_in (- expires-in elapsed-seconds)}})) + {:status 200 :body {:uid "user1" :lp_token_expires_in (- stored-expires-in elapsed-seconds)}}))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mock-echo-app/src/cmr/mock_echo/api/urs.clj` around lines 66 - 76, The code binds a local expires-in constant but then stores :expires-in in launchpad-token-validations and never reads it; update the logic in the branch that reads validation-info to use (:expires-in validation-info) instead of the local expires-in when comparing elapsed-seconds and when computing lp_token_expires_in, so the code consistently uses the stored :expires-in value from validation-info (the atom launchpad-token-validations and the map key :expires-in) rather than the hardcoded expires-in local.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@system-int-test/test/cmr/system_int_test/misc/launchpad_user_cache.clj`:
- Around line 83-96: Update the assertions in transient-errors-not-cached-test
to match the new Launchpad messages emitted by
transmit-lib/src/cmr/transmit/urs.clj: replace the re-find checks for "Rate
limit exceeded" and "Gateway timeout" with patterns that match "Launchpad rate
limit exceeded" and "(gateway timeout)" (or use case-insensitive patterns like
(?i)rate limit exceeded and (?i)gateway timeout) so the ingest/ingest-concept
responses asserted in transient-errors-not-cached-test will match the current
URS messages.
In `@transmit-lib/src/cmr/transmit/launchpad_user_cache.clj`:
- Around line 56-57: Currently cache-key is created with (keyword (str (hash
token))) which interns a keyword per token and exposes collision/memory-leak
risks; change cache-key to be a plain string (not a keyword) derived from the
token or, preferably, a collision-resistant digest/HMAC (e.g., SHA-256 or
HMAC-SHA256 of token) and use that string when calling cache/get-value with
get-launchpad-user-fn so tokens are not interned as keywords and collisions are
minimized.
- Around line 61-74: After evicting the stale entry, do not write a failing
revalidation result back into the cache; change the logic in the block using
cache/evict, get-launchpad-user-fn, cache/set-value, and
errors/throw-service-error so that if fresh-result is valid you set
cache/set-value and return it, but if fresh-result is invalid you do NOT call
cache/set-value and instead directly raise the error
(errors/throw-service-error) using the existing (:error-type/:error-message)
(with the scrubbed token message fallback); this preserves eviction on failed
revalidation instead of caching the failure.
---
Duplicate comments:
In `@common-lib/src/cmr/common/cache.clj`:
- Around line 41-43: The protocol CmrCache was extended with an evict method but
FallbackCache, SingleThreadLookupCache, and RedisCache do not implement it,
causing dispatch failures; add an evict implementation to each of these types
(implement the protocol method named evict / cache/evict) that matches their
semantics: for FallbackCache delegate eviction to both/backing caches or the
appropriate inner cache, for SingleThreadLookupCache remove the key from its
local lookup store and delegate to its backing cache, and for RedisCache perform
the appropriate key removal from Redis; after changes run the provided rg script
to verify no "MISSING evict" results.
---
Nitpick comments:
In `@mock-echo-app/src/cmr/mock_echo/api/urs.clj`:
- Around line 66-76: The code binds a local expires-in constant but then stores
:expires-in in launchpad-token-validations and never reads it; update the logic
in the branch that reads validation-info to use (:expires-in validation-info)
instead of the local expires-in when comparing elapsed-seconds and when
computing lp_token_expires_in, so the code consistently uses the stored
:expires-in value from validation-info (the atom launchpad-token-validations and
the map key :expires-in) rather than the hardcoded expires-in local.
In `@system-int-test/test/cmr/system_int_test/misc/launchpad_user_cache.clj`:
- Around line 98-110: The test non-transient-errors-are-cached-test should
capture the cache entry's expiration timestamp before advancing time and assert
that it is updated after the second ingest; locate the cache entry using the
token-key variable and cache-util/list-cache-keys or a cache-read helper to
fetch the cached map (which contains :expiration-time), store that
:expiration-time in a local binding, call dev-sys-util/advance-time! 301 and
perform the second ingest, then read the cache entry again and assert that the
new :expiration-time is greater than the previously captured value to prove the
entry was refreshed after expiry.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e071585c-c7b1-4cf8-bfc2-3caf1c259ae8
📒 Files selected for processing (10)
common-lib/src/cmr/common/api/errors.cljcommon-lib/src/cmr/common/cache.cljcommon-lib/src/cmr/common/cache/in_memory_cache.cljcommon-lib/test/cmr/common/test/cache/in_memory_cache.cljmock-echo-app/src/cmr/mock_echo/api/routes.cljmock-echo-app/src/cmr/mock_echo/api/urs.cljsystem-int-test/test/cmr/system_int_test/misc/launchpad_user_cache.cljtransmit-lib/src/cmr/transmit/launchpad_user_cache.cljtransmit-lib/src/cmr/transmit/tokens.cljtransmit-lib/src/cmr/transmit/urs.clj
🚧 Files skipped from review as they are similar to previous changes (1)
- common-lib/test/cmr/common/test/cache/in_memory_cache.clj
There was a problem hiding this comment.
🧹 Nitpick comments (1)
system-int-test/test/cmr/system_int_test/misc/launchpad_user_cache.clj (1)
100-112: Consider removing unusedtoken-keyvariable.The
token-keycomputed on line 103 is never used in the test assertions. If no cache key verification is intended, this variable can be removed to improve clarity.♻️ Proposed fix to remove unused variable
(deftest non-transient-errors-are-cached-test (testing "Non-transient errors are cached for 5 minutes" - (let [token "ABC-INV-ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ" - token-key (DigestUtils/sha256Hex token)] + (let [token "ABC-INV-ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"] (let [resp (ingest/ingest-concept (data-umm-c/collection-concept {}) {:token token})] (is (= 401 (:status resp))))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@system-int-test/test/cmr/system_int_test/misc/launchpad_user_cache.clj` around lines 100 - 112, The test non-transient-errors-are-cached-test computes token-key using (DigestUtils/sha256Hex token) but never uses it; remove the unused token-key binding to clean up the test, i.e., delete the token-key binding declaration that references DigestUtils/sha256Hex and keep the rest of the test (token, ingest/ingest-concept calls, cache-util/list-cache-keys check, and dev-sys-util/advance-time! invocation) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@system-int-test/test/cmr/system_int_test/misc/launchpad_user_cache.clj`:
- Around line 100-112: The test non-transient-errors-are-cached-test computes
token-key using (DigestUtils/sha256Hex token) but never uses it; remove the
unused token-key binding to clean up the test, i.e., delete the token-key
binding declaration that references DigestUtils/sha256Hex and keep the rest of
the test (token, ingest/ingest-concept calls, cache-util/list-cache-keys check,
and dev-sys-util/advance-time! invocation) unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 443d2468-6180-4ffb-89de-9406fc1346c5
📒 Files selected for processing (2)
system-int-test/test/cmr/system_int_test/misc/launchpad_user_cache.cljtransmit-lib/src/cmr/transmit/launchpad_user_cache.clj
5ef6ea1 to
4ee6718
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@transmit-lib/src/cmr/transmit/launchpad_user_cache.clj`:
- Around line 29-31: The transient-error-types set currently only excludes
:too-many-requests and :gateway-timeout but upstream urs/get-launchpad-user can
throw :request-timeout, :internal-server-error, :bad-gateway, and
:service-unavailable; update the transient-error-types definition to include
those four symbols so the cache logic that treats other errors as invalid-token
results (e.g. the invalid-token-timeout/invalid-token caching path) will skip
caching for these transient server-side failures and let callers retry
immediately.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f0de632c-521c-49f9-9bbe-2174a428b74c
📒 Files selected for processing (10)
common-lib/src/cmr/common/api/errors.cljcommon-lib/src/cmr/common/cache.cljcommon-lib/src/cmr/common/cache/in_memory_cache.cljcommon-lib/test/cmr/common/test/cache/in_memory_cache.cljmock-echo-app/src/cmr/mock_echo/api/routes.cljmock-echo-app/src/cmr/mock_echo/api/urs.cljsystem-int-test/test/cmr/system_int_test/misc/launchpad_user_cache.cljtransmit-lib/src/cmr/transmit/launchpad_user_cache.cljtransmit-lib/src/cmr/transmit/tokens.cljtransmit-lib/src/cmr/transmit/urs.clj
🚧 Files skipped from review as they are similar to previous changes (3)
- common-lib/src/cmr/common/cache/in_memory_cache.clj
- system-int-test/test/cmr/system_int_test/misc/launchpad_user_cache.clj
- common-lib/src/cmr/common/api/errors.clj
4ee6718 to
985b20b
Compare
985b20b to
81088e0
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #2398 +/- ##
==========================================
- Coverage 57.88% 57.84% -0.04%
==========================================
Files 1067 1067
Lines 73351 73440 +89
Branches 2124 2129 +5
==========================================
+ Hits 42462 42485 +23
- Misses 28911 28968 +57
- Partials 1978 1987 +9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
eereiter
left a comment
There was a problem hiding this comment.
I think the other CmrCaches should be updated with evict, even if they don't do anything. That way we avoid exceptions being thrown. Also there are a couple of other code rabbit suggestions. Please let me know if they are not valid and then I can approve this PR.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
transmit-lib/src/cmr/transmit/launchpad_user_cache.clj (1)
29-31:⚠️ Potential issue | 🟠 MajorExpand transient error types to include all server-side availability errors.
The
transient-error-typesset currently only includes:too-many-requestsand:gateway-timeout, buturs.cljnow throws additional server-side error types (:request-timeout,:internal-server-error,:bad-gateway,:service-unavailable) that represent temporary EDL/Launchpad outages. These should also be treated as transient to avoid caching valid tokens as failures during temporary service disruptions.Suggested fix to include all transient error types
(def transient-error-types "Error types that should not be cached - they are transient and should be retried immediately" - #{:too-many-requests :gateway-timeout}) + #{:request-timeout + :too-many-requests + :internal-server-error + :bad-gateway + :service-unavailable + :gateway-timeout})🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@transmit-lib/src/cmr/transmit/launchpad_user_cache.clj` around lines 29 - 31, The transient-error-types set currently only contains :too-many-requests and :gateway-timeout; update the var transient-error-types to also include the additional server-side availability keywords thrown by urs.clj—:request-timeout, :internal-server-error, :bad-gateway, and :service-unavailable—so these error kinds are treated as transient (not cached) during EDL/Launchpad outages; modify the literal set in transmit-lib/src/cmr/transmit/launchpad_user_cache.clj to include those symbols.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@transmit-lib/src/cmr/transmit/launchpad_user_cache.clj`:
- Around line 29-31: The transient-error-types set currently only contains
:too-many-requests and :gateway-timeout; update the var transient-error-types to
also include the additional server-side availability keywords thrown by
urs.clj—:request-timeout, :internal-server-error, :bad-gateway, and
:service-unavailable—so these error kinds are treated as transient (not cached)
during EDL/Launchpad outages; modify the literal set in
transmit-lib/src/cmr/transmit/launchpad_user_cache.clj to include those symbols.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4b11b2ce-999c-4119-80d4-b10627f3bf20
📒 Files selected for processing (6)
common-lib/src/cmr/common/cache/fallback_cache.cljcommon-lib/src/cmr/common/cache/single_thread_lookup_cache.cljmock-echo-app/src/cmr/mock_echo/api/urs.cljredis-utils-lib/src/cmr/redis_utils/redis_cache.cljsystem-int-test/test/cmr/system_int_test/misc/launchpad_user_cache.cljtransmit-lib/src/cmr/transmit/launchpad_user_cache.clj
9d2eacb to
8bbc2aa
Compare
8bbc2aa to
65e6cea
Compare
…2398) * CMR-11116: fixes launchpad invalid token timeout * CMR-11106: adds error propagation from edl for launchpad tokens * CMR-11116: adds tests for launchpad token caching behavior * CMR-11116: pr comments * CMR-11116: pr nitpicks * CMR-11116: pr nitpicks
Overview
What is the objective?
Fix launchpad token timeout behavior and improve error propagation from EDL (Earthdata Login) to provide more actionable error messages to users.
What are the changes?
What areas of the application does this impact?
Required Checklist
Additional Checklist
Summary by CodeRabbit
New Features
Tests