Skip to content

DRAFT Cmr 10838 erich#2397

Closed
eereiter wants to merge 7 commits intomasterfrom
CMR-10838-Erich
Closed

DRAFT Cmr 10838 erich#2397
eereiter wants to merge 7 commits intomasterfrom
CMR-10838-Erich

Conversation

@eereiter
Copy link
Copy Markdown
Contributor

@eereiter eereiter commented Mar 9, 2026

Overview

What is the objective?

Please summarize the reason for the changes made in this PR.

What are the changes?

Summarize what you have changed.

What areas of the application does this impact?

List impacted applications

Required Checklist

  • New and existing unit and int tests pass locally and remotely
  • clj-kondo has been run locally and all errors in changed files are corrected
  • I have commented my code, particularly in hard-to-understand areas
  • I have made changes to the documentation (if necessary)
  • My changes generate no new warnings

Additional Checklist

  • I have removed unnecessary/dead code and imports in files I have changed
  • I have cleaned up integration tests by doing one or more of the following:
    • migrated any are2 tests to are3 in files I have changed
    • de-duped, consolidated, removed dead int tests
    • transformed applicable int tests into unit tests
    • reduced number of system state resets by updating fixtures. Ex) (use-fixtures :each (ingest/reset-fixture {})) to be :once instead of :each

Summary by CodeRabbit

  • Refactor

    • Granule counts now served from a caching layer for faster, more consistent responses; provider filtering is applied after retrieval.
  • Chores

    • Background refresh for the has-granules map standardized as a scheduled job and simplified.
  • Bug Fixes / Security

    • Zip extraction now rejects entries that would escape the target directory (prevents path‑traversal).
    • Upstream/auth error types now map to appropriate HTTP status codes for clearer error responses.
  • Chores

    • Cache implementations gained single-key eviction support for targeted invalidation.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 9, 2026

📝 Walkthrough

Walkthrough

Removed direct Elasticsearch collection-granule-count retrieval and switched callers to use the granule-counts-cache API; added cache eviction across multiple cache implementations; hardened Launchpad token handling and error mappings; removed split-cluster logging/migration paths; added zip-entry directory validation for shapefile extraction and corresponding tests.

Changes

Cohort / File(s) Summary
Granule counts refactor
search-app/src/cmr/search/data/elastic_search_index.clj, search-app/src/cmr/search/data/granule_counts_cache.clj, search-app/src/cmr/search/services/query_execution/has_granules_or_cwic_results_feature.clj, search-app/src/cmr/search/services/query_execution/has_granules_results_feature.clj, search-app/src/cmr/search/services/query_service.clj, search-app/test/cmr/search/test/unit/data/granule_counts_cache_test.clj
Deleted ES-side get-collection-granule-counts; callers now use granule-counts-cache/get-granule-counts and client-side filtering; updated function arities, simplified cache refresh logging, and adjusted tests to match new callback arity.
Cache protocol + implementations + tests
common-lib/src/cmr/common/cache.clj, common-lib/src/cmr/common/cache/cache_spec.clj, common-lib/src/cmr/common/cache/fallback_cache.clj, common-lib/src/cmr/common/cache/in_memory_cache.clj, common-lib/src/cmr/common/cache/single_thread_lookup_cache.clj, common-lib/test/cmr/common/test/cache/in_memory_cache.clj, redis-utils-lib/src/cmr/redis_utils/redis_cache.clj, redis-utils-lib/test/cmr/redis_utils/test/test_redis_cache.clj, transmit-lib/src/cmr/transmit/cache/consistent_cache.clj
Added evict to CmrCache protocol; implemented evict in multiple cache records (in-memory, fallback, single-thread wrapper, Redis, consistent memory); added/updated unit tests validating eviction behavior.
Launchpad / EDL token handling & caching
transmit-lib/src/cmr/transmit/launchpad_user_cache.clj, transmit-lib/src/cmr/transmit/tokens.clj, transmit-lib/src/cmr/transmit/urs.clj, mock-echo-app/src/cmr/mock_echo/api/urs.clj, mock-echo-app/src/cmr/mock_echo/api/routes.clj, system-int-test/test/cmr/system_int_test/misc/launchpad_user_cache.clj
Changed cache key hashing to SHA-256, introduced transient-error-types, added eviction-on-expiration and non-caching for transient errors, expanded EDL/URS error-type→HTTP mappings including 429 handling, added token validation simulation state and reset helper in mock-echo, and updated integration tests for new behaviors.
Split-cluster/logging and index-set cleanup
elastic-utils-lib/src/cmr/elastic_utils/config.clj, elastic-utils-lib/src/cmr/elastic_utils/es_helper.clj, elastic-utils-lib/src/cmr/elastic_utils/search/es_index.clj, indexer-app/src/cmr/indexer/data/elasticsearch.clj, indexer-app/src/cmr/indexer/data/index_set_elasticsearch.clj
Removed split-cluster-log-toggle config and conditional logging; removed legacy get-old-index-set migration path and related transitional branches; minor local rebinding in send-with-retry.
Shapefile unzip safety
search-app/src/cmr/search/services/parameters/converters/shapefile.clj, search-app/test/cmr/search/test/unit/services/parameter_converters/shapefile.clj
Added validate-entry-dir to assert zip entries cannot escape target directory; integrated into unzip-file and added tests covering malicious and valid entries.
API error mapping
common-lib/src/cmr/common/api/errors.clj
Extended type->http-status-code to include :request-timeout (408), :internal-server-error (500), and :bad-gateway (502).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

hacktoberfest-accepted

Suggested reviewers

  • DuJuan
  • jmaeng72
  • ygliuvt

Poem

🐇 I hopped through caches, hashes, and logs,
Evicted the crumbs, fixed zips and the frogs,
Tokens now age with a sigh and a thump,
Counts come from one place — no more double-jump,
Hooray for small fixes — time for a carrot chomp! 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description is entirely a template with no concrete information filled in. No objectives, changes, impacted areas, or checklist items are actually addressed with specific details. Complete all required sections with specific details: explain the objective (why these changes?), summarize the concrete changes made, list which applications are impacted, and check/address each checklist item with evidence.
Title check ❓ Inconclusive The title is vague and non-descriptive, using a draft prefix and a generic branch identifier rather than summarizing the actual changes made in the pull request. Replace the title with a clear, concise description of the main changes. For example, if the primary objective is removing legacy code or refactoring, state that explicitly rather than using draft/branch references.
✅ Passed checks (1 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch CMR-10838-Erich
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.

Change the reviews.profile setting to assertive to make CodeRabbit's nitpick more issues in your PRs.

@eereiter eereiter changed the title Cmr 10838 erich DRAFT Cmr 10838 erich Mar 9, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
search-app/src/cmr/search/data/granule_counts_cache.clj (2)

83-91: Consider using a set for provider-id lookups.

The current some #(= ... %) provider-ids is O(n) per collection entry. Converting provider-ids to a set would make lookups O(1), which could matter if the counts map is large.

♻️ Optional optimization
   ([context provider-ids get-collection-granule-counts-fn]
    (let [cache (cache/context->cache context granule-counts-cache-key)
          counts (cache/get-value cache
                                  granule-counts-cache-key
                                  #(get-collection-granule-counts-fn context))]
      (if (seq provider-ids)
-       (into {} (filter (fn [[k _]]
-                          (some #(= (concepts/concept-id->provider-id k) %) provider-ids))
-                        counts))
+       (let [provider-id-set (set provider-ids)]
+         (into {} (filter (fn [[k _]]
+                            (provider-id-set (concepts/concept-id->provider-id k)))
+                          counts)))
        counts))))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@search-app/src/cmr/search/data/granule_counts_cache.clj` around lines 83 -
91, The provider-ids list is used for repeated membership checks inside the
filter, causing O(n) work per entry; convert provider-ids to a set once (e.g.,
provider-ids-set via set) and replace the some #(=
(concepts/concept-id->provider-id k) %) call with a constant-time membership
test like (contains? provider-ids-set (concepts/concept-id->provider-id k)) in
the filter that produces the returned map (referencing the local symbols
provider-ids, counts, and concepts/concept-id->provider-id).

45-62: Move the extra-collection-count check to execute once per provider, not per collection.

The do block checks extra-collection-count > 0 on every coll-bucket iteration, but the value is constant within a provider. This repeats the check N times (once per collection) instead of once per provider. Move the check into the :let binding immediately after computing extra-collection-count.

♻️ Proposed refactor
     (into {} (for [provider-bucket (get-in results [:aggregations :by-provider :buckets])
-                   :let [extra-collection-count (get-in provider-bucket [:by-collection-id :sum_other_doc_count])]
+                   :let [extra-collection-count (get-in provider-bucket [:by-collection-id :sum_other_doc_count])
+                         provider-id (:key provider-bucket)
+                         _ (when (> extra-collection-count 0)
+                             (e/internal-error!
+                              (format "Provider %s has more collections ([%s]) with granules than we support"
+                                      provider-id extra-collection-count)))]
                    coll-bucket (get-in provider-bucket [:by-collection-id :buckets])
-                   :let [provider-id (:key provider-bucket)
-                         coll-seq-id (:key coll-bucket)
+                   :let [coll-seq-id (:key coll-bucket)
                          num-granules (:doc_count coll-bucket)]]
-               (do
-                 ;; It's possible that there are more collections in the provider than we expected.
-                 ;; :sum_other_doc_count will be greater than 0 in that case.
-                 (when (> extra-collection-count 0)
-                   (e/internal-error!
-                    (format "Provider %s has more collections ([%s]) with granules than we support"
-                            provider-id extra-collection-count)))
-
-                 [(concepts/build-concept-id {:sequence-number coll-seq-id
-                                              :provider-id provider-id
-                                              :concept-type :collection})
-                  num-granules])))))
+               [(concepts/build-concept-id {:sequence-number coll-seq-id
+                                            :provider-id provider-id
+                                            :concept-type :collection})
+                num-granules]))))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@search-app/src/cmr/search/data/granule_counts_cache.clj` around lines 45 -
62, The extra-collection-count check is performed inside the per-collection do
block, causing it to run for every coll-bucket; move that check to run once per
provider by performing it immediately after computing extra-collection-count in
the :let for provider-bucket (instead of inside the inner do/when). Concretely:
remove the when/e/internal-error! from inside the coll-bucket body and add a
single when/e/internal-error! guarded by extra-collection-count in the
provider-level :let (the same scope where extra-collection-count and provider-id
are computed) so the error is raised once per provider; leave the
concepts/build-concept-id and num-granules mapping for each coll-bucket
unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@search-app/src/cmr/search/data/granule_counts_cache.clj`:
- Around line 83-91: The provider-ids list is used for repeated membership
checks inside the filter, causing O(n) work per entry; convert provider-ids to a
set once (e.g., provider-ids-set via set) and replace the some #(=
(concepts/concept-id->provider-id k) %) call with a constant-time membership
test like (contains? provider-ids-set (concepts/concept-id->provider-id k)) in
the filter that produces the returned map (referencing the local symbols
provider-ids, counts, and concepts/concept-id->provider-id).
- Around line 45-62: The extra-collection-count check is performed inside the
per-collection do block, causing it to run for every coll-bucket; move that
check to run once per provider by performing it immediately after computing
extra-collection-count in the :let for provider-bucket (instead of inside the
inner do/when). Concretely: remove the when/e/internal-error! from inside the
coll-bucket body and add a single when/e/internal-error! guarded by
extra-collection-count in the provider-level :let (the same scope where
extra-collection-count and provider-id are computed) so the error is raised once
per provider; leave the concepts/build-concept-id and num-granules mapping for
each coll-bucket unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e445e653-ea45-4a64-902a-b722b1f08ac2

📥 Commits

Reviewing files that changed from the base of the PR and between 8c8ca5f and 4b727e1.

📒 Files selected for processing (2)
  • search-app/src/cmr/search/data/granule_counts_cache.clj
  • search-app/test/cmr/search/test/unit/data/granule_counts_cache_test.clj

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 9, 2026

Codecov Report

❌ Patch coverage is 4.58015% with 125 lines in your changes missing coverage. Please review.
✅ Project coverage is 24.12%. Comparing base (22c44cb) to head (024ff64).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
transmit-lib/src/cmr/transmit/urs.clj 0.00% 35 Missing ⚠️
...h-app/src/cmr/search/data/granule_counts_cache.clj 0.00% 30 Missing ⚠️
...smit-lib/src/cmr/transmit/launchpad_user_cache.clj 8.00% 23 Missing ⚠️
mock-echo-app/src/cmr/mock_echo/api/urs.clj 15.78% 16 Missing ⚠️
...tils-lib/src/cmr/elastic_utils/search/es_index.clj 0.00% 4 Missing ⚠️
...earch/services/parameters/converters/shapefile.clj 20.00% 4 Missing ⚠️
...s/query_execution/has_granules_results_feature.clj 0.00% 3 Missing ⚠️
mock-echo-app/src/cmr/mock_echo/api/routes.clj 0.00% 2 Missing ⚠️
...edis-utils-lib/src/cmr/redis_utils/redis_cache.clj 0.00% 2 Missing ⚠️
...it-lib/src/cmr/transmit/cache/consistent_cache.clj 0.00% 2 Missing ⚠️
... and 4 more

❗ There is a different number of reports uploaded between BASE (22c44cb) and HEAD (024ff64). Click for more details.

HEAD has 23 uploads less than BASE
Flag BASE (22c44cb) HEAD (024ff64)
24 1
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #2397       +/-   ##
===========================================
- Coverage   57.88%   24.12%   -33.77%     
===========================================
  Files        1067     1007       -60     
  Lines       73351    69938     -3413     
  Branches     2124      978     -1146     
===========================================
- Hits        42462    16873    -25589     
- Misses      28911    52087    +23176     
+ Partials     1978      978     -1000     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

daniel-zamora and others added 4 commits March 19, 2026 08:34
…2398)

* CMR-11116: fixes launchpad invalid token timeout

* CMR-11106: adds error propagation from edl for launchpad tokens

* CMR-11116: adds tests for launchpad token caching behavior

* CMR-11116: pr comments

* CMR-11116: pr nitpicks

* CMR-11116: pr nitpicks
* logic

* add new line

* update test
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
search-app/src/cmr/search/services/parameters/converters/shapefile.clj (1)

65-77: ⚠️ Potential issue | 🟠 Major

Recursively delete the temporary directory when extraction fails.

The File.delete() method (line 76) silently fails on non-empty directories—it returns false rather than throwing an exception. If extraction partially succeeds before an exception is raised, already-extracted files remain behind in the temporary directory, causing a resource leak.

The fix uses file-seq to recursively walk the directory tree and reverse to delete files before their parent directories:

Suggested fix
       (catch Exception e
-        (.delete (.toFile target-dir))
+        (run! io/delete-file (reverse (file-seq (.toFile target-dir))))
         (errors/throw-service-error :bad-request (str "Error while uncompressing zip file: " (.getMessage e))))))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@search-app/src/cmr/search/services/parameters/converters/shapefile.clj`
around lines 65 - 77, The catch block currently calls (.delete (.toFile
target-dir)) which silently fails on non-empty directories; change it to
recursively delete the temporary extraction directory by walking the tree from
(.toFile target-dir) using file-seq, reversing the sequence so files are removed
before their parent directories, and deleting each File (so partially-extracted
files are removed on error). Keep the rest of the try/catch and still throw the
same service error via errors/throw-service-error; reference the existing
symbols target-dir and (.toFile target-dir) and the surrounding extraction logic
(validate-entry-dir, io/copy) when making the change.
🧹 Nitpick comments (2)
redis-utils-lib/src/cmr/redis_utils/redis_cache.clj (1)

88-91: Pass s-key to wcar* here too.

wcar* uses its first argument for per-key logging, and the other write paths in this record pass the serialized Redis key. Using the raw EDN key just for evict makes invalidation logs/metrics inconsistent with set-value and reset.

♻️ Suggested change
   (evict
     [_this key]
     (let [s-key (serialize key)]
-      (wcar* key false primary-connection (carmine/del s-key))))
+      (wcar* s-key false primary-connection (carmine/del s-key))))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@redis-utils-lib/src/cmr/redis_utils/redis_cache.clj` around lines 88 - 91,
The evict implementation is calling (wcar* key false primary-connection
(carmine/del s-key)) which passes the raw EDN key for per-key logging; change
the first argument to the serialized key so logging/metrics match other write
paths. In the evict function (method name evict, helper serialize and wcar*),
pass s-key as the first parameter to wcar* while keeping (carmine/del s-key) as
the command so the call becomes wcar* s-key false primary-connection
(carmine/del s-key).
system-int-test/test/cmr/system_int_test/misc/launchpad_user_cache.clj (1)

119-120: Derive the expiry jump from config.

Hard-coding 301 bakes the default five-minute invalid-token timeout into the test. Using transmit-config/invalid-token-timeout keeps this assertion aligned if the cache window is tuned later.

Suggested change
-      (dev-sys-util/advance-time! 301)
+      (dev-sys-util/advance-time!
+       (inc (* 60 (transmit-config/invalid-token-timeout))))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@system-int-test/test/cmr/system_int_test/misc/launchpad_user_cache.clj`
around lines 119 - 120, Replace the hard-coded 301 advance in the test by
deriving the jump from the configured timeout: use
transmit-config/invalid-token-timeout to compute the advance amount (e.g.
timeout + 1) and call dev-sys-util/advance-time! with that value so the test
remains correct if invalid-token-timeout changes; update the reference in the
test where dev-sys-util/advance-time! is invoked to use the computed value
instead of 301.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@common-lib/src/cmr/common/cache/fallback_cache.clj`:
- Around line 80-83: The current evict implementation in the evict function
evicts primary-cache before backup-cache, which allows a race with get-value
that can repopulate primary from backup; change the eviction order to evict
backup-cache first and then primary-cache (i.e., call cache/evict on
backup-cache before calling cache/evict on primary-cache) to ensure the backup
is removed prior to any possible repopulation into primary.

In `@elastic-utils-lib/src/cmr/elastic_utils/search/es_index.clj`:
- Around line 179-182: Remove the four debug global defs that leak state (the
lines that do (def context context), (def index-info index-info), (def query
query), (def max-retries max-retries)); instead rely on the function parameters
or create local bindings with let if you need renamed locals, and ensure the
recursive retry call (the function that calls itself on retry) uses the
original/local bindings passed into the function rather than any global
vars—delete the debug def statements to avoid race conditions and namespace
pollution and, if necessary, wrap any temporary names in a let around the
existing logic so retries use the correct per-call values.

In `@search-app/src/cmr/search/services/parameters/converters/shapefile.clj`:
- Around line 55-59: The current validate-entry-dir uses string-based startsWith
on canonical paths which permits sibling escapes; change it to perform a
path-aware containment check using java.nio.file.Path APIs: obtain the canonical
Path for the target directory and the entry (use Path.of / .toPath on the
canonical File values), then verify that entryPath.startsWith(targetPath) (or
use targetPath.relativize and ensure it does not traverse up) before allowing
extraction; update the validate-entry-dir function and the extraction logic that
calls it (the canonical path computation around (.getCanonicalPath (File.
(.toString target-dir) (str entry))) and the check against (.getCanonicalPath
(.toFile target-dir))) to use Path-based containment instead of
String.startsWith.

In
`@search-app/test/cmr/search/test/unit/services/parameter_converters/shapefile.clj`:
- Around line 122-143: The test is asserting the wrong error for absolute paths
and misses a sibling-prefix traversal case; update the second test in
shapefile.clj that calls shapefile/unzip-file to expect the validation error
message "Given zip content is not allowed" (instead of the filesystem "No such
file or directory") and add a new unit test asserting that validate-entry-dir
throws the same validation error for a sibling-prefix traversal entry like
"../<target-dir-name>evil/..." (ensure you call validate-entry-dir directly and
assert thrown-with-msg? Exception #"Given zip content is not allowed") so the
raw .startsWith vulnerability is covered and prevented.

In `@transmit-lib/src/cmr/transmit/cache/consistent_cache.clj`:
- Around line 122-125: The evict implementation should remove the shared hash
entry before the local memory entry to minimize inconsistency window: in the
evict function (consistent_cache.clj) call (c/evict hash-cache
(key->hash-cache-key key)) before calling (c/evict memory-cache key) so the
shared hash is invalidated first; update the evict body in the function named
evict accordingly, preserving existing semantics and error handling.

In `@transmit-lib/src/cmr/transmit/launchpad_user_cache.clj`:
- Around line 29-31: The transient-error-types set currently only lists
:too-many-requests and :gateway-timeout but must also treat URS upstream
failures as transient; update the definition of transient-error-types (and the
similar set used at lines 47-56) to include :request-timeout,
:internal-server-error, :bad-gateway, and :service-unavailable so that
cmr.transmit.urs/get-launchpad-user and related caching logic will not store a
five-minute invalid-token entry for temporary Launchpad/EDL outages.
- Around line 63-74: After evicting the stale entry, the code currently throws
on a non-transient invalid fresh-result without storing it back into the cache,
causing repeated EDL calls for the same bad token; update the branch in the
block around get-launchpad-user-fn so that when fresh-result is invalid (i.e.,
(:valid fresh-result) is false) you still call cache/set-value with the same
cache and cache-key to record the failure result (fresh-result) before invoking
errors/throw-service-error; reference symbols to update: cache/evict,
get-launchpad-user-fn, fresh-result, cache/set-value,
errors/throw-service-error, common-util/scrub-token, cache-key and token.
- Around line 58-59: The code currently interns SHA-256 hashes by creating
keywords for cache keys (cache-key = (keyword (DigestUtils/sha256Hex token))),
which leaks memory; change cache-key to use the hex string directly (e.g., call
DigestUtils/sha256Hex on token and pass that string into cache/get-value with
get-launchpad-user-fn) so you stop creating keywords for each token and avoid
filling the keyword intern table.

---

Outside diff comments:
In `@search-app/src/cmr/search/services/parameters/converters/shapefile.clj`:
- Around line 65-77: The catch block currently calls (.delete (.toFile
target-dir)) which silently fails on non-empty directories; change it to
recursively delete the temporary extraction directory by walking the tree from
(.toFile target-dir) using file-seq, reversing the sequence so files are removed
before their parent directories, and deleting each File (so partially-extracted
files are removed on error). Keep the rest of the try/catch and still throw the
same service error via errors/throw-service-error; reference the existing
symbols target-dir and (.toFile target-dir) and the surrounding extraction logic
(validate-entry-dir, io/copy) when making the change.

---

Nitpick comments:
In `@redis-utils-lib/src/cmr/redis_utils/redis_cache.clj`:
- Around line 88-91: The evict implementation is calling (wcar* key false
primary-connection (carmine/del s-key)) which passes the raw EDN key for per-key
logging; change the first argument to the serialized key so logging/metrics
match other write paths. In the evict function (method name evict, helper
serialize and wcar*), pass s-key as the first parameter to wcar* while keeping
(carmine/del s-key) as the command so the call becomes wcar* s-key false
primary-connection (carmine/del s-key).

In `@system-int-test/test/cmr/system_int_test/misc/launchpad_user_cache.clj`:
- Around line 119-120: Replace the hard-coded 301 advance in the test by
deriving the jump from the configured timeout: use
transmit-config/invalid-token-timeout to compute the advance amount (e.g.
timeout + 1) and call dev-sys-util/advance-time! with that value so the test
remains correct if invalid-token-timeout changes; update the reference in the
test where dev-sys-util/advance-time! is invoked to use the computed value
instead of 301.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1267cade-6f57-4aee-a535-00ef0e2a8a3d

📥 Commits

Reviewing files that changed from the base of the PR and between 4b727e1 and 024ff64.

📒 Files selected for processing (23)
  • common-lib/src/cmr/common/api/errors.clj
  • common-lib/src/cmr/common/cache.clj
  • common-lib/src/cmr/common/cache/cache_spec.clj
  • common-lib/src/cmr/common/cache/fallback_cache.clj
  • common-lib/src/cmr/common/cache/in_memory_cache.clj
  • common-lib/src/cmr/common/cache/single_thread_lookup_cache.clj
  • common-lib/test/cmr/common/test/cache/in_memory_cache.clj
  • elastic-utils-lib/src/cmr/elastic_utils/config.clj
  • elastic-utils-lib/src/cmr/elastic_utils/es_helper.clj
  • elastic-utils-lib/src/cmr/elastic_utils/search/es_index.clj
  • indexer-app/src/cmr/indexer/data/elasticsearch.clj
  • indexer-app/src/cmr/indexer/data/index_set_elasticsearch.clj
  • mock-echo-app/src/cmr/mock_echo/api/routes.clj
  • mock-echo-app/src/cmr/mock_echo/api/urs.clj
  • redis-utils-lib/src/cmr/redis_utils/redis_cache.clj
  • redis-utils-lib/test/cmr/redis_utils/test/test_redis_cache.clj
  • search-app/src/cmr/search/services/parameters/converters/shapefile.clj
  • search-app/test/cmr/search/test/unit/services/parameter_converters/shapefile.clj
  • system-int-test/test/cmr/system_int_test/misc/launchpad_user_cache.clj
  • transmit-lib/src/cmr/transmit/cache/consistent_cache.clj
  • transmit-lib/src/cmr/transmit/launchpad_user_cache.clj
  • transmit-lib/src/cmr/transmit/tokens.clj
  • transmit-lib/src/cmr/transmit/urs.clj
💤 Files with no reviewable changes (3)
  • elastic-utils-lib/src/cmr/elastic_utils/es_helper.clj
  • elastic-utils-lib/src/cmr/elastic_utils/config.clj
  • indexer-app/src/cmr/indexer/data/index_set_elasticsearch.clj
✅ Files skipped from review due to trivial changes (1)
  • redis-utils-lib/test/cmr/redis_utils/test/test_redis_cache.clj

Comment on lines +80 to +83
(evict
[_this key]
(cache/evict primary-cache key)
(cache/evict backup-cache key))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Evict the backup before the primary.

get-value repopulates primary-cache from backup-cache on a miss. With the current order, a concurrent read between the two calls can miss in primary-cache, read the stale value from backup-cache, and write it back into primary-cache before backup eviction runs. That leaves the entry present again after evict returns.

🐛 Suggested change
   (evict
     [_this key]
-    (cache/evict primary-cache key)
-    (cache/evict backup-cache key))
+    (cache/evict backup-cache key)
+    (cache/evict primary-cache key))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@common-lib/src/cmr/common/cache/fallback_cache.clj` around lines 80 - 83, The
current evict implementation in the evict function evicts primary-cache before
backup-cache, which allows a race with get-value that can repopulate primary
from backup; change the eviction order to evict backup-cache first and then
primary-cache (i.e., call cache/evict on backup-cache before calling cache/evict
on primary-cache) to ensure the backup is removed prior to any possible
repopulation into primary.

Comment on lines +179 to +182
(def context context)
(def index-info index-info)
(def query query)
(def max-retries max-retries)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Remove debug def statements - causes race conditions and global namespace pollution.

Using def inside a function does not create local bindings—it creates or mutates global vars in the namespace. This is a well-known Clojure anti-pattern that causes:

  1. Race conditions: Concurrent ES queries will overwrite these global vars, corrupting each other's context, index-info, query, and max-retries values.
  2. Namespace pollution: These vars persist globally after the function returns.
  3. Non-deterministic behavior: The recursive retry call at line 201 will use corrupted values if another thread modified the globals.

The PR commit message mentions "Sending up a debug change"—this debug code must be removed before merging.

🐛 Proposed fix: Remove debug statements
 (defn- do-send-with-retry
   "Sends a query to ES, either normal or using a scroll query."
   [context index-info query max-retries]
-  (def context context)
-  (def index-info index-info)
-  (def query query)
-  (def max-retries max-retries)
   (try
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
(def context context)
(def index-info index-info)
(def query query)
(def max-retries max-retries)
(defn- do-send-with-retry
"Sends a query to ES, either normal or using a scroll query."
[context index-info query max-retries]
(try
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@elastic-utils-lib/src/cmr/elastic_utils/search/es_index.clj` around lines 179
- 182, Remove the four debug global defs that leak state (the lines that do (def
context context), (def index-info index-info), (def query query), (def
max-retries max-retries)); instead rely on the function parameters or create
local bindings with let if you need renamed locals, and ensure the recursive
retry call (the function that calls itself on retry) uses the original/local
bindings passed into the function rather than any global vars—delete the debug
def statements to avoid race conditions and namespace pollution and, if
necessary, wrap any temporary names in a let around the existing logic so
retries use the correct per-call values.

Comment on lines +55 to +59
(defn validate-entry-dir
[target-dir entry]
(let [canonical (.getCanonicalPath (File. (.toString target-dir) (str entry)))]
(when-not (.startsWith canonical (.getCanonicalPath (.toFile target-dir)))
(throw (Exception. "Given zip content is not allowed")))))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -path "*/search-app/src/cmr/search/services/parameters/converters/shapefile.clj" -type f

Repository: nasa/Common-Metadata-Repository

Length of output: 147


🏁 Script executed:

cat -n search-app/src/cmr/search/services/parameters/converters/shapefile.clj | sed -n '45,85p'

Repository: nasa/Common-Metadata-Repository

Length of output: 2103


🏁 Script executed:

# Test the specific vulnerability scenario with Java canonical paths
python3 << 'EOF'
import os
import tempfile
from pathlib import Path

# Simulate the scenario described in the review
# Temp dir: /tmp/Shapes123
# Entry: ../Shapes1234/escape.shp

# Create a test scenario
with tempfile.TemporaryDirectory(prefix="Shapes123_") as tmpdir:
    base_dir = tmpdir  # e.g., /tmp/Shapes123_xyz
    print(f"Base directory: {base_dir}")
    
    # The entry path that attempts to escape
    entry = "../Shapes1234/escape.shp"
    
    # Simulate what happens when constructing a File with base + entry
    # In Java: new File(base_dir, entry)
    test_path = os.path.join(base_dir, entry)
    
    # Normalize it (similar to .getCanonicalPath())
    canonical = os.path.normpath(test_path)
    print(f"Entry: {entry}")
    print(f"Constructed path: {test_path}")
    print(f"Canonical path: {canonical}")
    
    # Check if canonical path is outside base dir
    # This is the vulnerability: string prefix check
    canonical_base = os.path.normpath(base_dir)
    print(f"Canonical base: {canonical_base}")
    print(f"Does canonical start with base string? {canonical.startswith(canonical_base)}")
    
    # But what about the sibling directory issue?
    # If base_dir is /tmp/tmpXXX_Shapes123, and we escape to ../Shapes1234
    # We'd end up in /tmp/Shapes1234 which does NOT start with /tmp/tmpXXX_Shapes123
    
    # Let's test with a more realistic scenario matching the review
    # where the directory names have similar prefixes
    base_similar = "/tmp/Shapes123"
    escaped = os.path.join(base_similar, entry)
    canonical_escaped = os.path.normpath(escaped)
    print(f"\nRealistic scenario:")
    print(f"Base: {base_similar}")
    print(f"Escaped path: {escaped}")
    print(f"Canonical: {canonical_escaped}")
    print(f"Does canonical start with base? {canonical_escaped.startswith(base_similar)}")
    
    # Ah! If we go to /tmp/Shapes1234, it DOES start with /tmp/Shapes
    # But not with /tmp/Shapes123 (with trailing slash issue?)
    print(f"\nPrefix test details:")
    print(f"  {canonical_escaped}.startswith({base_similar}) = {canonical_escaped.startswith(base_similar)}")
    print(f"  {canonical_escaped}.startswith({base_similar + os.sep}) = {canonical_escaped.startswith(base_similar + os.sep)}")

EOF

Repository: nasa/Common-Metadata-Repository

Length of output: 662


🏁 Script executed:

# Read the complete validate-entry-dir function and surrounding context
cat -n search-app/src/cmr/search/services/parameters/converters/shapefile.clj | sed -n '50,80p'

Repository: nasa/Common-Metadata-Repository

Length of output: 1735


The zip-slip guard is bypassable through sibling path directories due to string-based prefix validation.

The current implementation at lines 57-58 uses String.startsWith() on canonical paths, which cannot distinguish between containment and string prefix matches. For example, if the temp directory is /tmp/Shapes123, an entry like ../Shapes1234/escape.shp canonicalizes to /tmp/Shapes1234/escape.shp, which passes the validation because it starts with the string /tmp/Shapes123 but is actually outside the intended extraction root.

This affects the extraction logic at lines 67-73 as well, where validate-entry-dir is called after the file path is already constructed.

Replace the string-based prefix check with path-aware comparison using the java.nio.file.Path API, which correctly handles path containment semantics and prevents sibling directory escapes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@search-app/src/cmr/search/services/parameters/converters/shapefile.clj`
around lines 55 - 59, The current validate-entry-dir uses string-based
startsWith on canonical paths which permits sibling escapes; change it to
perform a path-aware containment check using java.nio.file.Path APIs: obtain the
canonical Path for the target directory and the entry (use Path.of / .toPath on
the canonical File values), then verify that entryPath.startsWith(targetPath)
(or use targetPath.relativize and ensure it does not traverse up) before
allowing extraction; update the validate-entry-dir function and the extraction
logic that calls it (the canonical path computation around (.getCanonicalPath
(File. (.toString target-dir) (str entry))) and the check against
(.getCanonicalPath (.toFile target-dir))) to use Path-based containment instead
of String.startsWith.

Comment on lines +122 to +143
(deftest unzip-file-test
(testing "invalid file path in the source throws error"
(let [invalid-zip (create-single-entry-zip "invalid.zip" "../somewhere_else.txt" "something")]
(try
(is (thrown-with-msg? Exception #"Given zip content is not allowed" (shapefile/unzip-file invalid-zip)))
(finally
(io/delete-file invalid-zip true)))))
(testing "invalid file path within the target dir throws error"
(let [invalid-zip (create-single-entry-zip "wrong-dir.zip" "/BLAH/somewhere_else.txt" "something")]
(try
(is (thrown-with-msg? Exception #"Error while uncompressing zip file.*No such file or directory" (shapefile/unzip-file invalid-zip)))
(finally
(io/delete-file invalid-zip true)))))
(testing "valid zip file returns file obj"
(let [valid-zip (create-single-entry-zip "valid.zip" "somewhere_else.txt" "something")
result (shapefile/unzip-file valid-zip)]
(try
(is (some? result))
(finally
(when result
(run! io/delete-file (reverse (file-seq result))))
(io/delete-file valid-zip true))))))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find . -name "shapefile.clj" -type f

Repository: nasa/Common-Metadata-Repository

Length of output: 374


🏁 Script executed:

find . -name "shapefile.clj" -type f | head -20

Repository: nasa/Common-Metadata-Repository

Length of output: 374


🏁 Script executed:

wc -l ./search-app/src/cmr/search/services/parameters/converters/shapefile.clj

Repository: nasa/Common-Metadata-Repository

Length of output: 151


🏁 Script executed:

cat -n ./search-app/src/cmr/search/services/parameters/converters/shapefile.clj | head -70

Repository: nasa/Common-Metadata-Repository

Length of output: 3254


🏁 Script executed:

cat -n ./search-app/src/cmr/search/services/parameters/converters/shapefile.clj | sed -n '55,85p'

Repository: nasa/Common-Metadata-Repository

Length of output: 1721


🏁 Script executed:

cat -n ./search-app/test/cmr/search/test/unit/services/parameter_converters/shapefile.clj | sed -n '122,145p'

Repository: nasa/Common-Metadata-Repository

Length of output: 1363


🏁 Script executed:

cat -n ./search-app/test/cmr/search/test/unit/services/parameter_converters/shapefile.clj | head -40

Repository: nasa/Common-Metadata-Repository

Length of output: 2105


🏁 Script executed:

cat -n ./search-app/test/cmr/search/test/unit/services/parameter_converters/shapefile.clj | sed -n '100,121p'

Repository: nasa/Common-Metadata-Repository

Length of output: 959


🏁 Script executed:

cat -n ./search-app/src/cmr/search/services/parameters/converters/shapefile.clj | sed -n '61,77p'

Repository: nasa/Common-Metadata-Repository

Length of output: 1096


🏁 Script executed:

python3 << 'EOF'
# Simulate the startsWith vulnerability

# Example 1: Target directory
target_dir = "/tmp/ShapesABC"

# Example 2: Sibling-prefix traversal attempt
entry_path = "../ShapesABCevil/escape.txt"

# When File(target_dir, entry_path) is created:
# It creates: /tmp/ShapesABC/../ShapesABCevil/escape.txt
# Canonical form: /tmp/ShapesABCevil/escape.txt

canonical = "/tmp/ShapesABCevil/escape.txt"
canonical_target = "/tmp/ShapesABC"

# The startsWith check
vulnerable_result = canonical.startswith(canonical_target)
print(f"Canonical path: {canonical}")
print(f"Target dir canonical: {canonical_target}")
print(f"Does canonical.startswith(target)? {vulnerable_result}")
print(f"VULNERABLE: {vulnerable_result}")

# Example 3: Absolute path (what the current test uses)
entry_path2 = "/BLAH/somewhere_else.txt"
canonical2 = "/BLAH/somewhere_else.txt"
result2 = canonical2.startswith(canonical_target)
print(f"\nAbsolute path test: {canonical2}")
print(f"Does canonical.startswith(target)? {result2}")
EOF

Repository: nasa/Common-Metadata-Repository

Length of output: 301


🏁 Script executed:

cat -n ./search-app/test/cmr/search/test/unit/services/parameter_converters/shapefile.clj | sed -n '113,135p'

Repository: nasa/Common-Metadata-Repository

Length of output: 1373


🏁 Script executed:

python3 << 'EOF'
# Verify the regex matching behavior
import re

# Current test regex
regex_pattern = r"Error while uncompressing zip file.*No such file or directory"

# What the validation error would be
validation_error = "Error while uncompressing zip file: Given zip content is not allowed"

# What a filesystem error would be
filesystem_error = "Error while uncompressing zip file: No such file or directory"

print(f"Current regex: {regex_pattern}")
print(f"Validation error message: {validation_error}")
print(f"Match validation error? {bool(re.search(regex_pattern, validation_error))}")
print()
print(f"Filesystem error message: {filesystem_error}")
print(f"Match filesystem error? {bool(re.search(regex_pattern, filesystem_error))}")
EOF

Repository: nasa/Common-Metadata-Repository

Length of output: 394


🏁 Script executed:

# Check if there's any test that validates the sibling-prefix case
grep -n "sibling\|\.\./" ./search-app/test/cmr/search/test/unit/services/parameter_converters/shapefile.clj

Repository: nasa/Common-Metadata-Repository

Length of output: 177


🏁 Script executed:

# Let me verify that the error from validation happens before filesystem operations
cat -n ./search-app/src/cmr/search/services/parameters/converters/shapefile.clj | sed -n '61,77p'

Repository: nasa/Common-Metadata-Repository

Length of output: 1096


🏁 Script executed:

# Verify the exact error message behavior by testing with absolute path
python3 << 'EOF'
# When creating File("/tmp/ShapesABC", "/BLAH/somewhere_else.txt")
# The second argument is treated as an absolute path in Java
# File(parent, child) where child is absolute ignores parent
# So it becomes just "/BLAH/somewhere_else.txt"

# Canonical form of "/BLAH/somewhere_else.txt" is itself
canonical = "/BLAH/somewhere_else.txt"
target_dir = "/tmp/ShapesABC"

# startsWith check
result = canonical.startswith(target_dir)
print(f"Absolute path test:")
print(f"  Canonical: {canonical}")
print(f"  Target: {target_dir}")
print(f"  Passes validation? {result}")
print(f"  Should throw? {not result}")
EOF

Repository: nasa/Common-Metadata-Repository

Length of output: 205


🏁 Script executed:

grep -A 20 "deftest.*unzip" ./search-app/test/cmr/search/test/unit/services/parameter_converters/shapefile.clj | head -50

Repository: nasa/Common-Metadata-Repository

Length of output: 1162


🏁 Script executed:

# Let me also verify the complete imports to confirm ZipEntry is available
grep -i "zipentry" ./search-app/test/cmr/search/test/unit/services/parameter_converters/shapefile.clj

Repository: nasa/Common-Metadata-Repository

Length of output: 176


🏁 Script executed:

python3 << 'EOF'
# Simulate what happens with absolute path /BLAH/somewhere_else.txt

# When File("/tmp/Shapes123", "/BLAH/somewhere_else.txt") is created:
# In Java, if second arg is absolute, it overrides parent
# So it becomes /BLAH/somewhere_else.txt

# getCanonicalPath() on /BLAH/somewhere_else.txt will return /BLAH/somewhere_else.txt
# (even if directory doesn't exist, canonical path just normalizes)

target_dir_canonical = "/tmp/Shapes123"
entry_path_canonical = "/BLAH/somewhere_else.txt"

# startsWith check
fails_validation = not entry_path_canonical.startswith(target_dir_canonical)
print(f"Validation result for /BLAH/... path:")
print(f"  Entry canonical: {entry_path_canonical}")
print(f"  Target canonical: {target_dir_canonical}")
print(f"  Entry.startswith(target)? {entry_path_canonical.startswith(target_dir_canonical)}")
print(f"  Fails validation? {fails_validation}")
print(f"  Expected error: 'Given zip content is not allowed'")
print(f"  Wrapped as: 'Error while uncompressing zip file: Given zip content is not allowed'")
print()

# Now check sibling-prefix case
print("Vulnerability check - sibling-prefix traversal:")
target_dir_canonical = "/tmp/Shapes123"
entry_path = "../Shapes123evil/escape.txt"
# When resolved: /tmp/Shapes123/../Shapes123evil/escape.txt
entry_path_canonical = "/tmp/Shapes123evil/escape.txt"

passes_validation = entry_path_canonical.startswith(target_dir_canonical)
print(f"  Entry canonical: {entry_path_canonical}")
print(f"  Target canonical: {target_dir_canonical}")
print(f"  Entry.startswith(target)? {passes_validation}")
print(f"  VULNERABLE? {passes_validation} (should be False!)")
EOF

Repository: nasa/Common-Metadata-Repository

Length of output: 592


🏁 Script executed:

grep -A 5 "invalid file path within the target dir" ./search-app/test/cmr/search/test/unit/services/parameter_converters/shapefile.clj

Repository: nasa/Common-Metadata-Repository

Length of output: 462


🏁 Script executed:

# Check imports in the test file to confirm ZipEntry is available
head -30 ./search-app/test/cmr/search/test/unit/services/parameter_converters/shapefile.clj | grep -i "import\|zipentry"

Repository: nasa/Common-Metadata-Repository

Length of output: 145


🏁 Script executed:

python3 << 'EOF'
# Trace through the logic of unzip-file with absolute path entry

# Line 71: (validate-entry-dir target-dir entry)
# This calls validate-entry-dir which:
# 1. Creates File("/tmp/Shapes123", "/BLAH/somewhere_else.txt")
# 2. In Java, absolute path as second arg overrides parent
# 3. getCanonicalPath() returns "/BLAH/somewhere_else.txt"
# 4. target-dir canonical is "/tmp/Shapes123"
# 5. "/BLAH/somewhere_else.txt".startsWith("/tmp/Shapes123") = false
# 6. Throws Exception("Given zip content is not allowed")

# This exception is caught at line 75 and wrapped:
# (str "Error while uncompressing zip file: " (.getMessage e))
# Result: "Error while uncompressing zip file: Given zip content is not allowed"

# The test expects: #"Error while uncompressing zip file.*No such file or directory"
# This regex would NOT match!

import re
regex = r"Error while uncompressing zip file.*No such file or directory"
actual_message = "Error while uncompressing zip file: Given zip content is not allowed"

print(f"Test expects regex: {regex}")
print(f"Actual error message: {actual_message}")
print(f"Regex matches actual? {bool(re.search(regex, actual_message))}")
print()
print("FINDING: The test at line 132 expects the WRONG error message!")
print("It will fail because validation catches the absolute path before filesystem operations.")
EOF

Repository: nasa/Common-Metadata-Repository

Length of output: 427


Test the validation contract, not the filesystem message.

The test at line 132 expects a regex matching "No such file or directory", but absolute-style paths like /BLAH/somewhere_else.txt actually fail validation (line 71) before any filesystem operations occur, throwing "Given zip content is not allowed" instead. The error message won't match the expected regex.

The test suite also misses the sibling-prefix escape vulnerability (../<target-dir-name>evil/...) that bypasses line 58's raw .startsWith check. This pattern would pass validation and allow directory traversal, but no test covers it.

Update the second test to expect the actual validation error message, and add a test for sibling-prefix traversal directly against validate-entry-dir to ensure the attack vector is blocked.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@search-app/test/cmr/search/test/unit/services/parameter_converters/shapefile.clj`
around lines 122 - 143, The test is asserting the wrong error for absolute paths
and misses a sibling-prefix traversal case; update the second test in
shapefile.clj that calls shapefile/unzip-file to expect the validation error
message "Given zip content is not allowed" (instead of the filesystem "No such
file or directory") and add a new unit test asserting that validate-entry-dir
throws the same validation error for a sibling-prefix traversal entry like
"../<target-dir-name>evil/..." (ensure you call validate-entry-dir directly and
assert thrown-with-msg? Exception #"Given zip content is not allowed") so the
raw .startsWith vulnerability is covered and prevented.

Comment on lines +122 to +125
(evict
[_this key]
(c/evict memory-cache key)
(c/evict hash-cache (key->hash-cache-key key)))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Evict the shared hash before the local entry.

If the hash-cache delete fails after Line 124 removes the local value, other nodes can keep serving the stale entry because the shared hash is still intact. Invalidating the shared hash first narrows that inconsistency window and better matches this cache's consistency-first contract.

Suggested change
   (evict
     [_this key]
-    (c/evict memory-cache key)
-    (c/evict hash-cache (key->hash-cache-key key)))
+    (c/evict hash-cache (key->hash-cache-key key))
+    (c/evict memory-cache key))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
(evict
[_this key]
(c/evict memory-cache key)
(c/evict hash-cache (key->hash-cache-key key)))
(evict
[_this key]
(c/evict hash-cache (key->hash-cache-key key))
(c/evict memory-cache key))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@transmit-lib/src/cmr/transmit/cache/consistent_cache.clj` around lines 122 -
125, The evict implementation should remove the shared hash entry before the
local memory entry to minimize inconsistency window: in the evict function
(consistent_cache.clj) call (c/evict hash-cache (key->hash-cache-key key))
before calling (c/evict memory-cache key) so the shared hash is invalidated
first; update the evict body in the function named evict accordingly, preserving
existing semantics and error handling.

Comment on lines +29 to +31
(def transient-error-types
"Error types that should not be cached - they are transient and should be retried immediately"
#{:too-many-requests :gateway-timeout})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Treat the other URS service failures as transient too.

cmr.transmit.urs/get-launchpad-user also raises :request-timeout, :internal-server-error, :bad-gateway, and :service-unavailable for upstream failures. Leaving them out here turns a temporary Launchpad/EDL outage into a five-minute invalid-token cache entry for an otherwise valid token.

Suggested change
 (def transient-error-types
   "Error types that should not be cached - they are transient and should be retried immediately"
-  #{:too-many-requests :gateway-timeout})
+  #{:request-timeout
+    :too-many-requests
+    :internal-server-error
+    :bad-gateway
+    :service-unavailable
+    :gateway-timeout})

Also applies to: 47-56

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@transmit-lib/src/cmr/transmit/launchpad_user_cache.clj` around lines 29 - 31,
The transient-error-types set currently only lists :too-many-requests and
:gateway-timeout but must also treat URS upstream failures as transient; update
the definition of transient-error-types (and the similar set used at lines
47-56) to include :request-timeout, :internal-server-error, :bad-gateway, and
:service-unavailable so that cmr.transmit.urs/get-launchpad-user and related
caching logic will not store a five-minute invalid-token entry for temporary
Launchpad/EDL outages.

Comment on lines +58 to +59
(let [cache-key (keyword (DigestUtils/sha256Hex token))
token-info (cache/get-value cache cache-key get-launchpad-user-fn)]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check if the file exists and examine the relevant section
head -100 transmit-lib/src/cmr/transmit/launchpad_user_cache.clj | tail -60

Repository: nasa/Common-Metadata-Repository

Length of output: 3550


🏁 Script executed:

# Find the cache/get-value function definition to understand its key type expectations
rg "def.*get-value|fn.*get-value" --max-count 10 -A 5

Repository: nasa/Common-Metadata-Repository

Length of output: 4125


🏁 Script executed:

# Check the cache module to understand what key types are accepted
fd "cache" -e clj -e cljc | head -20

Repository: nasa/Common-Metadata-Repository

Length of output: 1363


🏁 Script executed:

# Look at the cache module to understand get-value signature
cat -n common-lib/src/cmr/common/cache.clj

Repository: nasa/Common-Metadata-Repository

Length of output: 3526


🏁 Script executed:

# Look at in-memory cache implementation
cat -n common-lib/src/cmr/common/cache/in_memory_cache.clj | head -80

Repository: nasa/Common-Metadata-Repository

Length of output: 2726


🏁 Script executed:

# Continue reading the in-memory cache implementation
cat -n common-lib/src/cmr/common/cache/in_memory_cache.clj | sed -n '80,130p'

Repository: nasa/Common-Metadata-Repository

Length of output: 1842


🏁 Script executed:

# Also check how hash-cache handles this, since there's mention of hash_cache
cat -n common-lib/src/cmr/common/hash_cache.clj | head -100

Repository: nasa/Common-Metadata-Repository

Length of output: 3469


🏁 Script executed:

# Check what type of cache is used for launchpad-user-cache-key
rg "launchpad-user-cache-key|launchpad_user_cache_key" -B 2 -A 2

Repository: nasa/Common-Metadata-Repository

Length of output: 3141


🏁 Script executed:

# Search for other usages of cache keys in similar contexts to see if strings are used elsewhere
rg "cache-key.*DigestUtils|cache-key.*sha256|cache-key.*hash" -B 2 -A 2

Repository: nasa/Common-Metadata-Repository

Length of output: 2396


🏁 Script executed:

# Verify if string keys work with the cache implementation by checking test cases
rg -A 10 "cache/set-value|cache/get-value" common-lib/src/cmr/common/cache/cache_spec.clj | head -50

Repository: nasa/Common-Metadata-Repository

Length of output: 1353


🏁 Script executed:

# Let's verify that get-keys behavior doesn't affect our fix
cat -n common-lib/src/cmr/common/cache/in_memory_cache.clj | sed -n '78,85p'

Repository: nasa/Common-Metadata-Repository

Length of output: 307


🏁 Script executed:

# Check if there are any constraints on key types in the protocol or implementations
rg "get-keys|key.*convert" common-lib/src/cmr/common/cache.clj -B 2 -A 3

Repository: nasa/Common-Metadata-Repository

Length of output: 303


Use the SHA-256 hex string directly as the cache key; avoid interning via keyword.

Line 58 creates a keyword from each SHA-256 token hash. In Clojure, this interns the key in the keyword table, which persists in memory even after cache entries are evicted. Under high-cardinality Launchpad traffic, this accumulates unbounded entries in the keyword intern table. The hex string itself is already a suitable cache key.

Suggested change
-      (let [cache-key (keyword (DigestUtils/sha256Hex token))
+      (let [cache-key (DigestUtils/sha256Hex token)
             token-info (cache/get-value cache cache-key get-launchpad-user-fn)]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
(let [cache-key (keyword (DigestUtils/sha256Hex token))
token-info (cache/get-value cache cache-key get-launchpad-user-fn)]
(let [cache-key (DigestUtils/sha256Hex token)
token-info (cache/get-value cache cache-key get-launchpad-user-fn)]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@transmit-lib/src/cmr/transmit/launchpad_user_cache.clj` around lines 58 - 59,
The code currently interns SHA-256 hashes by creating keywords for cache keys
(cache-key = (keyword (DigestUtils/sha256Hex token))), which leaks memory;
change cache-key to use the hex string directly (e.g., call
DigestUtils/sha256Hex on token and pass that string into cache/get-value with
get-launchpad-user-fn) so you stop creating keywords for each token and avoid
filling the keyword intern table.

Comment on lines 63 to +74
(do
(error (format "Launchpad token (partially redacted) [%s] has expired."
(common-util/scrub-token token)))
(errors/throw-service-error
:unauthorized
(format "Launchpad token (partially redacted) [%s] has expired."
(common-util/scrub-token token))))
(cache/evict cache cache-key)
(let [fresh-result (get-launchpad-user-fn)]
(if (:valid fresh-result)
(do
(cache/set-value cache cache-key fresh-result)
fresh-result)
(errors/throw-service-error
(or (:error-type fresh-result) :unauthorized)
(or (:error-message fresh-result)
(format "Invalid Launchpad token (partially redacted) [%s]"
(common-util/scrub-token token)))))))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Re-cache repeated permanent failures after revalidation.

After Line 64 evicts the stale entry, a second non-transient failure is thrown without storing fresh-result back. That means the first five-minute backoff only happens once; every later request with the same bad or expired token will call EDL again.

Suggested change
           (do
             (cache/evict cache cache-key)
             (let [fresh-result (get-launchpad-user-fn)]
               (if (:valid fresh-result)
                 (do
                   (cache/set-value cache cache-key fresh-result)
                   fresh-result)
-                (errors/throw-service-error
-                 (or (:error-type fresh-result) :unauthorized)
-                 (or (:error-message fresh-result)
-                     (format "Invalid Launchpad token (partially redacted) [%s]"
-                             (common-util/scrub-token token)))))))
+                (do
+                  (cache/set-value cache cache-key fresh-result)
+                  (errors/throw-service-error
+                   (or (:error-type fresh-result) :unauthorized)
+                   (or (:error-message fresh-result)
+                       (format "Invalid Launchpad token (partially redacted) [%s]"
+                               (common-util/scrub-token token))))))))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
(do
(error (format "Launchpad token (partially redacted) [%s] has expired."
(common-util/scrub-token token)))
(errors/throw-service-error
:unauthorized
(format "Launchpad token (partially redacted) [%s] has expired."
(common-util/scrub-token token))))
(cache/evict cache cache-key)
(let [fresh-result (get-launchpad-user-fn)]
(if (:valid fresh-result)
(do
(cache/set-value cache cache-key fresh-result)
fresh-result)
(errors/throw-service-error
(or (:error-type fresh-result) :unauthorized)
(or (:error-message fresh-result)
(format "Invalid Launchpad token (partially redacted) [%s]"
(common-util/scrub-token token)))))))
(do
(cache/evict cache cache-key)
(let [fresh-result (get-launchpad-user-fn)]
(if (:valid fresh-result)
(do
(cache/set-value cache cache-key fresh-result)
fresh-result)
(do
(cache/set-value cache cache-key fresh-result)
(errors/throw-service-error
(or (:error-type fresh-result) :unauthorized)
(or (:error-message fresh-result)
(format "Invalid Launchpad token (partially redacted) [%s]"
(common-util/scrub-token token))))))))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@transmit-lib/src/cmr/transmit/launchpad_user_cache.clj` around lines 63 - 74,
After evicting the stale entry, the code currently throws on a non-transient
invalid fresh-result without storing it back into the cache, causing repeated
EDL calls for the same bad token; update the branch in the block around
get-launchpad-user-fn so that when fresh-result is invalid (i.e., (:valid
fresh-result) is false) you still call cache/set-value with the same cache and
cache-key to record the failure result (fresh-result) before invoking
errors/throw-service-error; reference symbols to update: cache/evict,
get-launchpad-user-fn, fresh-result, cache/set-value,
errors/throw-service-error, common-util/scrub-token, cache-key and token.

@eereiter eereiter closed this Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants