Skip to content

fix(security): sidecar OOM caps, secret-file exclusion, SLSA attestations, cargo-deny#37

Merged
avifenesh merged 8 commits into
mainfrom
fix/security-hardening
Apr 26, 2026
Merged

fix(security): sidecar OOM caps, secret-file exclusion, SLSA attestations, cargo-deny#37
avifenesh merged 8 commits into
mainfrom
fix/security-hardening

Conversation

@avifenesh
Copy link
Copy Markdown
Contributor

@avifenesh avifenesh commented Apr 26, 2026

Summary

Security hardening from an internal audit (2026-04-26). 5 commits, each a single concern:

# Severity Finding Fix
1 HIGH Sidecar parser allocates up to 4 GiB from attacker-controlled u32 lengths (OOM + panic = abort) Cap header_len (1 MiB), vector_count (10M), dim (4096), name_len (1 KiB); 5 reject-path tests
2 HIGH No release attestation — only SHA256 sidecars Publish SLSA build-provenance via `actions/attest-build-provenance`
3 MED Embed/slop walkers include hidden files (`.env`, `.git/`, `.ssh/`, keys) `.hidden(true)` + is_secret_like deny patterns
4 MED No per-file size cap before embedding `max_filesize(5 MiB)` on both walkers
5 MED No cargo-deny config Add `deny.toml` (advisories/licenses/bans/sources)

Test plan

  • `cargo test -p analyzer-embed`: 35 unit + 4 integration, all pass
  • `cargo build --release`: clean
  • `cargo deny check`: all categories clean

Follow-ups (separate issues)

  • Client-side attestation verification in `agent-core`'s `lib/binary/index.js`
  • Add `cargo deny check` CI job (`EmbarkStudios/cargo-deny-action`)
  • Deduplicate `is_secret_like` between analyzer-embed and analyzer-graph

Note

Medium Risk
Moderate risk because it changes low-level sidecar parsing and repo file discovery (hidden/large/secret-like files now skipped), which could affect embedding/analyzer coverage, and it alters release workflow permissions/steps for attestations.

Overview
Adds security hardening across artifact parsing, repo walking, and releases.

The sidecar (repo-intel.embeddings.bin) reader now enforces strict caps and size cross-checks (header length, vector count, dim, string lengths) via Sidecar::from_bytes, with new tests to ensure malformed sidecars fail fast instead of causing OOM/panic.

Repo walkers used by embedding and slop analysis now exclude hidden paths, skip secret-like filenames/extensions/directories, and apply a 5 MiB per-file size cap (with new scan.rs tests).

The release workflow now publishes SLSA build-provenance attestations for per-target archives (OIDC id-token/attestations permissions, pinned actions/attest-build-provenance), and the repo adds a baseline deny.toml policy for cargo-deny (advisories/licenses/sources).

Reviewed by Cursor Bugbot for commit c6dc519. Configure here.

The sidecar binary format is fully attacker-controlled once a repository's
embedding cache is written to disk, and Sidecar::read previously trusted
every length/count field to drive allocations directly:

- header_len (u32) -> vec![0u8; header_len] could request up to 4 GiB.
- header.vector_count drove an unbounded outer loop.
- header.dim drove Vec::with_capacity per vector.
- name_len (u32) drove another unbounded string allocation.

Introduce explicit caps - 1 MiB header, 10M vectors, 4096 dim, 1 KiB
names - and validate each field before any allocation. Add a from_bytes
entry point that additionally cross-checks the declared vector count
against the actual byte buffer so a malformed header cannot claim 10M
vectors in a 100-byte file. read<R: Read> now slurps and delegates.

Unit tests cover each cap and the implied-size check. The existing
round-trip tests continue to pass.
…slop walks

Both the embed scanner and the slop walker built WalkBuilder with
.hidden(false), which meant .git/, .env, .ssh/id_rsa, .aws/credentials
and similar secret-bearing paths were visited and - in the embed case -
hashed and potentially shipped in the vector sidecar.

Flip .hidden(true) in both walkers and add an explicit is_secret_like()
filter that rejects well-known dotfiles (.env, .env.*, .npmrc, .pypirc,
.netrc, .htpasswd), SSH/PGP/cloud config directories (.git/, .ssh/,
.gnupg/, .aws/, .gcloud/, .azure/), SSH private-key filenames
(id_rsa*, id_ed25519*, ...), and credential-like extensions
(.pem/.key/.crt/.p12/.pfx/.jks/.keystore). The helper is duplicated
across the two crates to avoid pulling analyzer-embed into
analyzer-graph for one filter.

Test: walk_excludes_dotfiles_and_secret_patterns constructs a tempdir
populated with src/a.rs plus a set of secret-like files and confirms
only the legitimate source file survives the walk.
Both walkers called fs::read / std::fs::read_to_string on every file
they visited with no size check. A single large vendored bundle or a
hostile 1 GiB file in the repo would force that much RAM per worker.

Use WalkBuilder::max_filesize(Some(5 * 1024 * 1024)) in both the embed
scanner and the slop walker; ignore::WalkBuilder already short-circuits
oversized files before yielding them, so no per-file metadata call is
needed in hot paths. 5 MiB comfortably covers the long tail of real
source files while cutting off the obvious DoS vector.

Test: walk_skips_files_larger_than_cap writes a 10 MiB .rs file
alongside a small one in a tempdir and asserts only the small file is
returned.
Previously the release workflow only published SHA-256 checksums, which
prove integrity but not provenance. Add actions/attest-build-provenance
(pinned to SHA) after archive creation on each matrix job so every
published .tar.gz / .zip gets a signed in-toto attestation traceable
back to the exact workflow run that produced it.

Grant the job id-token: write and attestations: write so the action can
sign via the runner's OIDC identity and post the attestation to the
release.

Note: this is defense in depth - the attestations are published but not
yet verified client-side. The agent-core binary downloader would need
cosign or slsa-verifier to check them; wiring that in is a separate PR.
deny.toml establishes baseline policy for four checks:

- advisories: vulnerability = deny, yanked = deny.
- licenses: allow MIT/Apache-2.0/BSD-{2,3}-Clause/ISC/MPL-2.0/Unicode/
  Zlib/CC0; GPL-*/AGPL-*/SSPL are rejected by omission.
- bans: multiple-versions + wildcards = warn; deny list is empty for
  now, extend as concrete risks surface.
- sources: unknown-registry and unknown-git both deny; crates.io is the
  only allowed registry, agent-sh is the only allowed git org.

Verified locally with cargo deny check - all four checks pass (the only
warnings are "unmatched license allowance" / "multiple-versions", which
are informational).

CI integration (a `cargo deny check` job) is a follow-up.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several security hardening measures, including file size limits and secret-pattern exclusion during repository walks, as well as robust input validation for sidecar parsing to prevent potential memory exhaustion attacks. Additionally, a deny.toml configuration file is added to manage dependency security and licensing. The review feedback suggests refactoring the hardcoded file size limit in slop.rs into a constant for consistency and improving performance by reusing the tree_sitter::Parser instance.

Comment thread crates/analyzer-graph/src/slop.rs Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c6dc519. Configure here.

Comment thread crates/analyzer-embed/src/sidecar.rs
Comment thread crates/analyzer-graph/src/slop.rs Outdated
…aps, deduplicate is_secret_like into analyzer-core

- Introduce analyzer-core::limits with MAX_WALK_FILE_SIZE (5 MiB),
  MAX_NAME_LEN (1 KiB) and MAX_PATH_LEN (4 KiB / PATH_MAX) as the single
  source of truth for defense-in-depth bounds. Replaces the hardcoded
  5*1024*1024 in analyzer-graph::slop::walk_repo_files and the local
  MAX_FILE_BYTES in analyzer-embed::scan.
- Introduce analyzer-core::secrets::is_secret_like as the shared
  deny-list for credential-looking file/dir patterns. analyzer-embed
  and analyzer-graph previously carried byte-identical copies; any
  future addition (say, a new cloud-cred directory) would have had to
  be made in two places. Now there's one.
- Split path vs. name caps in the sidecar reader. `read_string`
  previously used MAX_NAME_LEN (1 KiB) for the length prefix on path
  strings, which rejects legitimate deeply-nested monorepo paths.
  It now uses MAX_PATH_LEN via a new read_bounded_string helper; the
  name_len branch keeps MAX_NAME_LEN for symbol identifiers where a
  tight cap is appropriate.
- Add tests: 5 new is_secret_like cases in analyzer-core, plus
  oversized_path_len_is_rejected and long_path_under_cap_round_trips
  in the sidecar suite to cover the new split.
…Store

The security pass set .hidden(true) for both embed and slop walkers,
but slop's tracked_artifacts() needs to flag .DS_Store, .swp, etc. as
deletable junk. Secret-like paths are still filtered by is_secret_like.

- embed walker: .hidden(true) kept (don't index .env)
- slop walker: .hidden(false) restored (must see OS junk)
@avifenesh avifenesh merged commit 1234c8d into main Apr 26, 2026
4 checks passed
@avifenesh avifenesh deleted the fix/security-hardening branch April 26, 2026 12:35
@avifenesh avifenesh mentioned this pull request Apr 26, 2026
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant