git scraper by moshloop · Pull Request #1925 · flanksource/config-db

moshloop · 2026-03-01T14:19:27Z

Summary by CodeRabbit

New Features
- Git scraping: repo/branch discovery, depth control, per-branch deltas, and tag lifecycle detection
- Per-file strategies (track/ignore/diff) via glob rules; file tree with last-modified metadata and optional diffs
Documentation
- Added Git/GitHub schemas and CRD extensions to support Git-based scraper configuration
- Example Git scrape fixture included
Tests
- Comprehensive unit and integration tests covering file rules, tree building, diffs, commits, tags, and branches
Chores
- Added Git client dependency for scraping implementation

coderabbitai · 2026-03-01T14:19:45Z

Walkthrough

Adds a Git scraper: API types for Git/file rules, schema and deepcopy updates, a go-git-based scraper that clones repos, builds rule-filtered file trees, detects commit/tag/branch changes with per-file strategies (ignore/track/diff), and test coverage + fixtures for integration and unit tests.

Changes

Cohort / File(s)	Summary
API Type Definitions `api/v1/git.go`	Adds `GitFileStrategy` (+constants), `GitFileRule`, and `Git` struct embedding `BaseScraper` and `connection.GitConnection` with `Branches` and `Files`.
API Integration `api/v1/types.go`, `api/v1/scrapeconfig_types.go`	Registers `git` in `AllScraperConfigs`, adds `ScraperSpec.Git []Git`, updates `ScrapeConfig.Type()` detection and applies plugins to Git entries.
Generated Code `api/v1/zz_generated.deepcopy.go`	DeepCopy implementations added/extended for `Git`, `GitFileRule`, and related GitHub types; `ScraperSpec.DeepCopyInto` updated to copy Git/GitHub slices.
Scraper Registry & Dependency `scrapers/common.go`, `go.mod`	Registers new Git scraper in exported `All` list; adds direct dependency `github.com/go-git/go-git/v5`.
Git Scraper Core `scrapers/git/scraper.go`	New `Scraper` with `CanScrape`/`Scrape`: clones repos, loads/stores TempCache state, builds repo/branch file trees, filters branches, and emits repo/branch results with change deltas.
Change Detection `scrapers/git/changes.go`	Implements commit/tag/branch change generation: history walking, tree diffs, per-file strategy application, tag resolution, ancestor checks, and ChangeResult construction (including optional diffs).
File Tree Processing `scrapers/git/filetree.go`	Builds rule-filtered `FileTree`/`FileEntry`, supports glob/** matching, computes SHA/size and last-modifying commit attribution.
Tests & Fixtures `fixtures/git.yaml`, `scrapers/git/*_test.go`, `scrapers/git/git_suite_test.go`	Adds fixture YAML and extensive unit/integration tests for rule matching, file tree building, commit/tag/branch change detection, and end-to-end scraper behavior.
Schemas & CRDs `chart/crds/...scrapeconfigs.yaml`, `config/schemas/*.schema.json`, `config/schemas/scrape_config.schema.json`	Adds Git and GitHub schema definitions and `git`/`github` fields to ScraperSpec; adds AWSSigV4 support to HTTP schemas; large JSON Schema and CRD updates.
Minor Edits `scrapers/aws/cloudtrail_test.go`, `scrapers/trivy/trivy.go`	Formatting changes in tests and minor import whitespace cleanup.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Scraper as Git Scraper
    participant Repo as Git Repo (go-git)
    participant Cache as TempCache
    participant Detector as Change Detector

    Client->>Scraper: Scrape(ctx, ScraperSpec)
    activate Scraper

    Scraper->>Repo: Clone / Open repository
    activate Repo
    Repo-->>Scraper: Repository object (refs, HEAD)
    deactivate Repo

    Scraper->>Cache: Load previous state (branches, tags, SHAs)
    activate Cache
    Cache-->>Scraper: previous state
    deactivate Cache

    Scraper->>Scraper: Build file tree (apply file rules)
    Scraper->>Detector: generateCommitChanges(branchRef, rules)
    activate Detector
    Detector->>Repo: Walk commits, diff trees
    Detector-->>Scraper: []ChangeResult (commits with diffs)
    deactivate Detector

    Scraper->>Detector: generateTagChanges(since)
    activate Detector
    Detector->>Repo: Resolve tags
    Detector-->>Scraper: []ChangeResult (tags)
    deactivate Detector

    Scraper->>Scraper: generateBranchChanges(current, previous)
    Scraper->>Cache: Store current state
    activate Cache
    Cache-->>Scraper: OK
    deactivate Cache

    Scraper-->>Client: ScrapeResults (repositories, branches, changes, files)
    deactivate Scraper

Possibly related PRs

feat: add github/openssf scrapers #1875: Similar changes to ScrapeConfig type detection and adding new scraper entries; strong overlap in API/type updates.
feat: add slim build tag and CLI improvements #1883: Modifies scrapers registry initialization; touches the same registry entry point where the Git scraper is registered.

Suggested reviewers

adityathebe

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 2.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'git scraper' is directly related to the main change in the changeset, which introduces a complete Git scraper implementation with API types, scraper logic, tests, and schema definitions.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch worktree-git-scraper

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (6)

scrapers/git/changes.go (3)

213-224: Consider using the actual tag timestamp for creation events.

For newly detected tags, the CreatedAt is set to time.Now() (line 214), but the actual tag creation time is available in tag.when. Using the real timestamp would provide more accurate historical data.

♻️ Proposed fix

		if _, existed := previousTags[name]; !existed {
-			now := time.Now()
+			createdAt := tag.when
			changes = append(changes, v1.ChangeResult{
				ExternalID:       repoExternalID,
				ConfigType:       "Git::Repository",
				ExternalChangeID: fmt.Sprintf("tag/%s/%s", name, tag.sha),
				ChangeType:       "Tag",
				Summary:          fmt.Sprintf("Tag %s created", name),
				Source:           "git",
-				CreatedAt:        &now,
+				CreatedAt:        &createdAt,
				Details:          map[string]any{"tag": name, "sha": tag.sha},
			})
		}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scrapers/git/changes.go` around lines 213 - 224, The CreatedAt for new tag
change events is using time.Now() which loses the actual tag timestamp; update
the code that appends to changes (the v1.ChangeResult for "Tag" created when
checking previousTags[name]) to use the tag's real time (tag.when) converted to
a *time.Time (or pointer) instead of time.Now(), ensuring CreatedAt is set from
tag.when (or its pointer) so the change record reflects the tag's actual
creation time.

20-23: Consider logging errors instead of silently returning empty results.

When repo.Log() fails, the function silently returns an empty slice, which could hide repository access issues or configuration problems. Consider logging the error or returning it to the caller for better observability.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scrapers/git/changes.go` around lines 20 - 23, The function currently
swallows errors from repo.Log (the call that assigns iter) and returns an empty
changes slice; update the implementation so failures are observable by either
(a) logging the error before returning (use the repository/component logger,
e.g., log.Printf or your package logger) or (b) changing the function signature
to return ([]Change, error) and propagate the error up instead of returning
changes; locate the repo.Log call and the local vars iter and changes and
implement one of these two fixes (log the error with context mentioning
repo.Log/branchRef.Hash() or return the error to the caller).

291-311: Potential performance concern for large repositories.

The isAncestor function walks the entire commit history from head to find candidate. For repositories with long histories, this could be slow. Consider caching or limiting the search depth if this becomes a bottleneck.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scrapers/git/changes.go` around lines 291 - 311, The isAncestor function
currently does a full history walk from head which can be very slow on long
repos; update isAncestor(repo *git.Repository, candidate, head plumbing.Hash) to
either (a) short-circuit with a maxDepth parameter/constant and stop the
iterator after that many commits, or (b) implement a simple ancestry cache
(map[plumbing.Hash]bool or an LRU) keyed by head/candidate to avoid repeated
full walks; locate the logic in isAncestor, add a depth check inside the
iter.ForEach loop (or check/populate the cache before walking), and ensure you
return false when the depth limit is reached while still allowing cache hits to
return true without walking.

scrapers/git/git_suite_test.go (1)

30-32: Directory change assumption may be fragile.

The os.Chdir("../..")) assumes tests are run from the scrapers/git/ directory. This can cause issues when running tests from the repository root or other locations. Consider using a more robust method to locate the repository root.
♻️ Alternative approach using runtime to find the test file location
func init() {
    _, filename, _, _ := runtime.Caller(0)
    dir := filepath.Dir(filename)
    // Navigate to repo root
    repoRoot := filepath.Join(dir, "..", "..")
    os.Chdir(repoRoot)
}
Or use an environment variable / build tag to specify the root path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scrapers/git/git_suite_test.go` around lines 30 - 32, Replace the fragile
os.Chdir("../..") call in git_suite_test.go with a deterministic repo-root
resolution: move the change into an init() or setup helper, use runtime.Caller
to get the current test file path, compute filepath.Dir(filename) and then
walk/join to the repository root (or respect an env var override) and call
os.Chdir to that resolved path; reference os.Chdir, runtime.Caller, filepath.Dir
and the init/test setup function to locate and update the code.

fixtures/git.yaml (1)

7-11: Clarify the relationship between branch and branches fields.

The fixture specifies both branch: main (line 8) and branches: [main] (lines 10-11). The branch field appears to come from GitConnection while branches is the scraper-specific field for tracking. Consider documenting which field controls what behavior, or removing the redundancy if they serve the same purpose.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fixtures/git.yaml` around lines 7 - 11, The fixture currently defines both
branch and branches which are ambiguous; update the code and fixture so their
relationship is explicit: decide whether GitConnection.branch (the
single-revision field) or the scraper-specific branches list controls checkout
behavior, then either remove the redundant field from the fixture or document
and enforce precedence (e.g., GitConnection.branch as default, scraper's
branches overrides it). Update the schema/docs and add validation in the loader
that checks consistency between GitConnection.branch and the scraper's branches
field (or auto-populates one from the other) and adjust the fixture to match the
chosen behavior; reference the branch and branches fields and the GitConnection
struct / scraper config when making changes.

scrapers/git/filetree.go (1)

104-109: Per-file history lookup will scale poorly on large repositories.

Running lastModifyingCommit for every file can dominate scrape time. Consider a single-pass commit traversal that fills last-modified metadata for unresolved paths instead of N log walks.

Also applies to: 121-131

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scrapers/git/filetree.go` around lines 104 - 109, Per-file
lastModifyingCommit calls (invoked where lastModifyingCommit(repo, commitHash,
f.Name) sets entry.LastModified/LastModifiedSHA/LastModifiedBy) will not scale;
instead implement a single-pass commit traversal that assigns last-modified
metadata for unresolved entries in bulk. Replace the per-file calls with a
preprocessing step (e.g., traverseCommitsFillLastModified or
fillLastModifiedForEntries) that walks commits once from newest to oldest,
checks each commit's tree for paths in the entries collection (use the same key
used when calling lastModifyingCommit), and when a path is first seen populate
entry.LastModified, entry.LastModifiedSHA and entry.LastModifiedBy; stop
updating an entry once populated. Update usages where lastModifyingCommit was
called (the spots setting LastModified, LastModifiedSHA, LastModifiedBy) to rely
on the populated metadata instead of calling lastModifyingCommit per file.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@go.mod`:
- Line 60: Update the go-git dependency in go.mod from
github.com/go-git/go-git/v5 v5.16.5 to v5.17.0 and ensure module files are
refreshed; specifically, bump the version for the module identifier
github.com/go-git/go-git/v5 (used by code in scrapers/git/) and then run the
module update/tidy step so go.sum is regenerated and builds pick up v5.17.0.

In `@scrapers/git/changes.go`:
- Around line 78-80: Replace the WriteString(fmt.Sprintf(...)) call in the block
that appends per-file patches to diffContent: instead of
diffContent.WriteString(fmt.Sprintf("--- %s ---\n%s\n", d.path, d.patch)), call
fmt.Fprintf on diffContent with the same format and arguments (use diffContent
as the io.Writer and format string "--- %s ---\n%s\n" with d.path and d.patch)
so the code uses fmt.Fprintf directly for efficiency; ensure you import fmt if
not already present and keep the same output formatting and variables
(diffContent, d.path, d.patch).

In `@scrapers/git/git_integration_test.go`:
- Around line 13-18: The test "should scrape a public git repository" (the
ginkgo.It block that builds scrapeConfig via getConfigSpec("git"), creates
scraperCtx with api.NewScrapeContext(DefaultContext).WithScrapeConfig, and calls
Scraper{}.Scrape) depends on live external Git state; change it to either be
gated behind an explicit integration-only flag or to run against a deterministic
local fixture/mirror: add a check at the start of the test that skips unless an
integration environment variable/flag (e.g., TEST_INTEGRATION_GIT) is set, or
replace getConfigSpec("git") with a config pointing to a local test repo/mirror
committed or created in test setup so Scraper.Scrape uses deterministic data.
Ensure the gating flag or local-repo setup is documented in the test comments.
- Around line 119-120: The test currently uses unchecked type assertions on
repo.Config and on configMap["files"], which can panic; change them to checked
assertions: first assert that repo.Config is a map[string]any (e.g. v, ok :=
repo.Config.(map[string]any)) and fail the test with a clear message if not,
then retrieve filesVal, ok := configMap["files"] and assert its presence, then
assert filesSlice, ok := filesVal.([]FileEntry) and fail with t.Fatalf or
require/require.IsType/require.NotNil so failures are reported as test errors
instead of panics; update the uses of configMap and files to use these checked
variables.

In `@scrapers/git/git_suite_test.go`:
- Around line 39-45: getConfigSpec currently returns configs[0] without ensuring
the slice has elements; update getConfigSpec to check the length of the configs
slice returned by v1.ParseConfigs and call Fail with a clear message if
len(configs) == 0 (or propagate an error) before accessing configs[0]; reference
the getConfigSpec function, the configs variable, and the v1.ParseConfigs call
and ensure the error/empty-slice path uses Fail(fmt.Sprintf(...)) so tests don't
panic.

In `@scrapers/git/scraper.go`:
- Around line 87-90: The code is replacing the entire repoResult.Tags map
(created by NewScrapeResult/BaseScraper) which discards user-configured tags;
instead merge the new keys into the existing repoResult.Tags map (e.g., set
repoResult.Tags["url"]=gitClient.URL and
repoResult.Tags["defaultBranch"]=gitClient.Branch) so you preserve any
pre-existing entries; apply the same merging approach for the other occurrence
that assigns repoResult.Tags at lines around the second block (the assignment
near where repoResult.Tags is set again).
- Around line 61-63: loadPreviousBranches currently returns nil so branch
lifecycle deltas are recomputed with no baseline; implement loadPreviousBranches
to load the persisted branch snapshot (same pattern as loadPreviousTags) by
reading the repo's stored state (DB/metadata) and returning the slice of
previous branch names/objects for the given repoExternalID, then ensure all
callers (the branch delta computation sites in scraper.go that call
loadPreviousBranches) use the returned value to compute create/delete deltas;
also update any other occurrences of the placeholder implementation (the other
call sites flagged in the comment) so branch deltas are stable across
consecutive scrapes.

---

Nitpick comments:
In `@fixtures/git.yaml`:
- Around line 7-11: The fixture currently defines both branch and branches which
are ambiguous; update the code and fixture so their relationship is explicit:
decide whether GitConnection.branch (the single-revision field) or the
scraper-specific branches list controls checkout behavior, then either remove
the redundant field from the fixture or document and enforce precedence (e.g.,
GitConnection.branch as default, scraper's branches overrides it). Update the
schema/docs and add validation in the loader that checks consistency between
GitConnection.branch and the scraper's branches field (or auto-populates one
from the other) and adjust the fixture to match the chosen behavior; reference
the branch and branches fields and the GitConnection struct / scraper config
when making changes.

In `@scrapers/git/changes.go`:
- Around line 213-224: The CreatedAt for new tag change events is using
time.Now() which loses the actual tag timestamp; update the code that appends to
changes (the v1.ChangeResult for "Tag" created when checking previousTags[name])
to use the tag's real time (tag.when) converted to a *time.Time (or pointer)
instead of time.Now(), ensuring CreatedAt is set from tag.when (or its pointer)
so the change record reflects the tag's actual creation time.
- Around line 20-23: The function currently swallows errors from repo.Log (the
call that assigns iter) and returns an empty changes slice; update the
implementation so failures are observable by either (a) logging the error before
returning (use the repository/component logger, e.g., log.Printf or your package
logger) or (b) changing the function signature to return ([]Change, error) and
propagate the error up instead of returning changes; locate the repo.Log call
and the local vars iter and changes and implement one of these two fixes (log
the error with context mentioning repo.Log/branchRef.Hash() or return the error
to the caller).
- Around line 291-311: The isAncestor function currently does a full history
walk from head which can be very slow on long repos; update isAncestor(repo
*git.Repository, candidate, head plumbing.Hash) to either (a) short-circuit with
a maxDepth parameter/constant and stop the iterator after that many commits, or
(b) implement a simple ancestry cache (map[plumbing.Hash]bool or an LRU) keyed
by head/candidate to avoid repeated full walks; locate the logic in isAncestor,
add a depth check inside the iter.ForEach loop (or check/populate the cache
before walking), and ensure you return false when the depth limit is reached
while still allowing cache hits to return true without walking.

In `@scrapers/git/filetree.go`:
- Around line 104-109: Per-file lastModifyingCommit calls (invoked where
lastModifyingCommit(repo, commitHash, f.Name) sets
entry.LastModified/LastModifiedSHA/LastModifiedBy) will not scale; instead
implement a single-pass commit traversal that assigns last-modified metadata for
unresolved entries in bulk. Replace the per-file calls with a preprocessing step
(e.g., traverseCommitsFillLastModified or fillLastModifiedForEntries) that walks
commits once from newest to oldest, checks each commit's tree for paths in the
entries collection (use the same key used when calling lastModifyingCommit), and
when a path is first seen populate entry.LastModified, entry.LastModifiedSHA and
entry.LastModifiedBy; stop updating an entry once populated. Update usages where
lastModifyingCommit was called (the spots setting LastModified, LastModifiedSHA,
LastModifiedBy) to rely on the populated metadata instead of calling
lastModifyingCommit per file.

In `@scrapers/git/git_suite_test.go`:
- Around line 30-32: Replace the fragile os.Chdir("../..") call in
git_suite_test.go with a deterministic repo-root resolution: move the change
into an init() or setup helper, use runtime.Caller to get the current test file
path, compute filepath.Dir(filename) and then walk/join to the repository root
(or respect an env var override) and call os.Chdir to that resolved path;
reference os.Chdir, runtime.Caller, filepath.Dir and the init/test setup
function to locate and update the code.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 72b1a58 and c4caa36.

📒 Files selected for processing (13)

api/v1/git.go
api/v1/scrapeconfig_types.go
api/v1/types.go
api/v1/zz_generated.deepcopy.go
fixtures/git.yaml
go.mod
scrapers/common.go
scrapers/git/changes.go
scrapers/git/filetree.go
scrapers/git/git_integration_test.go
scrapers/git/git_suite_test.go
scrapers/git/scraper.go
scrapers/git/scraper_test.go

coderabbitai · 2026-03-01T14:40:02Z

go.mod

 	github.com/flanksource/is-healthy v1.0.82
 	github.com/flanksource/ketall v1.1.9
 	github.com/flanksource/kopper v1.0.13
+	github.com/go-git/go-git/v5 v5.16.5


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

go-git/go-git v5.16.5 latest version releases

💡 Result:

As of March 1, 2026, the latest github.com/go-git/go-git/v5 release is:

v5.17.0 (released Feb 25, 2026) — marked Latest on GitHub Releases. [1]

The prior patch you mentioned:

v5.16.5 (released Feb 9, 2026) — includes the fix for GHSA-37cx-329c-33x3 / CVE-2026-25934 (fixed in 5.16.5). [2][3]

To upgrade:

go get github.com/go-git/go-git/v5@v5.17.0

Sources: [1] [2] [3]

🏁 Script executed:

find . -path "*/scrapers/git/*" -name "*.go" -type f | head -20

Repository: flanksource/config-db

Length of output: 246

🏁 Script executed:

rg "github.com/go-git/go-git" --type go -A 2 -B 2

Repository: flanksource/config-db

Length of output: 1467

Upgrade to go-git v5.17.0 (latest).

The go-git/go-git/v5 dependency promotion from indirect to direct is correct—the code imports it directly in scrapers/git/. However, v5.17.0 was released on Feb 25, 2026 and is now the latest stable version. v5.16.5 (Feb 9, 2026) remains valid with its security fix (CVE-2026-25934), but upgrading to v5.17.0 is recommended.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@go.mod` at line 60, Update the go-git dependency in go.mod from github.com/go-git/go-git/v5 v5.16.5 to v5.17.0 and ensure module files are refreshed; specifically, bump the version for the module identifier github.com/go-git/go-git/v5 (used by code in scrapers/git/) and then run the module update/tidy step so go.sum is regenerated and builds pick up v5.17.0.

coderabbitai · 2026-03-01T14:40:02Z

scrapers/git/changes.go

+			if d.patch != "" {
+				diffContent.WriteString(fmt.Sprintf("--- %s ---\n%s\n", d.path, d.patch))
+			}


⚠️ Potential issue | 🟡 Minor

Fix lint error: Use fmt.Fprintf instead of WriteString(fmt.Sprintf(...)).

The static analysis tool flagged this as QF1012. Using fmt.Fprintf directly is more efficient.

🔧 Proposed fix

if d.patch != "" { - diffContent.WriteString(fmt.Sprintf("--- %s ---\n%s\n", d.path, d.patch)) + fmt.Fprintf(&diffContent, "--- %s ---\n%s\n", d.path, d.patch) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if d.patch != "" {

diffContent.WriteString(fmt.Sprintf("--- %s ---\n%s\n", d.path, d.patch))

}

if d.patch != "" {

fmt.Fprintf(&diffContent, "--- %s ---\n%s\n", d.path, d.patch)

}

🧰 Tools

🪛 GitHub Actions: Lint

[error] 79-79: QF1012: Use fmt.Fprintf(...) instead of WriteString(fmt.Sprintf(...)) (staticcheck)

🪛 GitHub Check: lint

[failure] 79-79:
QF1012: Use fmt.Fprintf(...) instead of WriteString(fmt.Sprintf(...)) (staticcheck)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scrapers/git/changes.go` around lines 78 - 80, Replace the WriteString(fmt.Sprintf(...)) call in the block that appends per-file patches to diffContent: instead of diffContent.WriteString(fmt.Sprintf("--- %s ---\n%s\n", d.path, d.patch)), call fmt.Fprintf on diffContent with the same format and arguments (use diffContent as the io.Writer and format string "--- %s ---\n%s\n" with d.path and d.patch) so the code uses fmt.Fprintf directly for efficiency; ensure you import fmt if not already present and keep the same output formatting and variables (diffContent, d.path, d.patch).

coderabbitai · 2026-03-01T14:40:02Z

scrapers/git/git_integration_test.go

+	ginkgo.It("should scrape a public git repository", func() {
+		scrapeConfig := getConfigSpec("git")
+		scraperCtx := api.NewScrapeContext(DefaultContext).WithScrapeConfig(&scrapeConfig)
+
+		scraper := Scraper{}
+		results = scraper.Scrape(scraperCtx)


⚠️ Potential issue | 🟠 Major

This integration test is coupled to live external Git state.

Using a public repo directly makes CI brittle (network outages, branch/history drift, force-pushes). Please gate this as explicit integration-only or run against a deterministic local mirror/fixture repo.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scrapers/git/git_integration_test.go` around lines 13 - 18, The test "should scrape a public git repository" (the ginkgo.It block that builds scrapeConfig via getConfigSpec("git"), creates scraperCtx with api.NewScrapeContext(DefaultContext).WithScrapeConfig, and calls Scraper{}.Scrape) depends on live external Git state; change it to either be gated behind an explicit integration-only flag or to run against a deterministic local fixture/mirror: add a check at the start of the test that skips unless an integration environment variable/flag (e.g., TEST_INTEGRATION_GIT) is set, or replace getConfigSpec("git") with a config pointing to a local test repo/mirror committed or created in test setup so Scraper.Scrape uses deterministic data. Ensure the gating flag or local-repo setup is documented in the test comments.

coderabbitai · 2026-03-01T14:40:02Z

scrapers/git/git_integration_test.go

+		configMap := repo.Config.(map[string]any)
+		files := configMap["files"].([]FileEntry)


⚠️ Potential issue | 🟡 Minor

Guard type assertions to avoid panic-driven test failures.

Use checked assertions here so failures are reported as expectations, not panics.

Proposed fix

- configMap := repo.Config.(map[string]any) - files := configMap["files"].([]FileEntry) + configMap, ok := repo.Config.(map[string]any) + Expect(ok).To(BeTrue(), "repo config should be a map") + files, ok := configMap["files"].([]FileEntry) + Expect(ok).To(BeTrue(), "files should be a []FileEntry")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scrapers/git/git_integration_test.go` around lines 119 - 120, The test currently uses unchecked type assertions on repo.Config and on configMap["files"], which can panic; change them to checked assertions: first assert that repo.Config is a map[string]any (e.g. v, ok := repo.Config.(map[string]any)) and fail the test with a clear message if not, then retrieve filesVal, ok := configMap["files"] and assert its presence, then assert filesSlice, ok := filesVal.([]FileEntry) and fail with t.Fatalf or require/require.IsType/require.NotNil so failures are reported as test errors instead of panics; update the uses of configMap and files to use these checked variables.

coderabbitai · 2026-03-01T14:40:02Z

scrapers/git/git_suite_test.go

+func getConfigSpec(name string) v1.ScrapeConfig {
+	configs, err := v1.ParseConfigs("fixtures/" + name + ".yaml")
+	if err != nil {
+		Fail(fmt.Sprintf("Failed to parse config: %v", err))
+	}
+	return configs[0]
+}


⚠️ Potential issue | 🟡 Minor

Add bounds check before accessing configs[0].

The function returns configs[0] without verifying that the slice is non-empty. If the fixture file contains no valid configs, this will cause a panic.

🛡️ Proposed fix

func getConfigSpec(name string) v1.ScrapeConfig { configs, err := v1.ParseConfigs("fixtures/" + name + ".yaml") if err != nil { Fail(fmt.Sprintf("Failed to parse config: %v", err)) } + if len(configs) == 0 { + Fail(fmt.Sprintf("No configs found in fixtures/%s.yaml", name)) + } return configs[0] }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func getConfigSpec(name string) v1.ScrapeConfig {

configs, err := v1.ParseConfigs("fixtures/" + name + ".yaml")

if err != nil {

Fail(fmt.Sprintf("Failed to parse config: %v", err))

}

return configs[0]

}

func getConfigSpec(name string) v1.ScrapeConfig {

configs, err := v1.ParseConfigs("fixtures/" + name + ".yaml")

if err != nil {

Fail(fmt.Sprintf("Failed to parse config: %v", err))

}

if len(configs) == 0 {

Fail(fmt.Sprintf("No configs found in fixtures/%s.yaml", name))

}

return configs[0]

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scrapers/git/git_suite_test.go` around lines 39 - 45, getConfigSpec currently returns configs[0] without ensuring the slice has elements; update getConfigSpec to check the length of the configs slice returned by v1.ParseConfigs and call Fail with a clear message if len(configs) == 0 (or propagate an error) before accessing configs[0]; reference the getConfigSpec function, the configs variable, and the v1.ParseConfigs call and ensure the error/empty-slice path uses Fail(fmt.Sprintf(...)) so tests don't panic.

coderabbitai · 2026-03-01T14:40:02Z

scrapers/git/scraper.go

+	previousTags := loadPreviousTags(ctx, repoExternalID)
+	previousBranches := loadPreviousBranches(ctx, repoExternalID)
+


⚠️ Potential issue | 🟠 Major

Branch lifecycle delta is currently non-stateful.

Because loadPreviousBranches always returns nil, branch changes are recomputed without a real baseline each run.

Proposed fix

- // Store current tags in repo config for next run - repoConfig["tags"] = currentTags + // Store current state in repo config for next run + repoConfig["tags"] = currentTags + repoConfig["branches"] = branchNames(trackedBranches) @@ -func loadPreviousBranches(_ api.ScrapeContext, _ string) []string { - // FIXME: implement branch tracking from TempCache - return nil -} +func loadPreviousBranches(ctx api.ScrapeContext, repoExternalID string) []string { + configMap := loadCachedConfig(ctx, repoExternalID, "Git::Repository") + if configMap == nil { + return nil + } + + raw, ok := configMap["branches"] + if !ok { + return nil + } + + switch v := raw.(type) { + case []string: + return append([]string(nil), v...) + case []any: + out := make([]string, 0, len(v)) + for _, item := range v { + if s, ok := item.(string); ok { + out = append(out, s) + } + } + return out + default: + return nil + } +}

If you want, I can also draft a small regression test that verifies branch create/delete deltas are stable across consecutive scrapes.

Also applies to: 100-103, 118-120, 267-270

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scrapers/git/scraper.go` around lines 61 - 63, loadPreviousBranches currently returns nil so branch lifecycle deltas are recomputed with no baseline; implement loadPreviousBranches to load the persisted branch snapshot (same pattern as loadPreviousTags) by reading the repo's stored state (DB/metadata) and returning the slice of previous branch names/objects for the given repoExternalID, then ensure all callers (the branch delta computation sites in scraper.go that call loadPreviousBranches) use the returned value to compute create/delete deltas; also update any other occurrences of the placeholder implementation (the other call sites flagged in the comment) so branch deltas are stable across consecutive scrapes.

coderabbitai · 2026-03-01T14:40:02Z

scrapers/git/scraper.go

+	repoResult.Tags = v1.JSONStringMap{
+		"url":           gitClient.URL,
+		"defaultBranch": gitClient.Branch,
+	}


⚠️ Potential issue | 🟠 Major

Preserve BaseScraper tags instead of replacing them.

NewScrapeResult already carries base tags, but these assignments overwrite the entire map and discard user-configured tags.

Proposed fix

- repoResult.Tags = v1.JSONStringMap{ - "url": gitClient.URL, - "defaultBranch": gitClient.Branch, - } + if repoResult.Tags == nil { + repoResult.Tags = v1.JSONStringMap{} + } + repoResult.Tags["url"] = gitClient.URL + repoResult.Tags["defaultBranch"] = gitClient.Branch @@ - branchResult.Tags = v1.JSONStringMap{ - "branch": branchName, - "url": gitClient.URL, - } + if branchResult.Tags == nil { + branchResult.Tags = v1.JSONStringMap{} + } + branchResult.Tags["branch"] = branchName + branchResult.Tags["url"] = gitClient.URL

Also applies to: 144-147

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scrapers/git/scraper.go` around lines 87 - 90, The code is replacing the entire repoResult.Tags map (created by NewScrapeResult/BaseScraper) which discards user-configured tags; instead merge the new keys into the existing repoResult.Tags map (e.g., set repoResult.Tags["url"]=gitClient.URL and repoResult.Tags["defaultBranch"]=gitClient.Branch) so you preserve any pre-existing entries; apply the same merging approach for the other occurrence that assigns repoResult.Tags at lines around the second block (the assignment near where repoResult.Tags is set again).

Tags whose target commit is older than the oldest commit in the depth window are now excluded from change detection.

- Use Name <email> format for LastModifiedBy in file tree - Promote go-git/go-git to direct dependency - Update integration test assertions for Commit change type - Add tag and author/committer assertions to integration tests - Switch test fixture to mission-control repo

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (3)

scrapers/git/scraper.go (2)

87-90: ⚠️ Potential issue | 🟠 Major

Avoid replacing Tags; merge into existing map to preserve BaseScraper tags.

These assignments overwrite pre-existing tags and can drop user-configured metadata.

🔧 Proposed fix

-	repoResult.Tags = v1.JSONStringMap{
-		"url":           gitClient.URL,
-		"defaultBranch": gitClient.Branch,
-	}
+	if repoResult.Tags == nil {
+		repoResult.Tags = v1.JSONStringMap{}
+	}
+	repoResult.Tags["url"] = gitClient.URL
+	repoResult.Tags["defaultBranch"] = gitClient.Branch
@@
-		branchResult.Tags = v1.JSONStringMap{
-			"branch": branchName,
-			"url":    gitClient.URL,
-		}
+		if branchResult.Tags == nil {
+			branchResult.Tags = v1.JSONStringMap{}
+		}
+		branchResult.Tags["branch"] = branchName
+		branchResult.Tags["url"] = gitClient.URL

Also applies to: 144-147

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scrapers/git/scraper.go` around lines 87 - 90, The current code assigns
repoResult.Tags = v1.JSONStringMap{...}, which overwrites existing BaseScraper
tags and loses metadata; instead, merge the new keys into the existing
repoResult.Tags map (creating it if nil) and set "url" to gitClient.URL and
"defaultBranch" to gitClient.Branch so existing tags are preserved; apply the
same change for the other occurrence that sets repoResult.Tags (the block around
the second occurrence) so both places merge rather than replace.

61-63: ⚠️ Potential issue | 🟠 Major

Branch lifecycle deltas are still non-stateful.

loadPreviousBranches still returns nil, and current branches are not persisted in repoConfig, so create/delete branch changes are recomputed without a stable baseline.

🔧 Proposed fix

 	// Store current tags in repo config for next run
 	repoConfig["tags"] = currentTags
+	repoConfig["branches"] = branchNames(trackedBranches)
@@
-func loadPreviousBranches(_ api.ScrapeContext, _ string) []string {
-	// FIXME: implement branch tracking from TempCache
-	return nil
+func loadPreviousBranches(ctx api.ScrapeContext, repoExternalID string) []string {
+	configMap := loadCachedConfig(ctx, repoExternalID, "Git::Repository")
+	if configMap == nil {
+		return nil
+	}
+
+	raw, ok := configMap["branches"]
+	if !ok {
+		return nil
+	}
+
+	switch v := raw.(type) {
+	case []string:
+		return append([]string(nil), v...)
+	case []any:
+		out := make([]string, 0, len(v))
+		for _, item := range v {
+			if s, ok := item.(string); ok {
+				out = append(out, s)
+			}
+		}
+		return out
+	default:
+		return nil
+	}
 }

Also applies to: 118-120, 267-270

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scrapers/git/scraper.go` around lines 61 - 63, loadPreviousBranches currently
returns nil and current branch state isn't persisted, so branch create/delete
deltas are recomputed without a stable baseline; update loadPreviousBranches to
read the saved branch list from repoConfig (or storage used by
loadPreviousTags), and after computing currentBranches in the scraper persist
the current branch list back into repoConfig (the same place
loadPreviousBranches reads from) so subsequent runs have a stable baseline;
ensure the same fix is applied wherever loadPreviousBranches is called (e.g.,
alongside loadPreviousTags usage and the other two call sites mentioned) and
keep the saved format consistent with loadPreviousTags to avoid schema
mismatches.

scrapers/git/changes.go (1)

78-80: ⚠️ Potential issue | 🟡 Minor

Use fmt.Fprintf instead of WriteString(fmt.Sprintf(...)) to satisfy lint and avoid extra allocation.

This is still failing staticcheck (QF1012).

🔧 Proposed fix

 		case v1.GitFileStrategyDiff:
 			diffedFiles = append(diffedFiles, d.path)
 			if d.patch != "" {
-				diffContent.WriteString(fmt.Sprintf("--- %s ---\n%s\n", d.path, d.patch))
+				fmt.Fprintf(&diffContent, "--- %s ---\n%s\n", d.path, d.patch)
 			}
 		}

#!/bin/bash
rg -n 'WriteString\(fmt\.Sprintf\('

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scrapers/git/changes.go` around lines 78 - 80, Replace the
WriteString(fmt.Sprintf(...)) call to avoid the extra allocation and satisfy
staticcheck QF1012: change the usage of diffContent.WriteString(fmt.Sprintf("---
%s ---\n%s\n", d.path, d.patch)) to a direct fmt.Fprintf call (e.g.
fmt.Fprintf(&diffContent, "--- %s ---\n%s\n", d.path, d.patch)) so that
diffContent is written to directly; keep the same format string and use d.path
and d.patch as arguments.

🧹 Nitpick comments (4)

scrapers/git/filetree.go (1)
104-109: Per-file commit-history lookup is likely to become expensive on large repositories.

lastModifyingCommit is invoked for each file, each with a fresh log traversal. This can heavily degrade scrape latency as repo size grows.
♻️ Suggested direction
-		lastCommit, err := lastModifyingCommit(repo, commitHash, f.Name)
+		// TODO: avoid per-file history walks; build once-per-commit metadata cache
+		// for touched paths, or make last-modified enrichment optional/configurable.
+		lastCommit, err := lastModifyingCommit(repo, commitHash, f.Name)
scrapers/git/scraper_test.go (1)
323-323: Remove unused variable assignment.

The _ = dir assignment on line 323 appears to be a leftover from debugging or refactoring. Since dir is not used in the remainder of the test, consider removing this line or using the blank identifier directly in the initTestRepo call.
🧹 Proposed cleanup
-	_ = dir
-
-	changes, currentTags := generateTagChanges(repo, "git/test", map[string]string{}, time.Time{})
+	changes, currentTags := generateTagChanges(repo, "git/test", map[string]string{}, time.Time{})
Or modify the initTestRepo call:
-	dir, repo := initTestRepo(t)
+	_, repo := initTestRepo(t)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scrapers/git/scraper_test.go` at line 323, Remove the no-op assignment "_ =
dir" in the test; since the variable dir is never used after initTestRepo,
either pass the blank identifier directly to initTestRepo or eliminate the dir
declaration entirely and rely on initTestRepo's return being ignored — locate
the test where initTestRepo is called and the unused variable dir is declared
and delete the "_ = dir" statement (or replace the variable with _ in the
initTestRepo call).
config/schemas/scrape_config.schema.json (2)
1631-1633: Minor: Missing description for type field.

The type field in the Git definition lacks a description, while the same field in other scraper definitions (e.g., AWS, Azure) includes the description: "A static value or JSONPath expression to use as the type for the resource."
📝 Suggested fix
         "type": {
-          "type": "string"
+          "type": "string",
+          "description": "A static value or JSONPath expression to use as the type for the resource."
         },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/schemas/scrape_config.schema.json` around lines 1631 - 1633, Add a
"description" property to the "type" field in the Git scraper definition so it
matches other scrapers: set the description to "A static value or JSONPath
expression to use as the type for the resource." Locate the "type" field inside
the Git definition in config/schemas/scrape_config.schema.json (the property
named "type") and add the description string alongside the existing "type":
"string" entry.
1761-1776: Consider adding enum constraint for strategy field.

The strategy field accepts any string, but based on the Go type definition, only "track", "diff", and "ignore" are valid values. Adding an enum constraint would provide better validation feedback at schema level.
📝 Suggested enhancement
         "strategy": {
-          "type": "string"
+          "type": "string",
+          "enum": ["track", "diff", "ignore"],
+          "description": "File handling strategy: track (metadata only), diff (include content changes), ignore (exclude from scraping)"
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/schemas/scrape_config.schema.json` around lines 1761 - 1776, The
GitFileRule schema allows any string for the "strategy" property; restrict it to
valid values by adding an enum to the "strategy" property listing "track",
"diff", and "ignore". Update the "GitFileRule" object schema so its
"properties"."strategy" includes an "enum" array with those three strings and
keep "type": "string" and existing required/additionalProperties settings
unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/v1/git.go`:
- Around line 5-16: The API currently allows any string for
GitFileRule.Strategy; add explicit validation to reject unknown values by
implementing a strict unmarshaler/validator for GitFileStrategy that checks
incoming JSON/YAML against the allowed constants (GitFileStrategyIgnore,
GitFileStrategyTrack, GitFileStrategyDiff) and returns an error for anything
else, and ensure GitFileRule uses that by either adding a Validate() method on
GitFileRule or relying on GitFileStrategy.UnmarshalJSON/UnmarshalYAML so parsing
fails on invalid inputs; update callers/tests to expect and handle the
validation error.

In `@api/v1/types.go`:
- Line 70: ScraperSpec.IsEmpty() currently ignores the new Git field so a spec
that only sets Git can be reported empty; update ScraperSpec.IsEmpty() to treat
Git as non-empty when s.Git is non-nil and has elements (e.g., check len(s.Git)
> 0) or, if there is a Git.IsEmpty() helper on the element type, iterate and
treat the spec as non-empty if any Git element is non-empty; modify the IsEmpty
logic in ScraperSpec.IsEmpty() to include this Git presence check alongside the
existing field checks.

In `@config/schemas/config_git.schema.json`:
- Around line 295-297: Replace the loose "type": "string" for
GitFileRule.strategy with a JSON Schema enum of the exact supported strategy
values used by the runtime (e.g., the constants/enum used by the Git file
handling code — reference the runtime symbol that defines allowed strategies
such as GitFileRuleStrategy or the constant list in the codebase) so only valid
strategies are accepted; update the schema to use "type":"string" plus
"enum":[...allowed values...] and add a short "description" noting it must match
the runtime's strategy enum.

In `@config/schemas/config_github.schema.json`:
- Around line 337-343: Update the "description" value for the JSONStringMap
schema entry to correct the typo "defiend" to "defined" (entry symbol:
JSONStringMap); locate the JSONStringMap object and replace the misspelled word
in its "description" field so the sentence reads correctly while leaving the
rest of the schema unchanged.

In `@scrapers/git/changes.go`:
- Around line 207-213: The problem is that currentTags is only populated after
applying the since filter so comparing previousTags against this partial map
produces false "Tag deleted" events; move the assignment currentTags[name] =
tag.sha so it runs for every tag (before the if !since.IsZero() &&
tag.when.Before(since) check) and keep the since filter only for emitting change
events (e.g., skip creating "Tag added" or other events when tag.when is before
since). Apply the same fix to the other similar block (the code around lines
229-243) so both deletion detection and event emission use the full currentTags
map while the since filter only controls which events are produced.

---

Duplicate comments:
In `@scrapers/git/changes.go`:
- Around line 78-80: Replace the WriteString(fmt.Sprintf(...)) call to avoid the
extra allocation and satisfy staticcheck QF1012: change the usage of
diffContent.WriteString(fmt.Sprintf("--- %s ---\n%s\n", d.path, d.patch)) to a
direct fmt.Fprintf call (e.g. fmt.Fprintf(&diffContent, "--- %s ---\n%s\n",
d.path, d.patch)) so that diffContent is written to directly; keep the same
format string and use d.path and d.patch as arguments.

In `@scrapers/git/scraper.go`:
- Around line 87-90: The current code assigns repoResult.Tags =
v1.JSONStringMap{...}, which overwrites existing BaseScraper tags and loses
metadata; instead, merge the new keys into the existing repoResult.Tags map
(creating it if nil) and set "url" to gitClient.URL and "defaultBranch" to
gitClient.Branch so existing tags are preserved; apply the same change for the
other occurrence that sets repoResult.Tags (the block around the second
occurrence) so both places merge rather than replace.
- Around line 61-63: loadPreviousBranches currently returns nil and current
branch state isn't persisted, so branch create/delete deltas are recomputed
without a stable baseline; update loadPreviousBranches to read the saved branch
list from repoConfig (or storage used by loadPreviousTags), and after computing
currentBranches in the scraper persist the current branch list back into
repoConfig (the same place loadPreviousBranches reads from) so subsequent runs
have a stable baseline; ensure the same fix is applied wherever
loadPreviousBranches is called (e.g., alongside loadPreviousTags usage and the
other two call sites mentioned) and keep the saved format consistent with
loadPreviousTags to avoid schema mismatches.

---

Nitpick comments:
In `@config/schemas/scrape_config.schema.json`:
- Around line 1631-1633: Add a "description" property to the "type" field in the
Git scraper definition so it matches other scrapers: set the description to "A
static value or JSONPath expression to use as the type for the resource." Locate
the "type" field inside the Git definition in
config/schemas/scrape_config.schema.json (the property named "type") and add the
description string alongside the existing "type": "string" entry.
- Around line 1761-1776: The GitFileRule schema allows any string for the
"strategy" property; restrict it to valid values by adding an enum to the
"strategy" property listing "track", "diff", and "ignore". Update the
"GitFileRule" object schema so its "properties"."strategy" includes an "enum"
array with those three strings and keep "type": "string" and existing
required/additionalProperties settings unchanged.

In `@scrapers/git/scraper_test.go`:
- Line 323: Remove the no-op assignment "_ = dir" in the test; since the
variable dir is never used after initTestRepo, either pass the blank identifier
directly to initTestRepo or eliminate the dir declaration entirely and rely on
initTestRepo's return being ignored — locate the test where initTestRepo is
called and the unused variable dir is declared and delete the "_ = dir"
statement (or replace the variable with _ in the initTestRepo call).

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c4caa36 and cee32dd.

📒 Files selected for processing (20)

api/v1/git.go
api/v1/scrapeconfig_types.go
api/v1/types.go
api/v1/zz_generated.deepcopy.go
chart/crds/configs.flanksource.com_scrapeconfigs.yaml
config/schemas/config_git.schema.json
config/schemas/config_github.schema.json
config/schemas/config_http.schema.json
config/schemas/scrape_config.schema.json
fixtures/git.yaml
go.mod
scrapers/aws/cloudtrail_test.go
scrapers/common.go
scrapers/git/changes.go
scrapers/git/filetree.go
scrapers/git/git_integration_test.go
scrapers/git/git_suite_test.go
scrapers/git/scraper.go
scrapers/git/scraper_test.go
scrapers/trivy/trivy.go

💤 Files with no reviewable changes (1)

scrapers/trivy/trivy.go

✅ Files skipped from review due to trivial changes (2)

scrapers/aws/cloudtrail_test.go
chart/crds/configs.flanksource.com_scrapeconfigs.yaml

🚧 Files skipped from review as they are similar to previous changes (6)

go.mod
scrapers/common.go
api/v1/scrapeconfig_types.go
api/v1/zz_generated.deepcopy.go
scrapers/git/git_suite_test.go
scrapers/git/git_integration_test.go

coderabbitai · 2026-03-01T15:05:19Z

api/v1/git.go

+type GitFileStrategy string
+
+const (
+	GitFileStrategyIgnore GitFileStrategy = "ignore"
+	GitFileStrategyTrack  GitFileStrategy = "track"
+	GitFileStrategyDiff   GitFileStrategy = "diff"
+)
+
+type GitFileRule struct {
+	Pattern  string          `json:"pattern" yaml:"pattern"`
+	Strategy GitFileStrategy `json:"strategy" yaml:"strategy"`
+}


⚠️ Potential issue | 🟠 Major

Constrain GitFileStrategy values at the API layer.

GitFileRule.Strategy currently accepts arbitrary strings. Invalid values can silently bypass expected file handling logic.

🛡️ Proposed fix

+// +kubebuilder:validation:Enum=ignore;track;diff type GitFileStrategy string

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

type GitFileStrategy string

const (

GitFileStrategyIgnore GitFileStrategy = "ignore"

GitFileStrategyTrack GitFileStrategy = "track"

GitFileStrategyDiff GitFileStrategy = "diff"

)

type GitFileRule struct {

Pattern string `json:"pattern" yaml:"pattern"`

Strategy GitFileStrategy `json:"strategy" yaml:"strategy"`

}

// +kubebuilder:validation:Enum=ignore;track;diff

type GitFileStrategy string

const (

GitFileStrategyIgnore GitFileStrategy = "ignore"

GitFileStrategyTrack GitFileStrategy = "track"

GitFileStrategyDiff GitFileStrategy = "diff"

)

type GitFileRule struct {

Pattern string `json:"pattern" yaml:"pattern"`

Strategy GitFileStrategy `json:"strategy" yaml:"strategy"`

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@api/v1/git.go` around lines 5 - 16, The API currently allows any string for GitFileRule.Strategy; add explicit validation to reject unknown values by implementing a strict unmarshaler/validator for GitFileStrategy that checks incoming JSON/YAML against the allowed constants (GitFileStrategyIgnore, GitFileStrategyTrack, GitFileStrategyDiff) and returns an error for anything else, and ensure GitFileRule uses that by either adding a Validate() method on GitFileRule or relying on GitFileStrategy.UnmarshalJSON/UnmarshalYAML so parsing fails on invalid inputs; update callers/tests to expect and handle the validation error.

coderabbitai · 2026-03-01T15:05:19Z

api/v1/types.go

 	GCP            []GCP            `json:"gcp,omitempty" yaml:"gcp,omitempty"`
 	AWS            []AWS            `json:"aws,omitempty" yaml:"aws,omitempty"`
 	File           []File           `json:"file,omitempty" yaml:"file,omitempty"`
+	Git            []Git            `json:"git,omitempty" yaml:"git,omitempty"`


⚠️ Potential issue | 🟠 Major

ScraperSpec.IsEmpty() is now inconsistent with the new Git field.

Git-only scraper specs can be misclassified as empty because IsEmpty() still ignores Git.

🔧 Proposed fix

func (c ScraperSpec) IsEmpty() bool { - return len(c.AWS) == 0 && len(c.File) == 0 + return len(c.AWS) == 0 && len(c.File) == 0 && len(c.Git) == 0 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@api/v1/types.go` at line 70, ScraperSpec.IsEmpty() currently ignores the new Git field so a spec that only sets Git can be reported empty; update ScraperSpec.IsEmpty() to treat Git as non-empty when s.Git is non-nil and has elements (e.g., check len(s.Git) > 0) or, if there is a Git.IsEmpty() helper on the element type, iterate and treat the spec as non-empty if any Git element is non-empty; modify the IsEmpty logic in ScraperSpec.IsEmpty() to include this Git presence check alongside the existing field checks.

coderabbitai · 2026-03-01T15:05:19Z

config/schemas/config_git.schema.json

+        "strategy": {
+          "type": "string"
+        }


⚠️ Potential issue | 🟠 Major

Restrict GitFileRule.strategy to supported values.

Using plain "type": "string" allows invalid strategies that runtime logic won’t handle consistently.

🛡️ Proposed fix

"strategy": { - "type": "string" + "type": "string", + "enum": ["ignore", "track", "diff"] }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"strategy": {

"type": "string"

}

"strategy": {

"type": "string",

"enum": ["ignore", "track", "diff"]

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@config/schemas/config_git.schema.json` around lines 295 - 297, Replace the loose "type": "string" for GitFileRule.strategy with a JSON Schema enum of the exact supported strategy values used by the runtime (e.g., the constants/enum used by the Git file handling code — reference the runtime symbol that defines allowed strategies such as GitFileRuleStrategy or the constant list in the codebase) so only valid strategies are accepted; update the schema to use "type":"string" plus "enum":[...allowed values...] and add a short "description" noting it must match the runtime's strategy enum.

coderabbitai · 2026-03-01T15:05:20Z

config/schemas/config_github.schema.json

+    "JSONStringMap": {
+      "additionalProperties": {
+        "type": "string"
+      },
+      "type": "object",
+      "description": "JSONStringMap defiend JSON data type, need to implements driver.Valuer, sql.Scanner interface"
+    },


⚠️ Potential issue | 🟡 Minor

Minor typo in description.

The description contains a typo: "defiend" should be "defined".

📝 Suggested fix

"JSONStringMap": { "additionalProperties": { "type": "string" }, "type": "object", - "description": "JSONStringMap defiend JSON data type, need to implements driver.Valuer, sql.Scanner interface" + "description": "JSONStringMap defined JSON data type, needs to implement driver.Valuer, sql.Scanner interface" },

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"JSONStringMap": {

"additionalProperties": {

"type": "string"

},

"type": "object",

"description": "JSONStringMap defiend JSON data type, need to implements driver.Valuer, sql.Scanner interface"

},

"JSONStringMap": {

"additionalProperties": {

"type": "string"

},

"type": "object",

"description": "JSONStringMap defined JSON data type, needs to implement driver.Valuer, sql.Scanner interface"

},

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@config/schemas/config_github.schema.json` around lines 337 - 343, Update the "description" value for the JSONStringMap schema entry to correct the typo "defiend" to "defined" (entry symbol: JSONStringMap); locate the JSONStringMap object and replace the misspelled word in its "description" field so the sentence reads correctly while leaving the rest of the schema unchanged.

coderabbitai · 2026-03-01T15:05:20Z

scrapers/git/changes.go

+		if !since.IsZero() && tag.when.Before(since) {
+			return nil
+		}
+
+		currentTags[name] = tag.sha
+
+		if _, existed := previousTags[name]; !existed {


⚠️ Potential issue | 🟠 Major

Tag deletion detection is incorrect when since is set.

You only add tags to currentTags after the since filter, then compare previousTags against this partial map. That can emit false “Tag deleted” changes for tags that still exist but are older than since.

🐛 Proposed fix

tag, ok := resolveTag(repo, ref) if !ok { return nil } + // Always track full current state for accurate deletion detection. + currentTags[name] = tag.sha + if !since.IsZero() && tag.when.Before(since) { return nil } - currentTags[name] = tag.sha - if _, existed := previousTags[name]; !existed { now := time.Now() changes = append(changes, v1.ChangeResult{

Also applies to: 229-243

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scrapers/git/changes.go` around lines 207 - 213, The problem is that currentTags is only populated after applying the since filter so comparing previousTags against this partial map produces false "Tag deleted" events; move the assignment currentTags[name] = tag.sha so it runs for every tag (before the if !since.IsZero() && tag.when.Before(since) check) and keep the since filter only for emitting change events (e.g., skip creating "Tag added" or other events when tag.when is before since). Apply the same fix to the other similar block (the code around lines 229-243) so both deletion detection and event emission use the full currentTags map while the since filter only controls which events are produced.

coderabbitai bot reviewed Mar 1, 2026

View reviewed changes

moshloop added 4 commits March 1, 2026 16:44

feat(git): filter tags by commit date using depth-based since window

5784b99

Tags whose target commit is older than the oldest commit in the depth window are now excluded from change detection.

fix(git): use tagger date as fallback when tag target is not a commit

405391f

chore: apply go fmt formatting fixes

cee32dd

moshloop force-pushed the worktree-git-scraper branch from c4caa36 to cee32dd Compare March 1, 2026 14:46

coderabbitai bot reviewed Mar 1, 2026

View reviewed changes

		configMap := repo.Config.(map[string]any)
		files := configMap["files"].([]FileEntry)

		previousTags := loadPreviousTags(ctx, repoExternalID)
		previousBranches := loadPreviousBranches(ctx, repoExternalID)

Conversation

moshloop commented Mar 1, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

moshloop commented Mar 1, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 1, 2026 •

edited

Loading