Skip to content

Full headless mode support#1396

Merged
ehsandeep merged 29 commits intodevfrom
new-full-headless-mode
Dec 31, 2025
Merged

Full headless mode support#1396
ehsandeep merged 29 commits intodevfrom
new-full-headless-mode

Conversation

@ehsandeep
Copy link
Member

@ehsandeep ehsandeep commented Sep 5, 2025

Summary by CodeRabbit

  • New Features

    • Full headless crawler and optional hybrid headless mode; new CLI flags for hybrid, max-failure-count, and diagnostics; browser pooling, element/form interactions, request interception with cookie-consent blocking, diagnostics writer, and live crawl debugger HTTP endpoint.
  • Improvements

    • Enhanced page-ready heuristics, DOM/text normalization pipeline, SimHash near-duplicate detection, navigation-back strategies, safer form-fill handling, and filtered parser accumulation.
  • Bug Fixes

    • Filters out data:, mailto:, and javascript: navigation schemes.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 5, 2025

Walkthrough

Adds a new headless crawling subsystem: CLI/headless flags and options, a Rod-based browser launcher with pooled BrowserPage instances, JS instrumentation and element discovery, DOM/text normalizers with SimHash dedupe, graph-backed page-state management, diagnostics, cookie-consent request blocking, crawl debugger, parser filtering, form-filling, and related tests.

Changes

Cohort / File(s) Summary
CLI & Runner
cmd/katana/main.go, internal/runner/options.go, internal/runner/runner.go, pkg/types/options.go, pkg/types/crawler_options.go
Add -hybrid/-hh flag and HeadlessHybrid option; add --max-failure-count/-mfc and --enable-diagnostics/-ed with MaxFailureCount and EnableDiagnostics; update headless flag text; enforce mutual exclusivity between headless/hybrid; route to headless engine; add CrawlerOptions.Logger and CrawlerOptions.ChromeUser.
Dependencies
go.mod
Update and bump multiple dependencies (notably github.com/go-rod/rod and several indirect/tooling modules).
Headless Orchestrator & API
pkg/engine/headless/headless.go, pkg/engine/headless/debugger.go
New Headless orchestrator: Crawl/Close, scope validation, deduplication, optional CrawlDebugger HTTP tracker for active URLs.
Browser Launcher & Pages
pkg/engine/headless/browser/browser.go
New Launcher and BrowserPage with browser pool, per-page data dirs, launch flags, wait heuristics, dialog/fetch interception, request/response callback plumbing, and lifecycle management.
Element Discovery & JS Injection
pkg/engine/headless/browser/element.go, pkg/engine/headless/js/utils.js, pkg/engine/headless/js/page-init.js, pkg/engine/headless/js/js.go
Add element/form/event-listener extraction, navigation discovery (forms/buttons/links/event listeners), NavigatedLink type, embedded JS bundles and InitJavascriptEnv injection.
Cookie-consent Blocking
pkg/engine/headless/browser/cookie/cookie.go, pkg/engine/headless/browser/cookie/cookie_test.go
Add JSON-driven cookie-consent request-blocking rules via go:embed; ShouldBlockRequest API and unit tests.
Crawler Core & Form Fill
pkg/engine/headless/crawler/crawler.go, pkg/engine/headless/crawler/formfill.go
New queue-driven Crawler supporting Load/Click/FillForm execution, page-state construction, diagnostics hooks, MaxDepth/MaxFailureCount handling, and form-filling/submission logic.
Diagnostics Writer
pkg/engine/headless/crawler/diagnostics/diagnostics.go
Disk-backed Writer interface and concrete writer to persist actions, page states, navigations, screenshots; NewWriter.
Normalizer & SimHash
pkg/engine/headless/crawler/normalizer/dom_utils.go, .../helpers.go, normalizer.go, text_utils.go, .../simhash/simhash.go (+ tests)
DOMNormalizer, TextNormalizer, composite Normalizer, SimHash fingerprinting and Oracle, date/time registry, text patterns, and unit tests.
Crawler State & Graph
pkg/engine/headless/crawler/state.go, pkg/engine/headless/graph/graph.go, pkg/engine/headless/crawler/state_test.go
PageState creation (stripped DOM, SHA-256, SimHash), navigation-back strategies (element reuse/history/shortest-path), CrawlGraph with page-state vertices and action-labeled edges.
Types
pkg/engine/headless/types/types.go
New core types: PageState, Action, HTMLElement, HTMLForm, EventListener, action/navigation enums, hashing helpers, and diagnostic flag.
Parser Filtering & Tests
pkg/engine/parser/parser.go, pkg/engine/parser/parser_test.go
Add appendFiltered and isValidNavigationRequest to reject data:, mailto:, javascript: schemes; add unit test.
Normalizer Helpers & Tests
pkg/engine/headless/crawler/normalizer/helpers.go, .../dom_utils_test.go, .../text_utils_test.go, .../simhash/simhash_test.go
Date/time pattern registry and unit tests validating DOM/text normalization and simhash behavior.
Utils
pkg/utils/extensions/extensions.go
Short-circuit on invalid URL parse in path validator (early return false).
Design Notes
pkg/engine/headless/TODOS.md
New design doc describing state-manager architecture, fingerprinting strategy, WaitForRouteChange, and staged rollout plan.
Minor main.go change
cmd/katana/main.go
Reworked pprof shutdown defer and added a no-op comment about profiling env check.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as cmd/katana
  participant Runner as internal/runner
  participant Headless as headless.Engine
  participant Hybrid as hybrid.Engine
  participant Standard as standard.Engine

  User->>CLI: katana [flags]
  CLI->>Runner: New(options)
  alt options.Headless
    Runner->>Headless: headless.New(crawlerOptions)
    Headless-->>Runner: Headless instance
  else options.HeadlessHybrid
    Runner->>Hybrid: hybrid.New(crawlerOptions)
    Hybrid-->>Runner: Hybrid instance
  else
    Runner->>Standard: standard.New(crawlerOptions)
    Standard-->>Runner: Standard instance
  end
  Runner-->>User: runner instance
Loading
sequenceDiagram
  autonumber
  participant Headless as headless.Headless
  participant Crawler as headless.Crawler
  participant Launcher as browser.Launcher
  participant Page as BrowserPage
  participant JS as JS Env
  participant Diag as Diagnostics
  participant Out as OutputWriter

  Headless->>Crawler: New(opts)
  Crawler->>Launcher: NewLauncher(opts)
  Launcher-->>Crawler: *Launcher
  Headless->>Crawler: Crawl(URL)
  Crawler->>Launcher: GetPageFromPool()
  Launcher-->>Crawler: *BrowserPage
  Crawler->>JS: InitJavascriptEnv(Page)
  Crawler->>Page: Navigate + Wait heuristics
  Crawler->>Diag: LogAction / LogPageState / LogNavigations
  Crawler->>Out: RequestCallback(result)
  Headless->>Headless: performAdditionalAnalysis()
  Headless->>Out: Emit new requests
  Crawler->>Launcher: PutBrowserToPool(Page)
  Headless-->>Caller: done/error
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~150 minutes

Poem

I’m a quick-footed rabbit with a dev’s delight,
I sniff the DOM by day and the hashes by night.
I hop through links, block cookie pleas,
Pool browsers, log states, and prune noisy trees.
Diagnostics gleam — hop on, we’ll crawl just right. 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.54% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Full headless mode support' directly aligns with the main objective of adding comprehensive headless crawling capabilities to Katana, covering the substantial new headless engine components, options, and browser integration.
✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

@Ice3man543
Copy link
Member

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 5, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 42

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/runner/runner.go (1)

98-105: Enforce mutual exclusivity of Headless flags
In internal/runner/options.go (e.g. in ValidateOptions), add a check to error when both options.Headless and options.HeadlessHybrid are true:

if options.Headless && options.HeadlessHybrid {
    return fmt.Errorf("flags -hl and -hh are mutually exclusive")
}

This ensures ambiguous flag combinations fail fast rather than silently defaulting to Headless.

🧹 Nitpick comments (71)
.gitignore (2)

10-10: Confirm intent: ignoring .devcontainer hides devcontainer configs repo-wide.
These are typically committed for reproducible dev envs. If configs are intended to be versioned, drop this ignore.

Apply if you want to keep devcontainer configs tracked:

-.devcontainer

5-6: Duplicate ignore entry for katana_*/.
Remove the duplicate to keep the file tidy.

 katana_*/
-katana_*/
pkg/utils/extensions/extensions.go (1)

41-46: Avoid dropping valid relative paths on parse errors; fall back to path.Ext() instead of returning false

Short-circuiting here changes behavior and can filter out legitimate relative paths or slightly malformed-yet-parseable items. Prefer trimming, attempting parse, and on error falling back to extension extraction from the raw item.

Apply:

 func (e *Validator) ValidatePath(item string) bool {
-	var extension string
-	u, err := urlutil.Parse(item)
-	if err != nil {
-		gologger.Warning().Msgf("validatepath: failed to parse url %v got %v", item, err)
-		return false
-	}
+	var extension string
+	item = strings.TrimSpace(item)
+	u, err := urlutil.Parse(item)
+	if err != nil {
+		// noisy in hot paths; consider downgrading to Debug if this is expected
+		gologger.Warning().Msgf("validatepath: failed to parse url %v got %v", item, err)
+		extension = strings.ToLower(path.Ext(item))
+	} else {
+		if u.Path != "" {
+			extension = strings.ToLower(path.Ext(u.Path))
+		} else {
+			extension = strings.ToLower(path.Ext(item))
+		}
+		return e.validateExt(extension)
+	}
-	if u.Path != "" {
-		extension = strings.ToLower(path.Ext(u.Path))
-	} else {
-		extension = strings.ToLower(path.Ext(item))
-	}
+	return e.validateExt(extension)
+}
+
+// small helper to keep main flow simple
+func (e *Validator) validateExt(extension string) bool {
 	if extension == "" && len(e.extensionsMatch) > 0 {
 		return false
 	}
 	if len(e.extensionsMatch) > 0 {
 		if _, ok := e.extensionsMatch[extension]; ok {
 			return true
 		}
 		return false
 	}
 
 	if _, ok := e.extensionsFilter[extension]; ok {
 		return false
 	}
 	return true
 }

If you prefer not to touch structure, minimally replace the return false with extension = strings.ToLower(path.Ext(item)) and let the existing logic proceed.

pkg/types/crawler_options.go (1)

40-44: LGTM: Optional Logger and ChromeUser fields fit the new headless path

No functional risk. Consider wiring sensible defaults (e.g., a text handler at Info when verbose/diagnostics are enabled) in NewCrawlerOptions so downstream consumers don’t need nil checks everywhere.

pkg/engine/parser/parser_test.go (1)

527-566: Great coverage; add uppercase/whitespace variants to lock the filter

Add a few cases (DATA:, MAILTO:, JAVASCRIPT:, with leading spaces) to prevent regressions and validate case-insensitive + trim change.

Apply:

 		htmlContent := `
 			<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8/5+hHgAHggJ/PchI7wAAAABJRU5ErkJggg==">
 			<a href="data:text/html,<h1>Hello</h1>">Data Link</a>
+			<a href=" DATA:text/html,<h1>WS</h1>">WS Data Link</a>
 			<link href="data:text/css,body{color:red}">
 			<script src="data:application/javascript,console.log('test')"></script>
+			<script src=" JAVASCRIPT:alert(1) "></script>
 			<iframe src="data:text/html,<p>Test</p>"></iframe>
 			<object data="data:application/pdf,test"></object>
 			<embed src="data:application/x-shockwave-flash,test">
 			<audio src="data:audio/mp3,test"></audio>
 			<video src="data:video/mp4,test"></video>
 			<a href="mailto:test@example.com">Email Link</a>
+			<a href="MAILTO:test@example.com">Email Link 2</a>
 			<a href="javascript:alert('test')">JS Link</a>
+			<a href="JavaScript:alert('test')">JS Link 2</a>
 		`
pkg/engine/headless/crawler/normalizer/helpers.go (1)

3-44: Broaden date/time coverage and reduce false positives; consider precompilation

Nice registry. Consider:

  • Add ISO-8601 and ordinal suffixes (st/nd/rd) to improve recall.
  • Use raw-string literals for readability.
  • Precompile once ([]*regexp.Regexp) where consumed to avoid repeated compile cost.

Apply (illustrative additions):

 var dateTimePatterns = []string{
@@
 	/* long months */
 	"[0-9]{1,2} [a-zA-Z]{3,} [0-9]{4}",
-	"[0-9]{1,2}th [a-zA-Z]{3,} [0-9]{4}",
-	"[0-9]{1,2}th [a-zA-Z]{3,}",
+	"[0-9]{1,2}(st|nd|rd|th) [a-zA-Z]{3,} [0-9]{4}",
+	"[0-9]{1,2}(st|nd|rd|th) [a-zA-Z]{3,}",
@@
 	/* Times */
 	"[0-9]{1,2}:[0-9]{1,2}:[0-9]{1,2}( )?(pm|PM|am|AM)",
 	"[0-9]{1,2}:[0-9]{1,2}( )?(pm|PM|am|AM)",
 	"[0-9]{1,2}:[0-9]{1,2}:[0-9]{1,2}",
 	"[0-9]{1,2}:[0-9]{1,2}",
+	/* ISO-8601 */
+	"[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}(\\.[0-9]+)?(Z|[+-][0-9]{2}:[0-9]{2})?",
 }
pkg/engine/headless/js/page-init.js (4)

59-69: Preserve instanceof and static props for WebSocket/EventSource wrappers.

Reassigning constructors without copying prototype/static fields can break instanceof checks and constants.

-      var oldWebSocket = window.WebSocket;
-      window.WebSocket = function (url, arg) {
+      var oldWebSocket = window.WebSocket;
+      function WebSocketHook(url, arg) {
         window.__navigatedLinks.push({ url: url, source: "websocket" });
         return new oldWebSocket(url, arg);
-      };
+      }
+      WebSocketHook.prototype = oldWebSocket.prototype;
+      Object.setPrototypeOf(WebSocketHook, oldWebSocket);
+      window.WebSocket = WebSocketHook;
@@
-      var oldEventSource = window.EventSource;
-      window.EventSource = function (url, arg) {
+      var oldEventSource = window.EventSource;
+      function EventSourceHook(url, arg) {
         window.__navigatedLinks.push({ url: url, source: "eventsource" });
         return new oldEventSource(url, arg);
-      };
+      }
+      EventSourceHook.prototype = oldEventSource.prototype;
+      Object.setPrototypeOf(EventSourceHook, oldEventSource);
+      window.EventSource = EventSourceHook;

84-91: Blocking form reset/close is intrusive; make it opt-in and reversible.

Hard-disabling form reset and window.close can break sites during crawling.

  • Behind a flag (e.g., opts.preventFormReset/opts.preventClose).
  • Keep originals in closures and expose an unhook() to restore.

Also applies to: 93-98


105-112: Timer acceleration: handle non-numeric/large delays and clamp.

Multiplying undefined/NaN by 0.1 yields NaN; also consider minimum clamping.

-      const speedUpFactor = 0.1; // For example, 10 times faster
+      const speedUpFactor = 0.1; // 10x faster
+      const clamp = (d) => {
+        const n = Number(d);
+        if (!Number.isFinite(n)) return 0;
+        const sped = Math.max(0, Math.floor(n * speedUpFactor));
+        return Math.max(0, sped);
+      };
@@
-      window.setTimeout = function (callback, delay, ...args) {
-        return originalSetTimeout(callback, delay * speedUpFactor, ...args);
+      window.setTimeout = function (callback, delay, ...args) {
+        return originalSetTimeout(callback, clamp(delay), ...args);
       };
-      window.setInterval = function (callback, delay, ...args) {
-        return originalSetInterval(callback, delay * speedUpFactor, ...args);
+      window.setInterval = function (callback, delay, ...args) {
+        return originalSetInterval(callback, clamp(delay), ...args);
       };

155-159: Hooks are disabled; add a flag-driven initializer so callers can opt-in granularly.

Provide a single entry point like pageInitAndHook({ events: true, navigations: true, timers: true, guards: true }).

I can wire a small options parser and idempotent init/unhook helpers if you want.

go.mod (1)

12-12: QUIC/utls/rod upgrades: check minimum Go and transitive constraints.

quic-go and utls occasionally bump min-Go/crypto requirements. Given Go 1.24 here, you should be safe; still watch for CI across platforms.

  • Add a smoke test job that runs “go build ./...” on linux/windows/darwin with Go 1.24.2 to surface toolchain incompatibilities early.

Also applies to: 89-92

pkg/engine/headless/crawler/normalizer/text_utils.go (2)

45-45: Remove redundant shadowing.

pattern := pattern inside loops has no effect.

-    pattern := pattern

Also applies to: 58-58


56-62: Whitespace normalization (optional).

After removals, leftover double spaces/newlines can remain. Optionally collapse whitespace.

I can add a post-pass to compress multiple spaces and trim if desired.

pkg/engine/headless/crawler/normalizer/text_utils_test.go (2)

22-29: Make the test case name descriptive.

Improve signal when it fails.

-            name: "test",
+            name: "removes email inside anchor text",

34-38: Avoid depending on unexported fields in tests.

If TextNormalizer’s internals change, this test breaks. Prefer a constructor or exported option.

I can add a NewTextNormalizer(patterns ...*regexp.Regexp) and update the test—want me to draft it?

pkg/engine/headless/js/utils.js (4)

64-80: This only captures inline on* handlers, not addEventListener listeners.

Name/doc can mislead. Either document the limitation or hook EventTarget.prototype.addEventListener in page-init.js to record listeners (recommended).

I can wire a lightweight hook into page-init.js to track listeners without heavy scans—want a patch?


85-103: Broaden pseudo-form detection (optional).

Consider also [role="form"] and common wrapper patterns to improve coverage.

-      const pseudoForms = document.querySelectorAll("div.form");
+      const pseudoForms = document.querySelectorAll('div.form, [role="form"]');

15-28: Capture checkbox/radio state.

Including checked improves form introspection accuracy.

       window._elementDataFromElement = function (el) {
         return {
@@
-        value: el.value != null ? String(el.value) : '',
+        value: el.value != null ? String(el.value) : '',
+        checked: typeof el.checked === 'boolean' ? el.checked : undefined,

3-12: Make utils initialization idempotent.

Prevent re-definitions when InitJavascriptEnv runs multiple times.

-(function initUtilityFunctions() {
+(function initUtilityFunctions() {
+    if (window.__katanaUtilsInit) { return; }
+    window.__katanaUtilsInit = true;
pkg/engine/headless/js/js.go (1)

18-27: Keep remove handles from EvalOnNewDocument (optional cleanup).

Rod returns a remove func; storing it allows cleanup on page teardown to avoid duplicate injections on reinits.

-func InitJavascriptEnv(page *rod.Page) error {
-    if _, err := page.EvalOnNewDocument(utilsJavascriptBundle); err != nil {
+func InitJavascriptEnv(page *rod.Page) error {
+    if _, err := page.EvalOnNewDocument(utilsJavascriptBundle); err != nil {
         return errors.Wrap(err, "failed to inject utils.js")
     }
-    if _, err := page.EvalOnNewDocument(pageInitJavascriptBundle); err != nil {
+    if _, err := page.EvalOnNewDocument(pageInitJavascriptBundle); err != nil {
         return errors.Wrap(err, "failed to inject page-init.js")
     }
     return nil
 }

FYI: EvalOnNewDocument returns (remove func() error, err error). (github.com)

internal/runner/runner.go (1)

106-108: Fix misleading error message when creating crawler.

The message hardcodes “standard crawler” even for headless/hybrid errors. Prefer a neutral message.

-    return nil, errorutil.NewWithErr(err).Msgf("could not create standard crawler")
+    return nil, errorutil.NewWithErr(err).Msgf("could not create crawler")
internal/runner/options.go (2)

28-33: Consider applying the -aff auto-disable to hybrid as well (if hybrid uses headless navigation).

If hybrid mode executes navigation via a headless browser, the same rationale applies.

-    if options.Headless && options.AutomaticFormFill {
+    if (options.Headless || options.HeadlessHybrid) && options.AutomaticFormFill {
         options.AutomaticFormFill = false
         gologger.Info().Msgf("Automatic form fill (-aff) has been disabled for headless navigation.")
     }

38-42: Validate SystemChromePath is executable (not just present).

Helps fail early on directories/wrong files.

     if options.SystemChromePath != "" {
-        if !fileutil.FileExists(options.SystemChromePath) {
+        if !fileutil.FileExists(options.SystemChromePath) {
             return errorutil.New("specified system chrome binary does not exist")
         }
+        if fi, err := os.Stat(options.SystemChromePath); err == nil && fi.Mode()&0111 == 0 {
+            return errorutil.New("specified system chrome binary is not executable")
+        }
     }
pkg/types/options.go (1)

104-106: Clarify HeadlessHybrid precedence in docs and CLI help: when both --headless and --headless-hybrid are set, --headless takes precedence; update the HeadlessHybrid flag description in pkg/types/options.go (and the generated CLI help) to explain their difference and this precedence.

pkg/engine/headless/browser/cookie/cookie_test.go (1)

9-49: Solid table-driven coverage for consent blocking.

Tests read clearly and validate the core rule-matching paths.

Consider adding cases for:

  • A blocked URL with a non-script resource type to verify resource-type gating.
  • Precedence when both Included and Excluded initiator domains could match.

Example additional case:

@@ func TestShouldBlockRequest(t *testing.T) {
         {
             name:            "should not block non-matching URL",
             url:             "https://example.com/other-path",
             resourceType:    proto.NetworkResourceTypeScript,
             initiatorDomain: "hackerone.com",
             want:            false,
         },
+        {
+            name:            "should not block non-script resource",
+            url:             "https://consent.trustarc.com/notice?domain=hackerone.com",
+            resourceType:    proto.NetworkResourceTypeImage,
+            initiatorDomain: "hackerone.com",
+            want:            false,
+        },
pkg/engine/headless/crawler/normalizer/dom_utils_test.go (1)

65-76: Tiny cleanup: remove unused local ‘d’ variable in subtests.

Not functionally wrong; simplifies the subtest body.

-        t.Run(tt.name, func(t *testing.T) {
-            d := normalizer
-            got, err := d.Apply(tt.args.content)
+        t.Run(tt.name, func(t *testing.T) {
+            got, err := normalizer.Apply(tt.args.content)
pkg/engine/headless/crawler/state_test.go (3)

12-14: Implement or skip the placeholder test

Empty test will silently pass; either implement or mark it skipped to avoid false positives.

Apply this diff:

 func TestPageFingerprint_Stability(t *testing.T) {
-
+	t.Skip("TODO: implement stability test for page fingerprints")
 }

3-10: Drop pkg/errors in tests; use stdlib error wrapping

Avoid extra dependency in tests; use fmt.Errorf with %w.

Apply this diff:

 import (
+	"fmt"
 	"strings"
 	"testing"

-	"github.com/pkg/errors"
 	"github.com/projectdiscovery/katana/pkg/engine/headless/crawler/normalizer/simhash"
 	"github.com/stretchr/testify/assert"
 )
-			return "", errors.Wrap(err, "could not get stripped dom")
+			return "", fmt.Errorf("could not get stripped dom: %w", err)

Also applies to: 84-84


90-104: Parallelize subtests for speed

These table-driven tests are independent; run them in parallel.

Apply this diff:

 		t.Run(tt.name, func(t *testing.T) {
+			t.Parallel()
 		t.Run(tt.name, func(t *testing.T) {
+			t.Parallel()

Also applies to: 145-172

cmd/katana/main.go (3)

75-79: Remove nested defer; call Stop directly inside the wrapper

The inner defer is redundant; simplify for clarity.

Apply this diff:

-	defer func() {
-		if pprofServer != nil {
-			defer pprofServer.Stop()
-		}
-	}()
+	defer func() {
+		if pprofServer != nil {
+			pprofServer.Stop()
+		}
+	}()

49-50: Resolve TODO or delete the comment

Either wire an env-based toggle here or drop the stale comment to avoid confusion.


143-156: Clarify/validate headless vs hybrid flag interplay

If both -headless and -hybrid are set, define precedence or error out. Help strings could also hint at mutual exclusivity.

Add a post-parse validation (outside this hunk), e.g.:

// after flagSet.Parse() and config merge
if options.Headless && options.HeadlessHybrid {
	return nil, errorutil.New("flags -headless and -hybrid are mutually exclusive")
}
pkg/engine/headless/crawler/normalizer/simhash/simhash_test.go (2)

15-27: Avoid asserting undocumented “shingle=0 fallback to 1” behavior

If Fingerprint doesn’t guarantee this, the test is brittle. Either document it in code or adjust the test to just check determinism with a fixed shingle size.

Apply this diff:

-// TestFingerprintDeterministic ensures that hashing the same document twice produces
-// identical fingerprints and that a shingle value of 0 gracefully falls back to 1.
+// TestFingerprintDeterministic ensures that hashing the same document twice produces
+// identical fingerprints.
 func TestFingerprintDeterministic(t *testing.T) {
 	in1 := strings.NewReader(htmlA)
 	in2 := strings.NewReader(htmlA)

-	fp1 := Fingerprint(in1, 0)
-	fp2 := Fingerprint(in2, 1)
+	fp1 := Fingerprint(in1, 3)
+	fp2 := Fingerprint(in2, 3)

18-27: Run tests in parallel

Independent tests; parallelize to reduce runtime.

Apply this diff:

 func TestFingerprintDeterministic(t *testing.T) {
+	t.Parallel()
 func TestFingerprintSimilarity(t *testing.T) {
+	t.Parallel()
 func TestOracle(t *testing.T) {
+	t.Parallel()

Also applies to: 32-44, 48-72

pkg/engine/headless/headless.go (3)

27-44: Consider parameterizing the debugger port

The debugger port is hardcoded to 8089 on line 40. This could cause conflicts if multiple instances are running or if the port is already in use.

-		headless.debugger = NewCrawlDebugger(8089)
+		debuggerPort := 8089 // Consider making this configurable
+		if options.Options.DebuggerPort > 0 {
+			debuggerPort = options.Options.DebuggerPort
+		}
+		headless.debugger = NewCrawlDebugger(debuggerPort)

100-128: TODO comment indicates multi-threading is planned

Line 128 contains a TODO comment about making crawling multi-threaded. Currently, concurrency is hardcoded to 1.

Do you want me to create an issue to track the multi-threaded crawling implementation or generate a solution that implements concurrent crawling?


118-120: Consider making response body clearing configurable

The response body and raw content are unconditionally cleared on lines 118-119. This might not be desirable for all use cases, especially when debugging or when the output writer needs this data.

-			rr.Response.Raw = ""
-			rr.Response.Body = ""
+			if !h.options.Options.KeepResponseBody {
+				rr.Response.Raw = ""
+				rr.Response.Body = ""
+			}
pkg/engine/headless/crawler/state.go (4)

19-20: Consider making SimHash threshold configurable

The SimHash threshold is hardcoded to 2 bits. Different use cases might require different similarity thresholds.

-const simhashThreshold = 2 // Allow up to 2 bits difference
+// DefaultSimHashThreshold is the default threshold for SimHash distance comparison
+const DefaultSimHashThreshold = 2 // Allow up to 2 bits difference

Then use it from configuration if available, falling back to the default.


72-74: Consider handling more blank page patterns

The function only checks for empty URL and "about:blank", but there are other blank page patterns like "chrome://newtab" or "data:," that might need handling.

-	if pageInfo.URL == "" || pageInfo.URL == "about:blank" {
+	blankPages := []string{"", "about:blank", "about:newtab", "chrome://newtab/", "data:,"}
+	for _, blank := range blankPages {
+		if pageInfo.URL == blank {
+			return nil, ErrEmptyPage
+		}
+	}
-		return nil, ErrEmptyPage
-	}

203-205: FIXME comment needs attention

There's a FIXME comment indicating that the origin element ID should be returned for correct graph representation.

The FIXME comment on lines 203-205 indicates a potential issue with graph representation. Would you like me to create an issue to track this or help implement a solution?


223-227: Inefficient loop for browser navigation

The loop navigates back one step at a time. This could be optimized by navigating multiple steps at once if the browser API supports it.

-	for i := 0; i < stepsBack; i++ {
-		if err := page.NavigateBack(); err != nil {
-			return "", err
-		}
-		navigatedSuccessfully = true
-	}
+	// Try to navigate back multiple steps at once if supported
+	if err := page.NavigateBackSteps(stepsBack); err != nil {
+		// Fallback to step-by-step navigation
+		for i := 0; i < stepsBack; i++ {
+			if err := page.NavigateBack(); err != nil {
+				return "", err
+			}
+		}
+	}
+	navigatedSuccessfully = true

Note: This assumes NavigateBackSteps method can be added to the browser API.

pkg/engine/headless/crawler/normalizer/normalizer.go (3)

88-92: Error handling silently returns original on parse failure

When strconv.ParseInt fails in replaceHexEscapeSequence, the function silently returns the original match. This could hide malformed input issues.

Consider logging the error for debugging purposes:

 	value, err := strconv.ParseInt(code, 16, 32)
 	if err != nil {
+		// Log the error for debugging if a logger is available
+		// log.Debug("Failed to parse hex escape sequence", "sequence", match, "error", err)
 		// If there's an error, return the original match
 		return match
 	}

97-98: Consider caching compiled regex patterns

The regex pattern is compiled at package initialization but could benefit from being unexported since it's only used internally.

-// Define the regex pattern to match hexadecimal escape sequences
-var pattern = regexp.MustCompile(`\\x[0-9a-fA-F]{2}`)
+// hexEscapePattern matches hexadecimal escape sequences
+var hexEscapePattern = regexp.MustCompile(`\\x[0-9a-fA-F]{2}`)

And update line 101:

-	return pattern.ReplaceAllStringFunc(input, func(match string) string {
+	return hexEscapePattern.ReplaceAllStringFunc(input, func(match string) string {

45-54: Missing validation for empty input

The Apply method doesn't check if the input text is empty before processing, which could lead to unnecessary processing.

 func (n *Normalizer) Apply(text string) (string, error) {
+	if text == "" {
+		return "", nil
+	}
+	
 	first := normalizeDocument(text)
 
 	firstpass, err := n.dom.Apply(first)
pkg/engine/headless/graph/graph.go (3)

46-48: Fix doc comment to match function name.

Apply:

-// AddNavigation adds a navigation to the graph
+// AddPageState adds a page state (and its incoming navigation edge, if any) to the graph.

34-44: Return empty slice on error instead of nil.

This avoids nil-slice surprises for callers and remains zero-value friendly.

Apply:

 func (g *CrawlGraph) GetVertices() []string {
 	vertices := []string{}
 	adjacencyMap, err := g.graph.AdjacencyMap()
 	if err != nil {
-		return nil
+		return vertices
 	}

102-108: Return by value or document copy semantics.

&pageVertex returns a pointer to a copy, not the stored vertex; callers may expect in-place mutation. Prefer returning by value.

Example:

-func (g *CrawlGraph) GetPageState(id string) (*types.PageState, error) {
+func (g *CrawlGraph) GetPageState(id string) (types.PageState, error) {
 	pageVertex, err := g.graph.Vertex(id)
 	if err != nil {
-		return nil, errors.Wrap(err, "could not get vertex")
+		return types.PageState{}, errors.Wrap(err, "could not get vertex")
 	}
-	return &pageVertex, nil
+	return pageVertex, nil
 }
pkg/engine/headless/crawler/normalizer/simhash/simhash.go (3)

21-26: Cap text feature size to bound memory/CPU.

Long text/comment tokens can balloon feature strings. Cap to a reasonable length.

Apply:

 // Constants for optimization
 const (
 	maxTokens       = 5000
 	featuresBufSize = 1000
+	maxTextLen      = 256
 )

92-127: Trim large text/comment payloads and avoid fmt.Sprintf hot path.

Reduces allocations and GC churn in tight loops.

Apply:

 func extractFeatures(t *html.Token, features *[]string) {
 	// Pre-allocate string builder for efficiency
 	var s string
 
 	switch t.Type {
 	case html.StartTagToken:
 		s = "A:" + t.DataAtom.String()
 	case html.EndTagToken:
 		s = "B:" + t.DataAtom.String()
 	case html.SelfClosingTagToken:
 		s = "C:" + t.DataAtom.String()
 	case html.DoctypeToken:
-		s = "D:" + string(t.Data)
+		td := t.Data
+		if len(td) > maxTextLen {
+			td = td[:maxTextLen]
+		}
+		s = "D:" + td
 	case html.CommentToken:
-		s = "E:" + string(t.Data)
+		td := t.Data
+		if len(td) > maxTextLen {
+			td = td[:maxTextLen]
+		}
+		s = "E:" + td
 	case html.TextToken:
-		s = "F:" + string(t.Data)
+		td := t.Data
+		if len(td) > maxTextLen {
+			td = td[:maxTextLen]
+		}
+		s = "F:" + td
 	case html.ErrorToken:
-		s = "Z:" + string(t.Data)
+		td := t.Data
+		if len(td) > maxTextLen {
+			td = td[:maxTextLen]
+		}
+		s = "Z:" + td
 	default:
 		return
 	}
 
 	*features = append(*features, s)
 
 	// Process attributes
 	for _, attr := range t.Attr {
 		switch attr.Key {
 		case "class", "name", "rel":
-			s = fmt.Sprintf("G:%s:%s:%s", t.DataAtom.String(), attr.Key, attr.Val)
+			s = "G:" + t.DataAtom.String() + ":" + attr.Key + ":" + attr.Val
 		default:
-			s = fmt.Sprintf("G:%s:%s", t.DataAtom.String(), attr.Key)
+			s = "G:" + t.DataAtom.String() + ":" + attr.Key
 		}
 		*features = append(*features, s)
 	}
 }

34-89: Process features in-stream to avoid holding all features in memory.

You can update the fingerprint as features are extracted (only keep a shingle window), eliminating the features slice entirely.

I can provide a patch that folds shingling and vector updates into the tokenization loop if you want to pursue this optimization.

pkg/engine/headless/debugger.go (3)

3-9: Import context for graceful shutdown.

Apply:

 import (
 	"encoding/json"
 	"fmt"
+	"context"
 	"net/http"
 	"sync"
 	"time"
 )

98-102: Avoid double GetActiveURLs call.

Minor efficiency/readability tweak.

Apply:

 func (cd *CrawlDebugger) handleActiveURLs(w http.ResponseWriter, r *http.Request) {
 	w.Header().Set("Content-Type", "application/json")
-	json.NewEncoder(w).Encode(map[string]interface{}{
-		"timestamp":   time.Now().Format(time.RFC3339),
-		"active_urls": cd.GetActiveURLs(),
-		"count":       len(cd.GetActiveURLs()),
-	})
+	urls := cd.GetActiveURLs()
+	_ = json.NewEncoder(w).Encode(map[string]interface{}{
+		"timestamp":   time.Now().Format(time.RFC3339),
+		"active_urls": urls,
+		"count":       len(urls),
+	})
 }

113-122: Use graceful shutdown.

Prefer Shutdown with a timeout over Close to drain in-flight requests.

Apply:

 func (cd *CrawlDebugger) Close() {
 	if cd == nil {
 		return
 	}
 
 	if cd.httpServer != nil {
-		cd.httpServer.Close()
+		ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
+		defer cancel()
+		_ = cd.httpServer.Shutdown(ctx)
 	}
 }
pkg/engine/headless/browser/cookie/cookie.go (3)

19-21: Remove unused Action type (or document future use).

Currently unused and confusing in a blocker-only implementation.

Apply:

-type Action struct {
-	Type string `json:"type"`
-}
+// Action type reserved for future rule actions (e.g., block/redirect/modify).
+// Currently unused.

72-83: Use domain suffix match instead of substring for initiators.

Substrings can false-positive on unrelated domains.

Apply:

 	if len(rule.Condition.InitiatorDomains) > 0 {
 		matched := false
 		for _, domain := range rule.Condition.InitiatorDomains {
-			if strings.Contains(initiatorDomain, domain) {
+			if domainMatches(initiatorDomain, domain) {
 				matched = true
 				break
 			}
 		}
 		if !matched {
 			return false
 		}
 	}

Add this helper (outside the shown hunk):

// domainMatches returns true when host equals domain or ends with "."+domain (case-insensitive).
func domainMatches(host, domain string) bool {
	host = strings.ToLower(host)
	domain = strings.ToLower(domain)
	return host == domain || strings.HasSuffix(host, "."+domain)
}

85-91: Apply the same strictness for excluded initiators.

Apply:

 	if len(rule.Condition.ExcludedInitiatorDomains) > 0 {
 		for _, domain := range rule.Condition.ExcludedInitiatorDomains {
-			if strings.Contains(initiatorDomain, domain) {
+			if domainMatches(initiatorDomain, domain) {
 				return false
 			}
 		}
 	}
pkg/engine/headless/crawler/normalizer/dom_utils.go (3)

129-143: Over-aggressive attribute stripping (“href”, “src”, blanket “js”) increases collisions.*

Removing href/src from all elements erases meaningful structure and may make unrelated pages hash-identical. Also, stripping any attribute starting with “js” is too broad.

Consider:

  • Keep href/src but normalize (e.g., drop known tracking params: utm_*, gclid, fbclid, etc.).
  • Narrow the “js” rule to js- prefix only.
 var attributes = []string{
   "class",
   "id",
-  "name", // name tags is often random
-  "href", // remove href
   "style",
   "width",
   "height",
-  "src",
   "nowrap",
   "target",
   "valign",
   "cellpadding",
   "cellspacing",
 }
@@
-      if strings.HasPrefix(attr.Key, "data-") || strings.HasPrefix(attr.Key, "aria-") || strings.HasPrefix(attr.Key, "js") {
+      if strings.HasPrefix(attr.Key, "data-") || strings.HasPrefix(attr.Key, "aria-") || strings.HasPrefix(attr.Key, "js-") {
         s.RemoveAttr(attr.Key)
       }

If you do want to touch href values, add a small normalizer that removes only known volatile query params.

Also applies to: 154-167


115-127: Nil-safety for comment removal.

Guard against nil nodes to avoid panics on unusual selections.

Apply:

 func removeComments(n *html.Node) {
+  if n == nil {
+    return
+  }
   if n.Type == html.CommentNode {
-    n.Parent.RemoveChild(n)
+    if n.Parent != nil {
+      n.Parent.RemoveChild(n)
+    }
   }
   for c := n.FirstChild; c != nil; c = c.NextSibling {
     removeComments(c)
   }
 }
 
 func removeCommentsDomTransformationFunc(s *goquery.Selection) {
-  removeComments(s.Get(0))
+  removeComments(s.Get(0))
 }

82-103: Consider returning a stripped “outerHTML” of only.

Full doc.Html() may include head/meta noise that varies across requests. Hashing only normalized body can improve stability.

If acceptable, switch to serializing the normalized subtree when present.

pkg/engine/headless/browser/element.go (3)

210-245: Use a consistent Eval path (b.Page.Eval) for clarity.

Mixed usage (b.Eval vs b.Page.Eval) is confusing and can hide context differences.

Apply:

-  eventlisteners, err := b.Eval(`() => window.__eventListeners`)
+  eventlisteners, err := b.Page.Eval(`() => window.__eventListeners`)
@@
-  inlineListeners, err := b.Eval(`() => window.getAllElementsWithEventListeners()`)
+  inlineListeners, err := b.Page.Eval(`() => window.getAllElementsWithEventListeners()`)
@@
-  navigatedLinks, err := b.Eval(`() => window.__navigatedLinks`)
+  navigatedLinks, err := b.Page.Eval(`() => window.__navigatedLinks`)

Also applies to: 254-266


246-266: Nit: variable naming in GetNavigatedLinks.

The local slice is named listeners; rename to links for readability.

Apply:

-  listeners := make([]*NavigatedLink, 0)
-  if err := navigatedLinks.Value.Unmarshal(&listeners); err != nil {
+  links := make([]*NavigatedLink, 0)
+  if err := navigatedLinks.Value.Unmarshal(&links); err != nil {
     return nil, err
   }
-  return listeners, nil
+  return links, nil

146-151: Case-insensitive tag check.

Ensure TagName comparisons don’t depend on casing from JS marshalling.

Apply:

-      if element.TagName != "BUTTON" {
+      if !strings.EqualFold(element.TagName, "BUTTON") {
         continue
       }
pkg/engine/headless/TODOS.md (2)

84-86: Standardize horizontal rule to Markdown “---”.

Improves render consistency and satisfies linters.

----------------------------------
---

14-23: Tighten acceptance criteria for “page ready”.

Define concrete defaults (e.g., idle window 300ms, overall timeout 15s) and failure telemetry to aid tuning.

I can draft the exact options surface and metrics if helpful.

pkg/engine/headless/crawler/diagnostics/diagnostics.go (1)

185-195: Mutex unnecessary in LogPageStateScreenshot

The mutex is acquired but only filesystem operations are performed with no shared state access. Unless the directory path collision is a concern (which is already handled by MkdirAll), the mutex serves no purpose here.

Consider removing the unnecessary mutex:

 func (w *diskWriter) LogPageStateScreenshot(pageStateID string, screenshot []byte) error {
-	w.mu.Lock()
-	defer w.mu.Unlock()
-
 	dir := filepath.Join(w.directory, pageStateID)
 	if err := os.MkdirAll(dir, 0755); err != nil {
 		return err
 	}
 	screenshotFile := filepath.Join(dir, "screenshot.png")
 	return os.WriteFile(screenshotFile, screenshot, 0644)
 }
pkg/engine/headless/types/types.go (1)

4-5: Avoid using MD5 for cryptographic purposes

While MD5 is being used here for generating hash identifiers rather than security, it's deprecated and has known vulnerabilities. Consider using a more modern hash function.

Consider using SHA-256 for better future-proofing:

 import (
-	"crypto/md5"
+	"crypto/sha256"
 	"encoding/hex"

And update the hash generation accordingly:

 func (e *HTMLElement) Hash() string {
-	hasher := md5.New()
+	hasher := sha256.New()
pkg/engine/headless/crawler/crawler.go (1)

180-183: Redundant timeout mechanism

Having both context cancellation and time.After for the same timeout creates unnecessary complexity and resource usage. The context cancellation alone is sufficient.

Remove the redundant time.After mechanism:

-	// Retain the legacy time.After guard as a secondary fail-safe but the
-	// context cancellation is what actually stops in-flight rod calls.
-	var crawlTimeout <-chan time.Time
-	if c.options.MaxCrawlDuration > 0 {
-		crawlTimeout = time.After(c.options.MaxCrawlDuration)
-	}
-
 	consecutiveFailures := 0
 
 	for {
 		select {
-		case <-crawlTimeout:
+		case <-ctx.Done():
 			c.logger.Debug("Max crawl duration reached, stopping crawl")
 			return nil
 		default:
pkg/engine/headless/browser/browser.go (4)

329-331: Add browser health check as mentioned in TODO

The TODO comment correctly identifies that the browser might die. Implement a health check before returning the page.

The TODO comment at line 329-330 indicates that browser health checking is needed. Would you like me to generate a solution that implements browser health validation before returning the page from the pool?


494-524: Optimize pool management in PutBrowserToPool

The function performs multiple health checks that could be consolidated, and the page cleanup logic could be more efficient.

 func (l *Launcher) PutBrowserToPool(browser *BrowserPage) {
+	// Helper to cleanup and discard the browser
+	discardBrowser := func() {
+		browser.cancel()
+		browser.CloseBrowserPage()
+	}
+	
 	// Discard pages that hit a deadline or were cancelled to avoid immediately
 	// returning a poisoned page that will fail every subsequent call.
 	if cerr := browser.Page.GetContext().Err(); cerr != nil {
-		browser.cancel()
-		browser.CloseBrowserPage()
+		discardBrowser()
 		return
 	}
 	// If the browser is not connected, close it
 	if !isBrowserConnected(browser.Browser) {
-		browser.cancel()
-		browser.CloseBrowserPage()
+		discardBrowser()
 		return
 	}

 	pages, err := browser.Browser.Pages()
 	if err != nil {
-		browser.cancel()
-		browser.CloseBrowserPage()
+		discardBrowser()
 		return
 	}

 	// Close all pages except the current one
 	currentPageID := browser.Page.TargetID
 	for _, page := range pages {
 		if page.TargetID != currentPageID {
 			_ = page.Close()
 		}
 	}
 	l.browserPool.Put(browser)
 }

88-88: Consider making incognito mode configurable

The browser is hardcoded to use incognito mode, which might not be suitable for all use cases (e.g., when cookie persistence is needed).

Consider adding an Incognito field to LauncherOptions to make this configurable:

 type LauncherOptions struct {
 	ChromiumPath        string
 	MaxBrowsers         int
 	PageMaxTimeout      time.Duration
 	ShowBrowser         bool
 	Proxy               string
 	SlowMotion          bool
 	Trace               bool
 	CookieConsentBypass bool
+	Incognito           bool // default: true
 	ChromeUser          *user.User // optional chrome user to use
 	
 	ScopeValidator  ScopeValidator
 	RequestCallback func(*output.Result)
 }

526-535: Inefficient browser connectivity check

The isBrowserConnected function makes a synchronous RPC call every time it's invoked. Consider caching the connection state or using a more lightweight check.

 func isBrowserConnected(browser *rod.Browser) bool {
+	// First try a lightweight check if the browser context is valid
+	if browser == nil {
+		return false
+	}
+	
 	getVersionResult, err := proto.BrowserGetVersion{}.Call(browser)
 	if err != nil {
 		return false
 	}
 	if getVersionResult == nil || getVersionResult.Product == "" {
 		return false
 	}
 	return true
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 927edaf and 348eba0.

⛔ Files ignored due to path filters (2)
  • go.sum is excluded by !**/*.sum
  • pkg/engine/headless/browser/cookie/rules.json is excluded by !**/*.json
📒 Files selected for processing (34)
  • .gitignore (1 hunks)
  • cmd/katana/main.go (4 hunks)
  • go.mod (7 hunks)
  • internal/runner/options.go (1 hunks)
  • internal/runner/runner.go (2 hunks)
  • pkg/engine/headless/TODOS.md (1 hunks)
  • pkg/engine/headless/browser/browser.go (1 hunks)
  • pkg/engine/headless/browser/cookie/cookie.go (1 hunks)
  • pkg/engine/headless/browser/cookie/cookie_test.go (1 hunks)
  • pkg/engine/headless/browser/element.go (1 hunks)
  • pkg/engine/headless/crawler/crawler.go (1 hunks)
  • pkg/engine/headless/crawler/diagnostics/diagnostics.go (1 hunks)
  • pkg/engine/headless/crawler/normalizer/dom_utils.go (1 hunks)
  • pkg/engine/headless/crawler/normalizer/dom_utils_test.go (1 hunks)
  • pkg/engine/headless/crawler/normalizer/helpers.go (1 hunks)
  • pkg/engine/headless/crawler/normalizer/normalizer.go (1 hunks)
  • pkg/engine/headless/crawler/normalizer/simhash/simhash.go (1 hunks)
  • pkg/engine/headless/crawler/normalizer/simhash/simhash_test.go (1 hunks)
  • pkg/engine/headless/crawler/normalizer/text_utils.go (1 hunks)
  • pkg/engine/headless/crawler/normalizer/text_utils_test.go (1 hunks)
  • pkg/engine/headless/crawler/state.go (1 hunks)
  • pkg/engine/headless/crawler/state_test.go (1 hunks)
  • pkg/engine/headless/debugger.go (1 hunks)
  • pkg/engine/headless/graph/graph.go (1 hunks)
  • pkg/engine/headless/headless.go (1 hunks)
  • pkg/engine/headless/js/js.go (1 hunks)
  • pkg/engine/headless/js/page-init.js (1 hunks)
  • pkg/engine/headless/js/utils.js (1 hunks)
  • pkg/engine/headless/types/types.go (1 hunks)
  • pkg/engine/parser/parser.go (1 hunks)
  • pkg/engine/parser/parser_test.go (1 hunks)
  • pkg/types/crawler_options.go (2 hunks)
  • pkg/types/options.go (3 hunks)
  • pkg/utils/extensions/extensions.go (1 hunks)
🧰 Additional context used
🪛 LanguageTool
pkg/engine/headless/TODOS.md

[grammar] ~1-~1: There might be a mistake here.
Context: # 🗺️ Headless-Crawler Road-Map --- ## 🥇 Core Improvements (High-impact / fir...

(QB_NEW_EN)


[grammar] ~4-~4: There might be a mistake here.
Context: ...mprovements (High-impact / first passes) - [ ] After clicking on elements there isn...

(QB_NEW_EN)


[grammar] ~9-~9: There might be a mistake here.
Context: ...6 on stripped DOM for cheap exact-match. - Stage 2: compute 64-bit SimHash/MinHash ...

(QB_NEW_EN)


[grammar] ~10-~10: There might be a mistake here.
Context: ...es equal when Hamming distance ≤ 3 bits. - Stage 3 (optional): if SimHash inconclus...

(QB_NEW_EN)


[grammar] ~11-~11: There might be a mistake here.
Context: ...-res screenshot and compare pHash/dHash. - Store ExactHash & FuzzyHash; update ...

(QB_NEW_EN)


[grammar] ~22-~22: There might be a mistake here.
Context: ....9)untilscrollHeight` stops growing. - Fallback: IntersectionObserver on a sent...

(QB_NEW_EN)


[grammar] ~27-~27: There might be a mistake here.
Context: ...et, EventSource, Service-Worker scripts. - Feed new, in-scope URLs into the crawl q...

(QB_NEW_EN)


[grammar] ~32-~32: There might be a mistake here.
Context: ...queue (dedup by host/path). --- ## 🥈 Mid-term Enhancements - [ ] **Dynamic form-fill...

(QB_NEW_EN)


[grammar] ~32-~32: There might be a mistake here.
Context: ...ath). --- ## 🥈 Mid-term Enhancements - [ ] Dynamic form-filling - Gener...

(QB_NEW_EN)


[grammar] ~35-~35: There might be a mistake here.
Context: ..., min, max, maxlength, required. - Pluggable ValueProvider interface for ...

(QB_NEW_EN)


[grammar] ~42-~42: There might be a mistake here.
Context: ... Worker pool consuming the action queue. - Use multiple rod.Page instances (share...

(QB_NEW_EN)


[grammar] ~45-~45: There might be a mistake here.
Context: ... - [ ] Smart time-out & retry budgets - Adaptive timeout: first nav longer, late...

(QB_NEW_EN)


[grammar] ~48-~48: There might be a mistake here.
Context: ...d on stall. - [ ] Viewport variants - Crawl again at typical mobile (390×844) ...

(QB_NEW_EN)


[grammar] ~52-~52: There might be a mistake here.
Context: ...* - Close background tabs after use. - If Chrome RSS > threshold, restart brows...

(QB_NEW_EN)


[grammar] ~56-~56: There might be a mistake here.
Context: ... - Spoof fonts, canvas & audio contexts. - Rotate realistic UA strings, languages, ...

(QB_NEW_EN)


[grammar] ~57-~57: There might be a mistake here.
Context: ...rings, languages, hardwareConcurrency. - Optional headful mode via XVFB to enable...

(QB_NEW_EN)


[grammar] ~62-~62: There might be a mistake here.
Context: ...hs. --- ## 🥉 Nice-to-have / Advanced - [ ] Export crawl sessions - HAR ...

(QB_NEW_EN)


[grammar] ~77-~77: There might be a mistake here.
Context: ...rotation. - [ ] Sandboxing & security - Run Chrome under seccomp / user-namespac...

(QB_NEW_EN)


[grammar] ~90-~90: There might be a mistake here.
Context: ... ## 🎯 Critical Bug Fixes & Edge Cases - [ ] Handle iframe content extraction...

(QB_NEW_EN)


[grammar] ~93-~93: There might be a mistake here.
Context: ...oss-origin iframe detection and flagging - Same-origin iframe DOM traversal - Nes...

(QB_NEW_EN)


[grammar] ~94-~94: There might be a mistake here.
Context: ...ing - Same-origin iframe DOM traversal - Nested iframe support (up to N levels) ...

(QB_NEW_EN)


[grammar] ~97-~97: There might be a mistake here.
Context: ... [ ] WebComponent & Shadow DOM support - Detect custom elements with shadow roots...

(QB_NEW_EN)


[grammar] ~107-~107: There might be a mistake here.
Context: ...# 🔐 Authentication & Session Management - [ ] Auth state detection - Detect ...

(QB_NEW_EN)


[grammar] ~114-~114: There might be a mistake here.
Context: ...n actions - [ ] Multi-step auth flows - OAuth redirect handling - 2FA/MFA dete...

(QB_NEW_EN)


[grammar] ~115-~115: There might be a mistake here.
Context: ...auth flows** - OAuth redirect handling - 2FA/MFA detection and waiting - SAML/S...

(QB_NEW_EN)


[grammar] ~116-~116: There might be a mistake here.
Context: ...ndling - 2FA/MFA detection and waiting - SAML/SSO flow support - [ ] **Session p...

(QB_NEW_EN)


[grammar] ~124-~124: There might be a mistake here.
Context: ...uts ## 🎪 Advanced Interaction Patterns - [ ] Complex UI interactions - Drag...

(QB_NEW_EN)


[grammar] ~127-~127: There might be a mistake here.
Context: ... - Drag & drop detection and execution - File upload with generated test files ...

(QB_NEW_EN)


[grammar] ~128-~128: There might be a mistake here.
Context: ... - File upload with generated test files - Multi-select and combo-box handling - ...

(QB_NEW_EN)


[grammar] ~129-~129: There might be a mistake here.
Context: ... - Multi-select and combo-box handling - Date/time picker interaction - [ ] **Ke...

(QB_NEW_EN)


[grammar] ~133-~133: There might be a mistake here.
Context: ... support** - Tab-order based discovery - Keyboard shortcut detection (Ctrl+K, etc...

(QB_NEW_EN)


[grammar] ~134-~134: There might be a mistake here.
Context: ...yboard shortcut detection (Ctrl+K, etc.) - Access key enumeration - [ ] **Touch/mo...

(QB_NEW_EN)


[grammar] ~137-~137: There might be a mistake here.
Context: ...umeration - [ ] Touch/mobile gestures - Swipe detection for mobile views - Lon...

(QB_NEW_EN)


[grammar] ~138-~138: There might be a mistake here.
Context: ...s** - Swipe detection for mobile views - Long-press context menus - Pinch-to-zo...

(QB_NEW_EN)


[grammar] ~139-~139: There might be a mistake here.
Context: ...obile views - Long-press context menus - Pinch-to-zoom aware navigation ## 📊 An...

(QB_NEW_EN)


[grammar] ~142-~142: There might be a mistake here.
Context: ...navigation ## 📊 Analytics & Monitoring - [ ] Performance metrics - Page loa...

(QB_NEW_EN)


[grammar] ~145-~145: There might be a mistake here.
Context: ...ce metrics** - Page load time tracking - JavaScript execution overhead - Memory...

(QB_NEW_EN)


[grammar] ~146-~146: There might be a mistake here.
Context: ...acking - JavaScript execution overhead - Memory usage per page state - Network ...

(QB_NEW_EN)


[grammar] ~147-~147: There might be a mistake here.
Context: ...overhead - Memory usage per page state - Network request waterfalls - [ ] **Craw...

(QB_NEW_EN)


[grammar] ~151-~151: There might be a mistake here.
Context: ...y metrics** - Code coverage per domain - Unique vs duplicate state ratio - Acti...

(QB_NEW_EN)


[grammar] ~152-~152: There might be a mistake here.
Context: ...main - Unique vs duplicate state ratio - Action success/failure rates - Depth d...

(QB_NEW_EN)


[grammar] ~153-~153: There might be a mistake here.
Context: ...e ratio - Action success/failure rates - Depth distribution analysis - [ ] **Err...

(QB_NEW_EN)


[grammar] ~157-~157: There might be a mistake here.
Context: ...g** - JavaScript console error capture - Network error categorization - CSP vio...

(QB_NEW_EN)


[grammar] ~158-~158: There might be a mistake here.
Context: ...capture - Network error categorization - CSP violation logging - Failed action ...

(QB_NEW_EN)


[grammar] ~159-~159: There might be a mistake here.
Context: ...categorization - CSP violation logging - Failed action root cause analysis ## 🧠...

(QB_NEW_EN)


[grammar] ~162-~162: There might be a mistake here.
Context: ... analysis ## 🧠 Smart Crawling Features - [ ] ML-based duplicate detection -...

(QB_NEW_EN)


[grammar] ~164-~164: There might be a mistake here.
Context: ...es - [ ] ML-based duplicate detection - Train model on visual similarity - Sem...

(QB_NEW_EN)


[grammar] ~169-~169: There might be a mistake here.
Context: ...rns - [ ] Priority queue optimization - High-value path prediction - Anomaly d...

(QB_NEW_EN)


[grammar] ~170-~170: There might be a mistake here.
Context: ...ization** - High-value path prediction - Anomaly detection for interesting states...

(QB_NEW_EN)


[grammar] ~171-~171: There might be a mistake here.
Context: ...Anomaly detection for interesting states - Dynamic depth adjustment based on yield ...

(QB_NEW_EN)


[grammar] ~174-~174: There might be a mistake here.
Context: ... on yield - [ ] State space reduction - Identify and prune redundant actions -...

(QB_NEW_EN)


[grammar] ~179-~179: There might be a mistake here.
Context: ...ariations) ## 🛡️ Security & Compliance - [ ] CAPTCHA handling - Detection o...

(QB_NEW_EN)


[grammar] ~181-~181: There might be a mistake here.
Context: ...y & Compliance - [ ] CAPTCHA handling - Detection of common CAPTCHA providers ...

(QB_NEW_EN)


[grammar] ~182-~182: There might be a mistake here.
Context: ... - Detection of common CAPTCHA providers - Integration points for solving services ...

(QB_NEW_EN)


[grammar] ~183-~183: There might be a mistake here.
Context: ... Integration points for solving services - Graceful degradation strategies - [ ] *...

(QB_NEW_EN)


[grammar] ~186-~186: There might be a mistake here.
Context: ...gies - [ ] Rate limiting & politeness - Per-domain request throttling - Respec...

(QB_NEW_EN)


[grammar] ~187-~187: There might be a mistake here.
Context: ...ness** - Per-domain request throttling - Respect robots.txt for headless - Adap...

(QB_NEW_EN)


[grammar] ~188-~188: There might be a mistake here.
Context: ...ling - Respect robots.txt for headless - Adaptive delays based on response times ...

(QB_NEW_EN)


[grammar] ~192-~192: There might be a mistake here.
Context: ... compliance** - PII detection in forms - GDPR banner interaction - Data retenti...

(QB_NEW_EN)


[grammar] ~193-~193: There might be a mistake here.
Context: ...ion in forms - GDPR banner interaction - Data retention policies ## 🔌 Integrati...

(QB_NEW_EN)


[grammar] ~196-~196: There might be a mistake here.
Context: ...ion policies ## 🔌 Integration Features - [ ] API extraction - GraphQL query...

(QB_NEW_EN)


[grammar] ~199-~199: There might be a mistake here.
Context: ...n** - GraphQL query/mutation detection - REST endpoint parameter learning - Web...

(QB_NEW_EN)


[grammar] ~200-~200: There might be a mistake here.
Context: ...ion - REST endpoint parameter learning - WebSocket message format detection - [ ...

(QB_NEW_EN)


[grammar] ~204-~204: There might be a mistake here.
Context: ...OpenAPI spec generation from discoveries - Postman collection export - Burp Suite...

(QB_NEW_EN)


[grammar] ~205-~205: There might be a mistake here.
Context: ...iscoveries - Postman collection export - Burp Suite state file compatibility - [...

(QB_NEW_EN)


[grammar] ~209-~209: There might be a mistake here.
Context: ...- Playwright/Puppeteer script generation - Selenium IDE format export - Custom DS...

(QB_NEW_EN)


[grammar] ~210-~210: There might be a mistake here.
Context: ...eneration - Selenium IDE format export - Custom DSL for replay ## 🚀 Performance...

(QB_NEW_EN)


[grammar] ~213-~213: There might be a mistake here.
Context: ... replay ## 🚀 Performance Optimizations - [ ] Rendering optimizations - Disa...

(QB_NEW_EN)


[grammar] ~220-~220: There might be a mistake here.
Context: ...or battery saving - [ ] Caching layer - DOM diff caching - Screenshot perceptu...

(QB_NEW_EN)


[grammar] ~221-~221: There might be a mistake here.
Context: ...] Caching layer - DOM diff caching - Screenshot perceptual hashes - JavaScr...

(QB_NEW_EN)


[grammar] ~222-~222: There might be a mistake here.
Context: ...caching - Screenshot perceptual hashes - JavaScript execution results - [ ] **Di...

(QB_NEW_EN)


[grammar] ~225-~225: There might be a mistake here.
Context: ...on results - [ ] Distributed crawling - Work queue distribution - State synchr...

(QB_NEW_EN)


[grammar] ~226-~226: There might be a mistake here.
Context: ...d crawling** - Work queue distribution - State synchronization protocol - Resul...

(QB_NEW_EN)


[grammar] ~227-~227: There might be a mistake here.
Context: ...ution - State synchronization protocol - Result aggregation pipeline ## 🔧 Devel...

(QB_NEW_EN)


[grammar] ~230-~230: There might be a mistake here.
Context: ...ion pipeline ## 🔧 Developer Experience - [ ] Debug tooling - Live crawl vis...

(QB_NEW_EN)


[grammar] ~233-~233: There might be a mistake here.
Context: ...g tooling** - Live crawl visualization - State graph explorer UI - Action repla...

(QB_NEW_EN)


[grammar] ~234-~234: There might be a mistake here.
Context: ...isualization - State graph explorer UI - Action replay debugger - [ ] **Configur...

(QB_NEW_EN)


[grammar] ~237-~237: There might be a mistake here.
Context: ...bugger - [ ] Configuration management - Per-site config profiles - A/B testing...

(QB_NEW_EN)


[grammar] ~238-~238: There might be a mistake here.
Context: ...anagement** - Per-site config profiles - A/B testing different strategies - Hot...

(QB_NEW_EN)


[grammar] ~239-~239: There might be a mistake here.
Context: ...les - A/B testing different strategies - Hot-reload of site adapters - [ ] **Tes...

(QB_NEW_EN)


[grammar] ~243-~243: There might be a mistake here.
Context: ...ucture** - Headless crawler unit tests - Integration tests with test sites - Re...

(QB_NEW_EN)


[grammar] ~244-~244: There might be a mistake here.
Context: ...ts - Integration tests with test sites - Regression detection suite state.go is...

(QB_NEW_EN)


[grammar] ~259-~259: There might be a mistake here.
Context: ...──────────────────────────────────────── 1. Fingerprint strategy (page → id) ───────...

(QB_NEW_EN)


[grammar] ~260-~260: There might be a mistake here.
Context: ...── 1. Fingerprint strategy (page → id) ────────────────────────────────────────...

(QB_NEW_EN)


[grammar] ~261-~261: There might be a mistake here.
Context: ...──────────────────────────────────────── A. Canonical DOM extraction • Use t...

(QB_NEW_EN)


[grammar] ~262-~262: There might be a mistake here.
Context: ...──────────── A. Canonical DOM extraction • Use the existing domNormalizer (strip ...

(QB_NEW_EN)


[grammar] ~263-~263: There might be a mistake here.
Context: ...trip scripts, styles, dynamic IDs etc.). • Remove all transient event-attributes ...

(QB_NEW_EN)


[grammar] ~264-~264: There might be a mistake here.
Context: ...ttributes (onclick, onmouseover, …). • Collapse whitespace → single space. B...

(QB_NEW_EN)


[grammar] ~267-~267: There might be a mistake here.
Context: ...espace → single space. B. Two-tier hash • ExactHash = SHA-256(strippedDOM). ...

(QB_NEW_EN)


[grammar] ~268-~268: There might be a mistake here.
Context: ... • ExactHash = SHA-256(strippedDOM). • FuzzyHash = SimHash64(4-word shingles...

(QB_NEW_EN)


[grammar] ~269-~269: There might be a mistake here.
Context: ...mHash64(4-word shingles of strippedDOM). • Treat states equal if - Exac...

(QB_NEW_EN)


[grammar] ~270-~270: There might be a mistake here.
Context: ...rippedDOM). • Treat states equal if - ExactHash matches, or - Hamm...

(QB_NEW_EN)


[grammar] ~271-~271: There might be a mistake here.
Context: ...ual if - ExactHash matches, or - Hamming(FuzzyHash, other.FuzzyHash) ≤ ...

(QB_NEW_EN)


[grammar] ~275-~275: There might be a mistake here.
Context: ...FuzzyHash). C. Optional visual fallback • If comparison is inconclusive (≥ 4 bit...

(QB_NEW_EN)


[grammar] ~277-~277: There might be a mistake here.
Context: ...hot, pHash/dHash → same threshold logic. • Executed lazily to avoid perf hit. Re...

(QB_NEW_EN)


[grammar] ~278-~278: There might be a mistake here.
Context: ...eshold logic. • Executed lazily to avoid perf hit. Resulting struct: type Page...

(QB_NEW_EN)


[grammar] ~322-~322: There might be a mistake here.
Context: ...by XPath, ensure Visible & Interactable, plus DOM equality check under the canon...

(QB_NEW_EN)


[grammar] ~323-~323: There might be a mistake here.
Context: ... canonicalizer to avoid false positives. • If match, return targetOriginID. Step...

(QB_NEW_EN)


[grammar] ~330-~330: There might be a mistake here.
Context: ...Change() (new detector described below). • Recompute fingerprint; if equal (exact...

(QB_NEW_EN)


[grammar] ~336-~336: There might be a mistake here.
Context: ...ction; after each, WaitForRouteChange(). • After final step verify state (same eq...

(QB_NEW_EN)


[grammar] ~337-~337: There might be a mistake here.
Context: ...fter each, WaitForRouteChange(). • After final step verify state (same equality ...

(QB_NEW_EN)


[grammar] ~337-~337: There might be a mistake here.
Context: ...y state (same equality logic as Step 2). • Failure → ErrNoNavigationPossible. En...

(QB_NEW_EN)


[grammar] ~340-~340: There might be a mistake here.
Context: ...→ ErrNoNavigationPossible. Enhancements • Cache the computed “distance” between ...

(QB_NEW_EN)


[grammar] ~341-~341: There might be a mistake here.
Context: ...che the computed “distance” between two states; next call can skip graph search. • Re...

(QB_NEW_EN)


[grammar] ~341-~341: There might be a mistake here.
Context: ...ance” between two states; next call can skip graph search. • Record statistics (#n...

(QB_NEW_EN)


[grammar] ~341-~341: There might be a mistake here.
Context: ...states; next call can skip graph search. • Record statistics (#navigationBackSucc...

(QB_NEW_EN)


[grammar] ~351-~351: There might be a mistake here.
Context: ... const idle = () => new Promise(res => { const done = () => { obs.disconnect(); r...

(QB_NEW_EN)


[grammar] ~357-~357: There might be a mistake here.
Context: ...: true, attributes: true}); reset(); }); window.__katanaReady = () => Promis...

(QB_NEW_EN)


[grammar] ~360-~360: There might be a mistake here.
Context: ...ndow.__katanaReady = () => Promise.all([ idle(), new Promise(r => requestIdle...

(QB_NEW_EN)


[grammar] ~362-~362: There might be a mistake here.
Context: ...requestIdleCallback(r, {timeout: 5000})) ]); Go side: func (p *BrowserPage) Wai...

(QB_NEW_EN)


[grammar] ~371-~371: There might be a mistake here.
Context: ...val(ctx, await window.__katanaReady()) }) } Detects route changes, SPA navigat...

(QB_NEW_EN)


[grammar] ~388-~388: There might be a mistake here.
Context: ...imHash library, enable fuzzy comparator in graph. 3. Stage 3: optional pHash pat...

(QB_NEW_EN)


[grammar] ~388-~388: There might be a mistake here.
Context: ...brary, enable fuzzy comparator in graph. 3. Stage 3: optional pHash path guarded by ...

(QB_NEW_EN)


[grammar] ~389-~389: There might be a mistake here.
Context: ...onal pHash path guarded by feature flag. 4. Replace WaitPageLoadHeuristics with Wait...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
pkg/engine/headless/TODOS.md

17-17: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


18-18: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


84-84: Horizontal rule style
Expected: ---; Actual: ----------------------------------

(MD035, hr-style)

🪛 OSV Scanner (2.1.0)
go.mod

[HIGH] 113-113: golang.org/x/oauth2 0.11.0: Unexpected memory consumption during token parsing in golang.org/x/oauth2

(GO-2025-3488)


[HIGH] 113-113: golang.org/x/oauth2 0.11.0: golang.org/x/oauth2 Improper Validation of Syntactic Correctness of Input vulnerability

(GHSA-6v2p-p543-phr9)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review continued from previous batch...

@Ice3man543 Ice3man543 marked this pull request as ready for review September 25, 2025 12:59
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 23

🧹 Nitpick comments (21)
go.mod (3)

14-14: tint addition: OK; ensure graceful fallback for non‑TTY.

If used as a slog handler, confirm it degrades to plain output when not attached to a TTY to avoid control codes in logs (CI, JSON logs).


17-17: SimHash library is unmaintained; consider switching or vendoring.

mfonda/simhash is from 2015 and effectively unmaintained. Prefer a maintained SimHash implementation or vendor it with tests to mitigate supply‑chain and maintenance risk.

If you want, I can propose alternatives (e.g., go-dedup/simhash) and a minimal adapter layer.


66-66: Indirect pprof pinned to a tip pseudo‑version; consider stability.

It’s fine to keep, but if not strictly required, consider letting the resolver pick a stable tag/earlier pseudo‑version to reduce churn in reproducible builds.

pkg/engine/headless/js/utils.js (2)

57-69: Optional: Reduce cost of scanning all elements for inline listeners.

document.querySelectorAll("") can be heavy on large pages; consider narrowing to common interactive elements (a, button, input, select, textarea, [role="button"], [onclick], etc.) or short‑circuit when no “on” properties exist.


229-231: Nit: Avoid deprecated substr().

Use slice(1) for class name trimming.

Apply:

-          result += "." + window.escapeIdentifierIfNeeded(prefixedName.substr(1));
+          result += "." + window.escapeIdentifierIfNeeded(prefixedName.slice(1));
pkg/engine/headless/browser/element.go (1)

69-76: Limit form-button de‑duplication to submit controls.

Currently all BUTTONs inside forms are marked, which can suppress legitimate non-submit button actions. Only mark submit buttons.

Apply:

-        if element.TagName != "BUTTON" {
+        if element.TagName != "BUTTON" {
           continue
         }
-        // TODO: Check if this button is already in the unique map
-        // and if so remove it
-        unique[element.Hash()] = struct{}{}
+        if strings.EqualFold(element.Type, "submit") {
+          unique[element.Hash()] = struct{}{}
+        }
pkg/engine/headless/crawler/formfill.go (1)

225-226: Escape quotes in CSS attribute selector.

Unescaped " in value breaks [value="..."] selector.

Apply:

-                valueSelector := fmt.Sprintf(`[value="%s"]`, value)
+                esc := strings.ReplaceAll(value, `"`, `\"`)
+                valueSelector := fmt.Sprintf(`[value="%s"]`, esc)

Add import:

import "strings"
cmd/katana/main.go (1)

101-110: Simplify pprof shutdown; avoid nested defer.

Call Stop() directly within the deferred closure.

Apply:

-	defer func() {
-		if pprofServer != nil {
-			defer pprofServer.Stop()
-		}
-	}()
+	defer func() {
+		if pprofServer != nil {
+			pprofServer.Stop()
+		}
+	}()
pkg/engine/headless/headless.go (1)

38-41: Debugger port: avoid hard-coded 8089 clashes

Make port configurable (flag/env) or allow 0 to auto-assign and log the chosen port.

-		headless.debugger = NewCrawlDebugger(8089)
+		// TODO: plumb from options; consider 0 for ephemeral port
+		headless.debugger = NewCrawlDebugger(8089)
pkg/engine/headless/crawler/normalizer/normalizer.go (2)

38-41: Doc nit: “Denormalizing” → “Normalize again”

The step is a second normalization pass, not denormalization. Update comment for clarity.

-// - Denormalizing it
+// - Normalizing again for cleanup

68-76: URL decoding may convert '+' to space

url.QueryUnescape treats '+' as a space. If inputs are not query-encoded, consider url.PathUnescape or a guarded decode.

pkg/engine/headless/debugger.go (1)

117-125: Prefer graceful shutdown

Use Shutdown with a timeout to drain connections.

 func (cd *CrawlDebugger) Close() {
 	if cd == nil {
 		return
 	}
 
 	if cd.httpServer != nil {
-		cd.httpServer.Close()
+		ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second)
+		defer cancel()
+		_ = cd.httpServer.Shutdown(ctx)
 	}
 }

Add imports:

import "context"
pkg/engine/headless/browser/cookie/cookie.go (1)

58-66: Guard: URLFilter empty matches all

If any rule has an empty URLFilter, strings.Contains(url, "") will be true and matching relies only on other predicates. If that’s intentional per rules.json, ignore. Otherwise, add an explicit check to avoid accidental global matches.

- if !strings.Contains(url, rule.Condition.URLFilter) {
+ if rule.Condition.URLFilter == "" || !strings.Contains(url, rule.Condition.URLFilter) {
    return false
 }
pkg/engine/headless/crawler/crawler.go (1)

141-150: DOT path uses options directory — verify

After applying the propagation fix, the DOT will land in the same diagnostics dir as the writer. If you prefer isolation, write to that directory unconditionally, not CWD.

pkg/engine/headless/browser/browser.go (1)

563-570: Also cancel context on close

CloseBrowserPage should cancel its context to reliably terminate per-page operations.

 func (b *BrowserPage) CloseBrowserPage() {
+    if b.cancel != nil {
+        b.cancel()
+    }
     b.Page.Close()
     b.Browser.Close()
     // Clean up the temp user data directory
     if b.userDataDir != "" {
         _ = os.RemoveAll(b.userDataDir)
     }
 }
pkg/engine/headless/js/page-init.js (6)

48-55: Also listen to popstate to capture back/forward navigations.

Add popstate so you don’t miss SPA navigations via history traversal.

       // Add event listener for hashchange
       window.addEventListener("hashchange", function () {
         window.__navigatedLinks.push({
           url: document.location.href,
           source: "hashchange",
         });
       });
+      // Also capture history traversal via back/forward
+      window.addEventListener("popstate", function () {
+        try {
+          window.__navigatedLinks.push({
+            url: document.location.href,
+            source: "popstate",
+          });
+        } catch (_) {}
+      });

25-74: Bound growth of __navigatedLinks to avoid memory bloat.

This array can grow unbounded on long crawls. Cap size (e.g., 5k–10k) via a helper.

-      window.__navigatedLinks = [];
+      window.__navigatedLinks = [];
+      const __pushNavigated = (url, source) => {
+        try {
+          const a = window.__navigatedLinks;
+          if (a.length > 5000) a.shift();
+          a.push({ url, source });
+        } catch (_) {}
+      };
@@
-      function __wrappedPushState(a, b, c) {
-        try { window.__navigatedLinks.push({ url: c, source: "history.pushState" }); } catch (_) {}
-        return __origPushState(a, b, c);
-      }
+      function __wrappedPushState(a, b, c) { __pushNavigated(c, "history.pushState"); return __origPushState(a, b, c); }
@@
-      function __wrappedReplaceState(a, b, c) {
-        try { window.__navigatedLinks.push({ url: c, source: "history.replaceState" }); } catch (_) {}
-        return __origReplaceState(a, b, c);
-      }
+      function __wrappedReplaceState(a, b, c) { __pushNavigated(c, "history.replaceState"); return __origReplaceState(a, b, c); }
@@
-      function __wrappedOpen(url, ...rest) {
-        try { window.__navigatedLinks.push({ url, source: "window.open" }); } catch (_) {}
-        return __origOpen(url, ...rest);
-      }
+      function __wrappedOpen(url, ...rest) { __pushNavigated(url, "window.open"); return __origOpen(url, ...rest); }
@@
-      window.addEventListener("hashchange", function () {
-        window.__navigatedLinks.push({
-          url: document.location.href,
-          source: "hashchange",
-        });
-      });
+      window.addEventListener("hashchange", function () { __pushNavigated(document.location.href, "hashchange"); });

68-74: Stabilize and freeze fetch override.

Bind to window and freeze the replacement for consistency with other hooks; guard push in try/catch.

-      var originalFetch = window.fetch;
-      window.fetch = function (...args) {
-        const url = args[0] instanceof Request ? args[0].url : args[0];
-        window.__navigatedLinks.push({ url: url, source: "fetch" });
-        return originalFetch.apply(this, args);
-      };
+      const __origFetch = window.fetch.bind(window);
+      function __wrappedFetch(...args) {
+        const url = (args[0] && typeof args[0] === "object" && "url" in args[0]) ? args[0].url : args[0];
+        try { window.__navigatedLinks.push({ url, source: "fetch" }); } catch (_) {}
+        return __origFetch(...args);
+      }
+      Object.defineProperty(window, "fetch", { value: __wrappedFetch, writable: false, configurable: false });

90-95: Gate window.close block behind an option, keep original.

Avoid surprising site logic that legitimately calls close() (e.g., popups).

-      window.close = function () {
-        console.log("[hook] trying to close page.");
-      };
+      const __origClose = window.close.bind(window);
+      window.close = function (...args) {
+        if (window.__katanaHooksOptions?.preventWindowClose === true) {
+          try { console.log("[hook] trying to close page."); } catch (_) {}
+          return;
+        }
+        return __origClose(...args);
+      };

96-109: Speed-up factor should be configurable and bounded.

Hard-coded 0.1 can break timing-dependent flows. Make it opt-in and clamp to non-negative.

-      const speedUpFactor = 0.1; // For example, 10 times faster
+      const speedUpFactor = (window.__katanaHooksOptions?.timeoutSpeedupFactor ?? 1);
@@
-      window.setTimeout = function (callback, delay, ...args) {
-        return originalSetTimeout(callback, delay * speedUpFactor, ...args);
-      };
+      window.setTimeout = function (callback, delay, ...args) {
+        const d = Math.max(0, Number(delay) * speedUpFactor);
+        return originalSetTimeout(callback, d, ...args);
+      };
       window.setInterval = function (callback, delay, ...args) {
-        return originalSetInterval(callback, delay * speedUpFactor, ...args);
+        const d = Math.max(0, Number(delay) * speedUpFactor);
+        return originalSetInterval(callback, d, ...args);
       };

115-150: Consider extending hook to EventTarget to capture window/document listeners.

Overriding Element.prototype misses addEventListener on Window/Document. If needed, also wrap EventTarget.prototype with care (avoid double-wrapping).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 348eba0 and a228ae5.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (26)
  • cmd/katana/main.go (4 hunks)
  • go.mod (4 hunks)
  • internal/runner/options.go (1 hunks)
  • internal/runner/runner.go (2 hunks)
  • pkg/engine/headless/browser/browser.go (1 hunks)
  • pkg/engine/headless/browser/cookie/cookie.go (1 hunks)
  • pkg/engine/headless/browser/element.go (1 hunks)
  • pkg/engine/headless/crawler/crawler.go (1 hunks)
  • pkg/engine/headless/crawler/diagnostics/diagnostics.go (1 hunks)
  • pkg/engine/headless/crawler/formfill.go (1 hunks)
  • pkg/engine/headless/crawler/normalizer/dom_utils.go (1 hunks)
  • pkg/engine/headless/crawler/normalizer/normalizer.go (1 hunks)
  • pkg/engine/headless/crawler/normalizer/text_utils.go (1 hunks)
  • pkg/engine/headless/crawler/normalizer/text_utils_test.go (1 hunks)
  • pkg/engine/headless/crawler/state.go (1 hunks)
  • pkg/engine/headless/debugger.go (1 hunks)
  • pkg/engine/headless/graph/graph.go (1 hunks)
  • pkg/engine/headless/headless.go (1 hunks)
  • pkg/engine/headless/js/page-init.js (1 hunks)
  • pkg/engine/headless/js/utils.js (1 hunks)
  • pkg/engine/headless/types/types.go (1 hunks)
  • pkg/engine/parser/parser.go (1 hunks)
  • pkg/engine/parser/parser_test.go (1 hunks)
  • pkg/types/crawler_options.go (2 hunks)
  • pkg/types/options.go (3 hunks)
  • pkg/utils/extensions/extensions.go (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • pkg/engine/headless/crawler/diagnostics/diagnostics.go
🚧 Files skipped from review as they are similar to previous changes (7)
  • pkg/types/options.go
  • internal/runner/runner.go
  • pkg/engine/headless/crawler/normalizer/text_utils.go
  • pkg/utils/extensions/extensions.go
  • pkg/engine/headless/crawler/normalizer/text_utils_test.go
  • internal/runner/options.go
  • pkg/engine/parser/parser_test.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-25T12:59:59.470Z
Learnt from: Ice3man543
PR: projectdiscovery/katana#1396
File: pkg/engine/headless/js/utils.js:156-177
Timestamp: 2025-09-25T12:59:59.470Z
Learning: The CSS selector generation code in pkg/engine/headless/js/utils.js (specifically the _cssPathStep function) is copied directly from Chrome DevTools and is correct as implemented. The mix of uppercase and lowercase tag names follows the original Chrome implementation.

Applied to files:

  • pkg/engine/headless/js/utils.js
🧬 Code graph analysis (12)
pkg/engine/parser/parser.go (1)
pkg/navigation/request.go (1)
  • Request (12-25)
cmd/katana/main.go (1)
pkg/engine/headless/headless.go (1)
  • Headless (18-25)
pkg/engine/headless/crawler/normalizer/normalizer.go (2)
pkg/engine/headless/crawler/normalizer/dom_utils.go (2)
  • DOMNormalizer (31-33)
  • NewDOMNormalizer (38-73)
pkg/engine/headless/crawler/normalizer/text_utils.go (2)
  • TextNormalizer (30-33)
  • NewTextNormalizer (39-53)
pkg/engine/headless/crawler/formfill.go (4)
pkg/engine/headless/types/types.go (2)
  • HTMLElement (132-145)
  • HTMLForm (197-209)
pkg/utils/formfill.go (6)
  • FormInput (37-42)
  • FormTextArea (58-61)
  • FormSelect (52-56)
  • SelectOption (45-49)
  • FormFillSuggestions (176-193)
  • FormData (13-13)
pkg/engine/headless/crawler/crawler.go (1)
  • Crawler (28-37)
pkg/engine/headless/browser/browser.go (1)
  • BrowserPage (149-156)
pkg/engine/headless/debugger.go (1)
pkg/navigation/request.go (2)
  • Depth (9-9)
  • Request (12-25)
pkg/engine/headless/browser/element.go (2)
pkg/engine/headless/types/types.go (7)
  • HTMLElement (132-145)
  • Action (36-44)
  • ActionTypeFillForm (89-89)
  • ActionTypeLeftClick (75-75)
  • ActionFromEventListener (95-129)
  • HTMLForm (197-209)
  • EventListener (270-274)
pkg/engine/headless/browser/browser.go (2)
  • BrowserPage (149-156)
  • ScopeValidator (61-61)
pkg/engine/headless/js/page-init.js (1)
pkg/engine/headless/js/utils.js (1)
  • listener (77-77)
pkg/engine/headless/crawler/crawler.go (8)
pkg/engine/headless/browser/browser.go (5)
  • Launcher (37-42)
  • ScopeValidator (61-61)
  • NewLauncher (64-71)
  • LauncherOptions (45-59)
  • BrowserPage (149-156)
pkg/engine/headless/types/types.go (7)
  • Action (36-44)
  • ActionTypeLoadURL (73-73)
  • PageState (20-33)
  • ActionTypeFillForm (89-89)
  • ActionTypeLeftClick (75-75)
  • ActionTypeLeftClickDown (76-76)
  • HTMLElement (132-145)
pkg/engine/headless/graph/graph.go (2)
  • CrawlGraph (15-17)
  • NewCrawlGraph (24-32)
pkg/engine/headless/crawler/normalizer/simhash/simhash.go (2)
  • Oracle (134-137)
  • NewOracle (140-142)
pkg/engine/headless/crawler/diagnostics/diagnostics.go (1)
  • Writer (16-22)
pkg/output/result.go (1)
  • Result (10-15)
pkg/engine/headless/crawler/normalizer/normalizer.go (2)
  • Normalizer (17-20)
  • New (23-33)
pkg/engine/headless/crawler/state.go (1)
  • ErrNoNavigationPossible (118-118)
pkg/engine/headless/headless.go (7)
pkg/types/crawler_options.go (1)
  • CrawlerOptions (23-47)
pkg/engine/headless/debugger.go (2)
  • CrawlDebugger (20-24)
  • NewCrawlDebugger (27-52)
pkg/engine/headless/crawler/crawler.go (2)
  • New (79-126)
  • Options (39-63)
pkg/engine/headless/browser/browser.go (1)
  • ScopeValidator (61-61)
pkg/output/result.go (1)
  • Result (10-15)
pkg/navigation/request.go (1)
  • Request (12-25)
pkg/engine/parser/parser.go (1)
  • NewResponseParser (37-75)
pkg/engine/headless/crawler/state.go (5)
pkg/engine/headless/crawler/crawler.go (1)
  • Crawler (28-37)
pkg/engine/headless/browser/browser.go (1)
  • BrowserPage (149-156)
pkg/engine/headless/types/types.go (3)
  • Action (36-44)
  • PageState (20-33)
  • HTMLElement (132-145)
pkg/engine/headless/crawler/normalizer/simhash/simhash.go (2)
  • Distance (149-151)
  • Fingerprint (130-132)
pkg/engine/headless/crawler/diagnostics/diagnostics.go (1)
  • PreActionPageState (27-27)
pkg/engine/headless/graph/graph.go (1)
pkg/engine/headless/types/types.go (2)
  • PageState (20-33)
  • Action (36-44)
pkg/engine/headless/browser/browser.go (7)
pkg/output/result.go (1)
  • Result (10-15)
pkg/engine/headless/js/js.go (1)
  • InitJavascriptEnv (19-27)
pkg/navigation/request.go (1)
  • Request (12-25)
pkg/navigation/response.go (1)
  • Headers (11-11)
pkg/engine/headless/browser/cookie/cookie.go (1)
  • ShouldBlockRequest (48-56)
pkg/utils/utils.go (1)
  • FlattenHeaders (72-78)
pkg/engine/hybrid/hijack.go (2)
  • FetchContinueRequest (92-97)
  • FetchGetResponseBody (71-89)
🪛 GitHub Check: CodeQL
pkg/engine/parser/parser.go

[failure] 113-113: Incomplete URL scheme check
This check does not consider vbscript.

🪛 GitHub Check: Lint
pkg/engine/headless/debugger.go

[failure] 111-111:
Error return value of (*encoding/json.Encoder).Encode is not checked (errcheck)


[failure] 102-102:
Error return value of (*encoding/json.Encoder).Encode is not checked (errcheck)

pkg/engine/headless/headless.go

[failure] 164-164:
Error return value of h.deduplicator.Set is not checked (errcheck)


[failure] 125-125:
Error return value of h.options.OutputWriter.Write is not checked (errcheck)


[failure] 120-120:
Error return value of h.options.OutputWriter.Write is not checked (errcheck)

pkg/engine/headless/browser/browser.go

[failure] 40-40:
field userDataDir is unused (unused)


[failure] 407-407:
Error return value is not checked (errcheck)


[failure] 330-330:
Error return value of browserPage.handlePageDialogBoxes is not checked (errcheck)

pkg/engine/headless/types/types.go

[failure] 165-165:
var dynamicIDAndClassPattern is unused (unused)

🪛 GitHub Actions: 🔨 Build Test
pkg/engine/headless/browser/browser.go

[error] 330-330: Error return value of browserPage.handlePageDialogBoxes is not checked (errcheck)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: release-test-mac
  • GitHub Check: release-test-linux
  • GitHub Check: Functional Test (ubuntu-latest)
  • GitHub Check: Functional Test (macOS-latest)
  • GitHub Check: Functional Test (windows-latest)
🔇 Additional comments (21)
go.mod (2)

12-12: Rod upgrade + transitive alignment: looks consistent; check CGO/leakless and browser compatibility.

v0.116.2 with got v0.40.0 and leakless v0.9.0 is a sensible set. Please verify:

  • Runtime environment has a compatible Chrome/Chromium for this Rod version.
  • CGO/leakless doesn’t break static or cross builds in your CI (musl/alpine, linux/arm64, windows).

Also applies to: 114-114, 155-155


10-11: Approve dependency additions Verified direct imports for github.com/adrianbrad/queue (pkg/engine/headless/crawler/crawler.go:14) and github.com/dominikbraun/graph (pkg/engine/headless/graph/graph.go:8); dependencies are used.

pkg/engine/headless/js/utils.js (2)

34-40: LGTM: selector guard applied.

The try/catch around querySelectorAll safely handles invalid selectors.


42-51: LGTM: XPath guard applied.

Wrapping document.evaluate prevents crashes on invalid XPath.

pkg/engine/headless/browser/element.go (2)

131-139: LGTM: scheme allowlist for links.

Filtering to http/https before scope check is correct and prevents non-navigable schemes.


111-113: Confirm launcher presence before use.

Ensure b.launcher is always non-nil here; otherwise ScopeValidator() will panic.

If needed, guard:

if b.launcher == nil || b.launcher.ScopeValidator() == nil {
  return navigations, nil
}
pkg/types/crawler_options.go (1)

5-6: LGTM: added Logger and ChromeUser. Verify wiring.

Fields look fine. Confirm they’re set by the runner when headless mode is used.

Also applies to: 43-46

pkg/engine/headless/headless.go (3)

159-169: Dedup race: use atomic SetNX pattern

The Get-then-Set can race under concurrency. Use Set-if-not-exists.

-		if _, ok := h.deduplicator.Get(resp.URL); ok {
-			continue
-		}
-		h.deduplicator.Set(resp.URL, struct{}{})
+		// Atomically insert; if it already existed, skip
+		if !h.deduplicator.SetNX(resp.URL, struct{}{}) {
+			continue
+		}

If SetNX isn’t available in SyncLockMap, add it (or a helper that locks around check+insert) and switch callers.

#!/bin/bash
# Verify whether SetNX exists on SyncLockMap
rg -nP 'type\s+SyncLockMap|func\s+\(.*\*\s*SyncLockMap.*\)\s*SetNX\s*\(' -C3

75-87: Good: safe ScopeManager nil handling

Thanks for adding the nil guard before calling Validate.


127-132: Trace toggled with diagnostics — confirm intent

Trace mirrors EnableDiagnostics. If tracing and diagnostics are different modes, consider distinct flags.

pkg/engine/headless/crawler/state.go (1)

210-235: Stronger element identity matching looks good

Requiring either ID equality or at least two non-empty attribute matches reduces false positives as discussed.

pkg/engine/headless/crawler/normalizer/normalizer.go (1)

14-14: Whitespace regex fixed

Correctly matches actual CR/LF and collapses runs.

pkg/engine/headless/crawler/normalizer/dom_utils.go (1)

105-108: Nice: single-root selection transforms (perf)

Applying selection-based transforms once at the root avoids O(N²) traversal.

pkg/engine/headless/browser/cookie/cookie.go (1)

42-45: Deterministic rule ordering implemented — LGTM

The rules are sorted by priority at init as previously suggested. Good.

pkg/engine/headless/graph/graph.go (2)

85-103: Nil action guard — LGTM

The nil check prevents panics; edge attributes/weighting look consistent.


113-131: ShortestPath: skips about:blank and nil actions — LGTM

Pre-sized slice and nil filtering avoid downstream surprises.

pkg/engine/headless/types/types.go (1)

188-193: Diagnostics: prefer logger over stderr

Prints to stderr from library code are noisy. Use a logger or gate behind a callback.

- if IsDiagnosticEnabled {
-     fmt.Fprintf(os.Stderr, "[diagnostic] Element hash input: %s\n", hashInput)
- }
+ // Optionally log via a provided logger in callers; avoid direct stderr writes here.

Same for lines 236-241.

pkg/engine/headless/browser/browser.go (1)

382-458: Stop leaking EachEvent goroutine

Capture and call the stop func (or cancel context) when closing the page to prevent goroutine leaks.

Would you like me to propose a concrete patch that stores a stop function on BrowserPage and calls it from CloseBrowserPage?

pkg/engine/headless/js/page-init.js (3)

27-39: History wrappers: good fix (wrap + delegate + freeze).

This resolves the earlier concern about stubbing without delegation. Looks correct and safe.


41-46: window.open wrapper: good (delegates and preserves return value).

This addresses the prior breakage by calling the original and freezing replacement.


115-150: Event listener hook: avoid PII, handle object listeners, cap array, and guard utils.

Current code can leak sensitive values, stores full listener source, and grows unbounded. Also utils may be undefined.

       const originalAddEventListener = Element.prototype.addEventListener;
 
       window.__eventListeners = [];
       Element.prototype.addEventListener = function (type, listener, options) {
@@
-        let item = {
+        const isSensitive = (el) => {
+          const t = (el.type || "").toLowerCase();
+          const n = (el.name || "").toLowerCase();
+          return t === "password" || /(pass|token|secret|key|auth|session)/i.test(n);
+        };
+        const listenerStr = (typeof listener === "function")
+          ? String(listener)
+          : (listener && typeof listener.handleEvent === "function" ? String(listener.handleEvent) : "[object Listener]");
+        const isFormCtl = ["INPUT","TEXTAREA","SELECT"].includes(this.tagName);
+        let item = {
           element: {
             tagName: this.tagName,
-            id: this.id,
-            classes: this.className,
-            outerHTML: this.outerHTML.slice(0, 100), // Capture a snippet of the element's outerHTML
-            xpath: window.getXPath(this),
-            cssSelector: window.getCssPath(this),
-            attributes: window.getElementAttributes(this),
-            textContent: this.textContent.trim(),
-            hidden: this.hidden,
-            name: this.name,
-            type: this.type,
-            value: this.value,
+            id: this.id || "",
+            classes: this.className || "",
+            outerHTML: (this.outerHTML || "").slice(0, 100),
+            xpath: typeof window.getXPath === "function" ? window.getXPath(this) : "",
+            cssSelector: typeof window.getCssPath === "function" ? window.getCssPath(this) : "",
+            attributes: typeof window.getElementAttributes === "function" ? window.getElementAttributes(this) : {},
+            textContent: isFormCtl ? "" : ((this.textContent || "").trim().slice(0, 100)),
+            hidden: !!this.hidden,
+            name: this.name,
+            type: this.type,
+            value: isSensitive(this) ? undefined : (typeof this.value === "string" ? this.value.slice(0, 64) : undefined),
           },
           type: type,
-          listener: listener.toString(),
+          listener: listenerStr.slice(0, 500),
           options: options || {},
         };
-        console.log("[hook] got event listener", item);
-        window.__eventListeners.push(item);
+        try {
+          (window.__eventListeners.length > 2000) && window.__eventListeners.shift();
+          window.__eventListeners.push(item);
+        } catch (_) {}
         return originalAddEventListener.call(this, type, listener, options);
       };

Copy link
Member

@dogancanbakir dogancanbakir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing but getting into loop when running the following command:

go run . -u vulnerabletarget.com -headless -sb -v

already shared with @Ice3man543

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
pkg/engine/headless/crawler/crawler.go (1)

101-116: Persist diagnostics directory, sanitize timestamp for Windows, and handle Getwd error.

  • The resolved diagnostics directory isn’t written back to opts, so Crawl() writes the DOT under an empty c.options.DiagnosticsDir. This was flagged earlier and remains.
  • RFC3339 includes “:” which is invalid in Windows filenames.
  • os.Getwd() error is ignored.
   var diagnosticsWriter diagnostics.Writer
   if opts.EnableDiagnostics {
     directory := opts.DiagnosticsDir
     if directory == "" {
-      cwd, _ := os.Getwd()
-      directory = filepath.Join(cwd, fmt.Sprintf("katana-diagnostics-%s", time.Now().Format(time.RFC3339)))
+      cwd, err := os.Getwd()
+      if err != nil {
+        cwd = os.TempDir()
+      }
+      // Use a filesystem-safe UTC timestamp (no colons), e.g. 20250930T141953Z
+      ts := time.Now().UTC().Format("20060102T150405Z")
+      directory = filepath.Join(cwd, fmt.Sprintf("katana-diagnostics-%s", ts))
     }
 
     writer, err := diagnostics.NewWriter(directory)
     if err != nil {
       return nil, err
     }
     diagnosticsWriter = writer
+    // Persist resolved directory for later use (e.g., crawl-graph.dot).
+    opts.DiagnosticsDir = directory
     opts.Logger.Info("Diagnostics enabled", slog.String("directory", directory))
   }
🧹 Nitpick comments (2)
pkg/engine/headless/crawler/crawler.go (2)

447-451: Use errors.As for rod.CoveredError.

Consistent with other typed Rod errors, prefer errors.As here.

-      if errors.Is(err, &rod.CoveredError{}) {
+      var ce *rod.CoveredError
+      if errors.As(err, &ce) {
         return ErrElementNotVisible
       }

141-150: Fallback write path for DOT when DiagnosticsDir is empty.

After persisting DiagnosticsDir in New(), this should be covered. If you decide not to persist, consider falling back to CWD or a temp dir here to avoid writing to a relative “crawl-graph.dot”.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a228ae5 and 20e61c9.

📒 Files selected for processing (1)
  • pkg/engine/headless/crawler/crawler.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
pkg/engine/headless/crawler/crawler.go (9)
pkg/engine/headless/browser/browser.go (5)
  • Launcher (37-42)
  • ScopeValidator (61-61)
  • NewLauncher (64-71)
  • LauncherOptions (45-59)
  • BrowserPage (149-156)
pkg/utils/queue/queue.go (1)
  • Queue (19-25)
pkg/engine/headless/types/types.go (7)
  • Action (36-44)
  • ActionTypeLoadURL (73-73)
  • PageState (20-33)
  • ActionTypeFillForm (89-89)
  • ActionTypeLeftClick (75-75)
  • ActionTypeLeftClickDown (76-76)
  • HTMLElement (132-145)
pkg/engine/headless/graph/graph.go (2)
  • CrawlGraph (15-17)
  • NewCrawlGraph (24-32)
pkg/engine/headless/crawler/normalizer/simhash/simhash.go (2)
  • Oracle (134-137)
  • NewOracle (140-142)
pkg/engine/headless/crawler/diagnostics/diagnostics.go (3)
  • Writer (16-22)
  • NewWriter (55-66)
  • PostActionPageState (28-28)
pkg/output/result.go (1)
  • Result (10-15)
pkg/engine/headless/crawler/normalizer/normalizer.go (2)
  • Normalizer (17-20)
  • New (23-33)
pkg/engine/headless/crawler/state.go (1)
  • ErrNoNavigationPossible (118-118)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: release-test-linux
  • GitHub Check: Analyze (go)
  • GitHub Check: Functional Test (ubuntu-latest)
  • GitHub Check: Functional Test (windows-latest)
  • GitHub Check: Functional Test (macOS-latest)
🔇 Additional comments (3)
pkg/engine/headless/crawler/crawler.go (3)

69-77: Thanks for removing the startup panic path.

init() now records the normalizer init error and avoids panicking. This unblocks tests and improves startup robustness.


128-135: Diagnostics.Close() error is now handled.

Close() logs a warning on diagnostics writer close failures instead of dropping the error. LGTM.


152-158: Sanity check passed: all referenced symbols are defined.
emptyPageHash (state.go), WaitPageLoadHeurisitics (browser.go), and processForm (formfill.go) are implemented within the package; no missing definitions.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (5)
pkg/engine/headless/js/page-init.js (1)

138-158: Address PII capture, object listener handling, and unbounded growth.

This code segment still has the unresolved issues flagged in past reviews:

  1. PII leakage risk (lines 147, 151): Capturing textContent and value without sanitization can leak passwords, tokens, and sensitive data.
  2. Object listener crash (line 154): listener.toString() will fail or produce "[object Object]" when the listener is an object with a handleEvent method.
  3. Unbounded memory growth (line 158): __eventListeners can grow indefinitely, causing memory issues.
🔎 Proposed fix
         if (this.tagName == "BODY") {
           return originalAddEventListener.call(this, type, listener, options);
         }
+        
+        const isSensitive = (el) => {
+          const t = (el.type || "").toLowerCase();
+          const n = (el.name || "").toLowerCase();
+          return t === "password" || /(pass|token|secret|key|auth|session|credit|ssn)/i.test(n);
+        };
+        const safeValue = isSensitive(this) ? "[REDACTED]" : (this.value || "").slice(0, 64);
+        const safeText = (this.textContent || "").trim().slice(0, 100);
+        const listenerStr = (typeof listener === "function")
+          ? (listener.toString ? String(listener).slice(0, 500) : "[Function]")
+          : (listener && typeof listener.handleEvent === "function" ? "[Object:handleEvent]" : "[Object]");
+        
         let item = {
           element: {
             tagName: this.tagName,
             id: this.id,
             classes: this.className,
             outerHTML: this.outerHTML.slice(0, 100), // Capture a snippet of the element's outerHTML
             xpath: window.getXPath(this),
             cssSelector: window.getCssPath(this),
             attributes: window.getElementAttributes(this),
-            textContent: this.textContent.trim(),
+            textContent: safeText,
             hidden: this.hidden,
             name: this.name,
             type: this.type,
-            value: this.value,
+            value: safeValue,
           },
           type: type,
-          listener: listener.toString(),
+          listener: listenerStr,
           options: options || {},
         };
         console.log("[hook] got event listener", item);
+        if (window.__eventListeners.length > 5000) {
+          window.__eventListeners.shift(); // FIFO ring buffer
+        }
         window.__eventListeners.push(item);
         return originalAddEventListener.call(this, type, listener, options);

Based on past review feedback and privacy best practices.

pkg/engine/headless/crawler/normalizer/dom_utils.go (1)

171-178: Overbroad "js" attribute removal still present.

The condition strings.HasPrefix(attr.Key, "js") will incorrectly match unrelated attributes like json, justify, jsonld, etc. This was flagged in a previous review but remains unfixed.

🔎 Proposed fix
 	for _, node := range s.Nodes {
 		for _, attr := range node.Attr {
 			attr := attr
-			if strings.HasPrefix(attr.Key, "data-") || strings.HasPrefix(attr.Key, "aria-") || strings.HasPrefix(attr.Key, "js") {
+			if strings.HasPrefix(attr.Key, "data-") ||
+				strings.HasPrefix(attr.Key, "aria-") ||
+				attr.Key == "js" || strings.HasPrefix(attr.Key, "js-") {
 				s.RemoveAttr(attr.Key)
 			}
 		}
 	}
pkg/engine/headless/headless.go (1)

179-189: Potential race condition in deduplication logic.

The check-then-set pattern (Get at line 180, then Set at line 183) is not atomic. While SyncLockMap provides thread-safe individual operations, concurrent calls to performAdditionalAnalysis could result in duplicate entries being processed if two goroutines check the same URL simultaneously before either sets it.

Consider using an atomic set-if-not-exists pattern if SyncLockMap supports it, or accept this as a minor edge case given current single-threaded crawling (per the TODO at line 153).

pkg/engine/headless/graph/graph.go (1)

55-83: Bug: Returning on ErrVertexAlreadyExists skips edge creation.

When AddVertex returns ErrVertexAlreadyExists (lines 60-61), the function returns early and never creates the edge for this PageState. This breaks graph connectivity when re-visiting a state from a different origin, as the new edge won't be recorded.

🔎 Proposed fix
 	err := g.graph.AddVertex(n, func(vp *graph.VertexProperties) {
 		vp.Weight = n.Depth
 		vp.Attributes = vertexAttrs
 	})
-	if err != nil {
-		if errors.Is(err, graph.ErrVertexAlreadyExists) {
-			return nil
-		}
+	if err != nil && !errors.Is(err, graph.ErrVertexAlreadyExists) {
 		return errors.Wrap(err, "could not add vertex to graph")
 	}
 
 	if n.NavigationAction != nil {
pkg/engine/headless/browser/browser.go (1)

387-468: Goroutine in EachEvent may outlive BrowserPage lifecycle.

The goroutine started at line 387 with go b.EachEvent(...)() continues running even after the BrowserPage is returned to the pool or closed. This could lead to memory leaks and unexpected behavior if callbacks reference stale state.

Consider storing the stop function returned by EachEvent and calling it in CloseBrowserPage().

🧹 Nitpick comments (7)
pkg/engine/parser/parser.go (1)

337-340: Optional cleanup: redundant data: URI check.

The manual data: check here is now redundant since the centralized isValidNavigationRequest filter (line 82) will catch these URLs. However, given the TODO comment suggests future data URI parsing functionality, you may want to keep this location as a placeholder for that feature.

🔎 Optional: remove redundant check

If data URI parsing is not planned soon, consider removing the redundant check:

 		src, ok := item.Attr("src")
 		if ok && src != "" && src != "#" {
-			if strings.HasPrefix(src, "data:") {
-				// TODO: Add data:uri/data:image parsing
-				return
-			}
 			navigationRequests = append(navigationRequests, navigation.NewNavigationRequestURLFromResponse(src, resp.Resp.Request.URL.String(), "img", "src", resp))
 		}
pkg/engine/headless/js/page-init.js (3)

135-137: Document why BODY tag is excluded from event listener tracking.

The special case for tagName == "BODY" skips tracking without explanation. If this is intentional (e.g., to reduce noise from common document-level events), add a comment explaining the rationale.


113-113: Consider making speedUpFactor configurable.

The hardcoded speedUpFactor = 0.1 could be made configurable via window.__katanaHooksOptions.speedUpFactor to allow runtime tuning of delay acceleration.

🔎 Suggested enhancement
-      const speedUpFactor = 0.1; // For example, 10 times faster
+      const speedUpFactor = window.__katanaHooksOptions?.speedUpFactor ?? 0.1; // Default: 10 times faster

49-54: Wrap hashchange listener in try-catch for consistency.

Other hook registrations are wrapped in try-catch blocks (lines 165-167), but this addEventListener call is not. For consistency and defensive coding, wrap it.

🔎 Proposed enhancement
       // Add event listener for hashchange
-      window.addEventListener("hashchange", function () {
-        window.__navigatedLinks.push({
-          url: document.location.href,
-          source: "hashchange",
-        });
-      });
+      try {
+        window.addEventListener("hashchange", function () {
+          window.__navigatedLinks.push({
+            url: document.location.href,
+            source: "hashchange",
+          });
+        });
+      } catch (_) {}
pkg/engine/headless/browser/element.go (1)

214-250: Improve error handling consistency in GetEventListeners.

At line 219-221, the error from evaluating window.__eventListeners is silently swallowed, but at lines 232-237, errors from getAllElementsWithEventListeners() are returned. This inconsistency could mask issues with the first evaluation.

Consider logging when the first evaluation fails, or unifying the error handling approach.

pkg/engine/headless/crawler/crawler.go (1)

356-375: Screenshot error should not block diagnostics logging.

At lines 363-371, if Screenshot fails, the error is logged but screenshotState could be nil or empty. Then LogPageStateScreenshot is called with potentially invalid data. Consider skipping the screenshot log on error.

🔎 Proposed improvement
 	if c.diagnostics != nil {
 		screenshotState, err := page.Screenshot(false, &proto.PageCaptureScreenshot{
 			Format: proto.PageCaptureScreenshotFormatPng,
 		})
 		if err != nil {
 			c.logger.Error("Failed to take screenshot", slog.String("error", err.Error()))
-		}
-		if err := c.diagnostics.LogPageStateScreenshot(pageState.UniqueID, screenshotState); err != nil {
-			c.logger.Error("Failed to log page state screenshot", slog.String("error", err.Error()))
+		} else {
+			if err := c.diagnostics.LogPageStateScreenshot(pageState.UniqueID, screenshotState); err != nil {
+				c.logger.Error("Failed to log page state screenshot", slog.String("error", err.Error()))
+			}
 		}
 		if err := c.diagnostics.LogNavigations(pageState.UniqueID, navigations); err != nil {
 			c.logger.Error("Failed to log navigations", slog.String("error", err.Error()))
 		}
 	}
pkg/engine/headless/types/types.go (1)

163-189: Consider using structured logging for diagnostics.

The diagnostic output at lines 184-186 uses fmt.Fprintf(os.Stderr, ...) which bypasses any structured logging infrastructure. While this is an improvement over stdout, consider accepting an optional logger or using a package-level diagnostic logger for better integration with the logging subsystem.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 20e61c9 and 279eb81.

📒 Files selected for processing (11)
  • pkg/engine/headless/browser/browser.go
  • pkg/engine/headless/browser/element.go
  • pkg/engine/headless/crawler/crawler.go
  • pkg/engine/headless/crawler/normalizer/dom_utils.go
  • pkg/engine/headless/debugger.go
  • pkg/engine/headless/graph/graph.go
  • pkg/engine/headless/headless.go
  • pkg/engine/headless/js/page-init.js
  • pkg/engine/headless/js/utils.js
  • pkg/engine/headless/types/types.go
  • pkg/engine/parser/parser.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/engine/headless/debugger.go
  • pkg/engine/headless/js/utils.js
🧰 Additional context used
🧬 Code graph analysis (4)
pkg/engine/headless/headless.go (4)
pkg/types/crawler_options.go (1)
  • CrawlerOptions (23-47)
pkg/engine/headless/browser/browser.go (1)
  • ScopeValidator (60-60)
pkg/output/result.go (1)
  • Result (10-15)
pkg/navigation/request.go (1)
  • Request (12-25)
pkg/engine/headless/graph/graph.go (1)
pkg/engine/headless/types/types.go (2)
  • PageState (19-32)
  • Action (35-43)
pkg/engine/headless/browser/browser.go (5)
pkg/output/result.go (1)
  • Result (10-15)
pkg/navigation/request.go (1)
  • Request (12-25)
pkg/navigation/response.go (1)
  • Headers (11-11)
pkg/engine/headless/browser/cookie/cookie.go (1)
  • ShouldBlockRequest (48-56)
pkg/utils/utils.go (1)
  • FlattenHeaders (72-78)
pkg/engine/headless/browser/element.go (1)
pkg/engine/headless/types/types.go (6)
  • HTMLElement (131-144)
  • Action (35-43)
  • ActionTypeLeftClick (74-74)
  • ActionFromEventListener (94-128)
  • HTMLForm (192-204)
  • EventListener (265-269)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: release-test-mac
  • GitHub Check: release-test-linux
  • GitHub Check: release-test-windows
  • GitHub Check: Analyze (go)
  • GitHub Check: Functional Test (macOS-latest)
  • GitHub Check: Functional Test (windows-latest)
  • GitHub Check: Functional Test (ubuntu-latest)
🔇 Additional comments (14)
pkg/engine/parser/parser.go (1)

82-115: Excellent security hardening - all previous concerns addressed.

The centralized filtering implementation correctly addresses the earlier feedback:

  • Whitespace trimming prevents bypass attacks
  • Case-insensitive comparison blocks "JavaScript:", "DATA:", etc.
  • All dangerous schemes (data:, mailto:, javascript:, vbscript:) are now blocked

The logic is sound and consistently applied across all parser types.

pkg/engine/headless/crawler/normalizer/dom_utils.go (1)

1-114: LGTM - Core normalization pipeline is well-structured.

The two-phase transformation approach (selector-based removals followed by root-level selection transformations) is clean. The fix to apply selection-based transformations once at the root (lines 106-108) addresses the previous O(N²) concern. The nil guard in removeCommentsDomTransformationFunc is also properly implemented.

pkg/engine/headless/headless.go (2)

27-44: LGTM - Constructor properly initializes components.

The New function correctly initializes the logger, deduplicator, and conditionally enables the debugger. The error handling path is clean.


90-165: Crawl method is well-structured with proper resource management.

The scope validation, crawler options configuration, nil guards, error logging, and deferred cleanup (defer headlessCrawler.Close()) are all properly implemented. The previous concerns about nil checks and error handling have been addressed.

pkg/engine/headless/browser/element.go (1)

59-153: LGTM - Navigation discovery logic is comprehensive.

The flow correctly handles forms, buttons, links, and event listeners with proper deduplication. The scheme validation for links (lines 131-137) addresses the previous concern about non-http(s) URLs. The disabled element check is thorough.

pkg/engine/headless/graph/graph.go (2)

1-44: LGTM - Graph structure and vertex enumeration are well-designed.

The CrawlGraph type properly wraps the directed graph with appropriate traits (directed, rooted, weighted). The navigationHasherFunc provides stable vertex identification via UniqueID.


113-141: LGTM - ShortestPath and DrawGraph implementations.

The ShortestPath method now correctly pre-allocates the slice (line 118) and filters out both about:blank URLs and nil navigation actions (line 125). The DrawGraph method properly handles file creation and cleanup.

pkg/engine/headless/crawler/crawler.go (3)

69-77: LGTM - Graceful initialization error handling.

The init() function now captures initialization errors in initError instead of panicking, and New() checks this at line 80-82. This allows proper error propagation to callers.


242-285: LGTM - Error handling and failure counting logic is correct.

The code properly uses errors.As for Rod error types (lines 249-274), handles recoverable errors with continue, and only counts failures for non-recoverable errors. The consecutiveFailures is correctly reset on success (line 284).


417-476: LGTM - Action execution handles multiple action types correctly.

The executeCrawlStateAction method properly handles LoadURL, FillForm, and click actions with appropriate timeouts, visibility checks, and interactability validation. The ErrElementNotVisible sentinel provides clear error signaling.

pkg/engine/headless/browser/browser.go (2)

266-352: LGTM - Resource cleanup in createBrowserPageFunc is robust.

The deferred cleanup with shouldCleanup flag (lines 269-276, 349-350) ensures the temp directory is cleaned up on any failure path. The strconv.Atoi errors are now properly handled (lines 285-292), and all error paths properly close browser/page resources before returning.


138-145: LGTM - Launcher Close properly cleans up resources.

The Close method cancels contexts, closes browser pages via Cleanup, and closes the pool channel to prevent goroutine leaks from the pool.

pkg/engine/headless/types/types.go (2)

45-53: LGTM - Action.Hash fallback is properly implemented.

The Hash() method now returns a stable key composed of Type|Input|OriginID when neither Element nor Form is present, preventing hash collisions for distinct actions.


146-161: LGTM - String() method no longer mutates state.

The fix uses a local trimmedContent variable (line 156) instead of modifying e.TextContent in place, making the method properly read-only.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
pkg/engine/headless/js/page-init.js (1)

139-159: Event listener hook still captures PII, doesn't handle object listeners, and allows unbounded growth.

The past review flagged these issues, but they remain unaddressed:

  1. Line 152 captures this.value without filtering password/sensitive fields
  2. Line 155 calls listener.toString() which will fail for object listeners with handleEvent methods
  3. Line 159 pushes to window.__eventListeners without any size limit
🔎 Proposed fix
       }
       let item = {
         element: {
           tagName: this.tagName,
           id: this.id,
           classes: this.className,
           outerHTML: this.outerHTML.slice(0, 100), // Capture a snippet of the element's outerHTML
           xpath: window.getXPath(this),
           cssSelector: window.getCssPath(this),
           attributes: window.getElementAttributes(this),
-          textContent: this.textContent.trim(),
+          textContent: (this.textContent || "").trim().slice(0, 100),
           hidden: this.hidden,
           name: this.name,
           type: this.type,
-          value: this.value,
+          value: (this.type === "password" || /pass|secret|token/i.test(this.name || "")) ? undefined : (this.value || "").slice(0, 64),
         },
         type: type,
-        listener: listener.toString(),
+        listener: (typeof listener === "function") ? listener.toString().slice(0, 500) : (listener && typeof listener.handleEvent === "function" ? "[object Listener]" : String(listener).slice(0, 500)),
         options: options || {},
       };
       console.log("[hook] got event listener", item);
+      if (window.__eventListeners.length > 1000) {
+        window.__eventListeners.shift();
+      }
       window.__eventListeners.push(item);
pkg/engine/headless/browser/browser.go (1)

387-470: Goroutine started by EachEvent cannot be stopped, causing resource leak.

The go b.EachEvent(...)() call at line 387 starts a goroutine that continues running after the BrowserPage is closed or returned to the pool. The return value (stop function) from EachEvent() is not captured.

Store the stop function and call it during cleanup:

🔎 Proposed fix

Add a stopEvents field to BrowserPage:

 type BrowserPage struct {
 	*rod.Page
 	Browser     *rod.Browser
 	cancel      context.CancelFunc
 	userDataDir string
+	stopEvents  func()
 
 	launcher *Launcher
 }

Capture and store the stop function in handlePageDialogBoxes:

-	go b.EachEvent(
+	b.stopEvents = b.EachEvent(
 		func(e *proto.PageJavascriptDialogOpening) {
 			_ = proto.PageHandleJavaScriptDialog{
 				Accept:     true,
 				PromptText: xid.New().String(),
 			}.Call(b.Page)
 		},
 
 		func(e *proto.FetchRequestPaused) {
 			// ... existing handler code ...
 		},
-	)()
+	)

Then invoke it during cleanup in CloseBrowserPage:

 func (b *BrowserPage) CloseBrowserPage() {
+	if b.stopEvents != nil {
+		b.stopEvents()
+	}
 	_ = b.Close()
 	_ = b.Browser.Close()
 	if b.userDataDir != "" {
 		_ = os.RemoveAll(b.userDataDir)
 	}
 }
🧹 Nitpick comments (2)
pkg/engine/headless/types/types.go (1)

186-188: Consider using structured logging instead of fmt for diagnostics.

While the diagnostic output now goes to stderr (better than stdout), it still uses fmt.Fprintf rather than a structured logger. Since the project already uses log/slog elsewhere, consider using it here for consistency.

🔎 Proposed fix
 	hashInput := strings.Join(parts, "|")
 	if IsDiagnosticEnabled {
-		fmt.Fprintf(os.Stderr, "[diagnostic] Element hash input: %s\n", hashInput)
+		slog.Debug("Element hash input", "input", hashInput)
 	}
 	hasher.Write([]byte(hashInput))

Apply the same pattern at line 234-236 for form hashing.

pkg/engine/headless/browser/element.go (1)

69-76: Remove misleading TODO comment - deduplication logic is correct.

The TODO comment at lines 73-74 suggests the deduplication logic is incomplete, but the current implementation is actually correct:

  1. Form buttons are marked in the unique map (line 75)
  2. Later, when processing standalone buttons (lines 101-103), they're skipped if already in unique

This prevents duplicate actions for form submit buttons.

🔎 Proposed fix
 	for _, form := range forms {
 		for _, element := range form.Elements {
 			if element.TagName != "BUTTON" {
 				continue
 			}
-			// TODO: Check if this button is already in the unique map
-			// and if so remove it
+			// Mark form buttons as seen to avoid duplicate actions
 			unique[element.Hash()] = struct{}{}
 		}
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 279eb81 and 6857acc.

📒 Files selected for processing (4)
  • pkg/engine/headless/browser/browser.go
  • pkg/engine/headless/browser/element.go
  • pkg/engine/headless/js/page-init.js
  • pkg/engine/headless/types/types.go
🧰 Additional context used
🧬 Code graph analysis (3)
pkg/engine/headless/types/types.go (2)
pkg/engine/headless/browser/cookie/cookie.go (1)
  • Action (20-22)
pkg/navigation/response.go (1)
  • Form (13-18)
pkg/engine/headless/js/page-init.js (2)
pkg/navigation/request.go (1)
  • Request (12-25)
pkg/engine/headless/js/utils.js (1)
  • listener (77-77)
pkg/engine/headless/browser/browser.go (5)
pkg/engine/headless/js/js.go (1)
  • InitJavascriptEnv (19-27)
pkg/navigation/request.go (1)
  • Request (12-25)
pkg/navigation/response.go (1)
  • Headers (11-11)
pkg/engine/headless/browser/cookie/cookie.go (1)
  • ShouldBlockRequest (48-56)
pkg/utils/utils.go (1)
  • FlattenHeaders (72-78)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: Test Builds (ubuntu-latest, 1.24.x)
  • GitHub Check: Test Builds (macOS-latest, 1.24.x)
  • GitHub Check: Test Builds (windows-latest, 1.24.x)
  • GitHub Check: Analyze (go)
  • GitHub Check: release-test-linux
  • GitHub Check: Functional Test (macOS-latest)
  • GitHub Check: Functional Test (windows-latest)
  • GitHub Check: Functional Test (ubuntu-latest)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
pkg/engine/headless/crawler/normalizer/dom_utils.go (1)

145-152: Overbroad "js" attribute removal still present.

Line 148 uses strings.HasPrefix(attr.Key, "js") which will incorrectly match unrelated attributes like json, justify, jsonp, etc. This issue was previously flagged but not addressed.

🔎 Proposed fix
-		if strings.HasPrefix(attr.Key, "data-") || strings.HasPrefix(attr.Key, "aria-") || strings.HasPrefix(attr.Key, "js") {
+		if strings.HasPrefix(attr.Key, "data-") || 
+			strings.HasPrefix(attr.Key, "aria-") || 
+			attr.Key == "js" || strings.HasPrefix(attr.Key, "js-") {
 			s.RemoveAttr(attr.Key)
 		}
🧹 Nitpick comments (2)
pkg/engine/headless/crawler/normalizer/dom_utils.go (2)

40-41: Optional: Loop variable capture unnecessary in Go 1.22+.

Lines 41 and 50 use t := t to capture the loop variable for closures. This pattern was necessary in Go < 1.22 but is now redundant, as loop variables are per-iteration by default in Go 1.22+.

Consider removing these lines if your minimum Go version is 1.22 or higher.

Also applies to: 49-50


96-104: Return early after removing comment node to avoid unnecessary iteration.

After removing a comment node on line 98, the function proceeds to iterate its children on lines 101-103. HTML comment nodes don't have children, so this iteration is unnecessary.

🔎 Proposed fix
 func removeComments(n *html.Node) {
 	if n.Type == html.CommentNode {
 		n.Parent.RemoveChild(n)
+		return
 	}
 
 	for c := n.FirstChild; c != nil; c = c.NextSibling {
 		removeComments(c)
 	}
 }
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6857acc and 6c760a8.

📒 Files selected for processing (2)
  • pkg/engine/headless/crawler/normalizer/dom_utils.go
  • pkg/engine/headless/crawler/normalizer/normalizer.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/engine/headless/crawler/normalizer/normalizer.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: Test Builds (macOS-latest, 1.24.x)
  • GitHub Check: Test Builds (windows-latest, 1.24.x)
  • GitHub Check: Test Builds (ubuntu-latest, 1.24.x)
  • GitHub Check: release-test-windows
  • GitHub Check: Analyze (go)
  • GitHub Check: Functional Test (windows-latest)
  • GitHub Check: Functional Test (macOS-latest)
  • GitHub Check: Functional Test (ubuntu-latest)
🔇 Additional comments (2)
pkg/engine/headless/crawler/normalizer/dom_utils.go (2)

10-20: Verify that broad meta and SVG removal aligns with normalization goals.

Lines 14 and 16 remove all meta[content] and svg elements respectively. This is quite aggressive:

  • meta[content] will remove useful semantic meta tags (description, keywords, Open Graph tags, etc.)
  • svg removes all vector graphics, which may contain meaningful content or navigation elements

If the goal is purely DOM structure fingerprinting for deduplication, this may be acceptable. However, verify these removals don't negatively impact page semantics or cause excessive false-positive deduplication.


106-110: LGTM! Nil guard correctly implemented.

The nil guard on line 107 correctly addresses the previous review concern about potential nil dereference.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/engine/headless/crawler/crawler.go (1)

325-326: Scope restriction TODO identified.

This FIXME indicates incomplete scope validation before proceeding with actions. The current scope check only happens after action execution (lines 349-359), meaning out-of-scope navigations may be executed before being filtered.

Would you like me to help implement pre-action scope validation, or should I open an issue to track this enhancement?

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fa13e31 and be42013.

📒 Files selected for processing (1)
  • pkg/engine/headless/crawler/crawler.go
🧰 Additional context used
🧬 Code graph analysis (1)
pkg/engine/headless/crawler/crawler.go (4)
pkg/engine/headless/types/types.go (7)
  • Action (35-43)
  • ActionTypeLoadURL (72-72)
  • PageState (19-32)
  • ActionTypeFillForm (88-88)
  • ActionTypeLeftClick (74-74)
  • ActionTypeLeftClickDown (75-75)
  • HTMLElement (133-146)
pkg/engine/headless/graph/graph.go (2)
  • CrawlGraph (15-17)
  • NewCrawlGraph (24-32)
pkg/engine/headless/crawler/diagnostics/diagnostics.go (2)
  • Writer (16-22)
  • NewWriter (55-66)
pkg/engine/headless/crawler/state.go (1)
  • ErrNoNavigationPossible (118-118)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: Test Builds (ubuntu-latest, 1.24.x)
  • GitHub Check: Test Builds (macOS-latest, 1.24.x)
  • GitHub Check: Test Builds (windows-latest, 1.24.x)
  • GitHub Check: Functional Test (windows-latest)
  • GitHub Check: Functional Test (ubuntu-latest)
  • GitHub Check: Functional Test (macOS-latest)
  • GitHub Check: release-test-linux
  • GitHub Check: Analyze (go)
🔇 Additional comments (5)
pkg/engine/headless/crawler/crawler.go (5)

65-77: LGTM! Graceful initialization error handling.

The use of initError instead of panicking addresses the prior review feedback and allows the caller to handle initialization failures properly.


133-144: LGTM! Proper resource cleanup with error logging.

The Close() method now logs diagnostics close errors as requested in the prior review, ensuring failures are visible without crashing the cleanup flow.


242-290: LGTM! Proper error type matching with errors.As.

The error handling now correctly uses errors.As for Rod's pointer-typed errors (NoPointerEventsError, InvisibleShapeError, NavigationError, MaxSleepCountError) and reserves errors.Is for sentinel errors. The failure counter reset on success (line 289) is correctly placed.


422-481: LGTM! Robust action execution with visibility checks.

The action execution properly validates element visibility and interactability before clicking, handles CoveredError correctly with errors.As, and applies appropriate timeouts to Rod operations.


483-488: LGTM! Logout detection regex corrected.

The logoutPattern regex now correctly includes ausloggen (German for logout), addressing the prior review feedback about the typo.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
pkg/engine/headless/browser/browser.go (1)

391-474: Goroutine leak: event listener not stopped when page is closed.

The goroutine started by b.EachEvent(...)() at line 391 continues running after the page is closed or returned to the pool. EachEvent returns a stop function that must be saved and invoked during cleanup.

This issue was flagged in a previous review but remains unaddressed. Each browser page creation leaks one goroutine, which will accumulate over time and consume memory.

🔎 Proposed fix

Update the BrowserPage struct to store the stop function:

 type BrowserPage struct {
 	*rod.Page
 	Browser     *rod.Browser
 	cancel      context.CancelFunc
 	userDataDir string
+	stopEvents  func()
 
 	launcher *Launcher
 }

Save and invoke the stop function:

 func (b *BrowserPage) handlePageDialogBoxes() error {
 	err := proto.FetchEnable{
 		Patterns: []*proto.FetchRequestPattern{
 			{
 				URLPattern:   "*",
 				RequestStage: proto.FetchRequestStageResponse,
 			},
 		},
 	}.Call(b.Page)
 	if err != nil {
 		return errors.Wrap(err, "could not enable fetch domain")
 	}
 
-	go b.EachEvent(
+	b.stopEvents = b.EachEvent(
 		func(e *proto.PageJavascriptDialogOpening) {
 			// ... handlers ...
 		},
 		func(e *proto.FetchRequestPaused) {
 			// ... handlers ...
 		},
-	)()
+	)
 	return nil
 }

Call the stop function in cleanup:

 func (b *BrowserPage) CloseBrowserPage() {
+	if b.stopEvents != nil {
+		b.stopEvents()
+	}
 	_ = b.Close()
 	_ = b.Browser.Close()
 	if b.userDataDir != "" {
 		_ = os.RemoveAll(b.userDataDir)
 	}
 }

Based on past review comments indicating this pattern is required for rod event listeners.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between be42013 and 69f98e7.

📒 Files selected for processing (1)
  • pkg/engine/headless/browser/browser.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: Test Builds (windows-latest, 1.24.x)
  • GitHub Check: Test Builds (ubuntu-latest, 1.24.x)
  • GitHub Check: Test Builds (macOS-latest, 1.24.x)
  • GitHub Check: Analyze (go)
  • GitHub Check: release-test-linux
  • GitHub Check: Functional Test (ubuntu-latest)
  • GitHub Check: Functional Test (macOS-latest)
  • GitHub Check: Functional Test (windows-latest)

@ehsandeep ehsandeep merged commit a073dc6 into dev Dec 31, 2025
15 of 17 checks passed
@ehsandeep ehsandeep deleted the new-full-headless-mode branch December 31, 2025 10:36
Rhan2020 pushed a commit to Rhan2020/katana that referenced this pull request Feb 11, 2026
* feat: added new full headless mode to katana (wip)

* feat: use requests body + close browser & page

* feat: misc additions to scope

* feat: pass logger from outside + do not use global

* added user to optioanl flags

* misc

* misc changes

* feat: multiple bug fixes + edge cases handlings

* feat: more additions + code, diagnostics, cookie-popups and stealth

* feat: added additional simhash + misc additions and fixes

* feat: fixed panic + parser not working for katana headless

* feat: handle singleton lock issue with multiple instances

* feat: fixed singleton error with multiple processes in linux

* pprof misc fix

* feat: added timeout + fixing stuck on large lists

* misc additions + debug

* misc

* misc

* misc

* feat: addressed review comments + better form fill + misc

* feat: more code review comments

* fix out of scope parsing

* misc review / lint fixes etc

* fix review comments

* tests related fixes + added back text stripping

* misc

* logging changes

* return page

---------

Co-authored-by: Ice3man <nizamulrana@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments