Conversation
WalkthroughAdds a new headless crawling subsystem: CLI/headless flags and options, a Rod-based browser launcher with pooled BrowserPage instances, JS instrumentation and element discovery, DOM/text normalizers with SimHash dedupe, graph-backed page-state management, diagnostics, cookie-consent request blocking, crawl debugger, parser filtering, form-filling, and related tests. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant CLI as cmd/katana
participant Runner as internal/runner
participant Headless as headless.Engine
participant Hybrid as hybrid.Engine
participant Standard as standard.Engine
User->>CLI: katana [flags]
CLI->>Runner: New(options)
alt options.Headless
Runner->>Headless: headless.New(crawlerOptions)
Headless-->>Runner: Headless instance
else options.HeadlessHybrid
Runner->>Hybrid: hybrid.New(crawlerOptions)
Hybrid-->>Runner: Hybrid instance
else
Runner->>Standard: standard.New(crawlerOptions)
Standard-->>Runner: Standard instance
end
Runner-->>User: runner instance
sequenceDiagram
autonumber
participant Headless as headless.Headless
participant Crawler as headless.Crawler
participant Launcher as browser.Launcher
participant Page as BrowserPage
participant JS as JS Env
participant Diag as Diagnostics
participant Out as OutputWriter
Headless->>Crawler: New(opts)
Crawler->>Launcher: NewLauncher(opts)
Launcher-->>Crawler: *Launcher
Headless->>Crawler: Crawl(URL)
Crawler->>Launcher: GetPageFromPool()
Launcher-->>Crawler: *BrowserPage
Crawler->>JS: InitJavascriptEnv(Page)
Crawler->>Page: Navigate + Wait heuristics
Crawler->>Diag: LogAction / LogPageState / LogNavigations
Crawler->>Out: RequestCallback(result)
Headless->>Headless: performAdditionalAnalysis()
Headless->>Out: Emit new requests
Crawler->>Launcher: PutBrowserToPool(Page)
Headless-->>Caller: done/error
Estimated code review effort🎯 5 (Critical) | ⏱️ ~150 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 42
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/runner/runner.go (1)
98-105: Enforce mutual exclusivity of Headless flags
In internal/runner/options.go (e.g. in ValidateOptions), add a check to error when both options.Headless and options.HeadlessHybrid are true:if options.Headless && options.HeadlessHybrid { return fmt.Errorf("flags -hl and -hh are mutually exclusive") }This ensures ambiguous flag combinations fail fast rather than silently defaulting to Headless.
🧹 Nitpick comments (71)
.gitignore (2)
10-10: Confirm intent: ignoring .devcontainer hides devcontainer configs repo-wide.
These are typically committed for reproducible dev envs. If configs are intended to be versioned, drop this ignore.Apply if you want to keep devcontainer configs tracked:
-.devcontainer
5-6: Duplicate ignore entry for katana_*/.
Remove the duplicate to keep the file tidy.katana_*/ -katana_*/pkg/utils/extensions/extensions.go (1)
41-46: Avoid dropping valid relative paths on parse errors; fall back to path.Ext() instead of returning falseShort-circuiting here changes behavior and can filter out legitimate relative paths or slightly malformed-yet-parseable items. Prefer trimming, attempting parse, and on error falling back to extension extraction from the raw item.
Apply:
func (e *Validator) ValidatePath(item string) bool { - var extension string - u, err := urlutil.Parse(item) - if err != nil { - gologger.Warning().Msgf("validatepath: failed to parse url %v got %v", item, err) - return false - } + var extension string + item = strings.TrimSpace(item) + u, err := urlutil.Parse(item) + if err != nil { + // noisy in hot paths; consider downgrading to Debug if this is expected + gologger.Warning().Msgf("validatepath: failed to parse url %v got %v", item, err) + extension = strings.ToLower(path.Ext(item)) + } else { + if u.Path != "" { + extension = strings.ToLower(path.Ext(u.Path)) + } else { + extension = strings.ToLower(path.Ext(item)) + } + return e.validateExt(extension) + } - if u.Path != "" { - extension = strings.ToLower(path.Ext(u.Path)) - } else { - extension = strings.ToLower(path.Ext(item)) - } + return e.validateExt(extension) +} + +// small helper to keep main flow simple +func (e *Validator) validateExt(extension string) bool { if extension == "" && len(e.extensionsMatch) > 0 { return false } if len(e.extensionsMatch) > 0 { if _, ok := e.extensionsMatch[extension]; ok { return true } return false } if _, ok := e.extensionsFilter[extension]; ok { return false } return true }If you prefer not to touch structure, minimally replace the
return falsewithextension = strings.ToLower(path.Ext(item))and let the existing logic proceed.pkg/types/crawler_options.go (1)
40-44: LGTM: Optional Logger and ChromeUser fields fit the new headless pathNo functional risk. Consider wiring sensible defaults (e.g., a text handler at Info when verbose/diagnostics are enabled) in NewCrawlerOptions so downstream consumers don’t need nil checks everywhere.
pkg/engine/parser/parser_test.go (1)
527-566: Great coverage; add uppercase/whitespace variants to lock the filterAdd a few cases (DATA:, MAILTO:, JAVASCRIPT:, with leading spaces) to prevent regressions and validate case-insensitive + trim change.
Apply:
htmlContent := ` <img src=""> <a href="data:text/html,<h1>Hello</h1>">Data Link</a> + <a href=" DATA:text/html,<h1>WS</h1>">WS Data Link</a> <link href="data:text/css,body{color:red}"> <script src="data:application/javascript,console.log('test')"></script> + <script src=" JAVASCRIPT:alert(1) "></script> <iframe src="data:text/html,<p>Test</p>"></iframe> <object data="data:application/pdf,test"></object> <embed src="data:application/x-shockwave-flash,test"> <audio src="data:audio/mp3,test"></audio> <video src="data:video/mp4,test"></video> <a href="mailto:test@example.com">Email Link</a> + <a href="MAILTO:test@example.com">Email Link 2</a> <a href="javascript:alert('test')">JS Link</a> + <a href="JavaScript:alert('test')">JS Link 2</a> `pkg/engine/headless/crawler/normalizer/helpers.go (1)
3-44: Broaden date/time coverage and reduce false positives; consider precompilationNice registry. Consider:
- Add ISO-8601 and ordinal suffixes (st/nd/rd) to improve recall.
- Use raw-string literals for readability.
- Precompile once ([]*regexp.Regexp) where consumed to avoid repeated compile cost.
Apply (illustrative additions):
var dateTimePatterns = []string{ @@ /* long months */ "[0-9]{1,2} [a-zA-Z]{3,} [0-9]{4}", - "[0-9]{1,2}th [a-zA-Z]{3,} [0-9]{4}", - "[0-9]{1,2}th [a-zA-Z]{3,}", + "[0-9]{1,2}(st|nd|rd|th) [a-zA-Z]{3,} [0-9]{4}", + "[0-9]{1,2}(st|nd|rd|th) [a-zA-Z]{3,}", @@ /* Times */ "[0-9]{1,2}:[0-9]{1,2}:[0-9]{1,2}( )?(pm|PM|am|AM)", "[0-9]{1,2}:[0-9]{1,2}( )?(pm|PM|am|AM)", "[0-9]{1,2}:[0-9]{1,2}:[0-9]{1,2}", "[0-9]{1,2}:[0-9]{1,2}", + /* ISO-8601 */ + "[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}(\\.[0-9]+)?(Z|[+-][0-9]{2}:[0-9]{2})?", }pkg/engine/headless/js/page-init.js (4)
59-69: Preserve instanceof and static props for WebSocket/EventSource wrappers.Reassigning constructors without copying prototype/static fields can break instanceof checks and constants.
- var oldWebSocket = window.WebSocket; - window.WebSocket = function (url, arg) { + var oldWebSocket = window.WebSocket; + function WebSocketHook(url, arg) { window.__navigatedLinks.push({ url: url, source: "websocket" }); return new oldWebSocket(url, arg); - }; + } + WebSocketHook.prototype = oldWebSocket.prototype; + Object.setPrototypeOf(WebSocketHook, oldWebSocket); + window.WebSocket = WebSocketHook; @@ - var oldEventSource = window.EventSource; - window.EventSource = function (url, arg) { + var oldEventSource = window.EventSource; + function EventSourceHook(url, arg) { window.__navigatedLinks.push({ url: url, source: "eventsource" }); return new oldEventSource(url, arg); - }; + } + EventSourceHook.prototype = oldEventSource.prototype; + Object.setPrototypeOf(EventSourceHook, oldEventSource); + window.EventSource = EventSourceHook;
84-91: Blocking form reset/close is intrusive; make it opt-in and reversible.Hard-disabling form reset and window.close can break sites during crawling.
- Behind a flag (e.g., opts.preventFormReset/opts.preventClose).
- Keep originals in closures and expose an unhook() to restore.
Also applies to: 93-98
105-112: Timer acceleration: handle non-numeric/large delays and clamp.Multiplying undefined/NaN by 0.1 yields NaN; also consider minimum clamping.
- const speedUpFactor = 0.1; // For example, 10 times faster + const speedUpFactor = 0.1; // 10x faster + const clamp = (d) => { + const n = Number(d); + if (!Number.isFinite(n)) return 0; + const sped = Math.max(0, Math.floor(n * speedUpFactor)); + return Math.max(0, sped); + }; @@ - window.setTimeout = function (callback, delay, ...args) { - return originalSetTimeout(callback, delay * speedUpFactor, ...args); + window.setTimeout = function (callback, delay, ...args) { + return originalSetTimeout(callback, clamp(delay), ...args); }; - window.setInterval = function (callback, delay, ...args) { - return originalSetInterval(callback, delay * speedUpFactor, ...args); + window.setInterval = function (callback, delay, ...args) { + return originalSetInterval(callback, clamp(delay), ...args); };
155-159: Hooks are disabled; add a flag-driven initializer so callers can opt-in granularly.Provide a single entry point like pageInitAndHook({ events: true, navigations: true, timers: true, guards: true }).
I can wire a small options parser and idempotent init/unhook helpers if you want.
go.mod (1)
12-12: QUIC/utls/rod upgrades: check minimum Go and transitive constraints.quic-go and utls occasionally bump min-Go/crypto requirements. Given Go 1.24 here, you should be safe; still watch for CI across platforms.
- Add a smoke test job that runs “go build ./...” on linux/windows/darwin with Go 1.24.2 to surface toolchain incompatibilities early.
Also applies to: 89-92
pkg/engine/headless/crawler/normalizer/text_utils.go (2)
45-45: Remove redundant shadowing.
pattern := patterninside loops has no effect.- pattern := patternAlso applies to: 58-58
56-62: Whitespace normalization (optional).After removals, leftover double spaces/newlines can remain. Optionally collapse whitespace.
I can add a post-pass to compress multiple spaces and trim if desired.
pkg/engine/headless/crawler/normalizer/text_utils_test.go (2)
22-29: Make the test case name descriptive.Improve signal when it fails.
- name: "test", + name: "removes email inside anchor text",
34-38: Avoid depending on unexported fields in tests.If
TextNormalizer’s internals change, this test breaks. Prefer a constructor or exported option.I can add a
NewTextNormalizer(patterns ...*regexp.Regexp)and update the test—want me to draft it?pkg/engine/headless/js/utils.js (4)
64-80: This only captures inlineon*handlers, notaddEventListenerlisteners.Name/doc can mislead. Either document the limitation or hook
EventTarget.prototype.addEventListenerin page-init.js to record listeners (recommended).I can wire a lightweight hook into page-init.js to track listeners without heavy scans—want a patch?
85-103: Broaden pseudo-form detection (optional).Consider also
[role="form"]and common wrapper patterns to improve coverage.- const pseudoForms = document.querySelectorAll("div.form"); + const pseudoForms = document.querySelectorAll('div.form, [role="form"]');
15-28: Capture checkbox/radio state.Including
checkedimproves form introspection accuracy.window._elementDataFromElement = function (el) { return { @@ - value: el.value != null ? String(el.value) : '', + value: el.value != null ? String(el.value) : '', + checked: typeof el.checked === 'boolean' ? el.checked : undefined,
3-12: Make utils initialization idempotent.Prevent re-definitions when
InitJavascriptEnvruns multiple times.-(function initUtilityFunctions() { +(function initUtilityFunctions() { + if (window.__katanaUtilsInit) { return; } + window.__katanaUtilsInit = true;pkg/engine/headless/js/js.go (1)
18-27: Keep remove handles from EvalOnNewDocument (optional cleanup).Rod returns a remove func; storing it allows cleanup on page teardown to avoid duplicate injections on reinits.
-func InitJavascriptEnv(page *rod.Page) error { - if _, err := page.EvalOnNewDocument(utilsJavascriptBundle); err != nil { +func InitJavascriptEnv(page *rod.Page) error { + if _, err := page.EvalOnNewDocument(utilsJavascriptBundle); err != nil { return errors.Wrap(err, "failed to inject utils.js") } - if _, err := page.EvalOnNewDocument(pageInitJavascriptBundle); err != nil { + if _, err := page.EvalOnNewDocument(pageInitJavascriptBundle); err != nil { return errors.Wrap(err, "failed to inject page-init.js") } return nil }FYI:
EvalOnNewDocumentreturns(remove func() error, err error). (github.com)internal/runner/runner.go (1)
106-108: Fix misleading error message when creating crawler.The message hardcodes “standard crawler” even for headless/hybrid errors. Prefer a neutral message.
- return nil, errorutil.NewWithErr(err).Msgf("could not create standard crawler") + return nil, errorutil.NewWithErr(err).Msgf("could not create crawler")internal/runner/options.go (2)
28-33: Consider applying the -aff auto-disable to hybrid as well (if hybrid uses headless navigation).If hybrid mode executes navigation via a headless browser, the same rationale applies.
- if options.Headless && options.AutomaticFormFill { + if (options.Headless || options.HeadlessHybrid) && options.AutomaticFormFill { options.AutomaticFormFill = false gologger.Info().Msgf("Automatic form fill (-aff) has been disabled for headless navigation.") }
38-42: Validate SystemChromePath is executable (not just present).Helps fail early on directories/wrong files.
if options.SystemChromePath != "" { - if !fileutil.FileExists(options.SystemChromePath) { + if !fileutil.FileExists(options.SystemChromePath) { return errorutil.New("specified system chrome binary does not exist") } + if fi, err := os.Stat(options.SystemChromePath); err == nil && fi.Mode()&0111 == 0 { + return errorutil.New("specified system chrome binary is not executable") + } }pkg/types/options.go (1)
104-106: Clarify HeadlessHybrid precedence in docs and CLI help: when both--headlessand--headless-hybridare set,--headlesstakes precedence; update theHeadlessHybridflag description in pkg/types/options.go (and the generated CLI help) to explain their difference and this precedence.pkg/engine/headless/browser/cookie/cookie_test.go (1)
9-49: Solid table-driven coverage for consent blocking.Tests read clearly and validate the core rule-matching paths.
Consider adding cases for:
- A blocked URL with a non-script resource type to verify resource-type gating.
- Precedence when both Included and Excluded initiator domains could match.
Example additional case:
@@ func TestShouldBlockRequest(t *testing.T) { { name: "should not block non-matching URL", url: "https://example.com/other-path", resourceType: proto.NetworkResourceTypeScript, initiatorDomain: "hackerone.com", want: false, }, + { + name: "should not block non-script resource", + url: "https://consent.trustarc.com/notice?domain=hackerone.com", + resourceType: proto.NetworkResourceTypeImage, + initiatorDomain: "hackerone.com", + want: false, + },pkg/engine/headless/crawler/normalizer/dom_utils_test.go (1)
65-76: Tiny cleanup: remove unused local ‘d’ variable in subtests.Not functionally wrong; simplifies the subtest body.
- t.Run(tt.name, func(t *testing.T) { - d := normalizer - got, err := d.Apply(tt.args.content) + t.Run(tt.name, func(t *testing.T) { + got, err := normalizer.Apply(tt.args.content)pkg/engine/headless/crawler/state_test.go (3)
12-14: Implement or skip the placeholder testEmpty test will silently pass; either implement or mark it skipped to avoid false positives.
Apply this diff:
func TestPageFingerprint_Stability(t *testing.T) { - + t.Skip("TODO: implement stability test for page fingerprints") }
3-10: Drop pkg/errors in tests; use stdlib error wrappingAvoid extra dependency in tests; use fmt.Errorf with %w.
Apply this diff:
import ( + "fmt" "strings" "testing" - "github.com/pkg/errors" "github.com/projectdiscovery/katana/pkg/engine/headless/crawler/normalizer/simhash" "github.com/stretchr/testify/assert" )- return "", errors.Wrap(err, "could not get stripped dom") + return "", fmt.Errorf("could not get stripped dom: %w", err)Also applies to: 84-84
90-104: Parallelize subtests for speedThese table-driven tests are independent; run them in parallel.
Apply this diff:
t.Run(tt.name, func(t *testing.T) { + t.Parallel()t.Run(tt.name, func(t *testing.T) { + t.Parallel()Also applies to: 145-172
cmd/katana/main.go (3)
75-79: Remove nested defer; call Stop directly inside the wrapperThe inner defer is redundant; simplify for clarity.
Apply this diff:
- defer func() { - if pprofServer != nil { - defer pprofServer.Stop() - } - }() + defer func() { + if pprofServer != nil { + pprofServer.Stop() + } + }()
49-50: Resolve TODO or delete the commentEither wire an env-based toggle here or drop the stale comment to avoid confusion.
143-156: Clarify/validate headless vs hybrid flag interplayIf both -headless and -hybrid are set, define precedence or error out. Help strings could also hint at mutual exclusivity.
Add a post-parse validation (outside this hunk), e.g.:
// after flagSet.Parse() and config merge if options.Headless && options.HeadlessHybrid { return nil, errorutil.New("flags -headless and -hybrid are mutually exclusive") }pkg/engine/headless/crawler/normalizer/simhash/simhash_test.go (2)
15-27: Avoid asserting undocumented “shingle=0 fallback to 1” behaviorIf Fingerprint doesn’t guarantee this, the test is brittle. Either document it in code or adjust the test to just check determinism with a fixed shingle size.
Apply this diff:
-// TestFingerprintDeterministic ensures that hashing the same document twice produces -// identical fingerprints and that a shingle value of 0 gracefully falls back to 1. +// TestFingerprintDeterministic ensures that hashing the same document twice produces +// identical fingerprints. func TestFingerprintDeterministic(t *testing.T) { in1 := strings.NewReader(htmlA) in2 := strings.NewReader(htmlA) - fp1 := Fingerprint(in1, 0) - fp2 := Fingerprint(in2, 1) + fp1 := Fingerprint(in1, 3) + fp2 := Fingerprint(in2, 3)
18-27: Run tests in parallelIndependent tests; parallelize to reduce runtime.
Apply this diff:
func TestFingerprintDeterministic(t *testing.T) { + t.Parallel()func TestFingerprintSimilarity(t *testing.T) { + t.Parallel()func TestOracle(t *testing.T) { + t.Parallel()Also applies to: 32-44, 48-72
pkg/engine/headless/headless.go (3)
27-44: Consider parameterizing the debugger portThe debugger port is hardcoded to 8089 on line 40. This could cause conflicts if multiple instances are running or if the port is already in use.
- headless.debugger = NewCrawlDebugger(8089) + debuggerPort := 8089 // Consider making this configurable + if options.Options.DebuggerPort > 0 { + debuggerPort = options.Options.DebuggerPort + } + headless.debugger = NewCrawlDebugger(debuggerPort)
100-128: TODO comment indicates multi-threading is plannedLine 128 contains a TODO comment about making crawling multi-threaded. Currently, concurrency is hardcoded to 1.
Do you want me to create an issue to track the multi-threaded crawling implementation or generate a solution that implements concurrent crawling?
118-120: Consider making response body clearing configurableThe response body and raw content are unconditionally cleared on lines 118-119. This might not be desirable for all use cases, especially when debugging or when the output writer needs this data.
- rr.Response.Raw = "" - rr.Response.Body = "" + if !h.options.Options.KeepResponseBody { + rr.Response.Raw = "" + rr.Response.Body = "" + }pkg/engine/headless/crawler/state.go (4)
19-20: Consider making SimHash threshold configurableThe SimHash threshold is hardcoded to 2 bits. Different use cases might require different similarity thresholds.
-const simhashThreshold = 2 // Allow up to 2 bits difference +// DefaultSimHashThreshold is the default threshold for SimHash distance comparison +const DefaultSimHashThreshold = 2 // Allow up to 2 bits differenceThen use it from configuration if available, falling back to the default.
72-74: Consider handling more blank page patternsThe function only checks for empty URL and "about:blank", but there are other blank page patterns like "chrome://newtab" or "data:," that might need handling.
- if pageInfo.URL == "" || pageInfo.URL == "about:blank" { + blankPages := []string{"", "about:blank", "about:newtab", "chrome://newtab/", "data:,"} + for _, blank := range blankPages { + if pageInfo.URL == blank { + return nil, ErrEmptyPage + } + } - return nil, ErrEmptyPage - }
203-205: FIXME comment needs attentionThere's a FIXME comment indicating that the origin element ID should be returned for correct graph representation.
The FIXME comment on lines 203-205 indicates a potential issue with graph representation. Would you like me to create an issue to track this or help implement a solution?
223-227: Inefficient loop for browser navigationThe loop navigates back one step at a time. This could be optimized by navigating multiple steps at once if the browser API supports it.
- for i := 0; i < stepsBack; i++ { - if err := page.NavigateBack(); err != nil { - return "", err - } - navigatedSuccessfully = true - } + // Try to navigate back multiple steps at once if supported + if err := page.NavigateBackSteps(stepsBack); err != nil { + // Fallback to step-by-step navigation + for i := 0; i < stepsBack; i++ { + if err := page.NavigateBack(); err != nil { + return "", err + } + } + } + navigatedSuccessfully = trueNote: This assumes
NavigateBackStepsmethod can be added to the browser API.pkg/engine/headless/crawler/normalizer/normalizer.go (3)
88-92: Error handling silently returns original on parse failureWhen
strconv.ParseIntfails inreplaceHexEscapeSequence, the function silently returns the original match. This could hide malformed input issues.Consider logging the error for debugging purposes:
value, err := strconv.ParseInt(code, 16, 32) if err != nil { + // Log the error for debugging if a logger is available + // log.Debug("Failed to parse hex escape sequence", "sequence", match, "error", err) // If there's an error, return the original match return match }
97-98: Consider caching compiled regex patternsThe regex pattern is compiled at package initialization but could benefit from being unexported since it's only used internally.
-// Define the regex pattern to match hexadecimal escape sequences -var pattern = regexp.MustCompile(`\\x[0-9a-fA-F]{2}`) +// hexEscapePattern matches hexadecimal escape sequences +var hexEscapePattern = regexp.MustCompile(`\\x[0-9a-fA-F]{2}`)And update line 101:
- return pattern.ReplaceAllStringFunc(input, func(match string) string { + return hexEscapePattern.ReplaceAllStringFunc(input, func(match string) string {
45-54: Missing validation for empty inputThe
Applymethod doesn't check if the input text is empty before processing, which could lead to unnecessary processing.func (n *Normalizer) Apply(text string) (string, error) { + if text == "" { + return "", nil + } + first := normalizeDocument(text) firstpass, err := n.dom.Apply(first)pkg/engine/headless/graph/graph.go (3)
46-48: Fix doc comment to match function name.Apply:
-// AddNavigation adds a navigation to the graph +// AddPageState adds a page state (and its incoming navigation edge, if any) to the graph.
34-44: Return empty slice on error instead of nil.This avoids nil-slice surprises for callers and remains zero-value friendly.
Apply:
func (g *CrawlGraph) GetVertices() []string { vertices := []string{} adjacencyMap, err := g.graph.AdjacencyMap() if err != nil { - return nil + return vertices }
102-108: Return by value or document copy semantics.
&pageVertexreturns a pointer to a copy, not the stored vertex; callers may expect in-place mutation. Prefer returning by value.Example:
-func (g *CrawlGraph) GetPageState(id string) (*types.PageState, error) { +func (g *CrawlGraph) GetPageState(id string) (types.PageState, error) { pageVertex, err := g.graph.Vertex(id) if err != nil { - return nil, errors.Wrap(err, "could not get vertex") + return types.PageState{}, errors.Wrap(err, "could not get vertex") } - return &pageVertex, nil + return pageVertex, nil }pkg/engine/headless/crawler/normalizer/simhash/simhash.go (3)
21-26: Cap text feature size to bound memory/CPU.Long text/comment tokens can balloon feature strings. Cap to a reasonable length.
Apply:
// Constants for optimization const ( maxTokens = 5000 featuresBufSize = 1000 + maxTextLen = 256 )
92-127: Trim large text/comment payloads and avoid fmt.Sprintf hot path.Reduces allocations and GC churn in tight loops.
Apply:
func extractFeatures(t *html.Token, features *[]string) { // Pre-allocate string builder for efficiency var s string switch t.Type { case html.StartTagToken: s = "A:" + t.DataAtom.String() case html.EndTagToken: s = "B:" + t.DataAtom.String() case html.SelfClosingTagToken: s = "C:" + t.DataAtom.String() case html.DoctypeToken: - s = "D:" + string(t.Data) + td := t.Data + if len(td) > maxTextLen { + td = td[:maxTextLen] + } + s = "D:" + td case html.CommentToken: - s = "E:" + string(t.Data) + td := t.Data + if len(td) > maxTextLen { + td = td[:maxTextLen] + } + s = "E:" + td case html.TextToken: - s = "F:" + string(t.Data) + td := t.Data + if len(td) > maxTextLen { + td = td[:maxTextLen] + } + s = "F:" + td case html.ErrorToken: - s = "Z:" + string(t.Data) + td := t.Data + if len(td) > maxTextLen { + td = td[:maxTextLen] + } + s = "Z:" + td default: return } *features = append(*features, s) // Process attributes for _, attr := range t.Attr { switch attr.Key { case "class", "name", "rel": - s = fmt.Sprintf("G:%s:%s:%s", t.DataAtom.String(), attr.Key, attr.Val) + s = "G:" + t.DataAtom.String() + ":" + attr.Key + ":" + attr.Val default: - s = fmt.Sprintf("G:%s:%s", t.DataAtom.String(), attr.Key) + s = "G:" + t.DataAtom.String() + ":" + attr.Key } *features = append(*features, s) } }
34-89: Process features in-stream to avoid holding all features in memory.You can update the fingerprint as features are extracted (only keep a shingle window), eliminating the
featuresslice entirely.I can provide a patch that folds shingling and vector updates into the tokenization loop if you want to pursue this optimization.
pkg/engine/headless/debugger.go (3)
3-9: Import context for graceful shutdown.Apply:
import ( "encoding/json" "fmt" + "context" "net/http" "sync" "time" )
98-102: Avoid double GetActiveURLs call.Minor efficiency/readability tweak.
Apply:
func (cd *CrawlDebugger) handleActiveURLs(w http.ResponseWriter, r *http.Request) { w.Header().Set("Content-Type", "application/json") - json.NewEncoder(w).Encode(map[string]interface{}{ - "timestamp": time.Now().Format(time.RFC3339), - "active_urls": cd.GetActiveURLs(), - "count": len(cd.GetActiveURLs()), - }) + urls := cd.GetActiveURLs() + _ = json.NewEncoder(w).Encode(map[string]interface{}{ + "timestamp": time.Now().Format(time.RFC3339), + "active_urls": urls, + "count": len(urls), + }) }
113-122: Use graceful shutdown.Prefer
Shutdownwith a timeout overCloseto drain in-flight requests.Apply:
func (cd *CrawlDebugger) Close() { if cd == nil { return } if cd.httpServer != nil { - cd.httpServer.Close() + ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second) + defer cancel() + _ = cd.httpServer.Shutdown(ctx) } }pkg/engine/headless/browser/cookie/cookie.go (3)
19-21: Remove unused Action type (or document future use).Currently unused and confusing in a blocker-only implementation.
Apply:
-type Action struct { - Type string `json:"type"` -} +// Action type reserved for future rule actions (e.g., block/redirect/modify). +// Currently unused.
72-83: Use domain suffix match instead of substring for initiators.Substrings can false-positive on unrelated domains.
Apply:
if len(rule.Condition.InitiatorDomains) > 0 { matched := false for _, domain := range rule.Condition.InitiatorDomains { - if strings.Contains(initiatorDomain, domain) { + if domainMatches(initiatorDomain, domain) { matched = true break } } if !matched { return false } }Add this helper (outside the shown hunk):
// domainMatches returns true when host equals domain or ends with "."+domain (case-insensitive). func domainMatches(host, domain string) bool { host = strings.ToLower(host) domain = strings.ToLower(domain) return host == domain || strings.HasSuffix(host, "."+domain) }
85-91: Apply the same strictness for excluded initiators.Apply:
if len(rule.Condition.ExcludedInitiatorDomains) > 0 { for _, domain := range rule.Condition.ExcludedInitiatorDomains { - if strings.Contains(initiatorDomain, domain) { + if domainMatches(initiatorDomain, domain) { return false } } }pkg/engine/headless/crawler/normalizer/dom_utils.go (3)
129-143: Over-aggressive attribute stripping (“href”, “src”, blanket “js”) increases collisions.*Removing href/src from all elements erases meaningful structure and may make unrelated pages hash-identical. Also, stripping any attribute starting with “js” is too broad.
Consider:
- Keep href/src but normalize (e.g., drop known tracking params: utm_*, gclid, fbclid, etc.).
- Narrow the “js” rule to js- prefix only.
var attributes = []string{ "class", "id", - "name", // name tags is often random - "href", // remove href "style", "width", "height", - "src", "nowrap", "target", "valign", "cellpadding", "cellspacing", } @@ - if strings.HasPrefix(attr.Key, "data-") || strings.HasPrefix(attr.Key, "aria-") || strings.HasPrefix(attr.Key, "js") { + if strings.HasPrefix(attr.Key, "data-") || strings.HasPrefix(attr.Key, "aria-") || strings.HasPrefix(attr.Key, "js-") { s.RemoveAttr(attr.Key) }If you do want to touch href values, add a small normalizer that removes only known volatile query params.
Also applies to: 154-167
115-127: Nil-safety for comment removal.Guard against nil nodes to avoid panics on unusual selections.
Apply:
func removeComments(n *html.Node) { + if n == nil { + return + } if n.Type == html.CommentNode { - n.Parent.RemoveChild(n) + if n.Parent != nil { + n.Parent.RemoveChild(n) + } } for c := n.FirstChild; c != nil; c = c.NextSibling { removeComments(c) } } func removeCommentsDomTransformationFunc(s *goquery.Selection) { - removeComments(s.Get(0)) + removeComments(s.Get(0)) }
82-103: Consider returning a stripped “outerHTML” of only.Full doc.Html() may include head/meta noise that varies across requests. Hashing only normalized body can improve stability.
If acceptable, switch to serializing the normalized subtree when present.
pkg/engine/headless/browser/element.go (3)
210-245: Use a consistent Eval path (b.Page.Eval) for clarity.Mixed usage (b.Eval vs b.Page.Eval) is confusing and can hide context differences.
Apply:
- eventlisteners, err := b.Eval(`() => window.__eventListeners`) + eventlisteners, err := b.Page.Eval(`() => window.__eventListeners`) @@ - inlineListeners, err := b.Eval(`() => window.getAllElementsWithEventListeners()`) + inlineListeners, err := b.Page.Eval(`() => window.getAllElementsWithEventListeners()`) @@ - navigatedLinks, err := b.Eval(`() => window.__navigatedLinks`) + navigatedLinks, err := b.Page.Eval(`() => window.__navigatedLinks`)Also applies to: 254-266
246-266: Nit: variable naming in GetNavigatedLinks.The local slice is named listeners; rename to links for readability.
Apply:
- listeners := make([]*NavigatedLink, 0) - if err := navigatedLinks.Value.Unmarshal(&listeners); err != nil { + links := make([]*NavigatedLink, 0) + if err := navigatedLinks.Value.Unmarshal(&links); err != nil { return nil, err } - return listeners, nil + return links, nil
146-151: Case-insensitive tag check.Ensure TagName comparisons don’t depend on casing from JS marshalling.
Apply:
- if element.TagName != "BUTTON" { + if !strings.EqualFold(element.TagName, "BUTTON") { continue }pkg/engine/headless/TODOS.md (2)
84-86: Standardize horizontal rule to Markdown “---”.Improves render consistency and satisfies linters.
---------------------------------- ---
14-23: Tighten acceptance criteria for “page ready”.Define concrete defaults (e.g., idle window 300ms, overall timeout 15s) and failure telemetry to aid tuning.
I can draft the exact options surface and metrics if helpful.
pkg/engine/headless/crawler/diagnostics/diagnostics.go (1)
185-195: Mutex unnecessary in LogPageStateScreenshotThe mutex is acquired but only filesystem operations are performed with no shared state access. Unless the directory path collision is a concern (which is already handled by
MkdirAll), the mutex serves no purpose here.Consider removing the unnecessary mutex:
func (w *diskWriter) LogPageStateScreenshot(pageStateID string, screenshot []byte) error { - w.mu.Lock() - defer w.mu.Unlock() - dir := filepath.Join(w.directory, pageStateID) if err := os.MkdirAll(dir, 0755); err != nil { return err } screenshotFile := filepath.Join(dir, "screenshot.png") return os.WriteFile(screenshotFile, screenshot, 0644) }pkg/engine/headless/types/types.go (1)
4-5: Avoid using MD5 for cryptographic purposesWhile MD5 is being used here for generating hash identifiers rather than security, it's deprecated and has known vulnerabilities. Consider using a more modern hash function.
Consider using SHA-256 for better future-proofing:
import ( - "crypto/md5" + "crypto/sha256" "encoding/hex"And update the hash generation accordingly:
func (e *HTMLElement) Hash() string { - hasher := md5.New() + hasher := sha256.New()pkg/engine/headless/crawler/crawler.go (1)
180-183: Redundant timeout mechanismHaving both context cancellation and
time.Afterfor the same timeout creates unnecessary complexity and resource usage. The context cancellation alone is sufficient.Remove the redundant
time.Aftermechanism:- // Retain the legacy time.After guard as a secondary fail-safe but the - // context cancellation is what actually stops in-flight rod calls. - var crawlTimeout <-chan time.Time - if c.options.MaxCrawlDuration > 0 { - crawlTimeout = time.After(c.options.MaxCrawlDuration) - } - consecutiveFailures := 0 for { select { - case <-crawlTimeout: + case <-ctx.Done(): c.logger.Debug("Max crawl duration reached, stopping crawl") return nil default:pkg/engine/headless/browser/browser.go (4)
329-331: Add browser health check as mentioned in TODOThe TODO comment correctly identifies that the browser might die. Implement a health check before returning the page.
The TODO comment at line 329-330 indicates that browser health checking is needed. Would you like me to generate a solution that implements browser health validation before returning the page from the pool?
494-524: Optimize pool management in PutBrowserToPoolThe function performs multiple health checks that could be consolidated, and the page cleanup logic could be more efficient.
func (l *Launcher) PutBrowserToPool(browser *BrowserPage) { + // Helper to cleanup and discard the browser + discardBrowser := func() { + browser.cancel() + browser.CloseBrowserPage() + } + // Discard pages that hit a deadline or were cancelled to avoid immediately // returning a poisoned page that will fail every subsequent call. if cerr := browser.Page.GetContext().Err(); cerr != nil { - browser.cancel() - browser.CloseBrowserPage() + discardBrowser() return } // If the browser is not connected, close it if !isBrowserConnected(browser.Browser) { - browser.cancel() - browser.CloseBrowserPage() + discardBrowser() return } pages, err := browser.Browser.Pages() if err != nil { - browser.cancel() - browser.CloseBrowserPage() + discardBrowser() return } // Close all pages except the current one currentPageID := browser.Page.TargetID for _, page := range pages { if page.TargetID != currentPageID { _ = page.Close() } } l.browserPool.Put(browser) }
88-88: Consider making incognito mode configurableThe browser is hardcoded to use incognito mode, which might not be suitable for all use cases (e.g., when cookie persistence is needed).
Consider adding an
Incognitofield toLauncherOptionsto make this configurable:type LauncherOptions struct { ChromiumPath string MaxBrowsers int PageMaxTimeout time.Duration ShowBrowser bool Proxy string SlowMotion bool Trace bool CookieConsentBypass bool + Incognito bool // default: true ChromeUser *user.User // optional chrome user to use ScopeValidator ScopeValidator RequestCallback func(*output.Result) }
526-535: Inefficient browser connectivity checkThe
isBrowserConnectedfunction makes a synchronous RPC call every time it's invoked. Consider caching the connection state or using a more lightweight check.func isBrowserConnected(browser *rod.Browser) bool { + // First try a lightweight check if the browser context is valid + if browser == nil { + return false + } + getVersionResult, err := proto.BrowserGetVersion{}.Call(browser) if err != nil { return false } if getVersionResult == nil || getVersionResult.Product == "" { return false } return true }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (2)
go.sumis excluded by!**/*.sumpkg/engine/headless/browser/cookie/rules.jsonis excluded by!**/*.json
📒 Files selected for processing (34)
.gitignore(1 hunks)cmd/katana/main.go(4 hunks)go.mod(7 hunks)internal/runner/options.go(1 hunks)internal/runner/runner.go(2 hunks)pkg/engine/headless/TODOS.md(1 hunks)pkg/engine/headless/browser/browser.go(1 hunks)pkg/engine/headless/browser/cookie/cookie.go(1 hunks)pkg/engine/headless/browser/cookie/cookie_test.go(1 hunks)pkg/engine/headless/browser/element.go(1 hunks)pkg/engine/headless/crawler/crawler.go(1 hunks)pkg/engine/headless/crawler/diagnostics/diagnostics.go(1 hunks)pkg/engine/headless/crawler/normalizer/dom_utils.go(1 hunks)pkg/engine/headless/crawler/normalizer/dom_utils_test.go(1 hunks)pkg/engine/headless/crawler/normalizer/helpers.go(1 hunks)pkg/engine/headless/crawler/normalizer/normalizer.go(1 hunks)pkg/engine/headless/crawler/normalizer/simhash/simhash.go(1 hunks)pkg/engine/headless/crawler/normalizer/simhash/simhash_test.go(1 hunks)pkg/engine/headless/crawler/normalizer/text_utils.go(1 hunks)pkg/engine/headless/crawler/normalizer/text_utils_test.go(1 hunks)pkg/engine/headless/crawler/state.go(1 hunks)pkg/engine/headless/crawler/state_test.go(1 hunks)pkg/engine/headless/debugger.go(1 hunks)pkg/engine/headless/graph/graph.go(1 hunks)pkg/engine/headless/headless.go(1 hunks)pkg/engine/headless/js/js.go(1 hunks)pkg/engine/headless/js/page-init.js(1 hunks)pkg/engine/headless/js/utils.js(1 hunks)pkg/engine/headless/types/types.go(1 hunks)pkg/engine/parser/parser.go(1 hunks)pkg/engine/parser/parser_test.go(1 hunks)pkg/types/crawler_options.go(2 hunks)pkg/types/options.go(3 hunks)pkg/utils/extensions/extensions.go(1 hunks)
🧰 Additional context used
🪛 LanguageTool
pkg/engine/headless/TODOS.md
[grammar] ~1-~1: There might be a mistake here.
Context: # 🗺️ Headless-Crawler Road-Map --- ## 🥇 Core Improvements (High-impact / fir...
(QB_NEW_EN)
[grammar] ~4-~4: There might be a mistake here.
Context: ...mprovements (High-impact / first passes) - [ ] After clicking on elements there isn...
(QB_NEW_EN)
[grammar] ~9-~9: There might be a mistake here.
Context: ...6 on stripped DOM for cheap exact-match. - Stage 2: compute 64-bit SimHash/MinHash ...
(QB_NEW_EN)
[grammar] ~10-~10: There might be a mistake here.
Context: ...es equal when Hamming distance ≤ 3 bits. - Stage 3 (optional): if SimHash inconclus...
(QB_NEW_EN)
[grammar] ~11-~11: There might be a mistake here.
Context: ...-res screenshot and compare pHash/dHash. - Store ExactHash & FuzzyHash; update ...
(QB_NEW_EN)
[grammar] ~22-~22: There might be a mistake here.
Context: ....9)untilscrollHeight` stops growing. - Fallback: IntersectionObserver on a sent...
(QB_NEW_EN)
[grammar] ~27-~27: There might be a mistake here.
Context: ...et, EventSource, Service-Worker scripts. - Feed new, in-scope URLs into the crawl q...
(QB_NEW_EN)
[grammar] ~32-~32: There might be a mistake here.
Context: ...queue (dedup by host/path). --- ## 🥈 Mid-term Enhancements - [ ] **Dynamic form-fill...
(QB_NEW_EN)
[grammar] ~32-~32: There might be a mistake here.
Context: ...ath). --- ## 🥈 Mid-term Enhancements - [ ] Dynamic form-filling - Gener...
(QB_NEW_EN)
[grammar] ~35-~35: There might be a mistake here.
Context: ..., min, max, maxlength, required. - Pluggable ValueProvider interface for ...
(QB_NEW_EN)
[grammar] ~42-~42: There might be a mistake here.
Context: ... Worker pool consuming the action queue. - Use multiple rod.Page instances (share...
(QB_NEW_EN)
[grammar] ~45-~45: There might be a mistake here.
Context: ... - [ ] Smart time-out & retry budgets - Adaptive timeout: first nav longer, late...
(QB_NEW_EN)
[grammar] ~48-~48: There might be a mistake here.
Context: ...d on stall. - [ ] Viewport variants - Crawl again at typical mobile (390×844) ...
(QB_NEW_EN)
[grammar] ~52-~52: There might be a mistake here.
Context: ...* - Close background tabs after use. - If Chrome RSS > threshold, restart brows...
(QB_NEW_EN)
[grammar] ~56-~56: There might be a mistake here.
Context: ... - Spoof fonts, canvas & audio contexts. - Rotate realistic UA strings, languages, ...
(QB_NEW_EN)
[grammar] ~57-~57: There might be a mistake here.
Context: ...rings, languages, hardwareConcurrency. - Optional headful mode via XVFB to enable...
(QB_NEW_EN)
[grammar] ~62-~62: There might be a mistake here.
Context: ...hs. --- ## 🥉 Nice-to-have / Advanced - [ ] Export crawl sessions - HAR ...
(QB_NEW_EN)
[grammar] ~77-~77: There might be a mistake here.
Context: ...rotation. - [ ] Sandboxing & security - Run Chrome under seccomp / user-namespac...
(QB_NEW_EN)
[grammar] ~90-~90: There might be a mistake here.
Context: ... ## 🎯 Critical Bug Fixes & Edge Cases - [ ] Handle iframe content extraction...
(QB_NEW_EN)
[grammar] ~93-~93: There might be a mistake here.
Context: ...oss-origin iframe detection and flagging - Same-origin iframe DOM traversal - Nes...
(QB_NEW_EN)
[grammar] ~94-~94: There might be a mistake here.
Context: ...ing - Same-origin iframe DOM traversal - Nested iframe support (up to N levels) ...
(QB_NEW_EN)
[grammar] ~97-~97: There might be a mistake here.
Context: ... [ ] WebComponent & Shadow DOM support - Detect custom elements with shadow roots...
(QB_NEW_EN)
[grammar] ~107-~107: There might be a mistake here.
Context: ...# 🔐 Authentication & Session Management - [ ] Auth state detection - Detect ...
(QB_NEW_EN)
[grammar] ~114-~114: There might be a mistake here.
Context: ...n actions - [ ] Multi-step auth flows - OAuth redirect handling - 2FA/MFA dete...
(QB_NEW_EN)
[grammar] ~115-~115: There might be a mistake here.
Context: ...auth flows** - OAuth redirect handling - 2FA/MFA detection and waiting - SAML/S...
(QB_NEW_EN)
[grammar] ~116-~116: There might be a mistake here.
Context: ...ndling - 2FA/MFA detection and waiting - SAML/SSO flow support - [ ] **Session p...
(QB_NEW_EN)
[grammar] ~124-~124: There might be a mistake here.
Context: ...uts ## 🎪 Advanced Interaction Patterns - [ ] Complex UI interactions - Drag...
(QB_NEW_EN)
[grammar] ~127-~127: There might be a mistake here.
Context: ... - Drag & drop detection and execution - File upload with generated test files ...
(QB_NEW_EN)
[grammar] ~128-~128: There might be a mistake here.
Context: ... - File upload with generated test files - Multi-select and combo-box handling - ...
(QB_NEW_EN)
[grammar] ~129-~129: There might be a mistake here.
Context: ... - Multi-select and combo-box handling - Date/time picker interaction - [ ] **Ke...
(QB_NEW_EN)
[grammar] ~133-~133: There might be a mistake here.
Context: ... support** - Tab-order based discovery - Keyboard shortcut detection (Ctrl+K, etc...
(QB_NEW_EN)
[grammar] ~134-~134: There might be a mistake here.
Context: ...yboard shortcut detection (Ctrl+K, etc.) - Access key enumeration - [ ] **Touch/mo...
(QB_NEW_EN)
[grammar] ~137-~137: There might be a mistake here.
Context: ...umeration - [ ] Touch/mobile gestures - Swipe detection for mobile views - Lon...
(QB_NEW_EN)
[grammar] ~138-~138: There might be a mistake here.
Context: ...s** - Swipe detection for mobile views - Long-press context menus - Pinch-to-zo...
(QB_NEW_EN)
[grammar] ~139-~139: There might be a mistake here.
Context: ...obile views - Long-press context menus - Pinch-to-zoom aware navigation ## 📊 An...
(QB_NEW_EN)
[grammar] ~142-~142: There might be a mistake here.
Context: ...navigation ## 📊 Analytics & Monitoring - [ ] Performance metrics - Page loa...
(QB_NEW_EN)
[grammar] ~145-~145: There might be a mistake here.
Context: ...ce metrics** - Page load time tracking - JavaScript execution overhead - Memory...
(QB_NEW_EN)
[grammar] ~146-~146: There might be a mistake here.
Context: ...acking - JavaScript execution overhead - Memory usage per page state - Network ...
(QB_NEW_EN)
[grammar] ~147-~147: There might be a mistake here.
Context: ...overhead - Memory usage per page state - Network request waterfalls - [ ] **Craw...
(QB_NEW_EN)
[grammar] ~151-~151: There might be a mistake here.
Context: ...y metrics** - Code coverage per domain - Unique vs duplicate state ratio - Acti...
(QB_NEW_EN)
[grammar] ~152-~152: There might be a mistake here.
Context: ...main - Unique vs duplicate state ratio - Action success/failure rates - Depth d...
(QB_NEW_EN)
[grammar] ~153-~153: There might be a mistake here.
Context: ...e ratio - Action success/failure rates - Depth distribution analysis - [ ] **Err...
(QB_NEW_EN)
[grammar] ~157-~157: There might be a mistake here.
Context: ...g** - JavaScript console error capture - Network error categorization - CSP vio...
(QB_NEW_EN)
[grammar] ~158-~158: There might be a mistake here.
Context: ...capture - Network error categorization - CSP violation logging - Failed action ...
(QB_NEW_EN)
[grammar] ~159-~159: There might be a mistake here.
Context: ...categorization - CSP violation logging - Failed action root cause analysis ## 🧠...
(QB_NEW_EN)
[grammar] ~162-~162: There might be a mistake here.
Context: ... analysis ## 🧠 Smart Crawling Features - [ ] ML-based duplicate detection -...
(QB_NEW_EN)
[grammar] ~164-~164: There might be a mistake here.
Context: ...es - [ ] ML-based duplicate detection - Train model on visual similarity - Sem...
(QB_NEW_EN)
[grammar] ~169-~169: There might be a mistake here.
Context: ...rns - [ ] Priority queue optimization - High-value path prediction - Anomaly d...
(QB_NEW_EN)
[grammar] ~170-~170: There might be a mistake here.
Context: ...ization** - High-value path prediction - Anomaly detection for interesting states...
(QB_NEW_EN)
[grammar] ~171-~171: There might be a mistake here.
Context: ...Anomaly detection for interesting states - Dynamic depth adjustment based on yield ...
(QB_NEW_EN)
[grammar] ~174-~174: There might be a mistake here.
Context: ... on yield - [ ] State space reduction - Identify and prune redundant actions -...
(QB_NEW_EN)
[grammar] ~179-~179: There might be a mistake here.
Context: ...ariations) ## 🛡️ Security & Compliance - [ ] CAPTCHA handling - Detection o...
(QB_NEW_EN)
[grammar] ~181-~181: There might be a mistake here.
Context: ...y & Compliance - [ ] CAPTCHA handling - Detection of common CAPTCHA providers ...
(QB_NEW_EN)
[grammar] ~182-~182: There might be a mistake here.
Context: ... - Detection of common CAPTCHA providers - Integration points for solving services ...
(QB_NEW_EN)
[grammar] ~183-~183: There might be a mistake here.
Context: ... Integration points for solving services - Graceful degradation strategies - [ ] *...
(QB_NEW_EN)
[grammar] ~186-~186: There might be a mistake here.
Context: ...gies - [ ] Rate limiting & politeness - Per-domain request throttling - Respec...
(QB_NEW_EN)
[grammar] ~187-~187: There might be a mistake here.
Context: ...ness** - Per-domain request throttling - Respect robots.txt for headless - Adap...
(QB_NEW_EN)
[grammar] ~188-~188: There might be a mistake here.
Context: ...ling - Respect robots.txt for headless - Adaptive delays based on response times ...
(QB_NEW_EN)
[grammar] ~192-~192: There might be a mistake here.
Context: ... compliance** - PII detection in forms - GDPR banner interaction - Data retenti...
(QB_NEW_EN)
[grammar] ~193-~193: There might be a mistake here.
Context: ...ion in forms - GDPR banner interaction - Data retention policies ## 🔌 Integrati...
(QB_NEW_EN)
[grammar] ~196-~196: There might be a mistake here.
Context: ...ion policies ## 🔌 Integration Features - [ ] API extraction - GraphQL query...
(QB_NEW_EN)
[grammar] ~199-~199: There might be a mistake here.
Context: ...n** - GraphQL query/mutation detection - REST endpoint parameter learning - Web...
(QB_NEW_EN)
[grammar] ~200-~200: There might be a mistake here.
Context: ...ion - REST endpoint parameter learning - WebSocket message format detection - [ ...
(QB_NEW_EN)
[grammar] ~204-~204: There might be a mistake here.
Context: ...OpenAPI spec generation from discoveries - Postman collection export - Burp Suite...
(QB_NEW_EN)
[grammar] ~205-~205: There might be a mistake here.
Context: ...iscoveries - Postman collection export - Burp Suite state file compatibility - [...
(QB_NEW_EN)
[grammar] ~209-~209: There might be a mistake here.
Context: ...- Playwright/Puppeteer script generation - Selenium IDE format export - Custom DS...
(QB_NEW_EN)
[grammar] ~210-~210: There might be a mistake here.
Context: ...eneration - Selenium IDE format export - Custom DSL for replay ## 🚀 Performance...
(QB_NEW_EN)
[grammar] ~213-~213: There might be a mistake here.
Context: ... replay ## 🚀 Performance Optimizations - [ ] Rendering optimizations - Disa...
(QB_NEW_EN)
[grammar] ~220-~220: There might be a mistake here.
Context: ...or battery saving - [ ] Caching layer - DOM diff caching - Screenshot perceptu...
(QB_NEW_EN)
[grammar] ~221-~221: There might be a mistake here.
Context: ...] Caching layer - DOM diff caching - Screenshot perceptual hashes - JavaScr...
(QB_NEW_EN)
[grammar] ~222-~222: There might be a mistake here.
Context: ...caching - Screenshot perceptual hashes - JavaScript execution results - [ ] **Di...
(QB_NEW_EN)
[grammar] ~225-~225: There might be a mistake here.
Context: ...on results - [ ] Distributed crawling - Work queue distribution - State synchr...
(QB_NEW_EN)
[grammar] ~226-~226: There might be a mistake here.
Context: ...d crawling** - Work queue distribution - State synchronization protocol - Resul...
(QB_NEW_EN)
[grammar] ~227-~227: There might be a mistake here.
Context: ...ution - State synchronization protocol - Result aggregation pipeline ## 🔧 Devel...
(QB_NEW_EN)
[grammar] ~230-~230: There might be a mistake here.
Context: ...ion pipeline ## 🔧 Developer Experience - [ ] Debug tooling - Live crawl vis...
(QB_NEW_EN)
[grammar] ~233-~233: There might be a mistake here.
Context: ...g tooling** - Live crawl visualization - State graph explorer UI - Action repla...
(QB_NEW_EN)
[grammar] ~234-~234: There might be a mistake here.
Context: ...isualization - State graph explorer UI - Action replay debugger - [ ] **Configur...
(QB_NEW_EN)
[grammar] ~237-~237: There might be a mistake here.
Context: ...bugger - [ ] Configuration management - Per-site config profiles - A/B testing...
(QB_NEW_EN)
[grammar] ~238-~238: There might be a mistake here.
Context: ...anagement** - Per-site config profiles - A/B testing different strategies - Hot...
(QB_NEW_EN)
[grammar] ~239-~239: There might be a mistake here.
Context: ...les - A/B testing different strategies - Hot-reload of site adapters - [ ] **Tes...
(QB_NEW_EN)
[grammar] ~243-~243: There might be a mistake here.
Context: ...ucture** - Headless crawler unit tests - Integration tests with test sites - Re...
(QB_NEW_EN)
[grammar] ~244-~244: There might be a mistake here.
Context: ...ts - Integration tests with test sites - Regression detection suite state.go is...
(QB_NEW_EN)
[grammar] ~259-~259: There might be a mistake here.
Context: ...──────────────────────────────────────── 1. Fingerprint strategy (page → id) ───────...
(QB_NEW_EN)
[grammar] ~260-~260: There might be a mistake here.
Context: ...── 1. Fingerprint strategy (page → id) ────────────────────────────────────────...
(QB_NEW_EN)
[grammar] ~261-~261: There might be a mistake here.
Context: ...──────────────────────────────────────── A. Canonical DOM extraction • Use t...
(QB_NEW_EN)
[grammar] ~262-~262: There might be a mistake here.
Context: ...──────────── A. Canonical DOM extraction • Use the existing domNormalizer (strip ...
(QB_NEW_EN)
[grammar] ~263-~263: There might be a mistake here.
Context: ...trip scripts, styles, dynamic IDs etc.). • Remove all transient event-attributes ...
(QB_NEW_EN)
[grammar] ~264-~264: There might be a mistake here.
Context: ...ttributes (onclick, onmouseover, …). • Collapse whitespace → single space. B...
(QB_NEW_EN)
[grammar] ~267-~267: There might be a mistake here.
Context: ...espace → single space. B. Two-tier hash • ExactHash = SHA-256(strippedDOM). ...
(QB_NEW_EN)
[grammar] ~268-~268: There might be a mistake here.
Context: ... • ExactHash = SHA-256(strippedDOM). • FuzzyHash = SimHash64(4-word shingles...
(QB_NEW_EN)
[grammar] ~269-~269: There might be a mistake here.
Context: ...mHash64(4-word shingles of strippedDOM). • Treat states equal if - Exac...
(QB_NEW_EN)
[grammar] ~270-~270: There might be a mistake here.
Context: ...rippedDOM). • Treat states equal if - ExactHash matches, or - Hamm...
(QB_NEW_EN)
[grammar] ~271-~271: There might be a mistake here.
Context: ...ual if - ExactHash matches, or - Hamming(FuzzyHash, other.FuzzyHash) ≤ ...
(QB_NEW_EN)
[grammar] ~275-~275: There might be a mistake here.
Context: ...FuzzyHash). C. Optional visual fallback • If comparison is inconclusive (≥ 4 bit...
(QB_NEW_EN)
[grammar] ~277-~277: There might be a mistake here.
Context: ...hot, pHash/dHash → same threshold logic. • Executed lazily to avoid perf hit. Re...
(QB_NEW_EN)
[grammar] ~278-~278: There might be a mistake here.
Context: ...eshold logic. • Executed lazily to avoid perf hit. Resulting struct: type Page...
(QB_NEW_EN)
[grammar] ~322-~322: There might be a mistake here.
Context: ...by XPath, ensure Visible & Interactable, plus DOM equality check under the canon...
(QB_NEW_EN)
[grammar] ~323-~323: There might be a mistake here.
Context: ... canonicalizer to avoid false positives. • If match, return targetOriginID. Step...
(QB_NEW_EN)
[grammar] ~330-~330: There might be a mistake here.
Context: ...Change() (new detector described below). • Recompute fingerprint; if equal (exact...
(QB_NEW_EN)
[grammar] ~336-~336: There might be a mistake here.
Context: ...ction; after each, WaitForRouteChange(). • After final step verify state (same eq...
(QB_NEW_EN)
[grammar] ~337-~337: There might be a mistake here.
Context: ...fter each, WaitForRouteChange(). • After final step verify state (same equality ...
(QB_NEW_EN)
[grammar] ~337-~337: There might be a mistake here.
Context: ...y state (same equality logic as Step 2). • Failure → ErrNoNavigationPossible. En...
(QB_NEW_EN)
[grammar] ~340-~340: There might be a mistake here.
Context: ...→ ErrNoNavigationPossible. Enhancements • Cache the computed “distance” between ...
(QB_NEW_EN)
[grammar] ~341-~341: There might be a mistake here.
Context: ...che the computed “distance” between two states; next call can skip graph search. • Re...
(QB_NEW_EN)
[grammar] ~341-~341: There might be a mistake here.
Context: ...ance” between two states; next call can skip graph search. • Record statistics (#n...
(QB_NEW_EN)
[grammar] ~341-~341: There might be a mistake here.
Context: ...states; next call can skip graph search. • Record statistics (#navigationBackSucc...
(QB_NEW_EN)
[grammar] ~351-~351: There might be a mistake here.
Context: ... const idle = () => new Promise(res => { const done = () => { obs.disconnect(); r...
(QB_NEW_EN)
[grammar] ~357-~357: There might be a mistake here.
Context: ...: true, attributes: true}); reset(); }); window.__katanaReady = () => Promis...
(QB_NEW_EN)
[grammar] ~360-~360: There might be a mistake here.
Context: ...ndow.__katanaReady = () => Promise.all([ idle(), new Promise(r => requestIdle...
(QB_NEW_EN)
[grammar] ~362-~362: There might be a mistake here.
Context: ...requestIdleCallback(r, {timeout: 5000})) ]); Go side: func (p *BrowserPage) Wai...
(QB_NEW_EN)
[grammar] ~371-~371: There might be a mistake here.
Context: ...val(ctx, await window.__katanaReady()) }) } Detects route changes, SPA navigat...
(QB_NEW_EN)
[grammar] ~388-~388: There might be a mistake here.
Context: ...imHash library, enable fuzzy comparator in graph. 3. Stage 3: optional pHash pat...
(QB_NEW_EN)
[grammar] ~388-~388: There might be a mistake here.
Context: ...brary, enable fuzzy comparator in graph. 3. Stage 3: optional pHash path guarded by ...
(QB_NEW_EN)
[grammar] ~389-~389: There might be a mistake here.
Context: ...onal pHash path guarded by feature flag. 4. Replace WaitPageLoadHeuristics with Wait...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
pkg/engine/headless/TODOS.md
17-17: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
18-18: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
84-84: Horizontal rule style
Expected: ---; Actual: ----------------------------------
(MD035, hr-style)
🪛 OSV Scanner (2.1.0)
go.mod
[HIGH] 113-113: golang.org/x/oauth2 0.11.0: Unexpected memory consumption during token parsing in golang.org/x/oauth2
(GO-2025-3488)
[HIGH] 113-113: golang.org/x/oauth2 0.11.0: golang.org/x/oauth2 Improper Validation of Syntactic Correctness of Input vulnerability
… new-full-headless-mode
There was a problem hiding this comment.
Actionable comments posted: 23
🧹 Nitpick comments (21)
go.mod (3)
14-14: tint addition: OK; ensure graceful fallback for non‑TTY.If used as a slog handler, confirm it degrades to plain output when not attached to a TTY to avoid control codes in logs (CI, JSON logs).
17-17: SimHash library is unmaintained; consider switching or vendoring.mfonda/simhash is from 2015 and effectively unmaintained. Prefer a maintained SimHash implementation or vendor it with tests to mitigate supply‑chain and maintenance risk.
If you want, I can propose alternatives (e.g., go-dedup/simhash) and a minimal adapter layer.
66-66: Indirect pprof pinned to a tip pseudo‑version; consider stability.It’s fine to keep, but if not strictly required, consider letting the resolver pick a stable tag/earlier pseudo‑version to reduce churn in reproducible builds.
pkg/engine/headless/js/utils.js (2)
57-69: Optional: Reduce cost of scanning all elements for inline listeners.document.querySelectorAll("") can be heavy on large pages; consider narrowing to common interactive elements (a, button, input, select, textarea, [role="button"], [onclick], etc.) or short‑circuit when no “on” properties exist.
229-231: Nit: Avoid deprecated substr().Use slice(1) for class name trimming.
Apply:
- result += "." + window.escapeIdentifierIfNeeded(prefixedName.substr(1)); + result += "." + window.escapeIdentifierIfNeeded(prefixedName.slice(1));pkg/engine/headless/browser/element.go (1)
69-76: Limit form-button de‑duplication to submit controls.Currently all BUTTONs inside forms are marked, which can suppress legitimate non-submit button actions. Only mark submit buttons.
Apply:
- if element.TagName != "BUTTON" { + if element.TagName != "BUTTON" { continue } - // TODO: Check if this button is already in the unique map - // and if so remove it - unique[element.Hash()] = struct{}{} + if strings.EqualFold(element.Type, "submit") { + unique[element.Hash()] = struct{}{} + }pkg/engine/headless/crawler/formfill.go (1)
225-226: Escape quotes in CSS attribute selector.Unescaped " in value breaks [value="..."] selector.
Apply:
- valueSelector := fmt.Sprintf(`[value="%s"]`, value) + esc := strings.ReplaceAll(value, `"`, `\"`) + valueSelector := fmt.Sprintf(`[value="%s"]`, esc)Add import:
import "strings"cmd/katana/main.go (1)
101-110: Simplify pprof shutdown; avoid nested defer.Call Stop() directly within the deferred closure.
Apply:
- defer func() { - if pprofServer != nil { - defer pprofServer.Stop() - } - }() + defer func() { + if pprofServer != nil { + pprofServer.Stop() + } + }()pkg/engine/headless/headless.go (1)
38-41: Debugger port: avoid hard-coded 8089 clashesMake port configurable (flag/env) or allow 0 to auto-assign and log the chosen port.
- headless.debugger = NewCrawlDebugger(8089) + // TODO: plumb from options; consider 0 for ephemeral port + headless.debugger = NewCrawlDebugger(8089)pkg/engine/headless/crawler/normalizer/normalizer.go (2)
38-41: Doc nit: “Denormalizing” → “Normalize again”The step is a second normalization pass, not denormalization. Update comment for clarity.
-// - Denormalizing it +// - Normalizing again for cleanup
68-76: URL decoding may convert '+' to space
url.QueryUnescapetreats '+' as a space. If inputs are not query-encoded, considerurl.PathUnescapeor a guarded decode.pkg/engine/headless/debugger.go (1)
117-125: Prefer graceful shutdownUse
Shutdownwith a timeout to drain connections.func (cd *CrawlDebugger) Close() { if cd == nil { return } if cd.httpServer != nil { - cd.httpServer.Close() + ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second) + defer cancel() + _ = cd.httpServer.Shutdown(ctx) } }Add imports:
import "context"pkg/engine/headless/browser/cookie/cookie.go (1)
58-66: Guard: URLFilter empty matches allIf any rule has an empty URLFilter, strings.Contains(url, "") will be true and matching relies only on other predicates. If that’s intentional per rules.json, ignore. Otherwise, add an explicit check to avoid accidental global matches.
- if !strings.Contains(url, rule.Condition.URLFilter) { + if rule.Condition.URLFilter == "" || !strings.Contains(url, rule.Condition.URLFilter) { return false }pkg/engine/headless/crawler/crawler.go (1)
141-150: DOT path uses options directory — verifyAfter applying the propagation fix, the DOT will land in the same diagnostics dir as the writer. If you prefer isolation, write to that directory unconditionally, not CWD.
pkg/engine/headless/browser/browser.go (1)
563-570: Also cancel context on closeCloseBrowserPage should cancel its context to reliably terminate per-page operations.
func (b *BrowserPage) CloseBrowserPage() { + if b.cancel != nil { + b.cancel() + } b.Page.Close() b.Browser.Close() // Clean up the temp user data directory if b.userDataDir != "" { _ = os.RemoveAll(b.userDataDir) } }pkg/engine/headless/js/page-init.js (6)
48-55: Also listen to popstate to capture back/forward navigations.Add popstate so you don’t miss SPA navigations via history traversal.
// Add event listener for hashchange window.addEventListener("hashchange", function () { window.__navigatedLinks.push({ url: document.location.href, source: "hashchange", }); }); + // Also capture history traversal via back/forward + window.addEventListener("popstate", function () { + try { + window.__navigatedLinks.push({ + url: document.location.href, + source: "popstate", + }); + } catch (_) {} + });
25-74: Bound growth of __navigatedLinks to avoid memory bloat.This array can grow unbounded on long crawls. Cap size (e.g., 5k–10k) via a helper.
- window.__navigatedLinks = []; + window.__navigatedLinks = []; + const __pushNavigated = (url, source) => { + try { + const a = window.__navigatedLinks; + if (a.length > 5000) a.shift(); + a.push({ url, source }); + } catch (_) {} + }; @@ - function __wrappedPushState(a, b, c) { - try { window.__navigatedLinks.push({ url: c, source: "history.pushState" }); } catch (_) {} - return __origPushState(a, b, c); - } + function __wrappedPushState(a, b, c) { __pushNavigated(c, "history.pushState"); return __origPushState(a, b, c); } @@ - function __wrappedReplaceState(a, b, c) { - try { window.__navigatedLinks.push({ url: c, source: "history.replaceState" }); } catch (_) {} - return __origReplaceState(a, b, c); - } + function __wrappedReplaceState(a, b, c) { __pushNavigated(c, "history.replaceState"); return __origReplaceState(a, b, c); } @@ - function __wrappedOpen(url, ...rest) { - try { window.__navigatedLinks.push({ url, source: "window.open" }); } catch (_) {} - return __origOpen(url, ...rest); - } + function __wrappedOpen(url, ...rest) { __pushNavigated(url, "window.open"); return __origOpen(url, ...rest); } @@ - window.addEventListener("hashchange", function () { - window.__navigatedLinks.push({ - url: document.location.href, - source: "hashchange", - }); - }); + window.addEventListener("hashchange", function () { __pushNavigated(document.location.href, "hashchange"); });
68-74: Stabilize and freeze fetch override.Bind to window and freeze the replacement for consistency with other hooks; guard push in try/catch.
- var originalFetch = window.fetch; - window.fetch = function (...args) { - const url = args[0] instanceof Request ? args[0].url : args[0]; - window.__navigatedLinks.push({ url: url, source: "fetch" }); - return originalFetch.apply(this, args); - }; + const __origFetch = window.fetch.bind(window); + function __wrappedFetch(...args) { + const url = (args[0] && typeof args[0] === "object" && "url" in args[0]) ? args[0].url : args[0]; + try { window.__navigatedLinks.push({ url, source: "fetch" }); } catch (_) {} + return __origFetch(...args); + } + Object.defineProperty(window, "fetch", { value: __wrappedFetch, writable: false, configurable: false });
90-95: Gate window.close block behind an option, keep original.Avoid surprising site logic that legitimately calls close() (e.g., popups).
- window.close = function () { - console.log("[hook] trying to close page."); - }; + const __origClose = window.close.bind(window); + window.close = function (...args) { + if (window.__katanaHooksOptions?.preventWindowClose === true) { + try { console.log("[hook] trying to close page."); } catch (_) {} + return; + } + return __origClose(...args); + };
96-109: Speed-up factor should be configurable and bounded.Hard-coded 0.1 can break timing-dependent flows. Make it opt-in and clamp to non-negative.
- const speedUpFactor = 0.1; // For example, 10 times faster + const speedUpFactor = (window.__katanaHooksOptions?.timeoutSpeedupFactor ?? 1); @@ - window.setTimeout = function (callback, delay, ...args) { - return originalSetTimeout(callback, delay * speedUpFactor, ...args); - }; + window.setTimeout = function (callback, delay, ...args) { + const d = Math.max(0, Number(delay) * speedUpFactor); + return originalSetTimeout(callback, d, ...args); + }; window.setInterval = function (callback, delay, ...args) { - return originalSetInterval(callback, delay * speedUpFactor, ...args); + const d = Math.max(0, Number(delay) * speedUpFactor); + return originalSetInterval(callback, d, ...args); };
115-150: Consider extending hook to EventTarget to capture window/document listeners.Overriding Element.prototype misses addEventListener on Window/Document. If needed, also wrap EventTarget.prototype with care (avoid double-wrapping).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (26)
cmd/katana/main.go(4 hunks)go.mod(4 hunks)internal/runner/options.go(1 hunks)internal/runner/runner.go(2 hunks)pkg/engine/headless/browser/browser.go(1 hunks)pkg/engine/headless/browser/cookie/cookie.go(1 hunks)pkg/engine/headless/browser/element.go(1 hunks)pkg/engine/headless/crawler/crawler.go(1 hunks)pkg/engine/headless/crawler/diagnostics/diagnostics.go(1 hunks)pkg/engine/headless/crawler/formfill.go(1 hunks)pkg/engine/headless/crawler/normalizer/dom_utils.go(1 hunks)pkg/engine/headless/crawler/normalizer/normalizer.go(1 hunks)pkg/engine/headless/crawler/normalizer/text_utils.go(1 hunks)pkg/engine/headless/crawler/normalizer/text_utils_test.go(1 hunks)pkg/engine/headless/crawler/state.go(1 hunks)pkg/engine/headless/debugger.go(1 hunks)pkg/engine/headless/graph/graph.go(1 hunks)pkg/engine/headless/headless.go(1 hunks)pkg/engine/headless/js/page-init.js(1 hunks)pkg/engine/headless/js/utils.js(1 hunks)pkg/engine/headless/types/types.go(1 hunks)pkg/engine/parser/parser.go(1 hunks)pkg/engine/parser/parser_test.go(1 hunks)pkg/types/crawler_options.go(2 hunks)pkg/types/options.go(3 hunks)pkg/utils/extensions/extensions.go(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- pkg/engine/headless/crawler/diagnostics/diagnostics.go
🚧 Files skipped from review as they are similar to previous changes (7)
- pkg/types/options.go
- internal/runner/runner.go
- pkg/engine/headless/crawler/normalizer/text_utils.go
- pkg/utils/extensions/extensions.go
- pkg/engine/headless/crawler/normalizer/text_utils_test.go
- internal/runner/options.go
- pkg/engine/parser/parser_test.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-25T12:59:59.470Z
Learnt from: Ice3man543
PR: projectdiscovery/katana#1396
File: pkg/engine/headless/js/utils.js:156-177
Timestamp: 2025-09-25T12:59:59.470Z
Learning: The CSS selector generation code in pkg/engine/headless/js/utils.js (specifically the _cssPathStep function) is copied directly from Chrome DevTools and is correct as implemented. The mix of uppercase and lowercase tag names follows the original Chrome implementation.
Applied to files:
pkg/engine/headless/js/utils.js
🧬 Code graph analysis (12)
pkg/engine/parser/parser.go (1)
pkg/navigation/request.go (1)
Request(12-25)
cmd/katana/main.go (1)
pkg/engine/headless/headless.go (1)
Headless(18-25)
pkg/engine/headless/crawler/normalizer/normalizer.go (2)
pkg/engine/headless/crawler/normalizer/dom_utils.go (2)
DOMNormalizer(31-33)NewDOMNormalizer(38-73)pkg/engine/headless/crawler/normalizer/text_utils.go (2)
TextNormalizer(30-33)NewTextNormalizer(39-53)
pkg/engine/headless/crawler/formfill.go (4)
pkg/engine/headless/types/types.go (2)
HTMLElement(132-145)HTMLForm(197-209)pkg/utils/formfill.go (6)
FormInput(37-42)FormTextArea(58-61)FormSelect(52-56)SelectOption(45-49)FormFillSuggestions(176-193)FormData(13-13)pkg/engine/headless/crawler/crawler.go (1)
Crawler(28-37)pkg/engine/headless/browser/browser.go (1)
BrowserPage(149-156)
pkg/engine/headless/debugger.go (1)
pkg/navigation/request.go (2)
Depth(9-9)Request(12-25)
pkg/engine/headless/browser/element.go (2)
pkg/engine/headless/types/types.go (7)
HTMLElement(132-145)Action(36-44)ActionTypeFillForm(89-89)ActionTypeLeftClick(75-75)ActionFromEventListener(95-129)HTMLForm(197-209)EventListener(270-274)pkg/engine/headless/browser/browser.go (2)
BrowserPage(149-156)ScopeValidator(61-61)
pkg/engine/headless/js/page-init.js (1)
pkg/engine/headless/js/utils.js (1)
listener(77-77)
pkg/engine/headless/crawler/crawler.go (8)
pkg/engine/headless/browser/browser.go (5)
Launcher(37-42)ScopeValidator(61-61)NewLauncher(64-71)LauncherOptions(45-59)BrowserPage(149-156)pkg/engine/headless/types/types.go (7)
Action(36-44)ActionTypeLoadURL(73-73)PageState(20-33)ActionTypeFillForm(89-89)ActionTypeLeftClick(75-75)ActionTypeLeftClickDown(76-76)HTMLElement(132-145)pkg/engine/headless/graph/graph.go (2)
CrawlGraph(15-17)NewCrawlGraph(24-32)pkg/engine/headless/crawler/normalizer/simhash/simhash.go (2)
Oracle(134-137)NewOracle(140-142)pkg/engine/headless/crawler/diagnostics/diagnostics.go (1)
Writer(16-22)pkg/output/result.go (1)
Result(10-15)pkg/engine/headless/crawler/normalizer/normalizer.go (2)
Normalizer(17-20)New(23-33)pkg/engine/headless/crawler/state.go (1)
ErrNoNavigationPossible(118-118)
pkg/engine/headless/headless.go (7)
pkg/types/crawler_options.go (1)
CrawlerOptions(23-47)pkg/engine/headless/debugger.go (2)
CrawlDebugger(20-24)NewCrawlDebugger(27-52)pkg/engine/headless/crawler/crawler.go (2)
New(79-126)Options(39-63)pkg/engine/headless/browser/browser.go (1)
ScopeValidator(61-61)pkg/output/result.go (1)
Result(10-15)pkg/navigation/request.go (1)
Request(12-25)pkg/engine/parser/parser.go (1)
NewResponseParser(37-75)
pkg/engine/headless/crawler/state.go (5)
pkg/engine/headless/crawler/crawler.go (1)
Crawler(28-37)pkg/engine/headless/browser/browser.go (1)
BrowserPage(149-156)pkg/engine/headless/types/types.go (3)
Action(36-44)PageState(20-33)HTMLElement(132-145)pkg/engine/headless/crawler/normalizer/simhash/simhash.go (2)
Distance(149-151)Fingerprint(130-132)pkg/engine/headless/crawler/diagnostics/diagnostics.go (1)
PreActionPageState(27-27)
pkg/engine/headless/graph/graph.go (1)
pkg/engine/headless/types/types.go (2)
PageState(20-33)Action(36-44)
pkg/engine/headless/browser/browser.go (7)
pkg/output/result.go (1)
Result(10-15)pkg/engine/headless/js/js.go (1)
InitJavascriptEnv(19-27)pkg/navigation/request.go (1)
Request(12-25)pkg/navigation/response.go (1)
Headers(11-11)pkg/engine/headless/browser/cookie/cookie.go (1)
ShouldBlockRequest(48-56)pkg/utils/utils.go (1)
FlattenHeaders(72-78)pkg/engine/hybrid/hijack.go (2)
FetchContinueRequest(92-97)FetchGetResponseBody(71-89)
🪛 GitHub Check: CodeQL
pkg/engine/parser/parser.go
[failure] 113-113: Incomplete URL scheme check
This check does not consider vbscript.
🪛 GitHub Check: Lint
pkg/engine/headless/debugger.go
[failure] 111-111:
Error return value of (*encoding/json.Encoder).Encode is not checked (errcheck)
[failure] 102-102:
Error return value of (*encoding/json.Encoder).Encode is not checked (errcheck)
pkg/engine/headless/headless.go
[failure] 164-164:
Error return value of h.deduplicator.Set is not checked (errcheck)
[failure] 125-125:
Error return value of h.options.OutputWriter.Write is not checked (errcheck)
[failure] 120-120:
Error return value of h.options.OutputWriter.Write is not checked (errcheck)
pkg/engine/headless/browser/browser.go
[failure] 40-40:
field userDataDir is unused (unused)
[failure] 407-407:
Error return value is not checked (errcheck)
[failure] 330-330:
Error return value of browserPage.handlePageDialogBoxes is not checked (errcheck)
pkg/engine/headless/types/types.go
[failure] 165-165:
var dynamicIDAndClassPattern is unused (unused)
🪛 GitHub Actions: 🔨 Build Test
pkg/engine/headless/browser/browser.go
[error] 330-330: Error return value of browserPage.handlePageDialogBoxes is not checked (errcheck)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: release-test-mac
- GitHub Check: release-test-linux
- GitHub Check: Functional Test (ubuntu-latest)
- GitHub Check: Functional Test (macOS-latest)
- GitHub Check: Functional Test (windows-latest)
🔇 Additional comments (21)
go.mod (2)
12-12: Rod upgrade + transitive alignment: looks consistent; check CGO/leakless and browser compatibility.v0.116.2 with got v0.40.0 and leakless v0.9.0 is a sensible set. Please verify:
- Runtime environment has a compatible Chrome/Chromium for this Rod version.
- CGO/leakless doesn’t break static or cross builds in your CI (musl/alpine, linux/arm64, windows).
Also applies to: 114-114, 155-155
10-11: Approve dependency additions Verified direct imports for github.com/adrianbrad/queue (pkg/engine/headless/crawler/crawler.go:14) and github.com/dominikbraun/graph (pkg/engine/headless/graph/graph.go:8); dependencies are used.pkg/engine/headless/js/utils.js (2)
34-40: LGTM: selector guard applied.The try/catch around querySelectorAll safely handles invalid selectors.
42-51: LGTM: XPath guard applied.Wrapping document.evaluate prevents crashes on invalid XPath.
pkg/engine/headless/browser/element.go (2)
131-139: LGTM: scheme allowlist for links.Filtering to http/https before scope check is correct and prevents non-navigable schemes.
111-113: Confirm launcher presence before use.Ensure b.launcher is always non-nil here; otherwise ScopeValidator() will panic.
If needed, guard:
if b.launcher == nil || b.launcher.ScopeValidator() == nil { return navigations, nil }pkg/types/crawler_options.go (1)
5-6: LGTM: added Logger and ChromeUser. Verify wiring.Fields look fine. Confirm they’re set by the runner when headless mode is used.
Also applies to: 43-46
pkg/engine/headless/headless.go (3)
159-169: Dedup race: use atomic SetNX patternThe Get-then-Set can race under concurrency. Use Set-if-not-exists.
- if _, ok := h.deduplicator.Get(resp.URL); ok { - continue - } - h.deduplicator.Set(resp.URL, struct{}{}) + // Atomically insert; if it already existed, skip + if !h.deduplicator.SetNX(resp.URL, struct{}{}) { + continue + }If
SetNXisn’t available inSyncLockMap, add it (or a helper that locks around check+insert) and switch callers.#!/bin/bash # Verify whether SetNX exists on SyncLockMap rg -nP 'type\s+SyncLockMap|func\s+\(.*\*\s*SyncLockMap.*\)\s*SetNX\s*\(' -C3
75-87: Good: safe ScopeManager nil handlingThanks for adding the nil guard before calling
Validate.
127-132: Trace toggled with diagnostics — confirm intent
TracemirrorsEnableDiagnostics. If tracing and diagnostics are different modes, consider distinct flags.pkg/engine/headless/crawler/state.go (1)
210-235: Stronger element identity matching looks goodRequiring either ID equality or at least two non-empty attribute matches reduces false positives as discussed.
pkg/engine/headless/crawler/normalizer/normalizer.go (1)
14-14: Whitespace regex fixedCorrectly matches actual CR/LF and collapses runs.
pkg/engine/headless/crawler/normalizer/dom_utils.go (1)
105-108: Nice: single-root selection transforms (perf)Applying selection-based transforms once at the root avoids O(N²) traversal.
pkg/engine/headless/browser/cookie/cookie.go (1)
42-45: Deterministic rule ordering implemented — LGTMThe rules are sorted by priority at init as previously suggested. Good.
pkg/engine/headless/graph/graph.go (2)
85-103: Nil action guard — LGTMThe nil check prevents panics; edge attributes/weighting look consistent.
113-131: ShortestPath: skips about:blank and nil actions — LGTMPre-sized slice and nil filtering avoid downstream surprises.
pkg/engine/headless/types/types.go (1)
188-193: Diagnostics: prefer logger over stderrPrints to stderr from library code are noisy. Use a logger or gate behind a callback.
- if IsDiagnosticEnabled { - fmt.Fprintf(os.Stderr, "[diagnostic] Element hash input: %s\n", hashInput) - } + // Optionally log via a provided logger in callers; avoid direct stderr writes here.Same for lines 236-241.
pkg/engine/headless/browser/browser.go (1)
382-458: Stop leaking EachEvent goroutineCapture and call the stop func (or cancel context) when closing the page to prevent goroutine leaks.
Would you like me to propose a concrete patch that stores a stop function on BrowserPage and calls it from CloseBrowserPage?
pkg/engine/headless/js/page-init.js (3)
27-39: History wrappers: good fix (wrap + delegate + freeze).This resolves the earlier concern about stubbing without delegation. Looks correct and safe.
41-46: window.open wrapper: good (delegates and preserves return value).This addresses the prior breakage by calling the original and freezing replacement.
115-150: Event listener hook: avoid PII, handle object listeners, cap array, and guard utils.Current code can leak sensitive values, stores full listener source, and grows unbounded. Also utils may be undefined.
const originalAddEventListener = Element.prototype.addEventListener; window.__eventListeners = []; Element.prototype.addEventListener = function (type, listener, options) { @@ - let item = { + const isSensitive = (el) => { + const t = (el.type || "").toLowerCase(); + const n = (el.name || "").toLowerCase(); + return t === "password" || /(pass|token|secret|key|auth|session)/i.test(n); + }; + const listenerStr = (typeof listener === "function") + ? String(listener) + : (listener && typeof listener.handleEvent === "function" ? String(listener.handleEvent) : "[object Listener]"); + const isFormCtl = ["INPUT","TEXTAREA","SELECT"].includes(this.tagName); + let item = { element: { tagName: this.tagName, - id: this.id, - classes: this.className, - outerHTML: this.outerHTML.slice(0, 100), // Capture a snippet of the element's outerHTML - xpath: window.getXPath(this), - cssSelector: window.getCssPath(this), - attributes: window.getElementAttributes(this), - textContent: this.textContent.trim(), - hidden: this.hidden, - name: this.name, - type: this.type, - value: this.value, + id: this.id || "", + classes: this.className || "", + outerHTML: (this.outerHTML || "").slice(0, 100), + xpath: typeof window.getXPath === "function" ? window.getXPath(this) : "", + cssSelector: typeof window.getCssPath === "function" ? window.getCssPath(this) : "", + attributes: typeof window.getElementAttributes === "function" ? window.getElementAttributes(this) : {}, + textContent: isFormCtl ? "" : ((this.textContent || "").trim().slice(0, 100)), + hidden: !!this.hidden, + name: this.name, + type: this.type, + value: isSensitive(this) ? undefined : (typeof this.value === "string" ? this.value.slice(0, 64) : undefined), }, type: type, - listener: listener.toString(), + listener: listenerStr.slice(0, 500), options: options || {}, }; - console.log("[hook] got event listener", item); - window.__eventListeners.push(item); + try { + (window.__eventListeners.length > 2000) && window.__eventListeners.shift(); + window.__eventListeners.push(item); + } catch (_) {} return originalAddEventListener.call(this, type, listener, options); };
dogancanbakir
left a comment
There was a problem hiding this comment.
Still reviewing but getting into loop when running the following command:
go run . -u vulnerabletarget.com -headless -sb -valready shared with @Ice3man543
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
pkg/engine/headless/crawler/crawler.go (1)
101-116: Persist diagnostics directory, sanitize timestamp for Windows, and handle Getwd error.
- The resolved diagnostics directory isn’t written back to opts, so Crawl() writes the DOT under an empty c.options.DiagnosticsDir. This was flagged earlier and remains.
- RFC3339 includes “:” which is invalid in Windows filenames.
- os.Getwd() error is ignored.
var diagnosticsWriter diagnostics.Writer if opts.EnableDiagnostics { directory := opts.DiagnosticsDir if directory == "" { - cwd, _ := os.Getwd() - directory = filepath.Join(cwd, fmt.Sprintf("katana-diagnostics-%s", time.Now().Format(time.RFC3339))) + cwd, err := os.Getwd() + if err != nil { + cwd = os.TempDir() + } + // Use a filesystem-safe UTC timestamp (no colons), e.g. 20250930T141953Z + ts := time.Now().UTC().Format("20060102T150405Z") + directory = filepath.Join(cwd, fmt.Sprintf("katana-diagnostics-%s", ts)) } writer, err := diagnostics.NewWriter(directory) if err != nil { return nil, err } diagnosticsWriter = writer + // Persist resolved directory for later use (e.g., crawl-graph.dot). + opts.DiagnosticsDir = directory opts.Logger.Info("Diagnostics enabled", slog.String("directory", directory)) }
🧹 Nitpick comments (2)
pkg/engine/headless/crawler/crawler.go (2)
447-451: Use errors.As for rod.CoveredError.Consistent with other typed Rod errors, prefer errors.As here.
- if errors.Is(err, &rod.CoveredError{}) { + var ce *rod.CoveredError + if errors.As(err, &ce) { return ErrElementNotVisible }
141-150: Fallback write path for DOT when DiagnosticsDir is empty.After persisting DiagnosticsDir in New(), this should be covered. If you decide not to persist, consider falling back to CWD or a temp dir here to avoid writing to a relative “crawl-graph.dot”.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
pkg/engine/headless/crawler/crawler.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
pkg/engine/headless/crawler/crawler.go (9)
pkg/engine/headless/browser/browser.go (5)
Launcher(37-42)ScopeValidator(61-61)NewLauncher(64-71)LauncherOptions(45-59)BrowserPage(149-156)pkg/utils/queue/queue.go (1)
Queue(19-25)pkg/engine/headless/types/types.go (7)
Action(36-44)ActionTypeLoadURL(73-73)PageState(20-33)ActionTypeFillForm(89-89)ActionTypeLeftClick(75-75)ActionTypeLeftClickDown(76-76)HTMLElement(132-145)pkg/engine/headless/graph/graph.go (2)
CrawlGraph(15-17)NewCrawlGraph(24-32)pkg/engine/headless/crawler/normalizer/simhash/simhash.go (2)
Oracle(134-137)NewOracle(140-142)pkg/engine/headless/crawler/diagnostics/diagnostics.go (3)
Writer(16-22)NewWriter(55-66)PostActionPageState(28-28)pkg/output/result.go (1)
Result(10-15)pkg/engine/headless/crawler/normalizer/normalizer.go (2)
Normalizer(17-20)New(23-33)pkg/engine/headless/crawler/state.go (1)
ErrNoNavigationPossible(118-118)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: release-test-linux
- GitHub Check: Analyze (go)
- GitHub Check: Functional Test (ubuntu-latest)
- GitHub Check: Functional Test (windows-latest)
- GitHub Check: Functional Test (macOS-latest)
🔇 Additional comments (3)
pkg/engine/headless/crawler/crawler.go (3)
69-77: Thanks for removing the startup panic path.init() now records the normalizer init error and avoids panicking. This unblocks tests and improves startup robustness.
128-135: Diagnostics.Close() error is now handled.Close() logs a warning on diagnostics writer close failures instead of dropping the error. LGTM.
152-158: Sanity check passed: all referenced symbols are defined.
emptyPageHash (state.go), WaitPageLoadHeurisitics (browser.go), and processForm (formfill.go) are implemented within the package; no missing definitions.
There was a problem hiding this comment.
Actionable comments posted: 6
♻️ Duplicate comments (5)
pkg/engine/headless/js/page-init.js (1)
138-158: Address PII capture, object listener handling, and unbounded growth.This code segment still has the unresolved issues flagged in past reviews:
- PII leakage risk (lines 147, 151): Capturing
textContentandvaluewithout sanitization can leak passwords, tokens, and sensitive data.- Object listener crash (line 154):
listener.toString()will fail or produce"[object Object]"when the listener is an object with ahandleEventmethod.- Unbounded memory growth (line 158):
__eventListenerscan grow indefinitely, causing memory issues.🔎 Proposed fix
if (this.tagName == "BODY") { return originalAddEventListener.call(this, type, listener, options); } + + const isSensitive = (el) => { + const t = (el.type || "").toLowerCase(); + const n = (el.name || "").toLowerCase(); + return t === "password" || /(pass|token|secret|key|auth|session|credit|ssn)/i.test(n); + }; + const safeValue = isSensitive(this) ? "[REDACTED]" : (this.value || "").slice(0, 64); + const safeText = (this.textContent || "").trim().slice(0, 100); + const listenerStr = (typeof listener === "function") + ? (listener.toString ? String(listener).slice(0, 500) : "[Function]") + : (listener && typeof listener.handleEvent === "function" ? "[Object:handleEvent]" : "[Object]"); + let item = { element: { tagName: this.tagName, id: this.id, classes: this.className, outerHTML: this.outerHTML.slice(0, 100), // Capture a snippet of the element's outerHTML xpath: window.getXPath(this), cssSelector: window.getCssPath(this), attributes: window.getElementAttributes(this), - textContent: this.textContent.trim(), + textContent: safeText, hidden: this.hidden, name: this.name, type: this.type, - value: this.value, + value: safeValue, }, type: type, - listener: listener.toString(), + listener: listenerStr, options: options || {}, }; console.log("[hook] got event listener", item); + if (window.__eventListeners.length > 5000) { + window.__eventListeners.shift(); // FIFO ring buffer + } window.__eventListeners.push(item); return originalAddEventListener.call(this, type, listener, options);Based on past review feedback and privacy best practices.
pkg/engine/headless/crawler/normalizer/dom_utils.go (1)
171-178: Overbroad "js" attribute removal still present.The condition
strings.HasPrefix(attr.Key, "js")will incorrectly match unrelated attributes likejson,justify,jsonld, etc. This was flagged in a previous review but remains unfixed.🔎 Proposed fix
for _, node := range s.Nodes { for _, attr := range node.Attr { attr := attr - if strings.HasPrefix(attr.Key, "data-") || strings.HasPrefix(attr.Key, "aria-") || strings.HasPrefix(attr.Key, "js") { + if strings.HasPrefix(attr.Key, "data-") || + strings.HasPrefix(attr.Key, "aria-") || + attr.Key == "js" || strings.HasPrefix(attr.Key, "js-") { s.RemoveAttr(attr.Key) } } }pkg/engine/headless/headless.go (1)
179-189: Potential race condition in deduplication logic.The check-then-set pattern (Get at line 180, then Set at line 183) is not atomic. While
SyncLockMapprovides thread-safe individual operations, concurrent calls toperformAdditionalAnalysiscould result in duplicate entries being processed if two goroutines check the same URL simultaneously before either sets it.Consider using an atomic set-if-not-exists pattern if
SyncLockMapsupports it, or accept this as a minor edge case given current single-threaded crawling (per the TODO at line 153).pkg/engine/headless/graph/graph.go (1)
55-83: Bug: Returning on ErrVertexAlreadyExists skips edge creation.When
AddVertexreturnsErrVertexAlreadyExists(lines 60-61), the function returns early and never creates the edge for thisPageState. This breaks graph connectivity when re-visiting a state from a different origin, as the new edge won't be recorded.🔎 Proposed fix
err := g.graph.AddVertex(n, func(vp *graph.VertexProperties) { vp.Weight = n.Depth vp.Attributes = vertexAttrs }) - if err != nil { - if errors.Is(err, graph.ErrVertexAlreadyExists) { - return nil - } + if err != nil && !errors.Is(err, graph.ErrVertexAlreadyExists) { return errors.Wrap(err, "could not add vertex to graph") } if n.NavigationAction != nil {pkg/engine/headless/browser/browser.go (1)
387-468: Goroutine in EachEvent may outlive BrowserPage lifecycle.The goroutine started at line 387 with
go b.EachEvent(...)()continues running even after theBrowserPageis returned to the pool or closed. This could lead to memory leaks and unexpected behavior if callbacks reference stale state.Consider storing the stop function returned by
EachEventand calling it inCloseBrowserPage().
🧹 Nitpick comments (7)
pkg/engine/parser/parser.go (1)
337-340: Optional cleanup: redundant data: URI check.The manual
data:check here is now redundant since the centralizedisValidNavigationRequestfilter (line 82) will catch these URLs. However, given the TODO comment suggests future data URI parsing functionality, you may want to keep this location as a placeholder for that feature.🔎 Optional: remove redundant check
If data URI parsing is not planned soon, consider removing the redundant check:
src, ok := item.Attr("src") if ok && src != "" && src != "#" { - if strings.HasPrefix(src, "data:") { - // TODO: Add data:uri/data:image parsing - return - } navigationRequests = append(navigationRequests, navigation.NewNavigationRequestURLFromResponse(src, resp.Resp.Request.URL.String(), "img", "src", resp)) }pkg/engine/headless/js/page-init.js (3)
135-137: Document why BODY tag is excluded from event listener tracking.The special case for
tagName == "BODY"skips tracking without explanation. If this is intentional (e.g., to reduce noise from common document-level events), add a comment explaining the rationale.
113-113: Consider making speedUpFactor configurable.The hardcoded
speedUpFactor = 0.1could be made configurable viawindow.__katanaHooksOptions.speedUpFactorto allow runtime tuning of delay acceleration.🔎 Suggested enhancement
- const speedUpFactor = 0.1; // For example, 10 times faster + const speedUpFactor = window.__katanaHooksOptions?.speedUpFactor ?? 0.1; // Default: 10 times faster
49-54: Wrap hashchange listener in try-catch for consistency.Other hook registrations are wrapped in try-catch blocks (lines 165-167), but this
addEventListenercall is not. For consistency and defensive coding, wrap it.🔎 Proposed enhancement
// Add event listener for hashchange - window.addEventListener("hashchange", function () { - window.__navigatedLinks.push({ - url: document.location.href, - source: "hashchange", - }); - }); + try { + window.addEventListener("hashchange", function () { + window.__navigatedLinks.push({ + url: document.location.href, + source: "hashchange", + }); + }); + } catch (_) {}pkg/engine/headless/browser/element.go (1)
214-250: Improve error handling consistency in GetEventListeners.At line 219-221, the error from evaluating
window.__eventListenersis silently swallowed, but at lines 232-237, errors fromgetAllElementsWithEventListeners()are returned. This inconsistency could mask issues with the first evaluation.Consider logging when the first evaluation fails, or unifying the error handling approach.
pkg/engine/headless/crawler/crawler.go (1)
356-375: Screenshot error should not block diagnostics logging.At lines 363-371, if
Screenshotfails, the error is logged butscreenshotStatecould be nil or empty. ThenLogPageStateScreenshotis called with potentially invalid data. Consider skipping the screenshot log on error.🔎 Proposed improvement
if c.diagnostics != nil { screenshotState, err := page.Screenshot(false, &proto.PageCaptureScreenshot{ Format: proto.PageCaptureScreenshotFormatPng, }) if err != nil { c.logger.Error("Failed to take screenshot", slog.String("error", err.Error())) - } - if err := c.diagnostics.LogPageStateScreenshot(pageState.UniqueID, screenshotState); err != nil { - c.logger.Error("Failed to log page state screenshot", slog.String("error", err.Error())) + } else { + if err := c.diagnostics.LogPageStateScreenshot(pageState.UniqueID, screenshotState); err != nil { + c.logger.Error("Failed to log page state screenshot", slog.String("error", err.Error())) + } } if err := c.diagnostics.LogNavigations(pageState.UniqueID, navigations); err != nil { c.logger.Error("Failed to log navigations", slog.String("error", err.Error())) } }pkg/engine/headless/types/types.go (1)
163-189: Consider using structured logging for diagnostics.The diagnostic output at lines 184-186 uses
fmt.Fprintf(os.Stderr, ...)which bypasses any structured logging infrastructure. While this is an improvement over stdout, consider accepting an optional logger or using a package-level diagnostic logger for better integration with the logging subsystem.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
pkg/engine/headless/browser/browser.gopkg/engine/headless/browser/element.gopkg/engine/headless/crawler/crawler.gopkg/engine/headless/crawler/normalizer/dom_utils.gopkg/engine/headless/debugger.gopkg/engine/headless/graph/graph.gopkg/engine/headless/headless.gopkg/engine/headless/js/page-init.jspkg/engine/headless/js/utils.jspkg/engine/headless/types/types.gopkg/engine/parser/parser.go
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/engine/headless/debugger.go
- pkg/engine/headless/js/utils.js
🧰 Additional context used
🧬 Code graph analysis (4)
pkg/engine/headless/headless.go (4)
pkg/types/crawler_options.go (1)
CrawlerOptions(23-47)pkg/engine/headless/browser/browser.go (1)
ScopeValidator(60-60)pkg/output/result.go (1)
Result(10-15)pkg/navigation/request.go (1)
Request(12-25)
pkg/engine/headless/graph/graph.go (1)
pkg/engine/headless/types/types.go (2)
PageState(19-32)Action(35-43)
pkg/engine/headless/browser/browser.go (5)
pkg/output/result.go (1)
Result(10-15)pkg/navigation/request.go (1)
Request(12-25)pkg/navigation/response.go (1)
Headers(11-11)pkg/engine/headless/browser/cookie/cookie.go (1)
ShouldBlockRequest(48-56)pkg/utils/utils.go (1)
FlattenHeaders(72-78)
pkg/engine/headless/browser/element.go (1)
pkg/engine/headless/types/types.go (6)
HTMLElement(131-144)Action(35-43)ActionTypeLeftClick(74-74)ActionFromEventListener(94-128)HTMLForm(192-204)EventListener(265-269)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: release-test-mac
- GitHub Check: release-test-linux
- GitHub Check: release-test-windows
- GitHub Check: Analyze (go)
- GitHub Check: Functional Test (macOS-latest)
- GitHub Check: Functional Test (windows-latest)
- GitHub Check: Functional Test (ubuntu-latest)
🔇 Additional comments (14)
pkg/engine/parser/parser.go (1)
82-115: Excellent security hardening - all previous concerns addressed.The centralized filtering implementation correctly addresses the earlier feedback:
- Whitespace trimming prevents bypass attacks
- Case-insensitive comparison blocks "JavaScript:", "DATA:", etc.
- All dangerous schemes (data:, mailto:, javascript:, vbscript:) are now blocked
The logic is sound and consistently applied across all parser types.
pkg/engine/headless/crawler/normalizer/dom_utils.go (1)
1-114: LGTM - Core normalization pipeline is well-structured.The two-phase transformation approach (selector-based removals followed by root-level selection transformations) is clean. The fix to apply selection-based transformations once at the root (lines 106-108) addresses the previous O(N²) concern. The nil guard in
removeCommentsDomTransformationFuncis also properly implemented.pkg/engine/headless/headless.go (2)
27-44: LGTM - Constructor properly initializes components.The
Newfunction correctly initializes the logger, deduplicator, and conditionally enables the debugger. The error handling path is clean.
90-165: Crawl method is well-structured with proper resource management.The scope validation, crawler options configuration, nil guards, error logging, and deferred cleanup (
defer headlessCrawler.Close()) are all properly implemented. The previous concerns about nil checks and error handling have been addressed.pkg/engine/headless/browser/element.go (1)
59-153: LGTM - Navigation discovery logic is comprehensive.The flow correctly handles forms, buttons, links, and event listeners with proper deduplication. The scheme validation for links (lines 131-137) addresses the previous concern about non-http(s) URLs. The disabled element check is thorough.
pkg/engine/headless/graph/graph.go (2)
1-44: LGTM - Graph structure and vertex enumeration are well-designed.The
CrawlGraphtype properly wraps the directed graph with appropriate traits (directed, rooted, weighted). ThenavigationHasherFuncprovides stable vertex identification viaUniqueID.
113-141: LGTM - ShortestPath and DrawGraph implementations.The
ShortestPathmethod now correctly pre-allocates the slice (line 118) and filters out bothabout:blankURLs and nil navigation actions (line 125). TheDrawGraphmethod properly handles file creation and cleanup.pkg/engine/headless/crawler/crawler.go (3)
69-77: LGTM - Graceful initialization error handling.The
init()function now captures initialization errors ininitErrorinstead of panicking, andNew()checks this at line 80-82. This allows proper error propagation to callers.
242-285: LGTM - Error handling and failure counting logic is correct.The code properly uses
errors.Asfor Rod error types (lines 249-274), handles recoverable errors withcontinue, and only counts failures for non-recoverable errors. TheconsecutiveFailuresis correctly reset on success (line 284).
417-476: LGTM - Action execution handles multiple action types correctly.The
executeCrawlStateActionmethod properly handlesLoadURL,FillForm, and click actions with appropriate timeouts, visibility checks, and interactability validation. TheErrElementNotVisiblesentinel provides clear error signaling.pkg/engine/headless/browser/browser.go (2)
266-352: LGTM - Resource cleanup in createBrowserPageFunc is robust.The deferred cleanup with
shouldCleanupflag (lines 269-276, 349-350) ensures the temp directory is cleaned up on any failure path. Thestrconv.Atoierrors are now properly handled (lines 285-292), and all error paths properly close browser/page resources before returning.
138-145: LGTM - Launcher Close properly cleans up resources.The
Closemethod cancels contexts, closes browser pages viaCleanup, and closes the pool channel to prevent goroutine leaks from the pool.pkg/engine/headless/types/types.go (2)
45-53: LGTM - Action.Hash fallback is properly implemented.The
Hash()method now returns a stable key composed ofType|Input|OriginIDwhen neitherElementnorFormis present, preventing hash collisions for distinct actions.
146-161: LGTM - String() method no longer mutates state.The fix uses a local
trimmedContentvariable (line 156) instead of modifyinge.TextContentin place, making the method properly read-only.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
pkg/engine/headless/js/page-init.js (1)
139-159: Event listener hook still captures PII, doesn't handle object listeners, and allows unbounded growth.The past review flagged these issues, but they remain unaddressed:
- Line 152 captures
this.valuewithout filtering password/sensitive fields- Line 155 calls
listener.toString()which will fail for object listeners withhandleEventmethods- Line 159 pushes to
window.__eventListenerswithout any size limit🔎 Proposed fix
} let item = { element: { tagName: this.tagName, id: this.id, classes: this.className, outerHTML: this.outerHTML.slice(0, 100), // Capture a snippet of the element's outerHTML xpath: window.getXPath(this), cssSelector: window.getCssPath(this), attributes: window.getElementAttributes(this), - textContent: this.textContent.trim(), + textContent: (this.textContent || "").trim().slice(0, 100), hidden: this.hidden, name: this.name, type: this.type, - value: this.value, + value: (this.type === "password" || /pass|secret|token/i.test(this.name || "")) ? undefined : (this.value || "").slice(0, 64), }, type: type, - listener: listener.toString(), + listener: (typeof listener === "function") ? listener.toString().slice(0, 500) : (listener && typeof listener.handleEvent === "function" ? "[object Listener]" : String(listener).slice(0, 500)), options: options || {}, }; console.log("[hook] got event listener", item); + if (window.__eventListeners.length > 1000) { + window.__eventListeners.shift(); + } window.__eventListeners.push(item);pkg/engine/headless/browser/browser.go (1)
387-470: Goroutine started by EachEvent cannot be stopped, causing resource leak.The
go b.EachEvent(...)()call at line 387 starts a goroutine that continues running after the BrowserPage is closed or returned to the pool. The return value (stop function) fromEachEvent()is not captured.Store the stop function and call it during cleanup:
🔎 Proposed fix
Add a
stopEventsfield toBrowserPage:type BrowserPage struct { *rod.Page Browser *rod.Browser cancel context.CancelFunc userDataDir string + stopEvents func() launcher *Launcher }Capture and store the stop function in
handlePageDialogBoxes:- go b.EachEvent( + b.stopEvents = b.EachEvent( func(e *proto.PageJavascriptDialogOpening) { _ = proto.PageHandleJavaScriptDialog{ Accept: true, PromptText: xid.New().String(), }.Call(b.Page) }, func(e *proto.FetchRequestPaused) { // ... existing handler code ... }, - )() + )Then invoke it during cleanup in
CloseBrowserPage:func (b *BrowserPage) CloseBrowserPage() { + if b.stopEvents != nil { + b.stopEvents() + } _ = b.Close() _ = b.Browser.Close() if b.userDataDir != "" { _ = os.RemoveAll(b.userDataDir) } }
🧹 Nitpick comments (2)
pkg/engine/headless/types/types.go (1)
186-188: Consider using structured logging instead of fmt for diagnostics.While the diagnostic output now goes to stderr (better than stdout), it still uses
fmt.Fprintfrather than a structured logger. Since the project already useslog/slogelsewhere, consider using it here for consistency.🔎 Proposed fix
hashInput := strings.Join(parts, "|") if IsDiagnosticEnabled { - fmt.Fprintf(os.Stderr, "[diagnostic] Element hash input: %s\n", hashInput) + slog.Debug("Element hash input", "input", hashInput) } hasher.Write([]byte(hashInput))Apply the same pattern at line 234-236 for form hashing.
pkg/engine/headless/browser/element.go (1)
69-76: Remove misleading TODO comment - deduplication logic is correct.The TODO comment at lines 73-74 suggests the deduplication logic is incomplete, but the current implementation is actually correct:
- Form buttons are marked in the
uniquemap (line 75)- Later, when processing standalone buttons (lines 101-103), they're skipped if already in
uniqueThis prevents duplicate actions for form submit buttons.
🔎 Proposed fix
for _, form := range forms { for _, element := range form.Elements { if element.TagName != "BUTTON" { continue } - // TODO: Check if this button is already in the unique map - // and if so remove it + // Mark form buttons as seen to avoid duplicate actions unique[element.Hash()] = struct{}{} }
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
pkg/engine/headless/browser/browser.gopkg/engine/headless/browser/element.gopkg/engine/headless/js/page-init.jspkg/engine/headless/types/types.go
🧰 Additional context used
🧬 Code graph analysis (3)
pkg/engine/headless/types/types.go (2)
pkg/engine/headless/browser/cookie/cookie.go (1)
Action(20-22)pkg/navigation/response.go (1)
Form(13-18)
pkg/engine/headless/js/page-init.js (2)
pkg/navigation/request.go (1)
Request(12-25)pkg/engine/headless/js/utils.js (1)
listener(77-77)
pkg/engine/headless/browser/browser.go (5)
pkg/engine/headless/js/js.go (1)
InitJavascriptEnv(19-27)pkg/navigation/request.go (1)
Request(12-25)pkg/navigation/response.go (1)
Headers(11-11)pkg/engine/headless/browser/cookie/cookie.go (1)
ShouldBlockRequest(48-56)pkg/utils/utils.go (1)
FlattenHeaders(72-78)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: Test Builds (ubuntu-latest, 1.24.x)
- GitHub Check: Test Builds (macOS-latest, 1.24.x)
- GitHub Check: Test Builds (windows-latest, 1.24.x)
- GitHub Check: Analyze (go)
- GitHub Check: release-test-linux
- GitHub Check: Functional Test (macOS-latest)
- GitHub Check: Functional Test (windows-latest)
- GitHub Check: Functional Test (ubuntu-latest)
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
pkg/engine/headless/crawler/normalizer/dom_utils.go (1)
145-152: Overbroad "js" attribute removal still present.Line 148 uses
strings.HasPrefix(attr.Key, "js")which will incorrectly match unrelated attributes likejson,justify,jsonp, etc. This issue was previously flagged but not addressed.🔎 Proposed fix
- if strings.HasPrefix(attr.Key, "data-") || strings.HasPrefix(attr.Key, "aria-") || strings.HasPrefix(attr.Key, "js") { + if strings.HasPrefix(attr.Key, "data-") || + strings.HasPrefix(attr.Key, "aria-") || + attr.Key == "js" || strings.HasPrefix(attr.Key, "js-") { s.RemoveAttr(attr.Key) }
🧹 Nitpick comments (2)
pkg/engine/headless/crawler/normalizer/dom_utils.go (2)
40-41: Optional: Loop variable capture unnecessary in Go 1.22+.Lines 41 and 50 use
t := tto capture the loop variable for closures. This pattern was necessary in Go < 1.22 but is now redundant, as loop variables are per-iteration by default in Go 1.22+.Consider removing these lines if your minimum Go version is 1.22 or higher.
Also applies to: 49-50
96-104: Return early after removing comment node to avoid unnecessary iteration.After removing a comment node on line 98, the function proceeds to iterate its children on lines 101-103. HTML comment nodes don't have children, so this iteration is unnecessary.
🔎 Proposed fix
func removeComments(n *html.Node) { if n.Type == html.CommentNode { n.Parent.RemoveChild(n) + return } for c := n.FirstChild; c != nil; c = c.NextSibling { removeComments(c) } }
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pkg/engine/headless/crawler/normalizer/dom_utils.gopkg/engine/headless/crawler/normalizer/normalizer.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/engine/headless/crawler/normalizer/normalizer.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: Test Builds (macOS-latest, 1.24.x)
- GitHub Check: Test Builds (windows-latest, 1.24.x)
- GitHub Check: Test Builds (ubuntu-latest, 1.24.x)
- GitHub Check: release-test-windows
- GitHub Check: Analyze (go)
- GitHub Check: Functional Test (windows-latest)
- GitHub Check: Functional Test (macOS-latest)
- GitHub Check: Functional Test (ubuntu-latest)
🔇 Additional comments (2)
pkg/engine/headless/crawler/normalizer/dom_utils.go (2)
10-20: Verify that broad meta and SVG removal aligns with normalization goals.Lines 14 and 16 remove all
meta[content]andsvgelements respectively. This is quite aggressive:
meta[content]will remove useful semantic meta tags (description, keywords, Open Graph tags, etc.)svgremoves all vector graphics, which may contain meaningful content or navigation elementsIf the goal is purely DOM structure fingerprinting for deduplication, this may be acceptable. However, verify these removals don't negatively impact page semantics or cause excessive false-positive deduplication.
106-110: LGTM! Nil guard correctly implemented.The nil guard on line 107 correctly addresses the previous review concern about potential nil dereference.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/engine/headless/crawler/crawler.go (1)
325-326: Scope restriction TODO identified.This FIXME indicates incomplete scope validation before proceeding with actions. The current scope check only happens after action execution (lines 349-359), meaning out-of-scope navigations may be executed before being filtered.
Would you like me to help implement pre-action scope validation, or should I open an issue to track this enhancement?
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
pkg/engine/headless/crawler/crawler.go
🧰 Additional context used
🧬 Code graph analysis (1)
pkg/engine/headless/crawler/crawler.go (4)
pkg/engine/headless/types/types.go (7)
Action(35-43)ActionTypeLoadURL(72-72)PageState(19-32)ActionTypeFillForm(88-88)ActionTypeLeftClick(74-74)ActionTypeLeftClickDown(75-75)HTMLElement(133-146)pkg/engine/headless/graph/graph.go (2)
CrawlGraph(15-17)NewCrawlGraph(24-32)pkg/engine/headless/crawler/diagnostics/diagnostics.go (2)
Writer(16-22)NewWriter(55-66)pkg/engine/headless/crawler/state.go (1)
ErrNoNavigationPossible(118-118)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: Test Builds (ubuntu-latest, 1.24.x)
- GitHub Check: Test Builds (macOS-latest, 1.24.x)
- GitHub Check: Test Builds (windows-latest, 1.24.x)
- GitHub Check: Functional Test (windows-latest)
- GitHub Check: Functional Test (ubuntu-latest)
- GitHub Check: Functional Test (macOS-latest)
- GitHub Check: release-test-linux
- GitHub Check: Analyze (go)
🔇 Additional comments (5)
pkg/engine/headless/crawler/crawler.go (5)
65-77: LGTM! Graceful initialization error handling.The use of
initErrorinstead of panicking addresses the prior review feedback and allows the caller to handle initialization failures properly.
133-144: LGTM! Proper resource cleanup with error logging.The
Close()method now logs diagnostics close errors as requested in the prior review, ensuring failures are visible without crashing the cleanup flow.
242-290: LGTM! Proper error type matching witherrors.As.The error handling now correctly uses
errors.Asfor Rod's pointer-typed errors (NoPointerEventsError,InvisibleShapeError,NavigationError,MaxSleepCountError) and reserveserrors.Isfor sentinel errors. The failure counter reset on success (line 289) is correctly placed.
422-481: LGTM! Robust action execution with visibility checks.The action execution properly validates element visibility and interactability before clicking, handles
CoveredErrorcorrectly witherrors.As, and applies appropriate timeouts to Rod operations.
483-488: LGTM! Logout detection regex corrected.The
logoutPatternregex now correctly includesausloggen(German for logout), addressing the prior review feedback about the typo.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
pkg/engine/headless/browser/browser.go (1)
391-474: Goroutine leak: event listener not stopped when page is closed.The goroutine started by
b.EachEvent(...)()at line 391 continues running after the page is closed or returned to the pool.EachEventreturns a stop function that must be saved and invoked during cleanup.This issue was flagged in a previous review but remains unaddressed. Each browser page creation leaks one goroutine, which will accumulate over time and consume memory.
🔎 Proposed fix
Update the
BrowserPagestruct to store the stop function:type BrowserPage struct { *rod.Page Browser *rod.Browser cancel context.CancelFunc userDataDir string + stopEvents func() launcher *Launcher }Save and invoke the stop function:
func (b *BrowserPage) handlePageDialogBoxes() error { err := proto.FetchEnable{ Patterns: []*proto.FetchRequestPattern{ { URLPattern: "*", RequestStage: proto.FetchRequestStageResponse, }, }, }.Call(b.Page) if err != nil { return errors.Wrap(err, "could not enable fetch domain") } - go b.EachEvent( + b.stopEvents = b.EachEvent( func(e *proto.PageJavascriptDialogOpening) { // ... handlers ... }, func(e *proto.FetchRequestPaused) { // ... handlers ... }, - )() + ) return nil }Call the stop function in cleanup:
func (b *BrowserPage) CloseBrowserPage() { + if b.stopEvents != nil { + b.stopEvents() + } _ = b.Close() _ = b.Browser.Close() if b.userDataDir != "" { _ = os.RemoveAll(b.userDataDir) } }Based on past review comments indicating this pattern is required for rod event listeners.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
pkg/engine/headless/browser/browser.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: Test Builds (windows-latest, 1.24.x)
- GitHub Check: Test Builds (ubuntu-latest, 1.24.x)
- GitHub Check: Test Builds (macOS-latest, 1.24.x)
- GitHub Check: Analyze (go)
- GitHub Check: release-test-linux
- GitHub Check: Functional Test (ubuntu-latest)
- GitHub Check: Functional Test (macOS-latest)
- GitHub Check: Functional Test (windows-latest)
* feat: added new full headless mode to katana (wip) * feat: use requests body + close browser & page * feat: misc additions to scope * feat: pass logger from outside + do not use global * added user to optioanl flags * misc * misc changes * feat: multiple bug fixes + edge cases handlings * feat: more additions + code, diagnostics, cookie-popups and stealth * feat: added additional simhash + misc additions and fixes * feat: fixed panic + parser not working for katana headless * feat: handle singleton lock issue with multiple instances * feat: fixed singleton error with multiple processes in linux * pprof misc fix * feat: added timeout + fixing stuck on large lists * misc additions + debug * misc * misc * misc * feat: addressed review comments + better form fill + misc * feat: more code review comments * fix out of scope parsing * misc review / lint fixes etc * fix review comments * tests related fixes + added back text stripping * misc * logging changes * return page --------- Co-authored-by: Ice3man <nizamulrana@gmail.com>
Summary by CodeRabbit
New Features
Improvements
Bug Fixes
✏️ Tip: You can customize this high-level summary in your review settings.