Skip to content

Security tools are systematically false-positived by the moderation pipeline #1131

@Justincredible-tech

Description

@Justincredible-tech

Problem

ClawHub's moderation pipeline (static scanner + LLM evaluator) treats security tools identically to potentially malicious skills. Security tools -- scanners, audit tools, IOC databases, threat detectors -- legitimately contain patterns that the static scanner flags as suspicious:

  • Shell execution (child_process) -- security scanners run commands to audit configurations, check file integrity, inspect network state
  • Credential access (process.env + network) -- credential exposure scanners check environment variables specifically to detect leaked secrets
  • Encoded payloads (base64, hex sequences) -- IOC databases contain malware signatures and encoded samples as detection references
  • File read + network send -- reporting scan results to a dashboard or output file is core scanner functionality
  • Non-standard ports -- network scanning tools probe ports outside 80/443

The result: any genuinely useful security tool will trigger multiple suspicious.* reason codes and get flagged or hidden, regardless of whether its behaviour is coherent with its stated purpose. There is currently no mechanism for the moderation pipeline to distinguish between a weather app that harvests credentials (incoherent, actually suspicious) and a security scanner that checks for exposed credentials (coherent, expected behaviour).

The LLM evaluator has no concept of a "security tool" category, so it evaluates these patterns without the context that would make them coherent.

This is not a theoretical concern -- we encountered this directly when publishing agent security scanning tools.

Impact

  • Blocks an entire category of useful skills from the ecosystem
  • Forces security tool authors to remove the very patterns that make their tools effective, or to never publish at all
  • As the OpenClaw ecosystem grows, the demand for security tooling will increase (the ClawHavoc incidents demonstrated this clearly)

Proposed solution

A skill category system where publishers declare their skill's purpose via frontmatter metadata:

metadata:
  openclaw:
    category: security

When declared:

  1. The static scanner contextualises expected patterns (re-prefixes suspicious.* to info.security_context.*) so they don't trigger false-positive verdicts
  2. The LLM evaluator receives the category and assesses coherence through the appropriate lens
  3. Malicious codes (crypto_mining, install_terminal_payload, known_blocked_signature) are never suppressed regardless of category
  4. Declaring a false category is treated as misdirection evidence by the LLM -- makes the skill more suspicious, not less

The category system is extensible for future use cases beyond security.

We have a PR ready: Caelguard/clawhub#feat/security-tool-category -- 26 new tests, all 846 existing tests pass, zero lint issues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions