Skip to content

security(skills): audit codebase for unhardened XML parsers beyond PowerPoint skill #1056

@WilliamBerryiii

Description

@WilliamBerryiii

Summary

PR #1053 hardens XML parsing in the PowerPoint skill's extract_content.py by adding resolve_entities=False and no_network=True to lxml XMLParser instances, fixing the XXE vector reported in #1014. However, this fix is scoped to a single skill. Other skills and scripts in the codebase may contain etree.parse, etree.fromstring, etree.XMLParser, or similar XML parsing calls that lack these defenses.

Root Cause

XML External Entity (XXE) injection is a class of vulnerability where an XML parser processes external entity references, potentially allowing:

  • File disclosure — reading local files via file:// entities
  • Server-Side Request Forgery (SSRF) — making outbound HTTP requests via entity resolution
  • Denial of Service — entity expansion attacks (billion laughs)

Python's lxml.etree resolves external entities by default. The fix in PR #1053 sets resolve_entities=False and no_network=True on XMLParser instances, but this pattern was applied only to files touched by that PR. Any other XML parsing call site in the codebase that processes untrusted or semi-trusted input without these flags remains vulnerable.

What Needs to Be Addressed

  1. Comprehensive audit — Every Python file in the repository that imports lxml.etree, xml.etree.ElementTree, xml.sax, xml.dom.minidom, or defusedxml needs review
  2. Call site inventory — Document each etree.parse(), etree.fromstring(), etree.XMLParser(), ElementTree.parse(), and xml.sax.parse() call with its file path, line number, and current parser configuration
  3. Risk assessment — For each call site, determine whether the input can contain untrusted data and whether entity resolution is needed
  4. Remediation — Apply resolve_entities=False and no_network=True (for lxml) or switch to defusedxml (for stdlib xml.*) where untrusted input is possible

How to Address

  1. Grep the codebase for XML parsing patterns:

    grep -rn "etree\.parse\|etree\.fromstring\|etree\.XMLParser\|ElementTree\.parse\|xml\.sax\|minidom\.parse" .github/skills/ scripts/
  2. For each call site, check:

    • Does the parser instance set resolve_entities=False?
    • Does the parser instance set no_network=True?
    • Is input from an external/untrusted source?
    • For stdlib xml.etree.ElementTree: is defusedxml used instead?
  3. Apply fixes per parser library:

    • lxml: Create XMLParser(resolve_entities=False, no_network=True) and pass to parse/fromstring calls
    • stdlib xml.etree: Replace with defusedxml.ElementTree or add defusedxml.defuse_stdlib() at module level
    • xml.sax: Use defusedxml.sax.parse() instead
  4. Validate — Run npm run test:py to confirm no regressions; run npm run lint:py for code quality

Related Issues

Acceptance Criteria

  • Codebase-wide grep completed for all XML parsing patterns across skills and scripts
  • Call site inventory documented with file path, line number, parser config, and input trust level
  • All call sites processing untrusted input have resolve_entities=False and no_network=True (lxml) or use defusedxml (stdlib)
  • No regression in existing tests (npm run test:py)
  • Follow-up issues created for any findings that require larger refactoring

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestsecuritySecurity-related changes or concerns

    Type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions