-
Notifications
You must be signed in to change notification settings - Fork 132
security(skills): audit codebase for unhardened XML parsers beyond PowerPoint skill #1056
Description
Summary
PR #1053 hardens XML parsing in the PowerPoint skill's extract_content.py by adding resolve_entities=False and no_network=True to lxml XMLParser instances, fixing the XXE vector reported in #1014. However, this fix is scoped to a single skill. Other skills and scripts in the codebase may contain etree.parse, etree.fromstring, etree.XMLParser, or similar XML parsing calls that lack these defenses.
Root Cause
XML External Entity (XXE) injection is a class of vulnerability where an XML parser processes external entity references, potentially allowing:
- File disclosure — reading local files via
file://entities - Server-Side Request Forgery (SSRF) — making outbound HTTP requests via entity resolution
- Denial of Service — entity expansion attacks (billion laughs)
Python's lxml.etree resolves external entities by default. The fix in PR #1053 sets resolve_entities=False and no_network=True on XMLParser instances, but this pattern was applied only to files touched by that PR. Any other XML parsing call site in the codebase that processes untrusted or semi-trusted input without these flags remains vulnerable.
What Needs to Be Addressed
- Comprehensive audit — Every Python file in the repository that imports
lxml.etree,xml.etree.ElementTree,xml.sax,xml.dom.minidom, ordefusedxmlneeds review - Call site inventory — Document each
etree.parse(),etree.fromstring(),etree.XMLParser(),ElementTree.parse(), andxml.sax.parse()call with its file path, line number, and current parser configuration - Risk assessment — For each call site, determine whether the input can contain untrusted data and whether entity resolution is needed
- Remediation — Apply
resolve_entities=Falseandno_network=True(for lxml) or switch todefusedxml(for stdlibxml.*) where untrusted input is possible
How to Address
-
Grep the codebase for XML parsing patterns:
grep -rn "etree\.parse\|etree\.fromstring\|etree\.XMLParser\|ElementTree\.parse\|xml\.sax\|minidom\.parse" .github/skills/ scripts/ -
For each call site, check:
- Does the parser instance set
resolve_entities=False? - Does the parser instance set
no_network=True? - Is input from an external/untrusted source?
- For stdlib
xml.etree.ElementTree: isdefusedxmlused instead?
- Does the parser instance set
-
Apply fixes per parser library:
- lxml: Create
XMLParser(resolve_entities=False, no_network=True)and pass to parse/fromstring calls - stdlib xml.etree: Replace with
defusedxml.ElementTreeor adddefusedxml.defuse_stdlib()at module level - xml.sax: Use
defusedxml.sax.parse()instead
- lxml: Create
-
Validate — Run
npm run test:pyto confirm no regressions; runnpm run lint:pyfor code quality
Related Issues
- security(powerpoint): lxml XXE vector in extract_content.py theme color resolution #1014 (open, v3.2.0) — lxml XXE vector in
extract_content.py_resolve_theme_colors()(PowerPoint skill only, PR fix(skills): harden XML parsing and blob writes in powerpoint extract #1053 in-flight) - security(powerpoint): untrusted blob writes in extract_content.py image extraction #1016 (open, v3.2.0) — Related blob write safety issue (also addressed by PR fix(skills): harden XML parsing and blob writes in powerpoint extract #1053)
- epic: Python Security Testing & Fuzzing Initiative for PowerPoint Skill #1012 (open, v3.2.0) — Python Security Testing & Fuzzing Initiative (parent epic, security findings section)
Acceptance Criteria
- Codebase-wide grep completed for all XML parsing patterns across skills and scripts
- Call site inventory documented with file path, line number, parser config, and input trust level
- All call sites processing untrusted input have
resolve_entities=Falseandno_network=True(lxml) or usedefusedxml(stdlib) - No regression in existing tests (
npm run test:py) - Follow-up issues created for any findings that require larger refactoring