security(skills): audit codebase for unhardened XML parsers beyond PowerPoint skill

## Summary

PR #1053 hardens XML parsing in the PowerPoint skill's `extract_content.py` by adding `resolve_entities=False` and `no_network=True` to lxml `XMLParser` instances, fixing the XXE vector reported in #1014. However, this fix is scoped to a single skill. Other skills and scripts in the codebase may contain `etree.parse`, `etree.fromstring`, `etree.XMLParser`, or similar XML parsing calls that lack these defenses.

## Root Cause

XML External Entity (XXE) injection is a class of vulnerability where an XML parser processes external entity references, potentially allowing:

- **File disclosure** — reading local files via `file://` entities
- **Server-Side Request Forgery (SSRF)** — making outbound HTTP requests via entity resolution
- **Denial of Service** — entity expansion attacks (billion laughs)

Python's `lxml.etree` resolves external entities by default. The fix in PR #1053 sets `resolve_entities=False` and `no_network=True` on `XMLParser` instances, but this pattern was applied only to files touched by that PR. Any other XML parsing call site in the codebase that processes untrusted or semi-trusted input without these flags remains vulnerable.

## What Needs to Be Addressed

1. **Comprehensive audit** — Every Python file in the repository that imports `lxml.etree`, `xml.etree.ElementTree`, `xml.sax`, `xml.dom.minidom`, or `defusedxml` needs review
2. **Call site inventory** — Document each `etree.parse()`, `etree.fromstring()`, `etree.XMLParser()`, `ElementTree.parse()`, and `xml.sax.parse()` call with its file path, line number, and current parser configuration
3. **Risk assessment** — For each call site, determine whether the input can contain untrusted data and whether entity resolution is needed
4. **Remediation** — Apply `resolve_entities=False` and `no_network=True` (for lxml) or switch to `defusedxml` (for stdlib `xml.*`) where untrusted input is possible

## How to Address

1. **Grep the codebase** for XML parsing patterns:
   ```bash
   grep -rn "etree\.parse\|etree\.fromstring\|etree\.XMLParser\|ElementTree\.parse\|xml\.sax\|minidom\.parse" .github/skills/ scripts/
   ```

2. **For each call site**, check:
   - Does the parser instance set `resolve_entities=False`?
   - Does the parser instance set `no_network=True`?
   - Is input from an external/untrusted source?
   - For stdlib `xml.etree.ElementTree`: is `defusedxml` used instead?

3. **Apply fixes** per parser library:
   - **lxml**: Create `XMLParser(resolve_entities=False, no_network=True)` and pass to parse/fromstring calls
   - **stdlib xml.etree**: Replace with `defusedxml.ElementTree` or add `defusedxml.defuse_stdlib()` at module level
   - **xml.sax**: Use `defusedxml.sax.parse()` instead

4. **Validate** — Run `npm run test:py` to confirm no regressions; run `npm run lint:py` for code quality

## Related Issues

- #1014 (open, v3.2.0) — lxml XXE vector in `extract_content.py` `_resolve_theme_colors()` (PowerPoint skill only, PR #1053 in-flight)
- #1016 (open, v3.2.0) — Related blob write safety issue (also addressed by PR #1053)
- #1012 (open, v3.2.0) — Python Security Testing & Fuzzing Initiative (parent epic, security findings section)

## Acceptance Criteria

- [ ] Codebase-wide grep completed for all XML parsing patterns across skills and scripts
- [ ] Call site inventory documented with file path, line number, parser config, and input trust level
- [ ] All call sites processing untrusted input have `resolve_entities=False` and `no_network=True` (lxml) or use `defusedxml` (stdlib)
- [ ] No regression in existing tests (`npm run test:py`)
- [ ] Follow-up issues created for any findings that require larger refactoring

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

security(skills): audit codebase for unhardened XML parsers beyond PowerPoint skill #1056

Summary

Root Cause

What Needs to Be Addressed

How to Address

Related Issues

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

security(skills): audit codebase for unhardened XML parsers beyond PowerPoint skill #1056

Description

Summary

Root Cause

What Needs to Be Addressed

How to Address

Related Issues

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions