Skip to content

Conversation

@bug-ops
Copy link
Owner

@bug-ops bug-ops commented Dec 16, 2025

Summary

This PR implements key features to achieve parity with Python feedparser:

  • RSS 1.0 (RDF) Parser - Full support for RSS 1.0 feeds with Dublin Core namespace
  • Base URL Resolution - xml:base support for relative URL resolution (RFC 3986)
  • Enhanced Encoding Detection - HTTP Content-Type charset extraction with proper priority
  • SSRF Protection - Security hardening for URL validation

Changes

New Files

  • parser/rss10.rs - Complete RSS 1.0 parser implementation (623 lines)
  • util/base_url.rs - URL resolution utilities with SSRF protection (531 lines)

Modified Files

  • util/encoding.rs - Added Content-Type charset extraction (+301 lines)
  • parser/mod.rs - RSS 1.0 routing
  • util/mod.rs - New exports

Features

RSS 1.0 Parser

  • RDF structure support (rdf:RDF root, rdf:about attributes)
  • Dublin Core namespace integration (dc:creator, dc:date, etc.)
  • Image element parsing
  • ParserLimits enforcement (entry count, nesting depth, text length)
  • Tolerant parsing with bozo flag for malformed feeds

Base URL Utilities

  • resolve_url() - Resolve relative URLs against base
  • combine_bases() - Handle nested xml:base attributes
  • BaseUrlContext - Track base URL during parsing
  • is_safe_url() - SSRF protection for URL validation

Encoding Detection Priority

  1. BOM (UTF-8, UTF-16, UTF-32)
  2. HTTP Content-Type charset
  3. XML declaration encoding
  4. Default to UTF-8

Security Fixes

Issue Severity Fix
SSRF in URL resolution CRITICAL (CVSS 8.6) Added is_safe_url() validation
ftp:// allowed MEDIUM Removed from allowed schemes
DC tag validation MEDIUM Added alphanumeric validation

Test Plan

  • All 322 tests passing
  • Clippy clean (zero warnings)
  • Documentation builds
  • Security tests for SSRF protection
  • RSS 1.0 parsing with various fixtures
  • Encoding detection edge cases

Implement complete RSS 1.0 parser with:
- RDF root element handling (rdf:RDF and RDF variants)
- Channel metadata parsing with rdf:about attribute
- Item parsing as siblings of channel (RDF structure)
- Dublin Core namespace support (dc:creator, dc:date, etc.)
- Image element parsing
- Entry limits and nesting depth protection
- Tolerant parsing with bozo flag on errors

This closes the most critical feature gap vs Python feedparser.
All 272 tests pass.
Add utilities for relative URL resolution following RFC 3986:
- resolve_url(): resolves relative URLs against a base URL
- combine_bases(): combines nested xml:base values
- BaseUrlContext: tracks base URL state during parsing

This module provides the foundation for xml:base support
in Atom and RSS parsers. Integration with parsers will
follow in subsequent commits.

All 291 tests pass.
Enhance encoding detection with:
- extract_charset_from_content_type(): parse charset from HTTP headers
- detect_encoding_with_hint(): combined detection with priority order
- detect_bom(): separate BOM detection for reuse

Detection priority:
1. BOM (highest - cannot be wrong)
2. HTTP Content-Type charset
3. XML declaration encoding
4. Default to UTF-8

All 311 tests pass.
Apply improvements from code review:
- encoding.rs: Remove BOM detection duplication by using detect_bom()
- rss10.rs: Remove redundant dc:date handling (already in dublin_core)
- rss10.rs: Handle Event::Empty in parse_image()
- rss10.rs: Remove unused parse_date import

All 311 tests pass.
SEC-001: Add comprehensive SSRF protection in base_url.rs
- Implement is_safe_url() function to validate URLs before resolution
- Block file:// and other non-HTTP(S) schemes to prevent local file access
- Reject localhost addresses (127.0.0.1, ::1, localhost) to prevent firewall bypass
- Block private IP ranges (192.168.x.x, 10.x.x.x, 172.16-31.x.x) to prevent internal network access
- Reject cloud metadata endpoints (169.254.169.254, metadata.google.internal) to prevent credential leaks
- Add comprehensive test coverage for all SSRF attack vectors
- Export is_safe_url in util module for public use

SEC-002: Remove ftp:// from allowed URL schemes
- Remove ftp:// support from resolve_url() to reduce attack surface
- Only allow http://, https://, mailto:, and tel: schemes

SEC-003: Add Dublin Core tag name validation in rss10.rs
- Validate dc: namespace tag names to contain only alphanumeric characters and hyphens
- Prevent path traversal and special character injection through malicious XML tags
- Add validation for non-empty tag names
- Include tests for malicious tag name rejection

All changes maintain backward compatibility for legitimate use cases while
protecting against Server-Side Request Forgery (SSRF) attacks with CVSS 8.6 severity.
@github-actions github-actions bot added type: tooling Development tools, CI/CD, or infrastructure component: core feedparser-rs-core Rust library component: python Python bindings (PyO3) component: node Node.js bindings (napi-rs) area: parser Feed parsing logic area: rss RSS 0.9x, 1.0, 2.0 support area: encoding Character encoding detection lang: rust Rust code size: XXL Huge PR (1000+ lines changed) labels Dec 16, 2025
@codecov-commenter
Copy link

codecov-commenter commented Dec 16, 2025

Codecov Report

❌ Patch coverage is 94.55865% with 45 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
crates/feedparser-rs-core/src/parser/rss10.rs 90.57% 36 Missing ⚠️
crates/feedparser-rs-core/src/parser/rss.rs 70.58% 5 Missing ⚠️
crates/feedparser-rs-core/src/parser/json.rs 80.00% 2 Missing ⚠️
crates/feedparser-rs-core/src/parser/mod.rs 0.00% 1 Missing ⚠️
crates/feedparser-rs-core/src/util/base_url.rs 99.57% 1 Missing ⚠️

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #18      +/-   ##
==========================================
+ Coverage   85.24%   87.27%   +2.03%     
==========================================
  Files          27       29       +2     
  Lines        4641     5327     +686     
==========================================
+ Hits         3956     4649     +693     
+ Misses        685      678       -7     
Flag Coverage Δ
rust-core 87.27% <94.55%> (+2.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
crates/feedparser-rs-core/src/parser/atom.rs 88.82% <100.00%> (+1.75%) ⬆️
crates/feedparser-rs-core/src/parser/common.rs 73.43% <100.00%> (+6.57%) ⬆️
crates/feedparser-rs-core/src/types/entry.rs 87.69% <100.00%> (+2.78%) ⬆️
crates/feedparser-rs-core/src/types/feed.rs 96.36% <100.00%> (+0.44%) ⬆️
crates/feedparser-rs-core/src/util/encoding.rs 98.29% <100.00%> (+3.09%) ⬆️
crates/feedparser-rs-core/src/util/text.rs 100.00% <100.00%> (ø)
crates/feedparser-rs-core/src/parser/mod.rs 84.21% <0.00%> (+8.02%) ⬆️
crates/feedparser-rs-core/src/util/base_url.rs 99.57% <99.57%> (ø)
crates/feedparser-rs-core/src/parser/json.rs 91.51% <80.00%> (-0.12%) ⬇️
crates/feedparser-rs-core/src/parser/rss.rs 76.45% <70.58%> (+0.67%) ⬆️
... and 1 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- Extract common namespace detection to parser/common.rs:
  - extract_ns_local_name() with security validation
  - is_dc_tag(), is_content_tag(), is_media_tag(), is_itunes_tag()
- Enable check_depth() function for all parsers
- Add set_alternate_link() method to FeedMeta and Entry
- Add truncate_to_length() utility to util/text.rs
- Remove duplicate implementations from rss.rs, rss10.rs, atom.rs, json.rs

Eliminates ~200 lines of duplicate code while maintaining security
validation for namespace tag names.
@github-actions github-actions bot added area: atom Atom 1.0 support area: json-feed JSON Feed support labels Dec 16, 2025
@bug-ops bug-ops merged commit e8b720b into main Dec 16, 2025
31 checks passed
@bug-ops bug-ops deleted the feature/feedparser-parity branch December 16, 2025 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: atom Atom 1.0 support area: encoding Character encoding detection area: json-feed JSON Feed support area: parser Feed parsing logic area: rss RSS 0.9x, 1.0, 2.0 support component: core feedparser-rs-core Rust library component: node Node.js bindings (napi-rs) component: python Python bindings (PyO3) lang: rust Rust code size: XXL Huge PR (1000+ lines changed) type: tooling Development tools, CI/CD, or infrastructure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants