-
Notifications
You must be signed in to change notification settings - Fork 0
feat: RSS 1.0 parser, xml:base support, enhanced encoding detection #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Implement complete RSS 1.0 parser with: - RDF root element handling (rdf:RDF and RDF variants) - Channel metadata parsing with rdf:about attribute - Item parsing as siblings of channel (RDF structure) - Dublin Core namespace support (dc:creator, dc:date, etc.) - Image element parsing - Entry limits and nesting depth protection - Tolerant parsing with bozo flag on errors This closes the most critical feature gap vs Python feedparser. All 272 tests pass.
Add utilities for relative URL resolution following RFC 3986: - resolve_url(): resolves relative URLs against a base URL - combine_bases(): combines nested xml:base values - BaseUrlContext: tracks base URL state during parsing This module provides the foundation for xml:base support in Atom and RSS parsers. Integration with parsers will follow in subsequent commits. All 291 tests pass.
Enhance encoding detection with: - extract_charset_from_content_type(): parse charset from HTTP headers - detect_encoding_with_hint(): combined detection with priority order - detect_bom(): separate BOM detection for reuse Detection priority: 1. BOM (highest - cannot be wrong) 2. HTTP Content-Type charset 3. XML declaration encoding 4. Default to UTF-8 All 311 tests pass.
Apply improvements from code review: - encoding.rs: Remove BOM detection duplication by using detect_bom() - rss10.rs: Remove redundant dc:date handling (already in dublin_core) - rss10.rs: Handle Event::Empty in parse_image() - rss10.rs: Remove unused parse_date import All 311 tests pass.
SEC-001: Add comprehensive SSRF protection in base_url.rs - Implement is_safe_url() function to validate URLs before resolution - Block file:// and other non-HTTP(S) schemes to prevent local file access - Reject localhost addresses (127.0.0.1, ::1, localhost) to prevent firewall bypass - Block private IP ranges (192.168.x.x, 10.x.x.x, 172.16-31.x.x) to prevent internal network access - Reject cloud metadata endpoints (169.254.169.254, metadata.google.internal) to prevent credential leaks - Add comprehensive test coverage for all SSRF attack vectors - Export is_safe_url in util module for public use SEC-002: Remove ftp:// from allowed URL schemes - Remove ftp:// support from resolve_url() to reduce attack surface - Only allow http://, https://, mailto:, and tel: schemes SEC-003: Add Dublin Core tag name validation in rss10.rs - Validate dc: namespace tag names to contain only alphanumeric characters and hyphens - Prevent path traversal and special character injection through malicious XML tags - Add validation for non-empty tag names - Include tests for malicious tag name rejection All changes maintain backward compatibility for legitimate use cases while protecting against Server-Side Request Forgery (SSRF) attacks with CVSS 8.6 severity.
Codecov Report❌ Patch coverage is @@ Coverage Diff @@
## main #18 +/- ##
==========================================
+ Coverage 85.24% 87.27% +2.03%
==========================================
Files 27 29 +2
Lines 4641 5327 +686
==========================================
+ Hits 3956 4649 +693
+ Misses 685 678 -7
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
- Extract common namespace detection to parser/common.rs: - extract_ns_local_name() with security validation - is_dc_tag(), is_content_tag(), is_media_tag(), is_itunes_tag() - Enable check_depth() function for all parsers - Add set_alternate_link() method to FeedMeta and Entry - Add truncate_to_length() utility to util/text.rs - Remove duplicate implementations from rss.rs, rss10.rs, atom.rs, json.rs Eliminates ~200 lines of duplicate code while maintaining security validation for namespace tag names.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area: atom
Atom 1.0 support
area: encoding
Character encoding detection
area: json-feed
JSON Feed support
area: parser
Feed parsing logic
area: rss
RSS 0.9x, 1.0, 2.0 support
component: core
feedparser-rs-core Rust library
component: node
Node.js bindings (napi-rs)
component: python
Python bindings (PyO3)
lang: rust
Rust code
size: XXL
Huge PR (1000+ lines changed)
type: tooling
Development tools, CI/CD, or infrastructure
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR implements key features to achieve parity with Python feedparser:
xml:basesupport for relative URL resolution (RFC 3986)Changes
New Files
parser/rss10.rs- Complete RSS 1.0 parser implementation (623 lines)util/base_url.rs- URL resolution utilities with SSRF protection (531 lines)Modified Files
util/encoding.rs- Added Content-Type charset extraction (+301 lines)parser/mod.rs- RSS 1.0 routingutil/mod.rs- New exportsFeatures
RSS 1.0 Parser
rdf:RDFroot,rdf:aboutattributes)dc:creator,dc:date, etc.)Base URL Utilities
resolve_url()- Resolve relative URLs against basecombine_bases()- Handle nested xml:base attributesBaseUrlContext- Track base URL during parsingis_safe_url()- SSRF protection for URL validationEncoding Detection Priority
Security Fixes
is_safe_url()validationTest Plan