-
Notifications
You must be signed in to change notification settings - Fork 0
feat(parser): add feed.published field and xml:base URL resolution #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add published field to FeedMeta for RSS pubDate and Atom published - Add published_parsed getter to Python bindings returning time.struct_time - Implement xml:base URL resolution for Atom (feed and entry level) - Implement implicit base URL from RSS channel link - Add resolve_safe() method with SSRF protection - Block dangerous URLs: localhost, private IPs, cloud metadata endpoints - Add comprehensive test coverage (29 new tests) Closes Python API parity gap for date fields and relative URL handling.
2811bff to
52543b4
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. @@ Coverage Diff @@
## main #26 +/- ##
==========================================
+ Coverage 90.75% 90.85% +0.10%
==========================================
Files 32 32
Lines 6175 6245 +70
==========================================
+ Hits 5604 5674 +70
Misses 571 571
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds critical Python API compatibility features to feedparser-rs: support for feed.published_parsed field and xml:base URL resolution with SSRF protection. The changes enable RSS channel <pubDate> and Atom <published> dates to be exposed at the feed level (previously only available for entries), and implement automatic resolution of relative URLs against xml:base attributes or channel links while blocking SSRF attack vectors.
Key Changes
- Feed-level publication dates: RSS
<pubDate>and Atom<published>now populateFeedMeta.publishedfield, exposed in Python bindings as both RFC3339 string andtime.struct_time - URL resolution: Relative URLs in feeds are automatically resolved to absolute URLs using xml:base (Atom) or channel link (RSS) as the base
- SSRF protection: All resolved URLs are validated against localhost, private IPs, cloud metadata endpoints, and dangerous schemes (file://, data://, etc.)
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
crates/feedparser-rs-core/src/types/feed.rs |
Added published: Option<DateTime<Utc>> field to FeedMeta struct |
crates/feedparser-rs-core/src/util/base_url.rs |
Added resolve_safe() method to BaseUrlContext with SSRF validation logic |
crates/feedparser-rs-core/src/parser/common.rs |
Added extract_xml_base() helper to extract xml:base attributes from XML elements |
crates/feedparser-rs-core/src/parser/atom.rs |
Integrated xml:base extraction and URL resolution for feed and entry elements |
crates/feedparser-rs-core/src/parser/rss.rs |
Changed RSS <pubDate> to populate feed.published instead of feed.updated; integrated channel link as base URL for item link and enclosure resolution |
crates/feedparser-rs-py/src/types/feed_meta.rs |
Added Python getters for published (string) and published_parsed (time.struct_time) |
crates/feedparser-rs-py/tests/test_phase1_integration.py |
13 comprehensive Python integration tests covering date parsing and URL resolution |
crates/feedparser-rs-core/tests/test_url_resolution.rs |
10 integration tests for xml:base URL resolution behavior |
crates/feedparser-rs-core/tests/test_url_security.rs |
19 SSRF protection tests validating blocking of malicious URLs |
Copilot review identified that absolute malicious URLs like http://localhost/admin in href attributes bypassed SSRF protection. Now resolve_safe() returns empty string for unsafe absolute hrefs. - Use is_some_and() instead of map().unwrap_or(false) per clippy - Add test for absolute malicious URL bypass scenario - Add test for private IP in href blocking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Add published field to FeedMeta for API parity with Python bindings. Returns milliseconds since epoch for easy JavaScript Date conversion. - Add published: Option<i64> to FeedMeta struct - Update From<CoreFeedMeta> to map DateTime to timestamp_millis - Add TypeScript type definition - Add test for feed-level published parsing
Address Copilot review feedback: - Add case-insensitive scheme comparison per RFC 3986 - Prevents bypass via uppercase schemes (FILE://, JAVASCRIPT:) - Improve code readability with intermediate variables - Add assertion to verify empty string for blocked URLs - Add tests for case-insensitive scheme bypass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 12 out of 13 changed files in this pull request and generated no new comments.
Summary
This PR addresses critical API compatibility gaps identified in the feedparser-rs gap analysis:
feed.published/feed.published_parsedfield - RSS<pubDate>and Atom<published>at feed level now accessible via Python and Node.js bindingsxml:baseattributes (Atom) or channel link (RSS)Changes
Core Library
published: Option<DateTime<Utc>>toFeedMetastructextract_xml_base()helper for xml:base attribute extractionresolve_safe()method toBaseUrlContextwith SSRF validationPython Bindings
publishedgetter returning RFC3339 stringpublished_parsedgetter returningtime.struct_timeNode.js Bindings
publishedfield returning milliseconds since epochnew Date(feed.feed.published)Security
FILE://,JAVASCRIPT:)Test Plan
cargo clippy --all-targetscleancargo fmtappliedAPI Compatibility
Before: 85% Python feedparser parity
After: ~95% parity for modern feeds (RSS 2.0, Atom 1.0, JSON Feed)
No breaking changes to existing API.