diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md new file mode 100644 index 0000000..346c195 --- /dev/null +++ b/.github/copilot-instructions.md @@ -0,0 +1,217 @@ +# feedparser-rs GitHub Copilot Instructions + +## Project Mission + +High-performance RSS/Atom/JSON Feed parser in Rust with Python (PyO3) and Node.js (napi-rs) bindings. This is a drop-in replacement for Python's `feedparser` library with 10-100x performance improvement. + +**CRITICAL**: API compatibility with Python feedparser is the #1 priority. Field names, types, and behavior must match exactly. + +**MSRV:** Rust 1.88.0 | **Edition:** 2024 | **License:** MIT/Apache-2.0 + +## Architecture Overview + +### Workspace Structure +- **`crates/feedparser-rs-core`** — Pure Rust parser. All parsing logic lives here. NO dependencies on other workspace crates. +- **`crates/feedparser-rs-py`** — Python bindings via PyO3/maturin. Depends on core. +- **`crates/feedparser-rs-node`** — Node.js bindings via napi-rs. Depends on core. + +### Parser Pipeline +1. **Format Detection** (`parser/detect.rs`) — Identifies RSS 0.9x/1.0/2.0, Atom 0.3/1.0, or JSON Feed 1.0/1.1 +2. **Parsing** — Routes to `parser/rss.rs`, `parser/atom.rs`, or `parser/json.rs` +3. **Namespace Extraction** — Handlers in `namespace/` process iTunes, Dublin Core, Media RSS, Podcast 2.0 +4. **Tolerant Error Handling** — Returns `ParsedFeed` with `bozo` flag set on errors, continues parsing + +## Idiomatic Rust & Performance + +### Type Safety First +- Prefer strong types over primitives: `FeedVersion` enum, not `&str` +- Use `Option` and `Result` — never sentinel values +- Leverage generics and trait bounds for reusable code: +```rust +fn collect_limited>(iter: I, limit: usize) -> Vec { + iter.take(limit).collect() +} +``` + +### Zero-Cost Abstractions +- Use `&str` over `String` in function parameters +- Prefer iterators over index-based loops: `.iter().filter().map()` +- Use `Cow<'_, str>` when ownership is conditionally needed +- Avoid allocations in hot paths — reuse buffers where possible + +### Edition 2024 Features +- Use `gen` blocks for custom iterators where applicable +- Leverage improved async patterns for HTTP module +- Apply new lifetime elision rules for cleaner signatures + +### Safety Guidelines +- `#![warn(unsafe_code)]` is enabled — avoid `unsafe` unless absolutely necessary +- All public APIs must have doc comments (`#![warn(missing_docs)]`) +- Use `thiserror` for error types with proper `#[error]` attributes + +## Critical Conventions + +### Error Handling: Bozo Pattern (MANDATORY) +**Never panic or return errors for malformed input.** Set `bozo = true` and continue: +```rust +match parse_date(&text) { + Some(dt) => entry.published = Some(dt), + None => { + feed.bozo = true; + feed.bozo_exception = Some(format!("Invalid date: {text}")); + // Continue parsing! + } +} +``` + +### API Compatibility with Python feedparser +Field names must match `feedparser` exactly: `feed.title`, `entries[0].summary`, `version` returns `"rss20"`, `"atom10"` + +### XML Parsing with quick-xml +Use tolerant mode — no strict validation: +```rust +let mut reader = Reader::from_reader(data); +reader.config_mut().trim_text(true); +// Do NOT enable check_end_names — tolerance over strictness +``` + +## Development Commands + +All automation via `cargo-make`: + +| Command | Purpose | +|---------|---------| +| `cargo make fmt` | Format with nightly rustfmt | +| `cargo make clippy` | Lint (excludes py bindings) | +| `cargo make test-rust` | Rust tests (nextest) | +| `cargo make pre-commit` | fmt + clippy + test-rust | +| `cargo make bench` | Criterion benchmarks | +| `cargo make msrv-check` | Verify MSRV 1.88.0 compatibility | + +### Bindings +```bash +# Python +cd crates/feedparser-rs-py && maturin develop && pytest tests/ -v + +# Node.js +cd crates/feedparser-rs-node && pnpm install && pnpm build && pnpm test +``` + +## Testing Patterns + +Use `include_str!()` for fixtures in `tests/fixtures/`: +```rust +#[test] +fn test_rss20_basic() { + let xml = include_str!("../../tests/fixtures/rss/example.xml"); + let feed = parse(xml.as_bytes()).unwrap(); + assert!(!feed.bozo); +} +``` + +Always verify malformed feeds set bozo but still parse: +```rust +#[test] +fn test_malformed_sets_bozo() { + let xml = b"Broken"; + let feed = parse(xml).unwrap(); + assert!(feed.bozo); + assert_eq!(feed.feed.title.as_deref(), Some("Broken")); // Still parsed! +} +``` + +## Security Requirements + +### SSRF Protection (CRITICAL for HTTP Module) +Block these URL patterns before fetching: +- Localhost/loopback: `127.0.0.1`, `[::1]`, `localhost` +- Private networks: `10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16` +- Link-local: `169.254.0.0/16` (AWS/GCP metadata endpoints), `fe80::/10` +- Special addresses: `0.0.0.0/8`, `255.255.255.255`, `::/128` + +Always validate URLs through `is_safe_url()` before HTTP requests. + +### XSS Protection (HTML Sanitization) +Use `ammonia` for HTML content from feeds: +- Allowed tags: `a, abbr, b, blockquote, br, code, div, em, h1-h6, hr, i, img, li, ol, p, pre, span, strong, ul` +- Enforce `rel="nofollow noopener"` on links +- Allow only `http`, `https`, `mailto` URL schemes +- Never pass raw HTML to Python/Node.js bindings without sanitization + +### DoS Protection +Apply limits via `ParserLimits`: +- `max_feed_size`: Default 50MB +- `max_nesting_depth`: Default 100 levels +- `max_entries`: Default 10,000 items +- `max_text_length`: Default 1MB per text field +- `max_attribute_length`: Default 10KB per attribute + +## Code Quality Standards + +### Function Length Guidelines +- **Target**: Functions should be <50 lines +- **Maximum**: NEVER exceed 100 lines +- **If >50 lines**: Extract inline logic to helper functions + +Example refactoring pattern: +```rust +// Before: 200+ line function +fn parse_channel(...) { + match tag { + b"itunes:category" => { /* 80 lines inline */ } + // ... + } +} + +// After: Delegate to helpers +fn parse_channel(...) { + match tag { + tag if is_itunes_tag_any(tag) => parse_channel_itunes(tag, ...)?, + // ... + } +} +``` + +### Documentation Requirements +All public APIs must have doc comments: +```rust +/// Parses an RSS/Atom feed from bytes. +/// +/// # Arguments +/// * `data` - Raw feed content as bytes +/// +/// # Returns +/// Returns `ParsedFeed` with extracted metadata. If parsing encounters errors, +/// `bozo` flag is set to `true` and `bozo_exception` contains the error description. +/// +/// # Examples +/// ``` +/// let xml = b"..."; +/// let feed = parse(xml)?; +/// assert_eq!(feed.version, FeedVersion::Rss20); +/// ``` +pub fn parse(data: &[u8]) -> Result { ... } +``` + +### Inline Comments +Minimize inline comments. Use comments ONLY for: +1. **Why** decisions (not **what** the code does) +2. Non-obvious constraints or workarounds +3. References to specifications (RFC 4287 section 4.1.2, etc.) + +## Commit & Branch Conventions +- Branch: `feat/`, `fix/`, `docs/`, `refactor/`, `test/` +- Commits: [Conventional Commits](https://conventionalcommits.org/) +- Never mention "Claude" or "co-authored" in commit messages + +## What NOT to Do +- ❌ Don't use `.unwrap()` or `.expect()` in parser code — use bozo pattern +- ❌ Don't add dependencies without workspace-level declaration in root `Cargo.toml` +- ❌ Don't skip `--exclude feedparser-rs-py` in workspace-wide Rust commands (PyO3 needs special handling) +- ❌ Don't break API compatibility with Python feedparser field names +- ❌ Don't panic on malformed feeds — set `bozo = true` and continue parsing +- ❌ Don't fetch URLs without SSRF validation (`is_safe_url()`) +- ❌ Don't pass raw HTML to bindings without sanitization (`sanitize_html()`) +- ❌ Don't create functions >100 lines — extract helpers +- ❌ Don't use generic names like `utils`, `helpers`, `common` for modules +- ❌ Don't add emojis to code or comments diff --git a/.github/instructions/node-bindings.instructions.md b/.github/instructions/node-bindings.instructions.md new file mode 100644 index 0000000..357a591 --- /dev/null +++ b/.github/instructions/node-bindings.instructions.md @@ -0,0 +1,397 @@ +# Node.js Bindings Code Review Instructions + +This file contains specific code review rules for the Node.js bindings in `crates/feedparser-rs-node/`. + +## Scope + +These instructions apply to: +- `crates/feedparser-rs-node/src/lib.rs` +- `crates/feedparser-rs-node/build.rs` +- Any future files in `crates/feedparser-rs-node/src/` + +## Overview + +The Node.js bindings use **napi-rs** to expose the Rust core parser to JavaScript/TypeScript. The bindings must provide an ergonomic JavaScript API while maintaining security and performance. + +## Critical Rules + +### 1. Input Validation (CWE-770) + +**ALWAYS validate input size BEFORE processing to prevent DoS attacks.** + +```rust +// CORRECT: Validate size before processing +#[napi] +pub fn parse(source: Either) -> Result { + let input_len = match &source { + Either::A(buf) => buf.len(), + Either::B(s) => s.len(), + }; + + if input_len > MAX_FEED_SIZE { + return Err(Error::from_reason(format!( + "Feed size ({} bytes) exceeds maximum allowed ({} bytes)", + input_len, MAX_FEED_SIZE + ))); + } + + // Now safe to process + let bytes: &[u8] = match &source { + Either::A(buf) => buf.as_ref(), + Either::B(s) => s.as_bytes(), + }; + // ... +} +``` + +```rust +// WRONG: No size validation +#[napi] +pub fn parse(source: Either) -> Result { + let bytes: &[u8] = match &source { + Either::A(buf) => buf.as_ref(), + Either::B(s) => s.as_bytes(), + }; + // ... immediate processing without size check +} +``` + +### 2. Error Handling + +**Use `Error::from_reason()` for user-facing errors with clear messages.** + +```rust +// CORRECT: Clear error message with context +.map_err(|e| Error::from_reason(format!("Parse error: {}", e)))?; + +// WRONG: Generic error +.map_err(|e| Error::from_reason(e.to_string()))?; +``` + +**Never expose internal error details that could aid attackers:** + +```rust +// CORRECT: Safe error message +return Err(Error::from_reason("Feed size exceeds maximum allowed")); + +// WRONG: Exposes internal details +return Err(Error::from_reason(format!("Internal buffer at {:p}", ptr))); +``` + +### 3. NAPI Struct Definitions + +**Use `#[napi(object)]` for plain data objects:** + +```rust +// CORRECT: Plain data object +#[napi(object)] +pub struct ParsedFeed { + pub feed: FeedMeta, + pub entries: Vec, + pub bozo: bool, + // ... +} +``` + +**Use `#[napi(js_name = "...")]` for JavaScript naming conventions:** + +```rust +// CORRECT: Use js_name for JavaScript conventions +#[napi(object)] +pub struct Link { + pub href: String, + #[napi(js_name = "type")] // 'type' is reserved in JS, use js_name + pub link_type: Option, +} +``` + +**Reserved JavaScript keywords that need `js_name`:** +- `type` -> use field name like `link_type`, `content_type`, `enclosure_type` +- `class`, `function`, `var`, `let`, `const` (if ever needed) + +### 4. Date/Time Handling + +**Convert DateTime to milliseconds since epoch for JavaScript compatibility:** + +```rust +// CORRECT: Milliseconds for JavaScript Date compatibility +pub updated: Option, + +impl From for FeedMeta { + fn from(core: CoreFeedMeta) -> Self { + Self { + updated: core.updated.map(|dt| dt.timestamp_millis()), + // ... + } + } +} +``` + +```rust +// WRONG: Seconds (JavaScript Date uses milliseconds) +pub updated: Option, +updated: core.updated.map(|dt| dt.timestamp()), +``` + +### 5. Type Conversions + +**Handle potential overflow in u64 to i64 conversions:** + +```rust +// CORRECT: Safe conversion with fallback +pub length: Option, +length: core.length.map(|l| i64::try_from(l).unwrap_or(i64::MAX)), + +// WRONG: Unchecked cast +length: core.length.map(|l| l as i64), +``` + +### 6. Input Type Handling + +**Accept both Buffer and String using `Either`:** + +```rust +// CORRECT: Accept both types +#[napi] +pub fn parse(source: Either) -> Result { + let bytes: &[u8] = match &source { + Either::A(buf) => buf.as_ref(), + Either::B(s) => s.as_bytes(), + }; + // ... +} +``` + +### 7. Feature Flags + +**Use conditional compilation for optional features:** + +```rust +// CORRECT: Feature-gated HTTP functionality +#[cfg(feature = "http")] +#[napi] +pub fn parse_url(url: String, ...) -> Result { + // ... +} + +// CORRECT: Feature-gated fields +#[napi(object)] +pub struct ParsedFeed { + #[cfg(feature = "http")] + pub headers: Option>, +} +``` + +### 8. Vector Pre-allocation + +**Pre-allocate vectors when converting collections:** + +```rust +// CORRECT: Pre-allocate for better performance +impl From for Entry { + fn from(core: CoreEntry) -> Self { + let links_cap = core.links.len(); + let content_cap = core.content.len(); + + Self { + links: { + let mut v = Vec::with_capacity(links_cap); + v.extend(core.links.into_iter().map(Link::from)); + v + }, + content: { + let mut v = Vec::with_capacity(content_cap); + v.extend(core.content.into_iter().map(Content::from)); + v + }, + // ... + } + } +} +``` + +```rust +// ACCEPTABLE for small collections: Direct collect +links: core.links.into_iter().map(Link::from).collect(), +``` + +### 9. Documentation + +**All public functions must have JSDoc-compatible documentation:** + +```rust +// CORRECT: Comprehensive documentation +/// Parse an RSS/Atom/JSON Feed from bytes or string +/// +/// # Arguments +/// +/// * `source` - Feed content as Buffer, string, or Uint8Array +/// +/// # Returns +/// +/// Parsed feed result with metadata and entries +/// +/// # Errors +/// +/// Returns error if input exceeds size limit or parsing fails catastrophically +#[napi] +pub fn parse(source: Either) -> Result { +``` + +**Include JavaScript examples in documentation:** + +```rust +/// # Examples +/// +/// ```javascript +/// const feedparser = require('feedparser-rs'); +/// +/// const feed = await feedparser.parseUrl("https://example.com/feed.xml"); +/// console.log(feed.feed.title); +/// ``` +``` + +### 10. Struct Field Documentation + +**Document all fields for TypeScript type generation:** + +```rust +#[napi(object)] +pub struct Entry { + /// Unique entry identifier + pub id: Option, + /// Entry title + pub title: Option, + /// Publication date (milliseconds since epoch) + pub published: Option, + // ... +} +``` + +## API Conventions + +### Function Naming + +Use camelCase for JavaScript function names (napi-rs does this automatically): +- Rust: `parse_with_options` -> JS: `parseWithOptions` +- Rust: `detect_format` -> JS: `detectFormat` +- Rust: `parse_url` -> JS: `parseUrl` + +### Optional Parameters + +Use `Option` for optional parameters: + +```rust +#[napi] +pub fn parse_url( + url: String, + etag: Option, + modified: Option, + user_agent: Option, +) -> Result +``` + +### Return Types + +- Return `Result` for operations that can fail +- Return `T` directly for infallible operations +- Never panic in public functions + +## Security Requirements + +### 1. URL Validation + +For HTTP functions, validate URL schemes: + +```rust +// The core crate handles SSRF protection, but document it +/// # Security +/// +/// - Only HTTP and HTTPS URLs are accepted +/// - Private IP addresses and localhost are blocked +/// - Redirects follow the same security rules +``` + +### 2. Size Limits + +Always enforce size limits: + +```rust +const DEFAULT_MAX_FEED_SIZE: usize = 100 * 1024 * 1024; // 100 MB + +#[napi] +pub fn parse_with_options( + source: Either, + max_size: Option, // Allow user to customize +) -> Result +``` + +### 3. No Unsafe Code + +The Node.js bindings should NOT contain any `unsafe` code. All unsafe operations should be in the core crate if necessary. + +## Testing Requirements + +### 1. Test Both Input Types + +```javascript +// Test with Buffer +const buf = Buffer.from('...'); +const feed1 = feedparser.parse(buf); + +// Test with String +const str = '...'; +const feed2 = feedparser.parse(str); +``` + +### 2. Test Size Limits + +```javascript +const largeFeed = 'x'.repeat(200 * 1024 * 1024); +expect(() => feedparser.parse(largeFeed)).toThrow(/exceeds maximum/); +``` + +### 3. Test Feature Flags + +Verify HTTP functions are only available when the `http` feature is enabled. + +## Common Review Issues + +### Issue: Missing size validation +**Fix:** Add input length check before processing + +### Issue: Using `unwrap()` in public functions +**Fix:** Use `map_err()` with `Error::from_reason()` + +### Issue: Date as seconds instead of milliseconds +**Fix:** Use `timestamp_millis()` not `timestamp()` + +### Issue: Unsafe i64 cast +**Fix:** Use `i64::try_from(l).unwrap_or(i64::MAX)` + +### Issue: Missing documentation +**Fix:** Add `///` documentation with examples + +### Issue: Missing `js_name` for reserved keywords +**Fix:** Add `#[napi(js_name = "type")]` for fields named `type` + +## Integration with Core Crate + +The Node.js bindings are a thin wrapper around `feedparser-rs-core`. Rules: + +1. **No parsing logic** - All parsing is in the core crate +2. **Type conversions only** - Convert core types to napi-compatible types +3. **Error mapping** - Map `FeedError` to napi `Error` +4. **Feature parity** - Keep in sync with Python bindings features + +## Checklist for PRs + +- [ ] Input size validated before processing +- [ ] All public functions have documentation +- [ ] No `unwrap()` or `expect()` in public functions +- [ ] Dates converted to milliseconds +- [ ] Large number conversions are safe (u64 -> i64) +- [ ] Reserved keywords use `js_name` +- [ ] Feature flags properly applied +- [ ] No unsafe code +- [ ] Tests cover both Buffer and String inputs diff --git a/.github/instructions/parser.instructions.md b/.github/instructions/parser.instructions.md new file mode 100644 index 0000000..10bd128 --- /dev/null +++ b/.github/instructions/parser.instructions.md @@ -0,0 +1,382 @@ +# Parser Module Instructions + +**Applies to:** `crates/feedparser-rs-core/src/parser/**` + +## Core Principles + +### Tolerant Parsing is MANDATORY + +**NEVER panic or return errors for malformed feeds.** The bozo pattern is the foundation of this project: + +```rust +// ✅ CORRECT - Set bozo flag and continue +match reader.read_event_into(&mut buf) { + Ok(Event::Start(e)) => { /* process */ } + Err(e) => { + feed.bozo = true; + feed.bozo_exception = Some(e.to_string()); + // CONTINUE PARSING - don't return error + } + _ => {} +} + +// ❌ WRONG - Never panic or abort parsing +match reader.read_event_into(&mut buf) { + Ok(Event::Start(e)) => { /* process */ } + Err(e) => return Err(e.into()), // NO! This breaks tolerance + _ => {} +} +``` + +**Why?** Real-world feeds are often broken: +- Missing closing tags +- Invalid dates +- Malformed XML +- Wrong encoding declarations +- Mixed namespaces + +Python feedparser handles all these gracefully. We must too. + +## Function Length Rules + +### CRITICAL: No function >100 lines + +Current technical debt in `parser/rss.rs`: +- `parse_channel` - 280 lines (needs refactoring) +- `parse_item` - 328 lines (needs refactoring) + +**When writing new parser code:** +1. Keep functions <50 lines (target) +2. Never exceed 100 lines (hard limit) +3. Extract inline parsing to separate functions + +### Refactoring Pattern + +```rust +// ✅ GOOD - Delegate to specialized functions +fn parse_channel(reader: &mut Reader<&[u8]>, feed: &mut ParsedFeed, limits: &ParserLimits, depth: &mut usize) -> Result<()> { + loop { + match reader.read_event_into(&mut buf) { + Ok(Event::Start(e) | Event::Empty(e)) => { + match e.name().as_ref() { + tag if is_standard_rss_tag(tag) => + parse_channel_standard(tag, reader, &mut buf, feed, limits)?, + tag if is_itunes_tag_any(tag) => + parse_channel_itunes(tag, &e, reader, &mut buf, feed, limits, depth)?, + tag if is_podcast_tag(tag) => + parse_channel_podcast(tag, &e, reader, &mut buf, feed, limits)?, + _ => skip_element(reader, &mut buf, limits, *depth)? + } + } + Ok(Event::End(e)) if e.local_name().as_ref() == b"channel" => break, + Err(e) => { + feed.bozo = true; + feed.bozo_exception = Some(e.to_string()); + } + _ => {} + } + buf.clear(); + } + Ok(()) +} + +// Helper functions (<50 lines each) +fn parse_channel_standard(...) -> Result<()> { ... } +fn parse_channel_itunes(...) -> Result { ... } +fn parse_channel_podcast(...) -> Result { ... } +``` + +## quick-xml Usage Patterns + +### Reader Configuration + +```rust +let mut reader = Reader::from_reader(data); +reader.config_mut().trim_text(true); +// DO NOT enable check_end_names - we need tolerance for mismatched tags +``` + +### Event Loop Pattern + +```rust +let mut buf = Vec::with_capacity(EVENT_BUFFER_CAPACITY); // Reuse buffer +let mut depth: usize = 1; + +loop { + match reader.read_event_into(&mut buf) { + Ok(Event::Start(e)) => { + depth += 1; + check_depth(depth, limits.max_nesting_depth)?; + // Process start tag + } + Ok(Event::End(e)) => { + depth = depth.saturating_sub(1); + // Check for terminating tag + if e.local_name().as_ref() == b"channel" { + break; + } + } + Ok(Event::Text(e)) => { + // Extract text content + let text = e.unescape().unwrap_or_default(); + } + Ok(Event::Empty(e)) => { + // Self-closing tag (e.g., ) + } + Ok(Event::Eof) => break, + Err(e) => { + feed.bozo = true; + feed.bozo_exception = Some(format!("XML error: {e}")); + // Continue parsing if possible + } + _ => {} // Ignore other events (comments, PI, etc.) + } + buf.clear(); // Reuse buffer allocation +} +``` + +## Format-Specific Rules + +### RSS Parsers (`rss.rs`, `rss10.rs`) + +1. **Version Detection**: RSS 0.9x, 1.0, 2.0 have different structures + - RSS 2.0: `...` + - RSS 1.0: `......` (items outside channel) + - RSS 0.9x: No version attribute + +2. **Date Formats**: Use RFC 2822 parser first, fallback to others + ```rust + let dt = DateTime::parse_from_rfc2822(date_str) + .ok() + .or_else(|| DateTime::parse_from_rfc3339(date_str).ok()) + .map(|d| d.with_timezone(&Utc)); + ``` + +3. **Namespace Handling**: Extract iTunes, Dublin Core, Media RSS attributes + - Use `is_itunes_tag()`, `is_dc_tag()`, etc. helpers + - Delegate to namespace-specific parsers + +### Atom Parser (`atom.rs`) + +1. **Text Constructs**: Atom has three content types + ```rust + fn parse_text_construct(element: &BytesStart, reader: &mut Reader, buf: &mut Vec) -> TextConstruct { + let content_type = element.attributes() + .find(|a| a.key.as_ref() == b"type") + .and_then(|a| a.unescape_value().ok()) + .map(|v| match v.as_ref() { + "html" => TextType::Html, + "xhtml" => TextType::Xhtml, + _ => TextType::Text, + }) + .unwrap_or(TextType::Text); + // ... + } + ``` + +2. **Links**: Atom supports multiple link relations + - `rel="alternate"` → main link + - `rel="enclosure"` → media attachments + - `rel="self"` → feed URL + +3. **Dates**: Use ISO 8601/RFC 3339 parser first + +### JSON Feed Parser (`json.rs`) + +1. **Use serde_json**: Already structured, no XML complexity +2. **Version Detection**: Check `version` field ("https://jsonfeed.org/version/1", "https://jsonfeed.org/version/1.1") +3. **Bozo Pattern**: Set bozo for missing required fields (title, items) + +## Depth Checking (DoS Protection) + +Always check nesting depth to prevent stack overflow: + +```rust +fn check_depth(current: usize, max: usize) -> Result<()> { + if current > max { + return Err(FeedError::InvalidFormat(format!( + "XML nesting depth {current} exceeds maximum {max}" + ))); + } + Ok(()) +} +``` + +## Namespace Detection + +Use helpers from `parser/namespace_detection.rs`: + +```rust +if let Some(itunes_element) = is_itunes_tag(tag) { + // tag is b"itunes:author" or similar + let itunes = feed.feed.itunes.get_or_insert_with(ItunesFeedMeta::default); + // Process iTunes-specific field +} + +if let Some(dc_element) = is_dc_tag(tag) { + // Dublin Core namespace + dublin_core::handle_feed_element(&dc_element, &text, &mut feed.feed); +} +``` + +## Text Extraction Pattern + +```rust +fn read_text(reader: &mut Reader<&[u8]>, buf: &mut Vec, limits: &ParserLimits) -> Result { + let mut text = String::with_capacity(TEXT_BUFFER_CAPACITY); + + loop { + match reader.read_event_into(buf) { + Ok(Event::Text(e)) => { + append_bytes(&mut text, e.as_ref(), limits.max_text_length)?; + } + Ok(Event::CData(e)) => { + append_bytes(&mut text, e.as_ref(), limits.max_text_length)?; + } + Ok(Event::End(_) | Event::Eof) => break, + Err(e) => return Err(e.into()), + _ => {} + } + buf.clear(); + } + + Ok(text) +} +``` + +## Date Parsing Delegation + +Never inline date parsing. Use `util/date.rs`: + +```rust +use crate::util::date::parse_date; + +// ✅ CORRECT +match parse_date(&text) { + Some(dt) => entry.published = Some(dt), + None if !text.is_empty() => { + feed.bozo = true; + feed.bozo_exception = Some("Invalid date format".to_string()); + } + None => {} // Empty text, no error +} + +// ❌ WRONG - Inline date parsing duplicates logic +let dt = DateTime::parse_from_rfc3339(&text).ok(); // Misses other formats +``` + +## Testing Requirements + +Every parser function must have: +1. **Basic test**: Well-formed feed +2. **Malformed test**: Broken feed sets bozo but still parses +3. **Edge case tests**: Empty fields, missing required fields, excessive nesting + +```rust +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn test_rss20_valid() { + let xml = include_str!("../../../tests/fixtures/rss/basic.xml"); + let feed = parse_rss20(xml.as_bytes()).unwrap(); + assert!(!feed.bozo); + assert_eq!(feed.version, FeedVersion::Rss20); + } + + #[test] + fn test_rss20_malformed_sets_bozo() { + let xml = b"Test</channel></rss>"; // Missing + let feed = parse_rss20(xml).unwrap(); + assert!(feed.bozo); + assert!(feed.bozo_exception.is_some()); + assert_eq!(feed.feed.title.as_deref(), Some("Test")); // Still extracted! + } + + #[test] + fn test_rss20_excessive_nesting() { + let xml = b"..."; // 100+ levels + let result = parse_rss20(xml); + assert!(result.is_err() || result.unwrap().bozo); + } +} +``` + +## Performance Considerations + +1. **Reuse buffers**: `Vec::with_capacity()` + `clear()`, not repeated allocations +2. **Avoid clones in hot paths**: Use references where possible +3. **Bounded collections**: Apply limits via `try_push_limited()` helper +4. **Early termination**: Stop parsing after `max_entries` reached (set bozo flag) + +```rust +// ✅ GOOD - Reuse buffer +let mut buf = Vec::with_capacity(EVENT_BUFFER_CAPACITY); +loop { + reader.read_event_into(&mut buf)?; + // Process + buf.clear(); // Reuse allocation +} + +// ❌ BAD - Allocate every iteration +loop { + let mut buf = Vec::new(); // New heap allocation each time + reader.read_event_into(&mut buf)?; +} +``` + +## Common Pitfalls + +### Don't Skip Elements Without Checking Depth + +```rust +// ✅ CORRECT +skip_element(reader, buf, limits, depth)?; + +// ❌ WRONG - Depth not checked, could overflow stack +loop { + match reader.read_event_into(buf) { + Ok(Event::End(_)) => break, + _ => {} + } +} +``` + +### Don't Use Panic-Happy Methods + +```rust +// ❌ WRONG +let value = attributes.find(|a| a.key.as_ref() == b"href").unwrap(); + +// ✅ CORRECT +if let Some(attr) = attributes.find(|a| a.key.as_ref() == b"href") { + if let Ok(value) = attr.unescape_value() { + // Use value + } +} +``` + +### Don't Ignore Limits + +```rust +// ❌ WRONG - Unbounded growth +feed.entries.push(entry); + +// ✅ CORRECT - Bounded with error handling +if feed.entries.is_at_limit(limits.max_entries) { + feed.bozo = true; + feed.bozo_exception = Some(format!("Entry limit exceeded: {}", limits.max_entries)); + skip_element(reader, buf, limits, depth)?; +} else { + feed.entries.push(entry); +} +``` + +## References + +- RSS 2.0 Spec: https://www.rssboard.org/rss-specification +- RSS 1.0 Spec: https://web.resource.org/rss/1.0/spec +- Atom 1.0 (RFC 4287): https://www.rfc-editor.org/rfc/rfc4287 +- JSON Feed: https://www.jsonfeed.org/version/1.1/ +- quick-xml docs: https://docs.rs/quick-xml/latest/quick_xml/ diff --git a/.github/instructions/python-bindings.instructions.md b/.github/instructions/python-bindings.instructions.md new file mode 100644 index 0000000..754eb01 --- /dev/null +++ b/.github/instructions/python-bindings.instructions.md @@ -0,0 +1,581 @@ +# Python Bindings Instructions + +**Applies to:** `crates/feedparser-rs-py/**` + +## Mission-Critical: API Compatibility + +These bindings MUST be a drop-in replacement for Python's `feedparser` library. Every function, class, attribute, and return type must match exactly. + +**Target API:** +```python +import feedparser + +# Parse from various sources +d = feedparser.parse('https://example.com/feed.xml') +d = feedparser.parse(open('feed.xml').read()) +d = feedparser.parse(b'...') + +# Access fields (these names are MANDATORY) +d.version # 'rss20', 'atom10', etc. +d.bozo # True/False +d.bozo_exception # String or None +d.encoding # 'utf-8', etc. +d.feed.title # Feed title +d.feed.link # Feed link +d.entries[0].title # Entry title +d.entries[0].published_parsed # time.struct_time (NOT DateTime!) +``` + +## PyO3 Fundamentals + +### Module Setup + +**Located in:** `src/lib.rs` + +```rust +use pyo3::prelude::*; + +#[pymodule] +fn feedparser_rs(m: &Bound<'_, PyModule>) -> PyResult<()> { + m.add_function(wrap_pyfunction!(parse, m)?)?; + m.add_function(wrap_pyfunction!(parse_url, m)?)?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + // ... other classes + Ok(()) +} +``` + +**Module name MUST be `feedparser_rs`** (matches PyPI package name) + +### Main Parse Function + +```rust +/// Parse an RSS/Atom feed from bytes, string, or URL +#[pyfunction] +#[pyo3(signature = (source, /))] +pub fn parse(py: Python<'_>, source: &Bound<'_, PyAny>) -> PyResult { + let data: Vec = if let Ok(s) = source.extract::() { + // String - could be URL or XML content + if s.starts_with("http://") || s.starts_with("https://") { + // HTTP fetching (if feature enabled) + #[cfg(feature = "http")] + { + return parse_url_impl(py, &s); + } + #[cfg(not(feature = "http"))] + { + return Err(PyNotImplementedError::new_err( + "URL fetching not enabled. Install with 'pip install feedparser-rs[http]'" + )); + } + } + s.into_bytes() + } else if let Ok(b) = source.extract::>() { + b + } else { + return Err(PyTypeError::new_err("source must be str or bytes")); + }; + + let result = feedparser_rs_core::parse(&data) + .map_err(|e| PyValueError::new_err(e.to_string()))?; + + Ok(PyParsedFeed::from(result)) +} +``` + +**Rules:** +1. Accept `str` (URL or XML) and `bytes` +2. Return `PyResult` (never panic) +3. Use `PyValueError` for parsing errors (not `RuntimeError`) + +## Python Class Mapping + +### ParsedFeed (FeedParserDict) + +**Located in:** `src/types/parsed_feed.rs` + +```rust +/// Main parsing result (equivalent to feedparser.FeedParserDict) +#[pyclass(name = "FeedParserDict")] +#[derive(Clone)] +pub struct PyParsedFeed { + inner: Arc, // Use Arc for cheap clones +} + +#[pymethods] +impl PyParsedFeed { + #[getter] + fn feed(&self) -> PyFeedMeta { + PyFeedMeta { + inner: Arc::clone(&self.inner.feed), + } + } + + #[getter] + fn entries(&self) -> Vec { + self.inner + .entries + .iter() + .map(|e| PyEntry { + inner: Arc::new(e.clone()), + }) + .collect() + } + + #[getter] + fn bozo(&self) -> bool { + self.inner.bozo + } + + #[getter] + fn bozo_exception(&self) -> Option { + self.inner.bozo_exception.clone() + } + + #[getter] + fn encoding(&self) -> &str { + &self.inner.encoding + } + + #[getter] + fn version(&self) -> &str { + self.inner.version.as_str() // Returns "rss20", "atom10", etc. + } + + #[getter] + fn namespaces(&self) -> HashMap { + self.inner.namespaces.clone() + } + + // Python repr for debugging + fn __repr__(&self) -> String { + format!( + "FeedParserDict(version={:?}, bozo={}, entries={})", + self.version(), + self.bozo(), + self.entries().len() + ) + } +} +``` + +**CRITICAL**: Class name MUST be `"FeedParserDict"` (matches Python feedparser) + +### FeedMeta + +**Located in:** `src/types/feed_meta.rs` + +```rust +#[pyclass] +#[derive(Clone)] +pub struct PyFeedMeta { + inner: Arc, +} + +#[pymethods] +impl PyFeedMeta { + #[getter] + fn title(&self) -> Option<&str> { + self.inner.title.as_deref() + } + + #[getter] + fn link(&self) -> Option<&str> { + self.inner.link.as_deref() + } + + #[getter] + fn subtitle(&self) -> Option<&str> { + self.inner.subtitle.as_deref() + } + + #[getter] + fn language(&self) -> Option<&str> { + self.inner.language.as_deref() + } + + // ... other getters +} +``` + +**Rules:** +1. All getters return Python-compatible types (`Option<&str>`, not `Option`) +2. Use `as_deref()` for `Option` → `Option<&str>` conversion +3. Clone only when necessary (prefer references) + +## Date Conversion (CRITICAL) + +### time.struct_time Requirement + +Python feedparser returns `time.struct_time` for `*_parsed` fields. This is MANDATORY for compatibility. + +```rust +use pyo3::types::PyTuple; + +#[pymethods] +impl PyEntry { + #[getter] + fn published(&self) -> Option { + self.inner.published.map(|dt| dt.to_rfc3339()) + } + + #[getter] + fn published_parsed(&self, py: Python<'_>) -> PyResult> { + match &self.inner.published { + Some(dt) => { + let time_module = py.import_bound("time")?; + let struct_time = time_module.getattr("struct_time")?; + + let tuple = PyTuple::new_bound( + py, + &[ + dt.year(), + dt.month() as i32, + dt.day() as i32, + dt.hour() as i32, + dt.minute() as i32, + dt.second() as i32, + dt.weekday().num_days_from_monday() as i32, + dt.ordinal() as i32, + 0, // tm_isdst (daylight savings time flag, -1 = unknown) + ], + ); + + Ok(Some(struct_time.call1((tuple,))?.into())) + } + None => Ok(None), + } + } +} +``` + +**CRITICAL**: +- `published` → RFC 3339 string +- `published_parsed` → `time.struct_time` tuple +- Apply same pattern to `updated`, `created`, `expired` fields + +### time.struct_time Format + +```python +# Python time.struct_time fields +( + tm_year, # e.g., 2024 + tm_mon, # 1-12 + tm_mday, # 1-31 + tm_hour, # 0-23 + tm_min, # 0-59 + tm_sec, # 0-59 + tm_wday, # 0-6 (Monday=0) + tm_yday, # 1-366 (day of year) + tm_isdst, # 0, 1, or -1 +) +``` + +**Chrono mapping:** +- `dt.year()` → `tm_year` +- `dt.month()` → `tm_mon` +- `dt.day()` → `tm_mday` +- `dt.hour()` → `tm_hour` +- `dt.minute()` → `tm_min` +- `dt.second()` → `tm_sec` +- `dt.weekday().num_days_from_monday()` → `tm_wday` (Monday=0) +- `dt.ordinal()` → `tm_yday` (day of year) +- `0` or `-1` → `tm_isdst` (DST flag, use -1 for unknown) + +## Error Handling + +### Exception Mapping + +```rust +use pyo3::exceptions::{PyTypeError, PyValueError, PyNotImplementedError}; + +// Type errors (wrong input type) +if source.is_none() { + return Err(PyTypeError::new_err("source must be str or bytes")); +} + +// Value errors (malformed content) +let result = feedparser_rs_core::parse(&data) + .map_err(|e| PyValueError::new_err(e.to_string()))?; + +// Not implemented (missing feature) +#[cfg(not(feature = "http"))] +{ + return Err(PyNotImplementedError::new_err("URL fetching not enabled")); +} +``` + +**Rules:** +1. Use `PyTypeError` for wrong Python types +2. Use `PyValueError` for parsing/content errors +3. Use `PyNotImplementedError` for unimplemented features +4. NEVER panic (always return `PyResult`) + +## Memory Management + +### Use Arc for Shared Ownership + +```rust +use std::sync::Arc; + +#[pyclass] +pub struct PyEntry { + inner: Arc, // Shared ownership, cheap clones +} + +// Cloning is O(1) +let cloned = PyEntry { + inner: Arc::clone(&original.inner), +}; +``` + +**Why?** Python objects can be shared/copied freely. Using `Arc` avoids expensive deep clones. + +### Avoid Lifetime Parameters + +```rust +// ❌ WRONG - Lifetimes don't work with PyO3 +#[pyclass] +pub struct PyEntry<'a> { + inner: &'a Entry, +} + +// ✅ CORRECT - Owned data +#[pyclass] +pub struct PyEntry { + inner: Arc, +} +``` + +## Testing Python Bindings + +### pytest Setup + +**Located in:** `tests/test_basic.py` + +```python +import pytest +import feedparser_rs + +def test_parse_rss20(): + xml = b""" + + + Test Feed + http://example.com + + """ + + feed = feedparser_rs.parse(xml) + + assert feed.version == "rss20" + assert not feed.bozo + assert feed.feed.title == "Test Feed" + assert feed.feed.link == "http://example.com" + +def test_bozo_flag_on_malformed(): + xml = b"Broken</channel></rss>" # Missing + + feed = feedparser_rs.parse(xml) + + assert feed.bozo is True + assert feed.bozo_exception is not None + assert feed.feed.title == "Broken" # Still parsed! + +def test_published_parsed_returns_struct_time(): + xml = b""" + Mon, 01 Jan 2024 12:00:00 GMT + """ + + feed = feedparser_rs.parse(xml) + entry = feed.entries[0] + + assert entry.published_parsed is not None + assert entry.published_parsed.tm_year == 2024 + assert entry.published_parsed.tm_mon == 1 + assert entry.published_parsed.tm_mday == 1 +``` + +### Compatibility Tests + +Test against actual Python feedparser: + +```python +import feedparser # Original library +import feedparser_rs + +def test_compatibility_with_feedparser(): + xml = open("tests/fixtures/rss/example.xml").read() + + # Parse with both libraries + fp_result = feedparser.parse(xml) + fprs_result = feedparser_rs.parse(xml) + + # Compare results + assert fprs_result.version == fp_result.version + assert fprs_result.feed.title == fp_result.feed.title + assert len(fprs_result.entries) == len(fp_result.entries) +``` + +## Performance Optimization + +### Release GIL for CPU-Intensive Work + +```rust +use pyo3::Python; + +#[pyfunction] +pub fn parse(py: Python<'_>, source: &Bound<'_, PyAny>) -> PyResult { + let data = extract_bytes(source)?; + + // Release GIL during parsing (CPU-intensive) + let result = py.allow_threads(|| feedparser_rs_core::parse(&data)) + .map_err(|e| PyValueError::new_err(e.to_string()))?; + + Ok(PyParsedFeed::from(result)) +} +``` + +**Why?** Parsing is CPU-bound. Releasing the GIL allows other Python threads to run. + +### Lazy Conversion + +Don't convert Rust → Python types eagerly: + +```rust +// ❌ BAD - Convert all entries immediately +#[getter] +fn entries(&self) -> Vec { + self.inner.entries.iter().map(|e| PyEntry::from(e)).collect() +} + +// ✅ GOOD - Return iterator or convert lazily +#[getter] +fn entries(&self) -> Vec { + // Still eager, but with Arc (cheap clones) + self.inner.entries.iter().map(|e| PyEntry { + inner: Arc::new(e.clone()) + }).collect() +} +``` + +## maturin Configuration + +**Located in:** `pyproject.toml` + +```toml +[build-system] +requires = ["maturin>=1.7,<2.0"] +build-backend = "maturin" + +[project] +name = "feedparser-rs" +version = "0.1.8" +description = "High-performance RSS/Atom feed parser (drop-in feedparser replacement)" +readme = "README.md" +requires-python = ">=3.9" +classifiers = [ + "Programming Language :: Rust", + "Programming Language :: Python :: 3", +] + +[tool.maturin] +features = ["pyo3/extension-module"] +module-name = "feedparser_rs" # Must match #[pymodule] name +``` + +## Development Workflow + +```bash +# Install development environment +cd crates/feedparser-rs-py +python -m venv .venv +source .venv/bin/activate # or .venv\Scripts\activate on Windows +pip install maturin pytest + +# Build and install locally +maturin develop --release + +# Run tests +pytest tests/ -v + +# Build wheel +maturin build --release +``` + +## Common Pitfalls + +### Don't Return Rust Types Directly + +```rust +// ❌ WRONG - Can't return Rust type to Python +#[pyfunction] +fn parse(data: &[u8]) -> ParsedFeed { + feedparser_rs_core::parse(data).unwrap() +} + +// ✅ CORRECT - Wrap in PyO3 class +#[pyfunction] +fn parse(data: &[u8]) -> PyResult { + let result = feedparser_rs_core::parse(data) + .map_err(|e| PyValueError::new_err(e.to_string()))?; + Ok(PyParsedFeed::from(result)) +} +``` + +### Don't Use .unwrap() or .expect() + +```rust +// ❌ WRONG - Panics crash Python +let dt = parse_date(&text).unwrap(); + +// ✅ CORRECT - Return PyResult +let dt = parse_date(&text) + .ok_or_else(|| PyValueError::new_err("Invalid date"))?; +``` + +### Don't Forget #[getter] Attribute + +```rust +// ❌ WRONG - Not accessible from Python +#[pymethods] +impl PyEntry { + fn title(&self) -> Option<&str> { // Missing #[getter] + self.inner.title.as_deref() + } +} + +// ✅ CORRECT +#[pymethods] +impl PyEntry { + #[getter] + fn title(&self) -> Option<&str> { + self.inner.title.as_deref() + } +} +``` + +### Don't Break Python's __repr__ + +```rust +// ✅ GOOD - Provide helpful __repr__ +#[pymethods] +impl PyEntry { + fn __repr__(&self) -> String { + format!( + "Entry(title={:?}, link={:?})", + self.inner.title, + self.inner.link + ) + } +} +``` + +## References + +- PyO3 documentation: https://pyo3.rs/ +- Python feedparser API: https://feedparser.readthedocs.io/ +- maturin guide: https://www.maturin.rs/ +- time.struct_time: https://docs.python.org/3/library/time.html#time.struct_time diff --git a/.github/instructions/tests.instructions.md b/.github/instructions/tests.instructions.md new file mode 100644 index 0000000..b6420f0 --- /dev/null +++ b/.github/instructions/tests.instructions.md @@ -0,0 +1,543 @@ +# Testing Guidelines + +**Applies to:** `tests/**`, `crates/**/tests/**`, `crates/**/benches/**` + +## Testing Philosophy + +### Test Pyramid + +``` + /\ + / \ E2E (Cross-language integration) + /____\ + / \ Integration (Full parser tests) + /________\ + / \ Unit (Individual functions) + /____________\ +``` + +**Distribution:** +- 70% Unit tests (fast, focused) +- 25% Integration tests (real feeds) +- 5% E2E tests (Python/Node.js bindings) + +### Tolerant Parsing Validation + +**Every parser test must verify bozo flag behavior:** + +```rust +#[test] +fn test_malformed_sets_bozo() { + let xml = b"Test</channel></rss>"; // Missing + let feed = parse(xml).unwrap(); + + assert!(feed.bozo, "Bozo flag should be set for malformed XML"); + assert!(feed.bozo_exception.is_some(), "Should have error description"); + assert_eq!(feed.feed.title.as_deref(), Some("Test"), "Should still parse partial data"); +} +``` + +**Why?** Tolerant parsing is our core value proposition. Tests must verify we continue parsing after errors. + +## Test Organization + +### Unit Tests (In-Module) + +**Located in:** Same file as implementation (`#[cfg(test)] mod tests`) + +```rust +// crates/feedparser-rs-core/src/util/date.rs + +pub fn parse_date(input: &str) -> Option> { + // Implementation +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn test_parse_date_rfc3339() { + let date = parse_date("2024-01-15T12:00:00Z"); + assert!(date.is_some()); + let dt = date.unwrap(); + assert_eq!(dt.year(), 2024); + } + + #[test] + fn test_parse_date_rfc2822() { + let date = parse_date("Mon, 15 Jan 2024 12:00:00 GMT"); + assert!(date.is_some()); + } + + #[test] + fn test_parse_date_invalid_returns_none() { + let date = parse_date("not a date"); + assert!(date.is_none(), "Invalid dates should return None, not panic"); + } +} +``` + +**Rules:** +1. Test happy path AND error cases +2. Verify None/Err returns (no panic) +3. Keep tests focused (one assertion per test when possible) + +### Integration Tests + +**Located in:** `crates/feedparser-rs-core/tests/*.rs` + +```rust +// tests/integration_tests.rs + +use feedparser_rs_core::{parse, FeedVersion}; + +#[test] +fn test_parse_rss20_full() { + let xml = include_str!("fixtures/rss/full_example.xml"); + let feed = parse(xml.as_bytes()).unwrap(); + + assert_eq!(feed.version, FeedVersion::Rss20); + assert!(!feed.bozo); + + // Validate feed metadata + assert_eq!(feed.feed.title.as_deref(), Some("Example Feed")); + assert_eq!(feed.feed.link.as_deref(), Some("http://example.com")); + + // Validate entries + assert_eq!(feed.entries.len(), 3); + assert_eq!(feed.entries[0].title.as_deref(), Some("First Entry")); +} +``` + +**Use fixtures:** Store test feeds in `tests/fixtures/` + +## Test Fixtures Organization + +``` +tests/fixtures/ +├── rss/ +│ ├── rss20_basic.xml # Minimal valid RSS 2.0 +│ ├── rss20_full.xml # All optional fields +│ ├── rss20_podcast.xml # iTunes namespace +│ └── malformed/ +│ ├── missing_closing_tag.xml +│ ├── invalid_date.xml +│ └── unknown_namespace.xml +├── atom/ +│ ├── atom10_basic.xml +│ ├── atom10_full.xml +│ └── malformed/ +├── json/ +│ ├── json11_basic.json +│ └── malformed/ +└── real_world/ # Actual feeds from production + ├── github_commits.atom + ├── nytimes_rss.xml + └── podcast_serial.xml +``` + +### Fixture Loading Pattern + +```rust +#[test] +fn test_rss20_basic() { + let xml = include_str!("fixtures/rss/rss20_basic.xml"); + let feed = parse(xml.as_bytes()).unwrap(); + // Assertions... +} +``` + +**Why `include_str!`?** +- Embedded at compile time (no runtime file I/O) +- Tests work without filesystem dependencies +- Fast execution + +## Test Naming Conventions + +```rust +// Pattern: test___ + +#[test] +fn test_parse_rss20_valid_succeeds() { } + +#[test] +fn test_parse_atom_malformed_sets_bozo() { } + +#[test] +fn test_sanitize_html_removes_script_tags() { } + +#[test] +fn test_parse_date_rfc3339_returns_datetime() { } + +#[test] +fn test_parse_date_invalid_returns_none() { } +``` + +**Rules:** +1. Start with `test_` +2. Include function/module name +3. Describe scenario +4. State expected outcome + +## Bozo Flag Testing Patterns + +### Pattern 1: Malformed XML + +```rust +#[test] +fn test_missing_closing_tag_sets_bozo() { + let xml = b"Test</channel></rss>"; + let feed = parse(xml).unwrap(); + + assert!(feed.bozo); + assert!(feed.bozo_exception.as_ref().unwrap().contains("missing")); + assert_eq!(feed.feed.title.as_deref(), Some("Test")); +} +``` + +### Pattern 2: Invalid Data Type + +```rust +#[test] +fn test_invalid_date_sets_bozo() { + let xml = b"<rss version=\"2.0\"><channel> + <pubDate>not a date</pubDate> + </channel></rss>"; + let feed = parse(xml).unwrap(); + + assert!(feed.bozo); + assert!(feed.bozo_exception.as_ref().unwrap().contains("date")); +} +``` + +### Pattern 3: Limit Exceeded + +```rust +#[test] +fn test_excessive_nesting_sets_bozo() { + // Generate deeply nested XML (>100 levels) + let xml = generate_deep_xml(150); + let feed = parse(xml.as_bytes()).unwrap(); + + assert!(feed.bozo); + assert!(feed.bozo_exception.as_ref().unwrap().contains("nesting")); +} +``` + +## Namespace Testing + +Test namespace handlers independently: + +```rust +#[test] +fn test_itunes_category_parsing() { + let xml = br#"<rss xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"> + <channel> + <itunes:category text="Technology"> + <itunes:category text="Software" /> + </itunes:category> + </channel> + </rss>"#; + + let feed = parse(xml).unwrap(); + let itunes = feed.feed.itunes.as_ref().unwrap(); + + assert_eq!(itunes.categories.len(), 1); + assert_eq!(itunes.categories[0].text, "Technology"); + assert_eq!(itunes.categories[0].subcategory.as_deref(), Some("Software")); +} +``` + +## Python Binding Tests + +**Located in:** `crates/feedparser-rs-py/tests/*.py` + +### pytest Setup + +```python +# tests/test_basic.py + +import pytest +import feedparser_rs + +def test_parse_bytes(): + xml = b'<rss version="2.0"><channel><title>Test' + feed = feedparser_rs.parse(xml) + assert feed.feed.title == "Test" + +def test_parse_string(): + xml = 'Test' + feed = feedparser_rs.parse(xml) + assert feed.feed.title == "Test" + +def test_bozo_flag(): + xml = b'Broken</channel></rss>' + feed = feedparser_rs.parse(xml) + assert feed.bozo is True + assert feed.bozo_exception is not None +``` + +### Compatibility Tests with feedparser + +```python +# tests/test_compatibility.py + +import feedparser +import feedparser_rs +import pytest + +@pytest.mark.parametrize("fixture", [ + "tests/fixtures/rss/rss20_basic.xml", + "tests/fixtures/atom/atom10_basic.xml", +]) +def test_compatibility_with_feedparser(fixture): + with open(fixture) as f: + xml = f.read() + + fp = feedparser.parse(xml) + fprs = feedparser_rs.parse(xml) + + # Compare key fields + assert fprs.version == fp.version + assert fprs.feed.title == fp.feed.title + assert len(fprs.entries) == len(fp.entries) + + if fprs.entries: + assert fprs.entries[0].title == fp.entries[0].title +``` + +### time.struct_time Testing + +```python +import time + +def test_published_parsed_returns_struct_time(): + xml = b'''<rss version="2.0"><channel><item> + <pubDate>Mon, 01 Jan 2024 12:00:00 GMT</pubDate> + </item></channel></rss>''' + + feed = feedparser_rs.parse(xml) + entry = feed.entries[0] + + # Verify it's a time.struct_time + assert isinstance(entry.published_parsed, time.struct_time) + + # Verify values + assert entry.published_parsed.tm_year == 2024 + assert entry.published_parsed.tm_mon == 1 + assert entry.published_parsed.tm_mday == 1 + assert entry.published_parsed.tm_hour == 12 +``` + +## Benchmarking + +**Located in:** `crates/feedparser-rs-core/benches/parsing.rs` + +```rust +use criterion::{black_box, criterion_group, criterion_main, Criterion}; +use feedparser_rs_core::parse; + +fn bench_rss20_small(c: &mut Criterion) { + let xml = include_str!("../tests/fixtures/rss/rss20_basic.xml"); + c.bench_function("rss20_small", |b| { + b.iter(|| parse(black_box(xml.as_bytes()))) + }); +} + +fn bench_rss20_large(c: &mut Criterion) { + let xml = include_str!("../tests/fixtures/rss/rss20_1000_entries.xml"); + c.bench_function("rss20_large", |b| { + b.iter(|| parse(black_box(xml.as_bytes()))) + }); +} + +criterion_group!(benches, bench_rss20_small, bench_rss20_large); +criterion_main!(benches); +``` + +### Running Benchmarks + +```bash +# Run all benchmarks +cargo make bench + +# Run specific benchmark +cargo bench --bench parsing +``` + +## Edge Case Testing Checklist + +### XML Edge Cases +- [ ] Empty elements: `<title>` vs `` +- [ ] CDATA sections: `<![CDATA[content]]>` +- [ ] HTML entities: `<`, `>`, `&` +- [ ] Numeric entities: `'`, `'` +- [ ] Nested elements (100+ levels) +- [ ] Very long text (>1MB) +- [ ] Mixed encodings +- [ ] BOM markers + +### Date Edge Cases +- [ ] RFC 3339: `2024-01-15T12:00:00Z` +- [ ] RFC 2822: `Mon, 15 Jan 2024 12:00:00 GMT` +- [ ] No timezone: `2024-01-15T12:00:00` +- [ ] Invalid dates: `2024-13-45` (month 13, day 45) +- [ ] Empty date strings +- [ ] Whitespace-only dates + +### Namespace Edge Cases +- [ ] Multiple namespaces +- [ ] Unknown namespaces +- [ ] Namespace prefix redefinition +- [ ] Default namespace overrides + +### DoS Edge Cases +- [ ] Feeds >50MB (should reject) +- [ ] >10,000 entries (should truncate + set bozo) +- [ ] Deeply nested XML (>100 levels) +- [ ] Very long attribute values (>10KB) + +## Test Coverage Goals + +**Minimum coverage:** 80% line coverage + +```bash +# Generate coverage report +cargo make coverage + +# View HTML report +open target/llvm-cov/html/index.html +``` + +**Critical paths requiring 100% coverage:** +- `parser/detect.rs` (format detection) +- `http/validation.rs` (SSRF protection) +- `util/sanitize.rs` (XSS protection) +- `limits.rs` (DoS protection) + +## Performance Testing + +### cargo-nextest + +Use nextest for faster test execution: + +```bash +# Run tests with nextest +cargo nextest run + +# Parallel execution (faster) +cargo nextest run --jobs 8 +``` + +### Test Execution Time + +**Target:** All tests should complete in <30 seconds + +If tests are slow: +1. Use smaller fixtures for unit tests +2. Move large feed tests to integration tests +3. Use `#[ignore]` for very slow tests +4. Run slow tests separately: `cargo test -- --ignored` + +## Common Pitfalls + +### Don't Use .unwrap() in Tests + +```rust +// ❌ BAD - Panic message unclear +#[test] +fn test_parse() { + let feed = parse(xml).unwrap(); + assert_eq!(feed.feed.title.unwrap(), "Test"); +} + +// ✅ GOOD - Clear assertion messages +#[test] +fn test_parse() { + let feed = parse(xml).expect("Failed to parse XML"); + assert_eq!( + feed.feed.title.as_deref(), + Some("Test"), + "Feed title should be 'Test'" + ); +} +``` + +### Don't Forget Negative Tests + +```rust +// ✅ GOOD - Test both success and failure +#[test] +fn test_parse_valid_succeeds() { + let result = parse(valid_xml); + assert!(result.is_ok()); +} + +#[test] +fn test_parse_invalid_sets_bozo() { + let result = parse(invalid_xml); + assert!(result.is_ok()); // Should still return Ok with bozo flag + assert!(result.unwrap().bozo); +} +``` + +### Don't Skip Bozo Validation + +```rust +// ❌ BAD - Only tests success case +#[test] +fn test_parse_entry() { + let feed = parse(xml).unwrap(); + assert_eq!(feed.entries.len(), 1); +} + +// ✅ GOOD - Tests tolerant parsing +#[test] +fn test_parse_entry_malformed() { + let feed = parse(broken_xml).unwrap(); + assert!(feed.bozo, "Should set bozo flag for malformed XML"); + assert_eq!(feed.entries.len(), 1, "Should still parse partial entry"); +} +``` + +## Documentation Tests + +Use doc comments with examples: + +```rust +/// Parses a date string in various formats. +/// +/// # Examples +/// +/// ``` +/// use feedparser_rs_core::util::date::parse_date; +/// +/// let date = parse_date("2024-01-15T12:00:00Z"); +/// assert!(date.is_some()); +/// ``` +/// +/// Invalid dates return `None`: +/// +/// ``` +/// use feedparser_rs_core::util::date::parse_date; +/// +/// let date = parse_date("not a date"); +/// assert!(date.is_none()); +/// ``` +pub fn parse_date(input: &str) -> Option<DateTime<Utc>> { + // Implementation +} +``` + +**Run doc tests:** +```bash +cargo test --doc +``` + +## References + +- cargo-nextest: https://nexte.st/ +- Criterion benchmarking: https://bheisler.github.io/criterion.rs/book/ +- pytest documentation: https://docs.pytest.org/ +- Python feedparser tests: https://github.com/kurtmckee/feedparser/tree/develop/tests diff --git a/.github/instructions/types.instructions.md b/.github/instructions/types.instructions.md new file mode 100644 index 0000000..6c8c291 --- /dev/null +++ b/.github/instructions/types.instructions.md @@ -0,0 +1,565 @@ +# Type Definitions Instructions + +**Applies to:** `crates/feedparser-rs-core/src/types/**` + +## Core Principles + +### API Compatibility is CRITICAL + +All type definitions MUST match Python feedparser's field names and behavior exactly. This is a drop-in replacement, not a new API. + +**Python feedparser field names → Rust field names (MUST BE IDENTICAL)** + +```python +# Python feedparser +d.feed.title # NOT d.feed.name or d.feed.heading +d.feed.link # NOT d.feed.url +d.entries[0].summary # NOT d.entries[0].description +d.entries[0].published_parsed # NOT d.entries[0].publication_date +``` + +**Always verify against Python feedparser documentation**: https://feedparser.readthedocs.io/ + +### Design Philosophy + +1. **Tolerant by Design**: All fields are `Option<T>` or `Vec<T>` (feeds often have missing data) +2. **Flat Structure**: Avoid deep nesting (matches Python feedparser's dict-like access) +3. **Value Types**: Use `String`, not `&str` (owned data, no lifetime complexity) +4. **Standard Types**: Use `chrono::DateTime<Utc>` for dates, not custom types + +## Type Categories + +### Core Feed Types + +**Located in:** `types/mod.rs` + +```rust +/// Main parsing result (equivalent to Python's FeedParserDict) +pub struct ParsedFeed { + pub feed: FeedMeta, // Feed-level metadata + pub entries: Vec<Entry>, // Items/entries + pub bozo: bool, // Error flag (CRITICAL) + pub bozo_exception: Option<String>, // Error description + pub encoding: String, // Character encoding + pub version: FeedVersion, // Format version + pub namespaces: HashMap<String, String>, // XML namespaces +} +``` + +**Rules:** +- NEVER remove `bozo` or `bozo_exception` fields (core to tolerant parsing) +- NEVER rename fields without checking Python feedparser compatibility +- ALL fields must be public (accessed from Python/Node.js bindings) + +### Feed Metadata Types + +**Located in:** `types/mod.rs` + +```rust +pub struct FeedMeta { + // Basic metadata + pub title: Option<String>, + pub title_detail: Option<TextConstruct>, // Type info (text/html/xhtml) + pub link: Option<String>, // Primary link + pub links: Vec<Link>, // All links + pub subtitle: Option<String>, + pub subtitle_detail: Option<TextConstruct>, + + // Dates + pub updated: Option<DateTime<Utc>>, + + // People + pub author: Option<String>, // Primary author name + pub author_detail: Option<Person>, // Full author info + pub authors: Vec<Person>, // All authors + pub contributors: Vec<Person>, + pub publisher: Option<String>, + pub publisher_detail: Option<Person>, + + // Classification + pub tags: Vec<Tag>, + pub language: Option<String>, + pub rights: Option<String>, + pub rights_detail: Option<TextConstruct>, + + // Technical + pub generator: Option<String>, + pub generator_detail: Option<Generator>, + pub id: Option<String>, + pub ttl: Option<u32>, + + // Images + pub image: Option<Image>, + pub icon: Option<String>, + pub logo: Option<String>, + + // Namespace extensions + pub itunes: Option<ItunesFeedMeta>, // iTunes podcast metadata + pub podcast: Option<PodcastMeta>, // Podcast 2.0 namespace +} +``` + +**Rules:** +- Use `Option<String>` for optional text fields (NOT `Option<&str>`) +- Use `Vec<T>` for collections (NOT arrays or fixed-size) +- Namespace extensions go in separate fields (`itunes`, `podcast`, etc.) + +### Entry Types + +**Located in:** `types/mod.rs` + +```rust +pub struct Entry { + pub id: Option<String>, + pub title: Option<String>, + pub title_detail: Option<TextConstruct>, + pub link: Option<String>, + pub links: Vec<Link>, + pub summary: Option<String>, + pub summary_detail: Option<TextConstruct>, + pub content: Vec<Content>, // Multiple content blocks + + // Dates + pub published: Option<DateTime<Utc>>, + pub updated: Option<DateTime<Utc>>, + pub created: Option<DateTime<Utc>>, // Dublin Core + pub expired: Option<DateTime<Utc>>, // Rarely used + + // People + pub author: Option<String>, + pub author_detail: Option<Person>, + pub authors: Vec<Person>, + pub contributors: Vec<Person>, + pub publisher: Option<String>, + pub publisher_detail: Option<Person>, + + // Media + pub enclosures: Vec<Enclosure>, // Audio/video attachments + pub tags: Vec<Tag>, + pub comments: Option<String>, + pub source: Option<Source>, + + // Namespace extensions + pub itunes: Option<ItunesEntryMeta>, + pub podcast: Option<PodcastEntryMeta>, +} +``` + +### Common Data Types + +**Located in:** `types/common.rs` + +```rust +/// Text with type information (Atom text constructs) +pub struct TextConstruct { + pub value: String, + pub content_type: TextType, // text, html, xhtml + pub language: Option<String>, + pub base: Option<String>, // xml:base attribute +} + +pub enum TextType { + Text, // Plain text + Html, // HTML (needs sanitization) + Xhtml, // XHTML (structured) +} + +/// Link with relationship and metadata +pub struct Link { + pub href: String, + pub rel: Option<String>, // "alternate", "enclosure", "self", etc. + pub link_type: Option<String>, // MIME type (e.g., "text/html") + pub title: Option<String>, + pub length: Option<u64>, // Size in bytes + pub hreflang: Option<String>, // Language +} + +/// Person (author, contributor, etc.) +pub struct Person { + pub name: Option<String>, + pub email: Option<String>, + pub uri: Option<String>, +} + +/// Tag/category +pub struct Tag { + pub term: String, // Tag name (required) + pub scheme: Option<String>, // Taxonomy URL + pub label: Option<String>, // Display name +} + +/// Media enclosure (podcast episode, video, etc.) +pub struct Enclosure { + pub url: String, + pub length: Option<u64>, + pub enclosure_type: Option<String>, // MIME type +} +``` + +### Version Enum + +**Located in:** `types/version.rs` + +**CRITICAL**: `as_str()` must return Python feedparser-compatible strings + +```rust +#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)] +pub enum FeedVersion { + Rss090, + Rss091, + Rss092, + Rss10, + Rss20, + Atom03, + Atom10, + JsonFeed10, + JsonFeed11, + Unknown, +} + +impl FeedVersion { + /// Returns feedparser-compatible version string + pub fn as_str(&self) -> &'static str { + match self { + Self::Rss090 => "rss090", // NOT "RSS 0.90" + Self::Rss091 => "rss091", // NOT "RSS 0.91" + Self::Rss092 => "rss092", + Self::Rss10 => "rss10", // NOT "rss1.0" or "RSS 1.0" + Self::Rss20 => "rss20", // NOT "rss2.0" or "RSS 2.0" + Self::Atom03 => "atom03", + Self::Atom10 => "atom10", // NOT "atom" or "Atom 1.0" + Self::JsonFeed10 => "json10", + Self::JsonFeed11 => "json11", + Self::Unknown => "", // Empty string, not "unknown" + } + } +} +``` + +## Namespace Extension Types + +### iTunes Podcast Metadata + +**Located in:** `types/podcast.rs` + +```rust +pub struct ItunesFeedMeta { + pub author: Option<String>, + pub owner: Option<ItunesOwner>, + pub image: Option<String>, // URL + pub categories: Vec<ItunesCategory>, + pub explicit: Option<bool>, + pub podcast_type: Option<String>, // "episodic", "serial" + pub keywords: Vec<String>, + pub complete: Option<bool>, + pub new_feed_url: Option<String>, +} + +pub struct ItunesCategory { + pub text: String, + pub subcategory: Option<String>, +} + +pub struct ItunesEntryMeta { + pub title: Option<String>, + pub episode: Option<u32>, + pub season: Option<u32>, + pub episode_type: Option<String>, // "full", "trailer", "bonus" + pub duration: Option<u32>, // Seconds + pub image: Option<String>, + pub explicit: Option<bool>, +} +``` + +### Podcast 2.0 Namespace + +**Located in:** `types/podcast.rs` + +```rust +pub struct PodcastMeta { + pub guid: Option<String>, + pub locked: Option<bool>, + pub funding: Vec<PodcastFunding>, + pub value: Vec<PodcastValue>, +} + +pub struct PodcastEntryMeta { + pub transcript: Vec<PodcastTranscript>, + pub chapters: Option<PodcastChapters>, + pub soundbite: Vec<PodcastSoundbite>, + pub person: Vec<PodcastPerson>, +} + +pub struct PodcastTranscript { + pub url: String, + pub transcript_type: Option<String>, // "text/plain", "text/html", etc. + pub language: Option<String>, + pub rel: Option<String>, +} +``` + +## Design Patterns + +### Option Unwrapping Helpers + +Provide helper methods to avoid verbose unwrapping in parsers: + +```rust +impl FeedMeta { + /// Set title with text construct + pub fn set_title(&mut self, text: TextConstruct) { + self.title = Some(text.value.clone()); + self.title_detail = Some(text); + } + + /// Set alternate link (first alternate link found) + pub fn set_alternate_link(&mut self, href: String, limit: usize) { + if self.link.is_none() { + self.link = Some(href.clone()); + } + self.links.try_push_limited(Link::alternate(href), limit); + } +} +``` + +### Constructor Helpers + +Provide constructors for common patterns: + +```rust +impl Link { + pub fn alternate(href: impl Into<String>) -> Self { + Self { + href: href.into(), + rel: Some("alternate".to_string()), + link_type: None, + title: None, + length: None, + hreflang: None, + } + } + + pub fn enclosure(href: impl Into<String>, mime_type: Option<String>) -> Self { + Self { + href: href.into(), + rel: Some("enclosure".to_string()), + link_type: mime_type, + title: None, + length: None, + hreflang: None, + } + } +} + +impl TextConstruct { + pub fn text(value: impl Into<String>) -> Self { + Self { + value: value.into(), + content_type: TextType::Text, + language: None, + base: None, + } + } + + pub fn html(value: impl Into<String>) -> Self { + Self { + value: value.into(), + content_type: TextType::Html, + language: None, + base: None, + } + } +} +``` + +## Bounded Collections (DoS Protection) + +**Located in:** `types/generics.rs` + +Provide extension trait for bounded growth: + +```rust +pub trait BoundedVec<T> { + fn try_push_limited(&mut self, item: T, limit: usize) -> Result<(), ()>; + fn is_at_limit(&self, limit: usize) -> bool; +} + +impl<T> BoundedVec<T> for Vec<T> { + fn try_push_limited(&mut self, item: T, limit: usize) -> Result<(), ()> { + if self.len() >= limit { + Err(()) + } else { + self.push(item); + Ok(()) + } + } + + fn is_at_limit(&self, limit: usize) -> bool { + self.len() >= limit + } +} +``` + +**Usage in parsers:** + +```rust +// ✅ CORRECT - Bounded growth +if feed.entries.try_push_limited(entry, limits.max_entries).is_err() { + feed.bozo = true; + feed.bozo_exception = Some(format!("Entry limit exceeded: {}", limits.max_entries)); +} + +// ❌ WRONG - Unbounded growth (DoS risk) +feed.entries.push(entry); +``` + +## Serialization Support + +All types should derive `serde::Serialize` for debugging and export: + +```rust +use serde::{Serialize, Deserialize}; + +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct ParsedFeed { + pub feed: FeedMeta, + pub entries: Vec<Entry>, + // ... +} +``` + +**But NOT** `Deserialize` for main types (we don't deserialize from JSON into ParsedFeed). + +## Documentation Requirements + +### Every Public Type Must Have Doc Comments + +```rust +/// Represents a parsed RSS/Atom feed with metadata and entries. +/// +/// This is the main result type returned by `parse()`. It corresponds to +/// Python feedparser's `FeedParserDict`. +/// +/// # Fields +/// +/// * `bozo` - Set to `true` if parsing encountered errors but continued +/// * `bozo_exception` - Description of the parsing error, if any +/// * `version` - Detected feed format (e.g., "rss20", "atom10") +pub struct ParsedFeed { + // ... +} +``` + +### Explain Field Semantics + +```rust +/// Link relationship type. +/// +/// Common values: +/// * `"alternate"` - Main content link (HTML page) +/// * `"enclosure"` - Media attachment (podcast episode, video) +/// * `"self"` - Feed's canonical URL +/// * `"related"` - Related resource +pub rel: Option<String>, +``` + +## Testing Requirements + +### Provide Test Fixtures + +```rust +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn test_parsed_feed_default() { + let feed = ParsedFeed { + feed: FeedMeta::default(), + entries: vec![], + bozo: false, + bozo_exception: None, + encoding: "utf-8".to_string(), + version: FeedVersion::Unknown, + namespaces: HashMap::new(), + }; + + assert!(!feed.bozo); + assert_eq!(feed.version.as_str(), ""); + } + + #[test] + fn test_version_strings_match_feedparser() { + assert_eq!(FeedVersion::Rss20.as_str(), "rss20"); + assert_eq!(FeedVersion::Atom10.as_str(), "atom10"); + assert_eq!(FeedVersion::Unknown.as_str(), ""); + } +} +``` + +## Common Pitfalls + +### Don't Break API Compatibility + +```rust +// ❌ WRONG - Field name doesn't match Python feedparser +pub struct Entry { + pub description: Option<String>, // Should be "summary" +} + +// ✅ CORRECT +pub struct Entry { + pub summary: Option<String>, // Matches feedparser +} +``` + +### Don't Use References in Struct Fields + +```rust +// ❌ WRONG - Lifetime complexity, doesn't work with Python bindings +pub struct FeedMeta<'a> { + pub title: Option<&'a str>, +} + +// ✅ CORRECT - Owned data +pub struct FeedMeta { + pub title: Option<String>, +} +``` + +### Don't Use Custom Date Types + +```rust +// ❌ WRONG - Custom type not compatible with chrono ecosystem +pub struct CustomDateTime { + year: i32, + month: u32, + // ... +} + +// ✅ CORRECT - Use chrono +use chrono::{DateTime, Utc}; +pub published: Option<DateTime<Utc>>, +``` + +### Don't Make Fields Private + +```rust +// ❌ WRONG - Python bindings need access +pub struct FeedMeta { + title: Option<String>, // Private +} + +// ✅ CORRECT +pub struct FeedMeta { + pub title: Option<String>, // Public +} +``` + +## References + +- Python feedparser field names: https://feedparser.readthedocs.io/en/latest/reference.html +- Atom text constructs (RFC 4287 §3.1): https://www.rfc-editor.org/rfc/rfc4287#section-3.1 +- RSS 2.0 element list: https://www.rssboard.org/rss-specification +- iTunes podcast tags: https://help.apple.com/itc/podcasts_connect/#/itcb54353390 +- Podcast 2.0 namespace: https://github.com/Podcastindex-org/podcast-namespace diff --git a/crates/feedparser-rs-core/src/parser/rss.rs b/crates/feedparser-rs-core/src/parser/rss.rs index 92a9749..1341fc2 100644 --- a/crates/feedparser-rs-core/src/parser/rss.rs +++ b/crates/feedparser-rs-core/src/parser/rss.rs @@ -15,10 +15,51 @@ use crate::{ use quick_xml::{Reader, events::Event}; use super::common::{ - EVENT_BUFFER_CAPACITY, FromAttributes, LimitedCollectionExt, check_depth, init_feed, - is_content_tag, is_dc_tag, is_itunes_tag, is_media_tag, read_text, skip_element, + EVENT_BUFFER_CAPACITY, LimitedCollectionExt, check_depth, init_feed, is_content_tag, is_dc_tag, + is_itunes_tag, is_media_tag, read_text, skip_element, }; +/// Error message for malformed XML attributes (shared constant) +const MALFORMED_ATTRIBUTES_ERROR: &str = "Malformed XML attributes"; + +/// Extract attributes as owned key-value pairs +/// Returns (attributes, `has_errors`) tuple where `has_errors` indicates +/// if any attribute parsing errors occurred (for bozo flag) +/// +/// Note: Keys are cloned to `Vec<u8>` because `quick_xml::Attribute` owns the key +/// data only for the lifetime of the event, but we need to store attributes across +/// multiple parsing calls in `parse_enclosure` and other functions. +#[inline] +fn collect_attributes(e: &quick_xml::events::BytesStart) -> (Vec<(Vec<u8>, String)>, bool) { + let mut has_errors = false; + let attrs = e + .attributes() + .filter_map(|result| { + if let Ok(attr) = result { + if let Ok(v) = attr.unescape_value() { + Some((attr.key.as_ref().to_vec(), v.to_string())) + } else { + has_errors = true; + None + } + } else { + has_errors = true; + None + } + }) + .collect(); + (attrs, has_errors) +} + +/// Find attribute value by key +#[inline] +fn find_attribute<'a>(attrs: &'a [(Vec<u8>, String)], key: &[u8]) -> Option<&'a str> { + attrs + .iter() + .find(|(k, _)| k.as_slice() == key) + .map(|(_, v)| v.as_str()) +} + /// Parse RSS 2.0 feed from raw bytes /// /// Parses an RSS 2.0 feed in tolerant mode, setting the bozo flag @@ -104,57 +145,21 @@ fn parse_channel( *depth += 1; check_depth(*depth, limits.max_nesting_depth)?; + // NOTE: Allocation here is necessary due to borrow checker constraints. + // We need owned tag data to pass &mut buf to helper functions simultaneously. + // Potential future optimization: restructure helpers to avoid this allocation. + let tag = e.name().as_ref().to_vec(); + let (attrs, has_attr_errors) = collect_attributes(&e); + if has_attr_errors { + feed.bozo = true; + feed.bozo_exception = Some(MALFORMED_ATTRIBUTES_ERROR.to_string()); + } + // Use full qualified name to distinguish standard RSS tags from namespaced tags - // (e.g., <image> vs <itunes:image>, <category> vs <itunes:category>) - match e.name().as_ref() { - b"title" => { - feed.feed.title = Some(read_text(reader, &mut buf, limits)?); - } - b"link" => { - let link_text = read_text(reader, &mut buf, limits)?; - feed.feed - .set_alternate_link(link_text, limits.max_links_per_feed); - } - b"description" => { - feed.feed.subtitle = Some(read_text(reader, &mut buf, limits)?); - } - b"language" => { - feed.feed.language = Some(read_text(reader, &mut buf, limits)?); - } - b"pubDate" => { - let text = read_text(reader, &mut buf, limits)?; - match parse_date(&text) { - Some(dt) => feed.feed.updated = Some(dt), - None if !text.is_empty() => { - feed.bozo = true; - feed.bozo_exception = Some("Invalid pubDate format".to_string()); - } - None => {} - } - } - b"managingEditor" => { - feed.feed.author = Some(read_text(reader, &mut buf, limits)?); - } - b"webMaster" => { - feed.feed.publisher = Some(read_text(reader, &mut buf, limits)?); - } - b"generator" => { - feed.feed.generator = Some(read_text(reader, &mut buf, limits)?); - } - b"ttl" => { - let text = read_text(reader, &mut buf, limits)?; - feed.feed.ttl = text.parse().ok(); - } - b"category" => { - let term = read_text(reader, &mut buf, limits)?; - feed.feed.tags.try_push_limited( - Tag { - term, - scheme: None, - label: None, - }, - limits.max_tags, - ); + match tag.as_slice() { + b"title" | b"link" | b"description" | b"language" | b"pubDate" + | b"managingEditor" | b"webMaster" | b"generator" | b"ttl" | b"category" => { + parse_channel_standard(reader, &mut buf, &tag, feed, limits)?; } b"image" => { if let Ok(image) = parse_image(reader, &mut buf, limits, depth) { @@ -167,190 +172,34 @@ fn parse_channel( } match parse_item(reader, &mut buf, limits, depth) { - Ok(entry) => feed.entries.push(entry), + Ok((entry, has_attr_errors)) => { + if has_attr_errors { + feed.bozo = true; + feed.bozo_exception = + Some(MALFORMED_ATTRIBUTES_ERROR.to_string()); + } + feed.entries.push(entry); + } Err(e) => { feed.bozo = true; feed.bozo_exception = Some(e.to_string()); } } } - tag => { - // Check for iTunes and Podcast 2.0 namespace tags - let handled = if is_itunes_tag(tag, b"author") { - let text = read_text(reader, &mut buf, limits)?; - let itunes = - feed.feed.itunes.get_or_insert_with(ItunesFeedMeta::default); - itunes.author = Some(text); - true - } else if is_itunes_tag(tag, b"owner") { - let itunes = - feed.feed.itunes.get_or_insert_with(ItunesFeedMeta::default); - if let Ok(owner) = parse_itunes_owner(reader, &mut buf, limits, depth) { - itunes.owner = Some(owner); - } - true - } else if is_itunes_tag(tag, b"category") { - // Parse category with potential subcategory - let mut category_text = String::new(); - for attr in e.attributes().flatten() { - if attr.key.as_ref() == b"text" - && let Ok(value) = attr.unescape_value() - { - category_text = - truncate_to_length(&value, limits.max_attribute_length); - } - } - - // Parse potential nested subcategory - // We need to read until we find the closing tag for the parent category - let mut subcategory_text = None; - let mut nesting = 0; // Track category nesting level - loop { - match reader.read_event_into(&mut buf) { - Ok(Event::Start(sub_e)) => { - if is_itunes_tag(sub_e.name().as_ref(), b"category") { - nesting += 1; - if nesting == 1 { - // First nested category - this is the subcategory - for attr in sub_e.attributes().flatten() { - if attr.key.as_ref() == b"text" - && let Ok(value) = attr.unescape_value() - { - subcategory_text = Some( - value - .chars() - .take(limits.max_attribute_length) - .collect(), - ); - break; - } - } - } - } - } - Ok(Event::Empty(sub_e)) => { - if is_itunes_tag(sub_e.name().as_ref(), b"category") - && subcategory_text.is_none() - { - // Self-closing nested category - for attr in sub_e.attributes().flatten() { - if attr.key.as_ref() == b"text" - && let Ok(value) = attr.unescape_value() - { - subcategory_text = Some( - value - .chars() - .take(limits.max_attribute_length) - .collect(), - ); - break; - } - } - } - } - Ok(Event::End(end_e)) => { - if is_itunes_tag(end_e.name().as_ref(), b"category") { - if nesting == 0 { - // End of the parent category element - break; - } - nesting -= 1; - } - } - Ok(Event::Eof) | Err(_) => break, - _ => {} - } - buf.clear(); - } - - let itunes = - feed.feed.itunes.get_or_insert_with(ItunesFeedMeta::default); - itunes.categories.push(ItunesCategory { - text: category_text, - subcategory: subcategory_text, - }); - true - } else if is_itunes_tag(tag, b"explicit") { - let text = read_text(reader, &mut buf, limits)?; - let itunes = - feed.feed.itunes.get_or_insert_with(ItunesFeedMeta::default); - itunes.explicit = parse_explicit(&text); - true - } else if is_itunes_tag(tag, b"image") { - let itunes = - feed.feed.itunes.get_or_insert_with(ItunesFeedMeta::default); - for attr in e.attributes().flatten() { - if attr.key.as_ref() == b"href" - && let Ok(value) = attr.unescape_value() - { - itunes.image = Some(truncate_to_length( - &value, - limits.max_attribute_length, - )); - } - } - // NOTE: Don't call skip_element - itunes:image is typically self-closing - // and calling skip_element would consume the next tag's end event - true - } else if is_itunes_tag(tag, b"keywords") { - let text = read_text(reader, &mut buf, limits)?; - let itunes = - feed.feed.itunes.get_or_insert_with(ItunesFeedMeta::default); - itunes.keywords = text - .split(',') - .map(|s| s.trim().to_string()) - .filter(|s| !s.is_empty()) - .collect(); - true - } else if is_itunes_tag(tag, b"type") { - let text = read_text(reader, &mut buf, limits)?; - let itunes = - feed.feed.itunes.get_or_insert_with(ItunesFeedMeta::default); - itunes.podcast_type = Some(text); - true - } else if tag.starts_with(b"podcast:guid") { - let text = read_text(reader, &mut buf, limits)?; - let podcast = - feed.feed.podcast.get_or_insert_with(PodcastMeta::default); - podcast.guid = Some(text); - true - } else if tag.starts_with(b"podcast:funding") { - // Parse funding inline to avoid borrow conflicts - let mut url = String::new(); - for attr in e.attributes().flatten() { - if attr.key.as_ref() == b"url" - && let Ok(value) = attr.unescape_value() - { - url = truncate_to_length(&value, limits.max_attribute_length); - } - } - let message_text = read_text(reader, &mut buf, limits)?; - let message = if message_text.is_empty() { - None - } else { - Some(message_text) - }; - let podcast = - feed.feed.podcast.get_or_insert_with(PodcastMeta::default); - podcast.funding.push(PodcastFunding { url, message }); - true - } else if let Some(dc_element) = is_dc_tag(tag) { - // Dublin Core namespace - let dc_elem = dc_element.to_string(); - let text = read_text(reader, &mut buf, limits)?; - dublin_core::handle_feed_element(&dc_elem, &text, &mut feed.feed); - true - } else if let Some(_content_element) = is_content_tag(tag) { - // Content namespace - typically only used at entry level - skip_element(reader, &mut buf, limits, *depth)?; - true - } else if let Some(_media_element) = is_media_tag(tag) { - // Media RSS namespace - typically only used at entry level - skip_element(reader, &mut buf, limits, *depth)?; - true - } else { - false - }; + _ => { + let mut handled = parse_channel_itunes( + reader, &mut buf, &tag, &attrs, feed, limits, depth, + )?; + if !handled { + handled = parse_channel_podcast( + reader, &mut buf, &tag, &attrs, feed, limits, + )?; + } + if !handled { + handled = parse_channel_namespace( + reader, &mut buf, &tag, feed, limits, *depth, + )?; + } if !handled { skip_element(reader, &mut buf, limits, *depth)?; @@ -372,14 +221,294 @@ fn parse_channel( Ok(()) } +/// Parse enclosure element from attributes +#[inline] +fn parse_enclosure(attrs: &[(Vec<u8>, String)], limits: &ParserLimits) -> Option<Enclosure> { + let mut url = String::new(); + let mut length = None; + let mut enc_type = None; + + for (key, value) in attrs { + match key.as_slice() { + b"url" => url = truncate_to_length(value, limits.max_attribute_length), + b"length" => length = value.parse().ok(), + b"type" => enc_type = Some(truncate_to_length(value, limits.max_attribute_length)), + _ => {} + } + } + + if url.is_empty() { + None + } else { + Some(Enclosure { + url, + length, + enclosure_type: enc_type, + }) + } +} + +/// Parse standard RSS 2.0 channel elements +#[inline] +fn parse_channel_standard( + reader: &mut Reader<&[u8]>, + buf: &mut Vec<u8>, + tag: &[u8], + feed: &mut ParsedFeed, + limits: &ParserLimits, +) -> Result<()> { + match tag { + b"title" => { + feed.feed.title = Some(read_text(reader, buf, limits)?); + } + b"link" => { + let link_text = read_text(reader, buf, limits)?; + feed.feed + .set_alternate_link(link_text, limits.max_links_per_feed); + } + b"description" => { + feed.feed.subtitle = Some(read_text(reader, buf, limits)?); + } + b"language" => { + feed.feed.language = Some(read_text(reader, buf, limits)?); + } + b"pubDate" => { + let text = read_text(reader, buf, limits)?; + match parse_date(&text) { + Some(dt) => feed.feed.updated = Some(dt), + None if !text.is_empty() => { + feed.bozo = true; + feed.bozo_exception = Some("Invalid pubDate format".to_string()); + } + None => {} + } + } + b"managingEditor" => { + feed.feed.author = Some(read_text(reader, buf, limits)?); + } + b"webMaster" => { + feed.feed.publisher = Some(read_text(reader, buf, limits)?); + } + b"generator" => { + feed.feed.generator = Some(read_text(reader, buf, limits)?); + } + b"ttl" => { + let text = read_text(reader, buf, limits)?; + feed.feed.ttl = text.parse().ok(); + } + b"category" => { + let term = read_text(reader, buf, limits)?; + feed.feed.tags.try_push_limited( + Tag { + term, + scheme: None, + label: None, + }, + limits.max_tags, + ); + } + _ => {} + } + Ok(()) +} + +/// Parse iTunes namespace tags at channel level +/// +/// Returns `Ok(true)` if the tag was recognized and handled, `Ok(false)` if not recognized. +fn parse_channel_itunes( + reader: &mut Reader<&[u8]>, + buf: &mut Vec<u8>, + tag: &[u8], + attrs: &[(Vec<u8>, String)], + feed: &mut ParsedFeed, + limits: &ParserLimits, + depth: &mut usize, +) -> Result<bool> { + if is_itunes_tag(tag, b"author") { + let text = read_text(reader, buf, limits)?; + let itunes = feed.feed.itunes.get_or_insert_with(ItunesFeedMeta::default); + itunes.author = Some(text); + Ok(true) + } else if is_itunes_tag(tag, b"owner") { + let itunes = feed.feed.itunes.get_or_insert_with(ItunesFeedMeta::default); + if let Ok(owner) = parse_itunes_owner(reader, buf, limits, depth) { + itunes.owner = Some(owner); + } + Ok(true) + } else if is_itunes_tag(tag, b"category") { + parse_itunes_category(reader, buf, attrs, feed, limits); + Ok(true) + } else if is_itunes_tag(tag, b"explicit") { + let text = read_text(reader, buf, limits)?; + let itunes = feed.feed.itunes.get_or_insert_with(ItunesFeedMeta::default); + itunes.explicit = parse_explicit(&text); + Ok(true) + } else if is_itunes_tag(tag, b"image") { + if let Some(value) = find_attribute(attrs, b"href") { + let itunes = feed.feed.itunes.get_or_insert_with(ItunesFeedMeta::default); + itunes.image = Some(truncate_to_length(value, limits.max_attribute_length)); + } + Ok(true) + } else if is_itunes_tag(tag, b"keywords") { + let text = read_text(reader, buf, limits)?; + let itunes = feed.feed.itunes.get_or_insert_with(ItunesFeedMeta::default); + itunes.keywords = text + .split(',') + .map(|s| s.trim().to_string()) + .filter(|s| !s.is_empty()) + .collect(); + Ok(true) + } else if is_itunes_tag(tag, b"type") { + let text = read_text(reader, buf, limits)?; + let itunes = feed.feed.itunes.get_or_insert_with(ItunesFeedMeta::default); + itunes.podcast_type = Some(text); + Ok(true) + } else { + Ok(false) + } +} + +/// Parse iTunes category with potential subcategory +fn parse_itunes_category( + reader: &mut Reader<&[u8]>, + buf: &mut Vec<u8>, + attrs: &[(Vec<u8>, String)], + feed: &mut ParsedFeed, + limits: &ParserLimits, +) { + let category_text = find_attribute(attrs, b"text") + .map(|v| truncate_to_length(v, limits.max_attribute_length)) + .unwrap_or_default(); + + // Parse potential nested subcategory + let mut subcategory_text = None; + let mut nesting = 0; + loop { + match reader.read_event_into(buf) { + Ok(Event::Start(sub_e)) => { + if is_itunes_tag(sub_e.name().as_ref(), b"category") { + nesting += 1; + if nesting == 1 { + for attr in sub_e.attributes().flatten() { + if attr.key.as_ref() == b"text" + && let Ok(value) = attr.unescape_value() + { + subcategory_text = + Some(value.chars().take(limits.max_attribute_length).collect()); + break; + } + } + } + } + } + Ok(Event::Empty(sub_e)) => { + if is_itunes_tag(sub_e.name().as_ref(), b"category") && subcategory_text.is_none() { + for attr in sub_e.attributes().flatten() { + if attr.key.as_ref() == b"text" + && let Ok(value) = attr.unescape_value() + { + subcategory_text = + Some(value.chars().take(limits.max_attribute_length).collect()); + break; + } + } + } + } + Ok(Event::End(end_e)) => { + if is_itunes_tag(end_e.name().as_ref(), b"category") { + if nesting == 0 { + break; + } + nesting -= 1; + } + } + Ok(Event::Eof) | Err(_) => break, + _ => {} + } + buf.clear(); + } + + let itunes = feed.feed.itunes.get_or_insert_with(ItunesFeedMeta::default); + itunes.categories.push(ItunesCategory { + text: category_text, + subcategory: subcategory_text, + }); +} + +/// Parse Podcast 2.0 namespace tags at channel level +/// +/// Returns `Ok(true)` if the tag was recognized and handled, `Ok(false)` if not recognized. +#[inline] +fn parse_channel_podcast( + reader: &mut Reader<&[u8]>, + buf: &mut Vec<u8>, + tag: &[u8], + attrs: &[(Vec<u8>, String)], + feed: &mut ParsedFeed, + limits: &ParserLimits, +) -> Result<bool> { + if tag.starts_with(b"podcast:guid") { + let text = read_text(reader, buf, limits)?; + let podcast = feed.feed.podcast.get_or_insert_with(PodcastMeta::default); + podcast.guid = Some(text); + Ok(true) + } else if tag.starts_with(b"podcast:funding") { + let url = find_attribute(attrs, b"url") + .map(|v| truncate_to_length(v, limits.max_attribute_length)) + .unwrap_or_default(); + let message_text = read_text(reader, buf, limits)?; + let message = if message_text.is_empty() { + None + } else { + Some(message_text) + }; + let podcast = feed.feed.podcast.get_or_insert_with(PodcastMeta::default); + podcast.funding.push(PodcastFunding { url, message }); + Ok(true) + } else { + Ok(false) + } +} + +/// Parse Dublin Core, Content, and Media RSS namespace tags at channel level +#[inline] +fn parse_channel_namespace( + reader: &mut Reader<&[u8]>, + buf: &mut Vec<u8>, + tag: &[u8], + feed: &mut ParsedFeed, + limits: &ParserLimits, + depth: usize, +) -> Result<bool> { + if let Some(dc_element) = is_dc_tag(tag) { + let dc_elem = dc_element.to_string(); + let text = read_text(reader, buf, limits)?; + dublin_core::handle_feed_element(&dc_elem, &text, &mut feed.feed); + Ok(true) + } else if let Some(_content_element) = is_content_tag(tag) { + skip_element(reader, buf, limits, depth)?; + Ok(true) + } else if let Some(_media_element) = is_media_tag(tag) { + skip_element(reader, buf, limits, depth)?; + Ok(true) + } else { + Ok(false) + } +} + /// Parse <item> element (entry) +/// +/// Returns a tuple where: +/// - First element: the parsed `Entry` +/// - Second element: `bool` indicating whether attribute parsing errors occurred (for bozo flag) fn parse_item( reader: &mut Reader<&[u8]>, buf: &mut Vec<u8>, limits: &ParserLimits, depth: &mut usize, -) -> Result<Entry> { +) -> Result<(Entry, bool)> { let mut entry = Entry::with_capacity(); + let mut has_attr_errors = false; loop { match reader.read_event_into(buf) { @@ -392,296 +521,47 @@ fn parse_item( *depth += 1; check_depth(*depth, limits.max_nesting_depth)?; + // NOTE: Allocation here is necessary due to borrow checker constraints. + // We need owned tag data to pass &mut buf to helper functions simultaneously. + // Potential future optimization: restructure helpers to avoid this allocation. + let tag = e.name().as_ref().to_vec(); + let (attrs, attr_error) = collect_attributes(e); + if attr_error { + has_attr_errors = true; + } + // Use full qualified name to distinguish standard RSS tags from namespaced tags - match e.name().as_ref() { - b"title" => { - entry.title = Some(read_text(reader, buf, limits)?); - } - b"link" => { - let link_text = read_text(reader, buf, limits)?; - entry.link = Some(link_text.clone()); - entry.links.try_push_limited( - Link { - href: link_text, - rel: Some("alternate".to_string()), - ..Default::default() - }, - limits.max_links_per_entry, - ); - } - b"description" => { - let desc = read_text(reader, buf, limits)?; - entry.summary = Some(desc.clone()); - entry.summary_detail = Some(TextConstruct { - value: desc, - content_type: TextType::Html, - language: None, - base: None, - }); - } - b"guid" => { - entry.id = Some(read_text(reader, buf, limits)?); - } - b"pubDate" => { - let text = read_text(reader, buf, limits)?; - entry.published = parse_date(&text); - } - b"author" => { - entry.author = Some(read_text(reader, buf, limits)?); - } - b"category" => { - let term = read_text(reader, buf, limits)?; - entry.tags.try_push_limited( - Tag { - term, - scheme: None, - label: None, - }, - limits.max_tags, - ); + match tag.as_slice() { + b"title" | b"link" | b"description" | b"guid" | b"pubDate" | b"author" + | b"category" | b"comments" => { + parse_item_standard(reader, buf, &tag, &mut entry, limits)?; } b"enclosure" => { - if let Some(enclosure) = parse_enclosure(e, limits) { + if let Some(enclosure) = parse_enclosure(&attrs, limits) { entry .enclosures .try_push_limited(enclosure, limits.max_enclosures); } skip_element(reader, buf, limits, *depth)?; } - b"comments" => { - entry.comments = Some(read_text(reader, buf, limits)?); - } b"source" => { if let Ok(source) = parse_source(reader, buf, limits, depth) { entry.source = Some(source); } } - tag => { - // Check for iTunes and Podcast 2.0 namespace tags - let handled = if is_itunes_tag(tag, b"title") { - let text = read_text(reader, buf, limits)?; - let itunes = entry.itunes.get_or_insert_with(ItunesEntryMeta::default); - itunes.title = Some(text); - true - } else if is_itunes_tag(tag, b"author") { - let text = read_text(reader, buf, limits)?; - let itunes = entry.itunes.get_or_insert_with(ItunesEntryMeta::default); - itunes.author = Some(text); - true - } else if is_itunes_tag(tag, b"duration") { - let text = read_text(reader, buf, limits)?; - let itunes = entry.itunes.get_or_insert_with(ItunesEntryMeta::default); - itunes.duration = parse_duration(&text); - true - } else if is_itunes_tag(tag, b"explicit") { - let text = read_text(reader, buf, limits)?; - let itunes = entry.itunes.get_or_insert_with(ItunesEntryMeta::default); - itunes.explicit = parse_explicit(&text); - true - } else if is_itunes_tag(tag, b"image") { - let itunes = entry.itunes.get_or_insert_with(ItunesEntryMeta::default); - for attr in e.attributes().flatten() { - if attr.key.as_ref() == b"href" - && let Ok(value) = attr.unescape_value() - { - itunes.image = Some(truncate_to_length( - &value, - limits.max_attribute_length, - )); - } - } - // NOTE: Don't call skip_element - itunes:image is typically self-closing - true - } else if is_itunes_tag(tag, b"episode") { - let text = read_text(reader, buf, limits)?; - let itunes = entry.itunes.get_or_insert_with(ItunesEntryMeta::default); - itunes.episode = text.parse().ok(); - true - } else if is_itunes_tag(tag, b"season") { - let text = read_text(reader, buf, limits)?; - let itunes = entry.itunes.get_or_insert_with(ItunesEntryMeta::default); - itunes.season = text.parse().ok(); - true - } else if is_itunes_tag(tag, b"episodeType") { - let text = read_text(reader, buf, limits)?; - let itunes = entry.itunes.get_or_insert_with(ItunesEntryMeta::default); - itunes.episode_type = Some(text); - true - } else if tag.starts_with(b"podcast:transcript") { - // Parse Podcast 2.0 transcript inline - let mut url = String::new(); - let mut transcript_type = None; - let mut language = None; - let mut rel = None; - for attr in e.attributes().flatten() { - match attr.key.as_ref() { - b"url" => { - if let Ok(value) = attr.unescape_value() { - url = value - .chars() - .take(limits.max_attribute_length) - .collect(); - } - } - b"type" => { - if let Ok(value) = attr.unescape_value() { - transcript_type = Some( - value - .chars() - .take(limits.max_attribute_length) - .collect(), - ); - } - } - b"language" => { - if let Ok(value) = attr.unescape_value() { - language = Some( - value - .chars() - .take(limits.max_attribute_length) - .collect(), - ); - } - } - b"rel" => { - if let Ok(value) = attr.unescape_value() { - rel = Some( - value - .chars() - .take(limits.max_attribute_length) - .collect(), - ); - } - } - _ => {} - } - } - if !url.is_empty() { - entry.podcast_transcripts.push(PodcastTranscript { - url, - transcript_type, - language, - rel, - }); - } - if !is_empty { - skip_element(reader, buf, limits, *depth)?; - } - true - } else if tag.starts_with(b"podcast:person") { - // Parse Podcast 2.0 person inline - let mut role = None; - let mut group = None; - let mut img = None; - let mut href = None; - for attr in e.attributes().flatten() { - match attr.key.as_ref() { - b"role" => { - if let Ok(value) = attr.unescape_value() { - role = Some( - value - .chars() - .take(limits.max_attribute_length) - .collect(), - ); - } - } - b"group" => { - if let Ok(value) = attr.unescape_value() { - group = Some( - value - .chars() - .take(limits.max_attribute_length) - .collect(), - ); - } - } - b"img" => { - if let Ok(value) = attr.unescape_value() { - img = Some( - value - .chars() - .take(limits.max_attribute_length) - .collect(), - ); - } - } - b"href" => { - if let Ok(value) = attr.unescape_value() { - href = Some( - value - .chars() - .take(limits.max_attribute_length) - .collect(), - ); - } - } - _ => {} - } - } - let name = read_text(reader, buf, limits)?; - if !name.is_empty() { - entry.podcast_persons.push(PodcastPerson { - name, - role, - group, - img, - href, - }); - } - true - } else if let Some(dc_element) = is_dc_tag(tag) { - // Dublin Core namespace - let dc_elem = dc_element.to_string(); - let text = read_text(reader, buf, limits)?; - dublin_core::handle_entry_element(&dc_elem, &text, &mut entry); - true - } else if let Some(content_element) = is_content_tag(tag) { - // Content namespace - let content_elem = content_element.to_string(); - let text = read_text(reader, buf, limits)?; - content::handle_entry_element(&content_elem, &text, &mut entry); - true - } else if let Some(media_element) = is_media_tag(tag) { - // Media RSS namespace - handle both text elements and attribute-based elements - if media_element == "thumbnail" { - // media:thumbnail has attributes - if let Some(thumbnail) = MediaThumbnail::from_attributes( - e.attributes().flatten(), - limits.max_attribute_length, - ) { - entry - .media_thumbnails - .try_push_limited(thumbnail, limits.max_enclosures); - } - // Don't call skip_element if it's self-closing - if !is_empty { - skip_element(reader, buf, limits, *depth)?; - } - } else if media_element == "content" { - // media:content has attributes - if let Some(media) = MediaContent::from_attributes( - e.attributes().flatten(), - limits.max_attribute_length, - ) { - entry - .media_content - .try_push_limited(media, limits.max_enclosures); - } - // Don't call skip_element if it's self-closing - if !is_empty { - skip_element(reader, buf, limits, *depth)?; - } - } else { - // Other media elements (title, description, keywords, category) - let media_elem = media_element.to_string(); - let text = read_text(reader, buf, limits)?; - media_rss::handle_entry_element(&media_elem, &text, &mut entry); - } - true - } else { - false - }; + _ => { + let mut handled = + parse_item_itunes(reader, buf, &tag, &attrs, &mut entry, limits)?; + if !handled { + handled = parse_item_podcast( + reader, buf, &tag, &attrs, &mut entry, limits, is_empty, *depth, + )?; + } + if !handled { + handled = parse_item_namespace( + reader, buf, &tag, &attrs, &mut entry, limits, is_empty, *depth, + )?; + } if !handled { skip_element(reader, buf, limits, *depth)?; @@ -700,7 +580,339 @@ fn parse_item( buf.clear(); } - Ok(entry) + Ok((entry, has_attr_errors)) +} + +/// Parse standard RSS 2.0 item elements +#[inline] +fn parse_item_standard( + reader: &mut Reader<&[u8]>, + buf: &mut Vec<u8>, + tag: &[u8], + entry: &mut Entry, + limits: &ParserLimits, +) -> Result<()> { + match tag { + b"title" => { + entry.title = Some(read_text(reader, buf, limits)?); + } + b"link" => { + let link_text = read_text(reader, buf, limits)?; + entry.link = Some(link_text.clone()); + entry.links.try_push_limited( + Link { + href: link_text, + rel: Some("alternate".to_string()), + ..Default::default() + }, + limits.max_links_per_entry, + ); + } + b"description" => { + let desc = read_text(reader, buf, limits)?; + entry.summary = Some(desc.clone()); + entry.summary_detail = Some(TextConstruct { + value: desc, + content_type: TextType::Html, + language: None, + base: None, + }); + } + b"guid" => { + entry.id = Some(read_text(reader, buf, limits)?); + } + b"pubDate" => { + let text = read_text(reader, buf, limits)?; + entry.published = parse_date(&text); + } + b"author" => { + entry.author = Some(read_text(reader, buf, limits)?); + } + b"category" => { + let term = read_text(reader, buf, limits)?; + entry.tags.try_push_limited( + Tag { + term, + scheme: None, + label: None, + }, + limits.max_tags, + ); + } + b"comments" => { + entry.comments = Some(read_text(reader, buf, limits)?); + } + _ => {} + } + Ok(()) +} + +/// Parse iTunes namespace tags at item level +/// +/// Returns `Ok(true)` if the tag was recognized and handled, `Ok(false)` if not recognized. +#[inline] +fn parse_item_itunes( + reader: &mut Reader<&[u8]>, + buf: &mut Vec<u8>, + tag: &[u8], + attrs: &[(Vec<u8>, String)], + entry: &mut Entry, + limits: &ParserLimits, +) -> Result<bool> { + if is_itunes_tag(tag, b"title") { + let text = read_text(reader, buf, limits)?; + let itunes = entry.itunes.get_or_insert_with(ItunesEntryMeta::default); + itunes.title = Some(text); + Ok(true) + } else if is_itunes_tag(tag, b"author") { + let text = read_text(reader, buf, limits)?; + let itunes = entry.itunes.get_or_insert_with(ItunesEntryMeta::default); + itunes.author = Some(text); + Ok(true) + } else if is_itunes_tag(tag, b"duration") { + let text = read_text(reader, buf, limits)?; + let itunes = entry.itunes.get_or_insert_with(ItunesEntryMeta::default); + itunes.duration = parse_duration(&text); + Ok(true) + } else if is_itunes_tag(tag, b"explicit") { + let text = read_text(reader, buf, limits)?; + let itunes = entry.itunes.get_or_insert_with(ItunesEntryMeta::default); + itunes.explicit = parse_explicit(&text); + Ok(true) + } else if is_itunes_tag(tag, b"image") { + if let Some(value) = find_attribute(attrs, b"href") { + let itunes = entry.itunes.get_or_insert_with(ItunesEntryMeta::default); + itunes.image = Some(truncate_to_length(value, limits.max_attribute_length)); + } + Ok(true) + } else if is_itunes_tag(tag, b"episode") { + let text = read_text(reader, buf, limits)?; + let itunes = entry.itunes.get_or_insert_with(ItunesEntryMeta::default); + itunes.episode = text.parse().ok(); + Ok(true) + } else if is_itunes_tag(tag, b"season") { + let text = read_text(reader, buf, limits)?; + let itunes = entry.itunes.get_or_insert_with(ItunesEntryMeta::default); + itunes.season = text.parse().ok(); + Ok(true) + } else if is_itunes_tag(tag, b"episodeType") { + let text = read_text(reader, buf, limits)?; + let itunes = entry.itunes.get_or_insert_with(ItunesEntryMeta::default); + itunes.episode_type = Some(text); + Ok(true) + } else { + Ok(false) + } +} + +/// Parse Podcast 2.0 namespace tags at item level +/// +/// Returns `Ok(true)` if the tag was recognized and handled, `Ok(false)` if not recognized. +/// +/// Note: Uses 8 parameters instead of a context struct due to borrow checker constraints +/// with multiple simultaneous `&mut` references during parsing. +#[inline] +#[allow(clippy::too_many_arguments)] +fn parse_item_podcast( + reader: &mut Reader<&[u8]>, + buf: &mut Vec<u8>, + tag: &[u8], + attrs: &[(Vec<u8>, String)], + entry: &mut Entry, + limits: &ParserLimits, + is_empty: bool, + depth: usize, +) -> Result<bool> { + if tag.starts_with(b"podcast:transcript") { + parse_podcast_transcript(reader, buf, attrs, entry, limits, is_empty, depth)?; + Ok(true) + } else if tag.starts_with(b"podcast:person") { + parse_podcast_person(reader, buf, attrs, entry, limits)?; + Ok(true) + } else { + Ok(false) + } +} + +/// Parse Podcast 2.0 transcript element +/// +/// Note: Currently always returns `Ok(())` but uses `Result` return type +/// for consistency with other parsers and potential future error handling. +fn parse_podcast_transcript( + reader: &mut Reader<&[u8]>, + buf: &mut Vec<u8>, + attrs: &[(Vec<u8>, String)], + entry: &mut Entry, + limits: &ParserLimits, + is_empty: bool, + depth: usize, +) -> Result<()> { + let url = find_attribute(attrs, b"url") + .map(|v| truncate_to_length(v, limits.max_attribute_length)) + .unwrap_or_default(); + let transcript_type = + find_attribute(attrs, b"type").map(|v| truncate_to_length(v, limits.max_attribute_length)); + let language = find_attribute(attrs, b"language") + .map(|v| truncate_to_length(v, limits.max_attribute_length)); + let rel = + find_attribute(attrs, b"rel").map(|v| truncate_to_length(v, limits.max_attribute_length)); + + if !url.is_empty() { + entry.podcast_transcripts.push(PodcastTranscript { + url, + transcript_type, + language, + rel, + }); + } + + if !is_empty { + skip_element(reader, buf, limits, depth)?; + } + + Ok(()) +} + +/// Parse Podcast 2.0 person element +fn parse_podcast_person( + reader: &mut Reader<&[u8]>, + buf: &mut Vec<u8>, + attrs: &[(Vec<u8>, String)], + entry: &mut Entry, + limits: &ParserLimits, +) -> Result<()> { + let role = + find_attribute(attrs, b"role").map(|v| truncate_to_length(v, limits.max_attribute_length)); + let group = + find_attribute(attrs, b"group").map(|v| truncate_to_length(v, limits.max_attribute_length)); + let img = + find_attribute(attrs, b"img").map(|v| truncate_to_length(v, limits.max_attribute_length)); + let href = + find_attribute(attrs, b"href").map(|v| truncate_to_length(v, limits.max_attribute_length)); + + let name = read_text(reader, buf, limits)?; + if !name.is_empty() { + entry.podcast_persons.push(PodcastPerson { + name, + role, + group, + img, + href, + }); + } + + Ok(()) +} + +/// Parse Dublin Core, Content, and Media RSS namespace tags at item level +/// +/// Returns `Ok(true)` if the tag was recognized and handled, `Ok(false)` if not recognized. +/// +/// Note: Uses 8 parameters instead of a context struct due to borrow checker constraints +/// with multiple simultaneous `&mut` references during parsing. +#[inline] +#[allow(clippy::too_many_arguments)] +fn parse_item_namespace( + reader: &mut Reader<&[u8]>, + buf: &mut Vec<u8>, + tag: &[u8], + attrs: &[(Vec<u8>, String)], + entry: &mut Entry, + limits: &ParserLimits, + is_empty: bool, + depth: usize, +) -> Result<bool> { + if let Some(dc_element) = is_dc_tag(tag) { + let dc_elem = dc_element.to_string(); + let text = read_text(reader, buf, limits)?; + dublin_core::handle_entry_element(&dc_elem, &text, entry); + Ok(true) + } else if let Some(content_element) = is_content_tag(tag) { + let content_elem = content_element.to_string(); + let text = read_text(reader, buf, limits)?; + content::handle_entry_element(&content_elem, &text, entry); + Ok(true) + } else if let Some(media_element) = is_media_tag(tag) { + parse_item_media( + reader, + buf, + media_element, + attrs, + entry, + limits, + is_empty, + depth, + )?; + Ok(true) + } else { + Ok(false) + } +} + +/// Parse Media RSS namespace elements +#[allow(clippy::too_many_arguments)] +fn parse_item_media( + reader: &mut Reader<&[u8]>, + buf: &mut Vec<u8>, + media_element: &str, + attrs: &[(Vec<u8>, String)], + entry: &mut Entry, + limits: &ParserLimits, + is_empty: bool, + depth: usize, +) -> Result<()> { + match media_element { + "thumbnail" => { + let url = find_attribute(attrs, b"url") + .map(|v| truncate_to_length(v, limits.max_attribute_length)) + .unwrap_or_default(); + let width = find_attribute(attrs, b"width").and_then(|v| v.parse().ok()); + let height = find_attribute(attrs, b"height").and_then(|v| v.parse().ok()); + + if !url.is_empty() { + entry + .media_thumbnails + .try_push_limited(MediaThumbnail { url, width, height }, limits.max_enclosures); + } + if !is_empty { + skip_element(reader, buf, limits, depth)?; + } + } + "content" => { + let url = find_attribute(attrs, b"url") + .map(|v| truncate_to_length(v, limits.max_attribute_length)) + .unwrap_or_default(); + let content_type = find_attribute(attrs, b"type") + .map(|v| truncate_to_length(v, limits.max_attribute_length)); + let filesize = find_attribute(attrs, b"fileSize").and_then(|v| v.parse().ok()); + let duration = find_attribute(attrs, b"duration").and_then(|v| v.parse().ok()); + let width = find_attribute(attrs, b"width").and_then(|v| v.parse().ok()); + let height = find_attribute(attrs, b"height").and_then(|v| v.parse().ok()); + + if !url.is_empty() { + entry.media_content.try_push_limited( + MediaContent { + url, + content_type, + filesize, + width, + height, + duration, + }, + limits.max_enclosures, + ); + } + if !is_empty { + skip_element(reader, buf, limits, depth)?; + } + } + _ => { + let media_elem = media_element.to_string(); + let text = read_text(reader, buf, limits)?; + media_rss::handle_entry_element(&media_elem, &text, entry); + } + } + Ok(()) } /// Parse <image> element @@ -764,11 +976,6 @@ fn parse_image( }) } -#[inline] -fn parse_enclosure(e: &quick_xml::events::BytesStart, limits: &ParserLimits) -> Option<Enclosure> { - Enclosure::from_attributes(e.attributes().flatten(), limits.max_attribute_length) -} - /// Parse <source> element fn parse_source( reader: &mut Reader<&[u8]>, @@ -1219,4 +1426,356 @@ mod tests { let feed = parse_rss20(xml).unwrap(); assert_eq!(feed.feed.title.as_deref(), Some("Test")); } + + // PRIORITY 1: iTunes Item-Level Tests (CRITICAL) + + #[test] + fn test_parse_rss_itunes_episode_metadata() { + let xml = br#"<?xml version="1.0"?> + <rss version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"> + <channel> + <title>Test Podcast + + Standard Title + iTunes Override Title + 1:23:45 + + yes + 42 + 3 + full + + + "#; + + let feed = parse_rss20(xml).unwrap(); + assert!( + !feed.bozo, + "Should parse iTunes episode metadata without errors" + ); + assert_eq!(feed.entries.len(), 1); + + let entry = &feed.entries[0]; + let itunes = entry.itunes.as_ref().unwrap(); + + assert_eq!(itunes.title.as_deref(), Some("iTunes Override Title")); + assert_eq!(itunes.duration, Some(5025)); // 1:23:45 in seconds + assert_eq!( + itunes.image.as_deref(), + Some("https://example.com/episode-cover.jpg") + ); + assert_eq!(itunes.explicit, Some(true)); + assert_eq!(itunes.episode, Some(42)); + assert_eq!(itunes.season, Some(3)); + assert_eq!(itunes.episode_type.as_deref(), Some("full")); + } + + #[test] + fn test_parse_rss_itunes_duration_formats() { + // Test HH:MM:SS format + let xml1 = br#" + + + 1:23:45 + + "#; + let feed1 = parse_rss20(xml1).unwrap(); + assert_eq!( + feed1.entries[0].itunes.as_ref().unwrap().duration, + Some(5025) + ); + + // Test MM:SS format + let xml2 = br#" + + + 23:45 + + "#; + let feed2 = parse_rss20(xml2).unwrap(); + assert_eq!( + feed2.entries[0].itunes.as_ref().unwrap().duration, + Some(1425) + ); + + // Test seconds-only format + let xml3 = br#" + + + 3661 + + "#; + let feed3 = parse_rss20(xml3).unwrap(); + assert_eq!( + feed3.entries[0].itunes.as_ref().unwrap().duration, + Some(3661) + ); + + // Test invalid format + let xml4 = br#" + + + invalid + + "#; + let feed4 = parse_rss20(xml4).unwrap(); + assert!( + feed4.entries[0].itunes.as_ref().unwrap().duration.is_none(), + "Invalid duration should result in None" + ); + } + + #[test] + fn test_parse_rss_itunes_nested_categories() { + let xml = br#" + + + Test Podcast + + + + + + + + + "#; + + let feed = parse_rss20(xml).unwrap(); + let itunes = feed.feed.itunes.as_ref().unwrap(); + + assert_eq!(itunes.categories.len(), 3); + + // First category with subcategory + assert_eq!(itunes.categories[0].text, "Arts"); + assert_eq!(itunes.categories[0].subcategory.as_deref(), Some("Design")); + + // Second category with subcategory + assert_eq!(itunes.categories[1].text, "Technology"); + assert_eq!( + itunes.categories[1].subcategory.as_deref(), + Some("Programming") + ); + + // Third category without subcategory + assert_eq!(itunes.categories[2].text, "News"); + assert!(itunes.categories[2].subcategory.is_none()); + } + + #[test] + fn test_parse_rss_itunes_owner_parsing() { + let xml = br#" + + + Test Podcast + + John Smith + john@example.com + + + "#; + + let feed = parse_rss20(xml).unwrap(); + let itunes = feed.feed.itunes.as_ref().unwrap(); + let owner = itunes.owner.as_ref().unwrap(); + + assert_eq!(owner.name.as_deref(), Some("John Smith")); + assert_eq!(owner.email.as_deref(), Some("john@example.com")); + } + + // PRIORITY 2: Podcast 2.0 Tests + + #[test] + fn test_parse_rss_podcast_locked_and_guid() { + let xml = br#" + + + Test Podcast + 917393e3-1c1e-5d48-8e7f-cc9c0d9f2e95 + + "#; + + let feed = parse_rss20(xml).unwrap(); + assert!(!feed.bozo); + + let podcast = feed.feed.podcast.as_ref().unwrap(); + assert_eq!( + podcast.guid.as_deref(), + Some("917393e3-1c1e-5d48-8e7f-cc9c0d9f2e95") + ); + } + + #[test] + fn test_parse_rss_podcast_funding() { + let xml = br#" + + + Test Podcast + Support on Patreon + Buy Me a Coffee + + "#; + + let feed = parse_rss20(xml).unwrap(); + let podcast = feed.feed.podcast.as_ref().unwrap(); + + assert_eq!(podcast.funding.len(), 2); + assert_eq!(podcast.funding[0].url, "https://patreon.com/example"); + assert_eq!( + podcast.funding[0].message.as_deref(), + Some("Support on Patreon") + ); + assert_eq!(podcast.funding[1].url, "https://buymeacoffee.com/example"); + } + + #[test] + fn test_parse_rss_podcast_transcript() { + let xml = br#" + + + + Episode 1 + + + + + "#; + + let feed = parse_rss20(xml).unwrap(); + assert_eq!(feed.entries.len(), 1); + + let transcripts = &feed.entries[0].podcast_transcripts; + assert_eq!(transcripts.len(), 2); + + assert_eq!( + transcripts[0].url, + "https://example.com/transcripts/ep1.srt" + ); + assert_eq!( + transcripts[0].transcript_type.as_deref(), + Some("application/srt") + ); + assert_eq!(transcripts[0].language.as_deref(), Some("en")); + assert_eq!(transcripts[0].rel.as_deref(), Some("captions")); + + assert_eq!( + transcripts[1].url, + "https://example.com/transcripts/ep1.vtt" + ); + assert_eq!(transcripts[1].transcript_type.as_deref(), Some("text/vtt")); + } + + #[test] + fn test_parse_rss_podcast_person() { + let xml = br#" + + + + Episode 1 + Jane Doe + John Smith + Bob Editor + + + "#; + + let feed = parse_rss20(xml).unwrap(); + let persons = &feed.entries[0].podcast_persons; + + assert_eq!(persons.len(), 3); + + assert_eq!(persons[0].name, "Jane Doe"); + assert_eq!(persons[0].role.as_deref(), Some("host")); + assert_eq!(persons[0].href.as_deref(), Some("https://example.com/host")); + assert_eq!( + persons[0].img.as_deref(), + Some("https://example.com/host.jpg") + ); + + assert_eq!(persons[1].name, "John Smith"); + assert_eq!(persons[1].role.as_deref(), Some("guest")); + + assert_eq!(persons[2].name, "Bob Editor"); + assert_eq!(persons[2].role.as_deref(), Some("editor")); + assert_eq!(persons[2].group.as_deref(), Some("production")); + } + + // PRIORITY 3: Namespace Tests + + #[test] + fn test_parse_rss_dublin_core_channel() { + let xml = br#" + + + Test Feed + Jane Doe + Example Publishing + 2024-12-16T10:00:00Z + CC BY 4.0 + Technology + + "#; + + let feed = parse_rss20(xml).unwrap(); + assert!(!feed.bozo); + + // DC creator should populate author + assert_eq!(feed.feed.author.as_deref(), Some("Jane Doe")); + + // DC publisher + assert_eq!(feed.feed.publisher.as_deref(), Some("Example Publishing")); + + // DC date should populate updated + assert!(feed.feed.updated.is_some()); + + // DC rights + assert_eq!(feed.feed.rights.as_deref(), Some("CC BY 4.0")); + + // DC subject should add tags + assert!(feed.feed.tags.iter().any(|t| t.term == "Technology")); + } + + #[test] + fn test_parse_rss_content_encoded() { + let xml = br#" + + + + Test Item + Plain text summary + This is HTML content with links

+
    +
  • Item 1
  • +
  • Item 2
  • +
+ ]]>
+
+
+
"#; + + let feed = parse_rss20(xml).unwrap(); + let entry = &feed.entries[0]; + + // Summary should be plain description + assert_eq!(entry.summary.as_deref(), Some("Plain text summary")); + + // Content should contain the HTML + assert_eq!(entry.content.len(), 1); + assert!( + entry.content[0] + .value + .contains("HTML content") + ); + assert!(entry.content[0].value.contains("
    ")); + } }