|
| 1 | +# feedparser-rs GitHub Copilot Instructions |
| 2 | + |
| 3 | +## Project Mission |
| 4 | + |
| 5 | +High-performance RSS/Atom/JSON Feed parser in Rust with Python (PyO3) and Node.js (napi-rs) bindings. This is a drop-in replacement for Python's `feedparser` library with 10-100x performance improvement. |
| 6 | + |
| 7 | +**CRITICAL**: API compatibility with Python feedparser is the #1 priority. Field names, types, and behavior must match exactly. |
| 8 | + |
| 9 | +**MSRV:** Rust 1.88.0 | **Edition:** 2024 | **License:** MIT/Apache-2.0 |
| 10 | + |
| 11 | +## Architecture Overview |
| 12 | + |
| 13 | +### Workspace Structure |
| 14 | +- **`crates/feedparser-rs-core`** — Pure Rust parser. All parsing logic lives here. NO dependencies on other workspace crates. |
| 15 | +- **`crates/feedparser-rs-py`** — Python bindings via PyO3/maturin. Depends on core. |
| 16 | +- **`crates/feedparser-rs-node`** — Node.js bindings via napi-rs. Depends on core. |
| 17 | + |
| 18 | +### Parser Pipeline |
| 19 | +1. **Format Detection** (`parser/detect.rs`) — Identifies RSS 0.9x/1.0/2.0, Atom 0.3/1.0, or JSON Feed 1.0/1.1 |
| 20 | +2. **Parsing** — Routes to `parser/rss.rs`, `parser/atom.rs`, or `parser/json.rs` |
| 21 | +3. **Namespace Extraction** — Handlers in `namespace/` process iTunes, Dublin Core, Media RSS, Podcast 2.0 |
| 22 | +4. **Tolerant Error Handling** — Returns `ParsedFeed` with `bozo` flag set on errors, continues parsing |
| 23 | + |
| 24 | +## Idiomatic Rust & Performance |
| 25 | + |
| 26 | +### Type Safety First |
| 27 | +- Prefer strong types over primitives: `FeedVersion` enum, not `&str` |
| 28 | +- Use `Option<T>` and `Result<T, E>` — never sentinel values |
| 29 | +- Leverage generics and trait bounds for reusable code: |
| 30 | +```rust |
| 31 | +fn collect_limited<T, I: Iterator<Item = T>>(iter: I, limit: usize) -> Vec<T> { |
| 32 | + iter.take(limit).collect() |
| 33 | +} |
| 34 | +``` |
| 35 | + |
| 36 | +### Zero-Cost Abstractions |
| 37 | +- Use `&str` over `String` in function parameters |
| 38 | +- Prefer iterators over index-based loops: `.iter().filter().map()` |
| 39 | +- Use `Cow<'_, str>` when ownership is conditionally needed |
| 40 | +- Avoid allocations in hot paths — reuse buffers where possible |
| 41 | + |
| 42 | +### Edition 2024 Features |
| 43 | +- Use `gen` blocks for custom iterators where applicable |
| 44 | +- Leverage improved async patterns for HTTP module |
| 45 | +- Apply new lifetime elision rules for cleaner signatures |
| 46 | + |
| 47 | +### Safety Guidelines |
| 48 | +- `#![warn(unsafe_code)]` is enabled — avoid `unsafe` unless absolutely necessary |
| 49 | +- All public APIs must have doc comments (`#![warn(missing_docs)]`) |
| 50 | +- Use `thiserror` for error types with proper `#[error]` attributes |
| 51 | + |
| 52 | +## Critical Conventions |
| 53 | + |
| 54 | +### Error Handling: Bozo Pattern (MANDATORY) |
| 55 | +**Never panic or return errors for malformed input.** Set `bozo = true` and continue: |
| 56 | +```rust |
| 57 | +match parse_date(&text) { |
| 58 | + Some(dt) => entry.published = Some(dt), |
| 59 | + None => { |
| 60 | + feed.bozo = true; |
| 61 | + feed.bozo_exception = Some(format!("Invalid date: {text}")); |
| 62 | + // Continue parsing! |
| 63 | + } |
| 64 | +} |
| 65 | +``` |
| 66 | + |
| 67 | +### API Compatibility with Python feedparser |
| 68 | +Field names must match `feedparser` exactly: `feed.title`, `entries[0].summary`, `version` returns `"rss20"`, `"atom10"` |
| 69 | + |
| 70 | +### XML Parsing with quick-xml |
| 71 | +Use tolerant mode — no strict validation: |
| 72 | +```rust |
| 73 | +let mut reader = Reader::from_reader(data); |
| 74 | +reader.config_mut().trim_text(true); |
| 75 | +// Do NOT enable check_end_names — tolerance over strictness |
| 76 | +``` |
| 77 | + |
| 78 | +## Development Commands |
| 79 | + |
| 80 | +All automation via `cargo-make`: |
| 81 | + |
| 82 | +| Command | Purpose | |
| 83 | +|---------|---------| |
| 84 | +| `cargo make fmt` | Format with nightly rustfmt | |
| 85 | +| `cargo make clippy` | Lint (excludes py bindings) | |
| 86 | +| `cargo make test-rust` | Rust tests (nextest) | |
| 87 | +| `cargo make pre-commit` | fmt + clippy + test-rust | |
| 88 | +| `cargo make bench` | Criterion benchmarks | |
| 89 | +| `cargo make msrv-check` | Verify MSRV 1.88.0 compatibility | |
| 90 | + |
| 91 | +### Bindings |
| 92 | +```bash |
| 93 | +# Python |
| 94 | +cd crates/feedparser-rs-py && maturin develop && pytest tests/ -v |
| 95 | + |
| 96 | +# Node.js |
| 97 | +cd crates/feedparser-rs-node && pnpm install && pnpm build && pnpm test |
| 98 | +``` |
| 99 | + |
| 100 | +## Testing Patterns |
| 101 | + |
| 102 | +Use `include_str!()` for fixtures in `tests/fixtures/`: |
| 103 | +```rust |
| 104 | +#[test] |
| 105 | +fn test_rss20_basic() { |
| 106 | + let xml = include_str!("../../tests/fixtures/rss/example.xml"); |
| 107 | + let feed = parse(xml.as_bytes()).unwrap(); |
| 108 | + assert!(!feed.bozo); |
| 109 | +} |
| 110 | +``` |
| 111 | + |
| 112 | +Always verify malformed feeds set bozo but still parse: |
| 113 | +```rust |
| 114 | +#[test] |
| 115 | +fn test_malformed_sets_bozo() { |
| 116 | + let xml = b"<rss><channel><title>Broken</title></rss>"; |
| 117 | + let feed = parse(xml).unwrap(); |
| 118 | + assert!(feed.bozo); |
| 119 | + assert_eq!(feed.feed.title.as_deref(), Some("Broken")); // Still parsed! |
| 120 | +} |
| 121 | +``` |
| 122 | + |
| 123 | +## Security Requirements |
| 124 | + |
| 125 | +### SSRF Protection (CRITICAL for HTTP Module) |
| 126 | +Block these URL patterns before fetching: |
| 127 | +- Localhost/loopback: `127.0.0.1`, `[::1]`, `localhost` |
| 128 | +- Private networks: `10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16` |
| 129 | +- Link-local: `169.254.0.0/16` (AWS/GCP metadata endpoints), `fe80::/10` |
| 130 | +- Special addresses: `0.0.0.0/8`, `255.255.255.255`, `::/128` |
| 131 | + |
| 132 | +Always validate URLs through `is_safe_url()` before HTTP requests. |
| 133 | + |
| 134 | +### XSS Protection (HTML Sanitization) |
| 135 | +Use `ammonia` for HTML content from feeds: |
| 136 | +- Allowed tags: `a, abbr, b, blockquote, br, code, div, em, h1-h6, hr, i, img, li, ol, p, pre, span, strong, ul` |
| 137 | +- Enforce `rel="nofollow noopener"` on links |
| 138 | +- Allow only `http`, `https`, `mailto` URL schemes |
| 139 | +- Never pass raw HTML to Python/Node.js bindings without sanitization |
| 140 | + |
| 141 | +### DoS Protection |
| 142 | +Apply limits via `ParserLimits`: |
| 143 | +- `max_feed_size`: Default 50MB |
| 144 | +- `max_nesting_depth`: Default 100 levels |
| 145 | +- `max_entries`: Default 10,000 items |
| 146 | +- `max_text_length`: Default 1MB per text field |
| 147 | +- `max_attribute_length`: Default 10KB per attribute |
| 148 | + |
| 149 | +## Code Quality Standards |
| 150 | + |
| 151 | +### Function Length Guidelines |
| 152 | +- **Target**: Functions should be <50 lines |
| 153 | +- **Maximum**: NEVER exceed 100 lines |
| 154 | +- **If >50 lines**: Extract inline logic to helper functions |
| 155 | + |
| 156 | +Example refactoring pattern: |
| 157 | +```rust |
| 158 | +// Before: 200+ line function |
| 159 | +fn parse_channel(...) { |
| 160 | + match tag { |
| 161 | + b"itunes:category" => { /* 80 lines inline */ } |
| 162 | + // ... |
| 163 | + } |
| 164 | +} |
| 165 | + |
| 166 | +// After: Delegate to helpers |
| 167 | +fn parse_channel(...) { |
| 168 | + match tag { |
| 169 | + tag if is_itunes_tag_any(tag) => parse_channel_itunes(tag, ...)?, |
| 170 | + // ... |
| 171 | + } |
| 172 | +} |
| 173 | +``` |
| 174 | + |
| 175 | +### Documentation Requirements |
| 176 | +All public APIs must have doc comments: |
| 177 | +```rust |
| 178 | +/// Parses an RSS/Atom feed from bytes. |
| 179 | +/// |
| 180 | +/// # Arguments |
| 181 | +/// * `data` - Raw feed content as bytes |
| 182 | +/// |
| 183 | +/// # Returns |
| 184 | +/// Returns `ParsedFeed` with extracted metadata. If parsing encounters errors, |
| 185 | +/// `bozo` flag is set to `true` and `bozo_exception` contains the error description. |
| 186 | +/// |
| 187 | +/// # Examples |
| 188 | +/// ``` |
| 189 | +/// let xml = b"<rss version=\"2.0\">...</rss>"; |
| 190 | +/// let feed = parse(xml)?; |
| 191 | +/// assert_eq!(feed.version, FeedVersion::Rss20); |
| 192 | +/// ``` |
| 193 | +pub fn parse(data: &[u8]) -> Result<ParsedFeed> { ... } |
| 194 | +``` |
| 195 | + |
| 196 | +### Inline Comments |
| 197 | +Minimize inline comments. Use comments ONLY for: |
| 198 | +1. **Why** decisions (not **what** the code does) |
| 199 | +2. Non-obvious constraints or workarounds |
| 200 | +3. References to specifications (RFC 4287 section 4.1.2, etc.) |
| 201 | + |
| 202 | +## Commit & Branch Conventions |
| 203 | +- Branch: `feat/`, `fix/`, `docs/`, `refactor/`, `test/` |
| 204 | +- Commits: [Conventional Commits](https://conventionalcommits.org/) |
| 205 | +- Never mention "Claude" or "co-authored" in commit messages |
| 206 | + |
| 207 | +## What NOT to Do |
| 208 | +- ❌ Don't use `.unwrap()` or `.expect()` in parser code — use bozo pattern |
| 209 | +- ❌ Don't add dependencies without workspace-level declaration in root `Cargo.toml` |
| 210 | +- ❌ Don't skip `--exclude feedparser-rs-py` in workspace-wide Rust commands (PyO3 needs special handling) |
| 211 | +- ❌ Don't break API compatibility with Python feedparser field names |
| 212 | +- ❌ Don't panic on malformed feeds — set `bozo = true` and continue parsing |
| 213 | +- ❌ Don't fetch URLs without SSRF validation (`is_safe_url()`) |
| 214 | +- ❌ Don't pass raw HTML to bindings without sanitization (`sanitize_html()`) |
| 215 | +- ❌ Don't create functions >100 lines — extract helpers |
| 216 | +- ❌ Don't use generic names like `utils`, `helpers`, `common` for modules |
| 217 | +- ❌ Don't add emojis to code or comments |
0 commit comments