Skip to content

Commit 681abd8

Browse files
authored
Merge pull request #19 from bug-ops/refactor/split-long-functions
refactor: split long RSS parser functions for maintainability
2 parents e8b720b + 4ab5538 commit 681abd8

File tree

7 files changed

+3757
-513
lines changed

7 files changed

+3757
-513
lines changed

.github/copilot-instructions.md

Lines changed: 217 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,217 @@
1+
# feedparser-rs GitHub Copilot Instructions
2+
3+
## Project Mission
4+
5+
High-performance RSS/Atom/JSON Feed parser in Rust with Python (PyO3) and Node.js (napi-rs) bindings. This is a drop-in replacement for Python's `feedparser` library with 10-100x performance improvement.
6+
7+
**CRITICAL**: API compatibility with Python feedparser is the #1 priority. Field names, types, and behavior must match exactly.
8+
9+
**MSRV:** Rust 1.88.0 | **Edition:** 2024 | **License:** MIT/Apache-2.0
10+
11+
## Architecture Overview
12+
13+
### Workspace Structure
14+
- **`crates/feedparser-rs-core`** — Pure Rust parser. All parsing logic lives here. NO dependencies on other workspace crates.
15+
- **`crates/feedparser-rs-py`** — Python bindings via PyO3/maturin. Depends on core.
16+
- **`crates/feedparser-rs-node`** — Node.js bindings via napi-rs. Depends on core.
17+
18+
### Parser Pipeline
19+
1. **Format Detection** (`parser/detect.rs`) — Identifies RSS 0.9x/1.0/2.0, Atom 0.3/1.0, or JSON Feed 1.0/1.1
20+
2. **Parsing** — Routes to `parser/rss.rs`, `parser/atom.rs`, or `parser/json.rs`
21+
3. **Namespace Extraction** — Handlers in `namespace/` process iTunes, Dublin Core, Media RSS, Podcast 2.0
22+
4. **Tolerant Error Handling** — Returns `ParsedFeed` with `bozo` flag set on errors, continues parsing
23+
24+
## Idiomatic Rust & Performance
25+
26+
### Type Safety First
27+
- Prefer strong types over primitives: `FeedVersion` enum, not `&str`
28+
- Use `Option<T>` and `Result<T, E>` — never sentinel values
29+
- Leverage generics and trait bounds for reusable code:
30+
```rust
31+
fn collect_limited<T, I: Iterator<Item = T>>(iter: I, limit: usize) -> Vec<T> {
32+
iter.take(limit).collect()
33+
}
34+
```
35+
36+
### Zero-Cost Abstractions
37+
- Use `&str` over `String` in function parameters
38+
- Prefer iterators over index-based loops: `.iter().filter().map()`
39+
- Use `Cow<'_, str>` when ownership is conditionally needed
40+
- Avoid allocations in hot paths — reuse buffers where possible
41+
42+
### Edition 2024 Features
43+
- Use `gen` blocks for custom iterators where applicable
44+
- Leverage improved async patterns for HTTP module
45+
- Apply new lifetime elision rules for cleaner signatures
46+
47+
### Safety Guidelines
48+
- `#![warn(unsafe_code)]` is enabled — avoid `unsafe` unless absolutely necessary
49+
- All public APIs must have doc comments (`#![warn(missing_docs)]`)
50+
- Use `thiserror` for error types with proper `#[error]` attributes
51+
52+
## Critical Conventions
53+
54+
### Error Handling: Bozo Pattern (MANDATORY)
55+
**Never panic or return errors for malformed input.** Set `bozo = true` and continue:
56+
```rust
57+
match parse_date(&text) {
58+
Some(dt) => entry.published = Some(dt),
59+
None => {
60+
feed.bozo = true;
61+
feed.bozo_exception = Some(format!("Invalid date: {text}"));
62+
// Continue parsing!
63+
}
64+
}
65+
```
66+
67+
### API Compatibility with Python feedparser
68+
Field names must match `feedparser` exactly: `feed.title`, `entries[0].summary`, `version` returns `"rss20"`, `"atom10"`
69+
70+
### XML Parsing with quick-xml
71+
Use tolerant mode — no strict validation:
72+
```rust
73+
let mut reader = Reader::from_reader(data);
74+
reader.config_mut().trim_text(true);
75+
// Do NOT enable check_end_names — tolerance over strictness
76+
```
77+
78+
## Development Commands
79+
80+
All automation via `cargo-make`:
81+
82+
| Command | Purpose |
83+
|---------|---------|
84+
| `cargo make fmt` | Format with nightly rustfmt |
85+
| `cargo make clippy` | Lint (excludes py bindings) |
86+
| `cargo make test-rust` | Rust tests (nextest) |
87+
| `cargo make pre-commit` | fmt + clippy + test-rust |
88+
| `cargo make bench` | Criterion benchmarks |
89+
| `cargo make msrv-check` | Verify MSRV 1.88.0 compatibility |
90+
91+
### Bindings
92+
```bash
93+
# Python
94+
cd crates/feedparser-rs-py && maturin develop && pytest tests/ -v
95+
96+
# Node.js
97+
cd crates/feedparser-rs-node && pnpm install && pnpm build && pnpm test
98+
```
99+
100+
## Testing Patterns
101+
102+
Use `include_str!()` for fixtures in `tests/fixtures/`:
103+
```rust
104+
#[test]
105+
fn test_rss20_basic() {
106+
let xml = include_str!("../../tests/fixtures/rss/example.xml");
107+
let feed = parse(xml.as_bytes()).unwrap();
108+
assert!(!feed.bozo);
109+
}
110+
```
111+
112+
Always verify malformed feeds set bozo but still parse:
113+
```rust
114+
#[test]
115+
fn test_malformed_sets_bozo() {
116+
let xml = b"<rss><channel><title>Broken</title></rss>";
117+
let feed = parse(xml).unwrap();
118+
assert!(feed.bozo);
119+
assert_eq!(feed.feed.title.as_deref(), Some("Broken")); // Still parsed!
120+
}
121+
```
122+
123+
## Security Requirements
124+
125+
### SSRF Protection (CRITICAL for HTTP Module)
126+
Block these URL patterns before fetching:
127+
- Localhost/loopback: `127.0.0.1`, `[::1]`, `localhost`
128+
- Private networks: `10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16`
129+
- Link-local: `169.254.0.0/16` (AWS/GCP metadata endpoints), `fe80::/10`
130+
- Special addresses: `0.0.0.0/8`, `255.255.255.255`, `::/128`
131+
132+
Always validate URLs through `is_safe_url()` before HTTP requests.
133+
134+
### XSS Protection (HTML Sanitization)
135+
Use `ammonia` for HTML content from feeds:
136+
- Allowed tags: `a, abbr, b, blockquote, br, code, div, em, h1-h6, hr, i, img, li, ol, p, pre, span, strong, ul`
137+
- Enforce `rel="nofollow noopener"` on links
138+
- Allow only `http`, `https`, `mailto` URL schemes
139+
- Never pass raw HTML to Python/Node.js bindings without sanitization
140+
141+
### DoS Protection
142+
Apply limits via `ParserLimits`:
143+
- `max_feed_size`: Default 50MB
144+
- `max_nesting_depth`: Default 100 levels
145+
- `max_entries`: Default 10,000 items
146+
- `max_text_length`: Default 1MB per text field
147+
- `max_attribute_length`: Default 10KB per attribute
148+
149+
## Code Quality Standards
150+
151+
### Function Length Guidelines
152+
- **Target**: Functions should be <50 lines
153+
- **Maximum**: NEVER exceed 100 lines
154+
- **If >50 lines**: Extract inline logic to helper functions
155+
156+
Example refactoring pattern:
157+
```rust
158+
// Before: 200+ line function
159+
fn parse_channel(...) {
160+
match tag {
161+
b"itunes:category" => { /* 80 lines inline */ }
162+
// ...
163+
}
164+
}
165+
166+
// After: Delegate to helpers
167+
fn parse_channel(...) {
168+
match tag {
169+
tag if is_itunes_tag_any(tag) => parse_channel_itunes(tag, ...)?,
170+
// ...
171+
}
172+
}
173+
```
174+
175+
### Documentation Requirements
176+
All public APIs must have doc comments:
177+
```rust
178+
/// Parses an RSS/Atom feed from bytes.
179+
///
180+
/// # Arguments
181+
/// * `data` - Raw feed content as bytes
182+
///
183+
/// # Returns
184+
/// Returns `ParsedFeed` with extracted metadata. If parsing encounters errors,
185+
/// `bozo` flag is set to `true` and `bozo_exception` contains the error description.
186+
///
187+
/// # Examples
188+
/// ```
189+
/// let xml = b"<rss version=\"2.0\">...</rss>";
190+
/// let feed = parse(xml)?;
191+
/// assert_eq!(feed.version, FeedVersion::Rss20);
192+
/// ```
193+
pub fn parse(data: &[u8]) -> Result<ParsedFeed> { ... }
194+
```
195+
196+
### Inline Comments
197+
Minimize inline comments. Use comments ONLY for:
198+
1. **Why** decisions (not **what** the code does)
199+
2. Non-obvious constraints or workarounds
200+
3. References to specifications (RFC 4287 section 4.1.2, etc.)
201+
202+
## Commit & Branch Conventions
203+
- Branch: `feat/`, `fix/`, `docs/`, `refactor/`, `test/`
204+
- Commits: [Conventional Commits](https://conventionalcommits.org/)
205+
- Never mention "Claude" or "co-authored" in commit messages
206+
207+
## What NOT to Do
208+
- ❌ Don't use `.unwrap()` or `.expect()` in parser code — use bozo pattern
209+
- ❌ Don't add dependencies without workspace-level declaration in root `Cargo.toml`
210+
- ❌ Don't skip `--exclude feedparser-rs-py` in workspace-wide Rust commands (PyO3 needs special handling)
211+
- ❌ Don't break API compatibility with Python feedparser field names
212+
- ❌ Don't panic on malformed feeds — set `bozo = true` and continue parsing
213+
- ❌ Don't fetch URLs without SSRF validation (`is_safe_url()`)
214+
- ❌ Don't pass raw HTML to bindings without sanitization (`sanitize_html()`)
215+
- ❌ Don't create functions >100 lines — extract helpers
216+
- ❌ Don't use generic names like `utils`, `helpers`, `common` for modules
217+
- ❌ Don't add emojis to code or comments

0 commit comments

Comments
 (0)