|
1 | 1 | # CLAUDE.md |
2 | 2 |
|
3 | | -This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 3 | +Project guidance for Claude Code when working with this repository. |
4 | 4 |
|
5 | 5 | ## Project Overview |
6 | 6 |
|
7 | | -This is a Rust library implementing the Content Extraction via Text Density (CETD) algorithm for extracting main content from web pages. The core concept analyzes text density patterns to distinguish content-rich sections from navigational elements. |
| 7 | +Rust library implementing Content Extraction via Text Density (CETD) algorithm for extracting main content from web pages by analyzing text density patterns. |
| 8 | + |
| 9 | +## Recent Progress |
| 10 | + |
| 11 | +### ✅ Completed Features |
| 12 | +- **Markdown Extraction**: Structured markdown output using CETD density analysis |
| 13 | +- **HTTP Client**: Migrated to wreq for browser emulation and TLS fingerprinting |
| 14 | +- **Encoding Support**: Full non-UTF-8 encoding support using chardetng |
| 15 | + |
| 16 | +### 🔧 Current Status |
| 17 | +- **CLI Tool**: Fully functional with URL/file input, text/markdown output |
| 18 | +- **Library API**: Stable with comprehensive feature set |
| 19 | +- **Testing**: Comprehensive test suite |
8 | 20 |
|
9 | 21 | ## Architecture |
10 | 22 |
|
11 | 23 | ### Core Components |
12 | | - |
13 | | -- **`DensityTree`** (`src/cetd.rs`): Main structure representing text density analysis of HTML documents. Contains methods for building density trees, calculating metrics, and extracting content. |
14 | | -- **`DensityNode`** (`src/cetd.rs`): Individual nodes containing text density metrics (character count, tag count, link density). |
15 | | -- **Tree operations** (`src/tree.rs`): HTML document traversal and node metrics calculation. |
16 | | -- **Unicode handling** (`src/unicode.rs`): Proper character counting using grapheme clusters and Unicode normalization. |
17 | | -- **Utilities** (`src/utils.rs`): Helper functions for text extraction and link analysis. |
| 24 | +- **`DensityTree`** (`src/cetd.rs`): Main structure for text density analysis |
| 25 | +- **`DensityNode`** (`src/cetd.rs`): Individual nodes with text density metrics |
| 26 | +- **Tree operations** (`src/tree.rs`): HTML traversal and metrics calculation |
| 27 | +- **Unicode handling** (`src/unicode.rs`): Proper character counting |
| 28 | +- **Utilities** (`src/utils.rs`): Text extraction and link analysis |
18 | 29 |
|
19 | 30 | ### Algorithm Flow |
20 | | - |
21 | | -1. Parse HTML document using `scraper::Html` |
22 | | -2. Build density tree mirroring HTML structure (`DensityTree::from_document`) |
23 | | -3. Calculate text density metrics for each node |
24 | | -4. Compute composite density scores (`calculate_density_sum`) |
| 31 | +1. Parse HTML with `scraper::Html` |
| 32 | +2. Build density tree mirroring HTML structure |
| 33 | +3. Calculate text density metrics per node |
| 34 | +4. Compute composite density scores |
25 | 35 | 5. Extract high-density regions as main content |
26 | 36 |
|
27 | 37 | ### Binary Tool |
28 | | - |
29 | | -The `dce` binary (`src/main.rs`) provides CLI access to the library functionality, supporting both local files and URL fetching. |
| 38 | +`dce` CLI provides file/URL input with text/markdown output options. |
30 | 39 |
|
31 | 40 | ## Development Commands |
32 | 41 |
|
33 | | -### Build and Test |
34 | 42 | ```bash |
| 43 | +# Build and test |
35 | 44 | cargo build # Build library |
36 | | -cargo build --release # Optimized build |
| 45 | +cargo build --release # Optimized build |
37 | 46 | cargo test # Run tests |
38 | 47 | cargo bench # Run benchmarks |
39 | | -``` |
40 | 48 |
|
41 | | -### Code Quality |
42 | | -```bash |
43 | | -cargo fmt # Format code (max_width = 84, see rustfmt.toml) |
| 49 | +# Code quality |
| 50 | +cargo fmt # Format code |
44 | 51 | cargo clippy # Lint code |
45 | | -cargo tarpaulin # Generate coverage report (target: 80%+, see .tarpaulin.toml) |
46 | | -just coverage # Alternative coverage command (requires just) |
47 | | -``` |
| 52 | +cargo tarpaulin # Coverage report |
48 | 53 |
|
49 | | -### Examples |
50 | | -```bash |
51 | | -cargo run --example check -- lorem-ipsum # Extract from generated lorem ipsum |
52 | | -cargo run --example check -- test4 # Show highest density node |
53 | | -cargo run --example ce_score # Benchmark against CleanEval dataset |
54 | | -``` |
| 54 | +# Examples |
| 55 | +cargo run --example check -- lorem-ipsum # Test extraction |
| 56 | +cargo run --example check -- test4 # Show density nodes |
55 | 57 |
|
56 | | -### Binary Usage |
57 | | -```bash |
58 | | -cargo run --bin dce -- --url "https://example.com" # Extract from URL |
59 | | -cargo run --bin dce -- --file input.html --output out.txt # Extract from file |
| 58 | +# CLI usage |
| 59 | +cargo run -- --url "https://example.com" # Extract from URL |
| 60 | +cargo run -- --file input.html --output out.txt # Extract from file |
| 61 | +cargo run -- --file input.html --format markdown # Markdown output |
60 | 62 | ``` |
61 | 63 |
|
62 | 64 | ## Project Structure |
63 | | - |
64 | | -- `src/lib.rs` - Main library interface and public API |
65 | | -- `src/cetd.rs` - Core CETD algorithm implementation |
66 | | -- `src/tree.rs` - HTML tree traversal and metrics |
67 | | -- `src/unicode.rs` - Unicode-aware text processing |
68 | | -- `src/utils.rs` - Text extraction utilities |
69 | | -- `src/main.rs` - CLI binary implementation |
70 | | -- `examples/` - Usage examples and benchmarking tools |
| 65 | +- `src/lib.rs` - Library interface and API |
| 66 | +- `src/cetd.rs` - Core CETD algorithm |
| 67 | +- `src/tree.rs` - HTML traversal |
| 68 | +- `src/unicode.rs` - Unicode handling |
| 69 | +- `src/utils.rs` - Text utilities |
| 70 | +- `src/main.rs` - CLI implementation |
| 71 | +- `examples/` - Usage examples |
71 | 72 |
|
72 | 73 | ## Key Dependencies |
73 | | - |
74 | | -- `scraper` - HTML parsing and CSS selector support |
75 | | -- `ego-tree` - Tree data structure for density calculations |
76 | | -- `unicode-segmentation` - Proper Unicode grapheme handling |
77 | | -- `unicode-normalization` - Text normalization for consistent processing |
| 74 | +- `scraper` - HTML parsing |
| 75 | +- `ego-tree` - Tree structure |
| 76 | +- `unicode-segmentation` - Unicode handling |
| 77 | +- `chardetng` - Encoding detection |
78 | 78 |
|
79 | 79 | ## Features |
80 | 80 |
|
81 | | -- Default features include CLI functionality (`cli` feature) |
82 | | -- Library can be used without CLI dependencies by disabling default features |
83 | | -- Optional `markdown` feature for structured markdown extraction using density analysis |
84 | | - |
85 | | -## Markdown Extraction Implementation |
86 | | - |
87 | | -**Goal**: Add markdown extraction capability that leverages CETD density analysis to extract main content as structured markdown. |
88 | | - |
89 | | -**Approach**: |
90 | | -- Create completely separate `src/markdown.rs` module (do not modify CETD algorithm) |
91 | | -- Use existing density analysis to identify high-density content nodes |
92 | | -- Extract HTML subtrees for those nodes using their NodeIDs |
93 | | -- Convert HTML to markdown using `htmd` library |
94 | | -- Add as optional `markdown` feature flag |
95 | | - |
96 | | -**Implementation Steps**: |
97 | | -1. ✅ Add `htmd` dependency with `markdown` feature flag to Cargo.toml |
98 | | -2. ✅ Create `src/markdown.rs` with main API: `extract_content_as_markdown()` |
99 | | -3. ✅ Add markdown module to `src/lib.rs` with feature gating |
100 | | -4. ✅ Mirror logic from `DensityTree::extract_content()` but collect NodeIDs instead of text |
101 | | -5. ✅ Implement HTML container extraction using scraper's NodeID→HTML mapping |
102 | | -6. ✅ Integrate `htmd` for HTML→Markdown conversion |
103 | | -7. ✅ Add error handling and basic tests |
104 | | - |
105 | | -**Current Status**: ✅ Implementation complete and working |
106 | | - |
107 | | -**Resolution**: |
108 | | -- Simplified approach: Use `get_max_density_sum_node()` to find highest density content |
109 | | -- Handle text nodes by walking up the tree to find parent elements |
110 | | -- Extract HTML using `ElementRef::inner_html()` method |
111 | | -- Convert to markdown using `htmd::HtmlToMarkdown` with script/style tags skipped |
112 | | -- Proper error handling following existing patterns |
113 | | - |
114 | | -**Key Implementation Details**: |
115 | | -- Uses `ElementRef::wrap()` to convert scraper nodes to elements |
116 | | -- Walks up parent tree when max density node is text (whitespace) |
117 | | -- Returns empty string when no content found (consistent with existing behavior) |
118 | | -- Trims markdown output for clean results |
119 | | - |
120 | | -**Test Results**: |
121 | | -- ✅ Test `test_extract_content_as_markdown` passes |
122 | | -- ✅ All existing tests continue to pass |
123 | | -- ✅ Generated markdown includes proper formatting (headers, paragraphs) |
124 | | -- ✅ Works with both markdown feature enabled and disabled |
125 | | - |
126 | | -## CLI Integration Complete |
127 | | - |
128 | | -**Goal**: Add markdown output option to the `dce` CLI tool |
129 | | - |
130 | | -**Implementation**: |
131 | | -- Added `--format` option to CLI with values `text` (default) and `markdown` |
132 | | -- Modified `process_html()` function to handle both text and markdown formats |
133 | | -- Added proper feature gating with clear error messages when markdown feature not enabled |
134 | | -- Maintained backward compatibility with existing text output |
135 | | - |
136 | | -**CLI Usage**: |
137 | | -```bash |
138 | | -# Extract as text (default) |
139 | | -cargo run -- --file input.html |
140 | | -cargo run -- --url "https://example.com" |
141 | | - |
142 | | -# Extract as markdown |
143 | | -cargo run -- --file input.html --format markdown |
144 | | -cargo run -- --url "https://example.com" --format markdown |
145 | | - |
146 | | -# Output to file |
147 | | -cargo run -- --file input.html --format markdown --output content.md |
148 | | -``` |
149 | | - |
150 | | -**Technical Details**: |
151 | | -- Uses long option `--format` (no short option to avoid conflict with `--file -f`) |
152 | | -- Proper error handling when markdown feature is not enabled |
153 | | -- Clean integration with existing density analysis pipeline |
154 | | -- Coverage exclusion for `src/main.rs` via `.llvm-cov` configuration |
| 81 | +### Available Features |
| 82 | +- **`cli`** (default): Command-line interface with URL fetching |
| 83 | +- **`markdown`** (default): HTML to markdown conversion |
155 | 84 |
|
156 | | -**Testing**: |
157 | | -- ✅ CLI builds successfully with and without markdown feature |
158 | | -- ✅ Help output shows new `--format` option |
159 | | -- ✅ Error handling works correctly when markdown requested but feature disabled |
160 | | -- ✅ Backward compatibility maintained for existing text output |
161 | | - |
162 | | -## Current Task: Replace reqwest with wreq for browser-like HTTP requests |
163 | | - |
164 | | -**Goal**: Migrate from simple reqwest HTTP client to wreq for advanced browser emulation and TLS fingerprinting capabilities |
165 | | - |
166 | | -### Migration Plan |
167 | | - |
168 | | -#### 1. Dependency Updates |
| 85 | +### Feature Usage |
169 | 86 | ```bash |
170 | | -# Remove reqwest from Cargo.toml cli features |
171 | | -# Add wreq and related dependencies |
172 | | -wreq = "6.0.0-rc.20" |
173 | | -wreq-util = "3.0.0-rc.3" |
174 | | -tokio = { version = "1", features = ["full"] } |
175 | | -``` |
176 | | -
|
177 | | -#### 2. Code Changes (src/main.rs) |
178 | | -- Add `#[tokio::main]` attribute to main function |
179 | | -- Convert `fetch_url()` from blocking to async |
180 | | -- Replace `reqwest::blocking::Client` with `wreq::Client` |
181 | | -- Add browser emulation configuration using `wreq_util::Emulation` |
182 | | -- Update error handling for wreq's Result type |
183 | | -
|
184 | | -#### 3. Browser Emulation Configuration |
185 | | -```rust |
186 | | -use wreq::Client; |
187 | | -use wreq_util::Emulation; |
188 | | -
|
189 | | -let client = Client::builder() |
190 | | - .emulation(Emulation::Chrome120) // Or other browser profiles |
191 | | - .build()?; |
| 87 | +cargo build --no-default-features # Library only |
| 88 | +cargo build --no-default-features --features cli # CLI only |
| 89 | +cargo build --no-default-features --features markdown # Markdown only |
| 90 | +cargo build # Default (cli + markdown) |
192 | 91 | ``` |
193 | 92 |
|
194 | | -#### 4. Key Benefits |
195 | | -- **TLS Fingerprinting**: Avoids detection as bot/scraper |
196 | | -- **Browser Emulation**: Mimics real browser behavior |
197 | | -- **HTTP/2 Support**: Modern protocol support |
198 | | -- **Advanced Features**: Cookie store, redirect policies, rotating proxies |
199 | | -
|
200 | | -#### 5. Testing Strategy |
201 | | -- Verify URL fetching still works with various websites |
202 | | -- Test TLS fingerprinting effectiveness |
203 | | -- Ensure error handling is robust |
204 | | -- Maintain backward compatibility with existing CLI interface |
205 | | -
|
206 | | -#### 6. Technical Considerations |
207 | | -- **Async Migration**: Move from blocking to async architecture |
208 | | -- **Error Handling**: wreq uses different error types than reqwest |
209 | | -- **TLS Backend**: wreq uses BoringSSL instead of system TLS |
210 | | -- **Dependency Conflicts**: Avoid openssl-sys conflicts |
211 | | -
|
212 | | -**Status**: Planning phase complete, ready for implementation |
| 93 | +## Markdown Extraction |
| 94 | +- Extracts high-density content as structured markdown |
| 95 | +- Uses `htmd` for HTML to markdown conversion |
| 96 | +- Feature-gated behind `markdown` flag |
| 97 | + |
| 98 | +## CLI Tool |
| 99 | +- `--format text` (default): Plain text extraction |
| 100 | +- `--format markdown`: Structured markdown output |
| 101 | +- Supports file/URL input with proper error handling |
| 102 | + |
| 103 | +## HTTP Client Migration (Completed ✅) |
| 104 | +**Migrated to wreq for browser emulation and TLS fingerprinting:** |
| 105 | +- Async runtime with `tokio` |
| 106 | +- Chrome 120 browser emulation |
| 107 | +- TLS fingerprinting avoidance |
| 108 | +- HTTP/2 support with advanced features |
| 109 | + |
| 110 | +## Encoding Support (Enhanced ✅) |
| 111 | +**Fixed non-UTF-8 encoding handling:** |
| 112 | +- Replaced custom detection with `chardetng` |
| 113 | +- Fixed NaN threshold bug in extraction algorithm |
| 114 | +- Verified with Windows-1251 Russian content |
0 commit comments