Skip to content

Commit 028cdbf

Browse files
authored
Merge pull request #47 from oiwn/dev
Process files with different encoding. Better client for http!
2 parents edd87ac + ac2e91e commit 028cdbf

File tree

11 files changed

+367
-200
lines changed

11 files changed

+367
-200
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,8 @@
77

88
/tmp
99
/data
10+
interfax_cb.html
11+
examples/debug_interfax.rs
1012
*.profraw
11-
dom_content_extracton.txt
1213
.code
1314
.amc.toml

CLAUDE.md

Lines changed: 77 additions & 175 deletions
Original file line numberDiff line numberDiff line change
@@ -1,212 +1,114 @@
11
# CLAUDE.md
22

3-
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
3+
Project guidance for Claude Code when working with this repository.
44

55
## Project Overview
66

7-
This is a Rust library implementing the Content Extraction via Text Density (CETD) algorithm for extracting main content from web pages. The core concept analyzes text density patterns to distinguish content-rich sections from navigational elements.
7+
Rust library implementing Content Extraction via Text Density (CETD) algorithm for extracting main content from web pages by analyzing text density patterns.
8+
9+
## Recent Progress
10+
11+
### ✅ Completed Features
12+
- **Markdown Extraction**: Structured markdown output using CETD density analysis
13+
- **HTTP Client**: Migrated to wreq for browser emulation and TLS fingerprinting
14+
- **Encoding Support**: Full non-UTF-8 encoding support using chardetng
15+
16+
### 🔧 Current Status
17+
- **CLI Tool**: Fully functional with URL/file input, text/markdown output
18+
- **Library API**: Stable with comprehensive feature set
19+
- **Testing**: Comprehensive test suite
820

921
## Architecture
1022

1123
### Core Components
12-
13-
- **`DensityTree`** (`src/cetd.rs`): Main structure representing text density analysis of HTML documents. Contains methods for building density trees, calculating metrics, and extracting content.
14-
- **`DensityNode`** (`src/cetd.rs`): Individual nodes containing text density metrics (character count, tag count, link density).
15-
- **Tree operations** (`src/tree.rs`): HTML document traversal and node metrics calculation.
16-
- **Unicode handling** (`src/unicode.rs`): Proper character counting using grapheme clusters and Unicode normalization.
17-
- **Utilities** (`src/utils.rs`): Helper functions for text extraction and link analysis.
24+
- **`DensityTree`** (`src/cetd.rs`): Main structure for text density analysis
25+
- **`DensityNode`** (`src/cetd.rs`): Individual nodes with text density metrics
26+
- **Tree operations** (`src/tree.rs`): HTML traversal and metrics calculation
27+
- **Unicode handling** (`src/unicode.rs`): Proper character counting
28+
- **Utilities** (`src/utils.rs`): Text extraction and link analysis
1829

1930
### Algorithm Flow
20-
21-
1. Parse HTML document using `scraper::Html`
22-
2. Build density tree mirroring HTML structure (`DensityTree::from_document`)
23-
3. Calculate text density metrics for each node
24-
4. Compute composite density scores (`calculate_density_sum`)
31+
1. Parse HTML with `scraper::Html`
32+
2. Build density tree mirroring HTML structure
33+
3. Calculate text density metrics per node
34+
4. Compute composite density scores
2535
5. Extract high-density regions as main content
2636

2737
### Binary Tool
28-
29-
The `dce` binary (`src/main.rs`) provides CLI access to the library functionality, supporting both local files and URL fetching.
38+
`dce` CLI provides file/URL input with text/markdown output options.
3039

3140
## Development Commands
3241

33-
### Build and Test
3442
```bash
43+
# Build and test
3544
cargo build # Build library
36-
cargo build --release # Optimized build
45+
cargo build --release # Optimized build
3746
cargo test # Run tests
3847
cargo bench # Run benchmarks
39-
```
4048

41-
### Code Quality
42-
```bash
43-
cargo fmt # Format code (max_width = 84, see rustfmt.toml)
49+
# Code quality
50+
cargo fmt # Format code
4451
cargo clippy # Lint code
45-
cargo tarpaulin # Generate coverage report (target: 80%+, see .tarpaulin.toml)
46-
just coverage # Alternative coverage command (requires just)
47-
```
52+
cargo tarpaulin # Coverage report
4853

49-
### Examples
50-
```bash
51-
cargo run --example check -- lorem-ipsum # Extract from generated lorem ipsum
52-
cargo run --example check -- test4 # Show highest density node
53-
cargo run --example ce_score # Benchmark against CleanEval dataset
54-
```
54+
# Examples
55+
cargo run --example check -- lorem-ipsum # Test extraction
56+
cargo run --example check -- test4 # Show density nodes
5557

56-
### Binary Usage
57-
```bash
58-
cargo run --bin dce -- --url "https://example.com" # Extract from URL
59-
cargo run --bin dce -- --file input.html --output out.txt # Extract from file
58+
# CLI usage
59+
cargo run -- --url "https://example.com" # Extract from URL
60+
cargo run -- --file input.html --output out.txt # Extract from file
61+
cargo run -- --file input.html --format markdown # Markdown output
6062
```
6163

6264
## Project Structure
63-
64-
- `src/lib.rs` - Main library interface and public API
65-
- `src/cetd.rs` - Core CETD algorithm implementation
66-
- `src/tree.rs` - HTML tree traversal and metrics
67-
- `src/unicode.rs` - Unicode-aware text processing
68-
- `src/utils.rs` - Text extraction utilities
69-
- `src/main.rs` - CLI binary implementation
70-
- `examples/` - Usage examples and benchmarking tools
65+
- `src/lib.rs` - Library interface and API
66+
- `src/cetd.rs` - Core CETD algorithm
67+
- `src/tree.rs` - HTML traversal
68+
- `src/unicode.rs` - Unicode handling
69+
- `src/utils.rs` - Text utilities
70+
- `src/main.rs` - CLI implementation
71+
- `examples/` - Usage examples
7172

7273
## Key Dependencies
73-
74-
- `scraper` - HTML parsing and CSS selector support
75-
- `ego-tree` - Tree data structure for density calculations
76-
- `unicode-segmentation` - Proper Unicode grapheme handling
77-
- `unicode-normalization` - Text normalization for consistent processing
74+
- `scraper` - HTML parsing
75+
- `ego-tree` - Tree structure
76+
- `unicode-segmentation` - Unicode handling
77+
- `chardetng` - Encoding detection
7878

7979
## Features
8080

81-
- Default features include CLI functionality (`cli` feature)
82-
- Library can be used without CLI dependencies by disabling default features
83-
- Optional `markdown` feature for structured markdown extraction using density analysis
84-
85-
## Markdown Extraction Implementation
86-
87-
**Goal**: Add markdown extraction capability that leverages CETD density analysis to extract main content as structured markdown.
88-
89-
**Approach**:
90-
- Create completely separate `src/markdown.rs` module (do not modify CETD algorithm)
91-
- Use existing density analysis to identify high-density content nodes
92-
- Extract HTML subtrees for those nodes using their NodeIDs
93-
- Convert HTML to markdown using `htmd` library
94-
- Add as optional `markdown` feature flag
95-
96-
**Implementation Steps**:
97-
1. ✅ Add `htmd` dependency with `markdown` feature flag to Cargo.toml
98-
2. ✅ Create `src/markdown.rs` with main API: `extract_content_as_markdown()`
99-
3. ✅ Add markdown module to `src/lib.rs` with feature gating
100-
4. ✅ Mirror logic from `DensityTree::extract_content()` but collect NodeIDs instead of text
101-
5. ✅ Implement HTML container extraction using scraper's NodeID→HTML mapping
102-
6. ✅ Integrate `htmd` for HTML→Markdown conversion
103-
7. ✅ Add error handling and basic tests
104-
105-
**Current Status**: ✅ Implementation complete and working
106-
107-
**Resolution**:
108-
- Simplified approach: Use `get_max_density_sum_node()` to find highest density content
109-
- Handle text nodes by walking up the tree to find parent elements
110-
- Extract HTML using `ElementRef::inner_html()` method
111-
- Convert to markdown using `htmd::HtmlToMarkdown` with script/style tags skipped
112-
- Proper error handling following existing patterns
113-
114-
**Key Implementation Details**:
115-
- Uses `ElementRef::wrap()` to convert scraper nodes to elements
116-
- Walks up parent tree when max density node is text (whitespace)
117-
- Returns empty string when no content found (consistent with existing behavior)
118-
- Trims markdown output for clean results
119-
120-
**Test Results**:
121-
- ✅ Test `test_extract_content_as_markdown` passes
122-
- ✅ All existing tests continue to pass
123-
- ✅ Generated markdown includes proper formatting (headers, paragraphs)
124-
- ✅ Works with both markdown feature enabled and disabled
125-
126-
## CLI Integration Complete
127-
128-
**Goal**: Add markdown output option to the `dce` CLI tool
129-
130-
**Implementation**:
131-
- Added `--format` option to CLI with values `text` (default) and `markdown`
132-
- Modified `process_html()` function to handle both text and markdown formats
133-
- Added proper feature gating with clear error messages when markdown feature not enabled
134-
- Maintained backward compatibility with existing text output
135-
136-
**CLI Usage**:
137-
```bash
138-
# Extract as text (default)
139-
cargo run -- --file input.html
140-
cargo run -- --url "https://example.com"
141-
142-
# Extract as markdown
143-
cargo run -- --file input.html --format markdown
144-
cargo run -- --url "https://example.com" --format markdown
145-
146-
# Output to file
147-
cargo run -- --file input.html --format markdown --output content.md
148-
```
149-
150-
**Technical Details**:
151-
- Uses long option `--format` (no short option to avoid conflict with `--file -f`)
152-
- Proper error handling when markdown feature is not enabled
153-
- Clean integration with existing density analysis pipeline
154-
- Coverage exclusion for `src/main.rs` via `.llvm-cov` configuration
81+
### Available Features
82+
- **`cli`** (default): Command-line interface with URL fetching
83+
- **`markdown`** (default): HTML to markdown conversion
15584

156-
**Testing**:
157-
- ✅ CLI builds successfully with and without markdown feature
158-
- ✅ Help output shows new `--format` option
159-
- ✅ Error handling works correctly when markdown requested but feature disabled
160-
- ✅ Backward compatibility maintained for existing text output
161-
162-
## Current Task: Replace reqwest with wreq for browser-like HTTP requests
163-
164-
**Goal**: Migrate from simple reqwest HTTP client to wreq for advanced browser emulation and TLS fingerprinting capabilities
165-
166-
### Migration Plan
167-
168-
#### 1. Dependency Updates
85+
### Feature Usage
16986
```bash
170-
# Remove reqwest from Cargo.toml cli features
171-
# Add wreq and related dependencies
172-
wreq = "6.0.0-rc.20"
173-
wreq-util = "3.0.0-rc.3"
174-
tokio = { version = "1", features = ["full"] }
175-
```
176-
177-
#### 2. Code Changes (src/main.rs)
178-
- Add `#[tokio::main]` attribute to main function
179-
- Convert `fetch_url()` from blocking to async
180-
- Replace `reqwest::blocking::Client` with `wreq::Client`
181-
- Add browser emulation configuration using `wreq_util::Emulation`
182-
- Update error handling for wreq's Result type
183-
184-
#### 3. Browser Emulation Configuration
185-
```rust
186-
use wreq::Client;
187-
use wreq_util::Emulation;
188-
189-
let client = Client::builder()
190-
.emulation(Emulation::Chrome120) // Or other browser profiles
191-
.build()?;
87+
cargo build --no-default-features # Library only
88+
cargo build --no-default-features --features cli # CLI only
89+
cargo build --no-default-features --features markdown # Markdown only
90+
cargo build # Default (cli + markdown)
19291
```
19392

194-
#### 4. Key Benefits
195-
- **TLS Fingerprinting**: Avoids detection as bot/scraper
196-
- **Browser Emulation**: Mimics real browser behavior
197-
- **HTTP/2 Support**: Modern protocol support
198-
- **Advanced Features**: Cookie store, redirect policies, rotating proxies
199-
200-
#### 5. Testing Strategy
201-
- Verify URL fetching still works with various websites
202-
- Test TLS fingerprinting effectiveness
203-
- Ensure error handling is robust
204-
- Maintain backward compatibility with existing CLI interface
205-
206-
#### 6. Technical Considerations
207-
- **Async Migration**: Move from blocking to async architecture
208-
- **Error Handling**: wreq uses different error types than reqwest
209-
- **TLS Backend**: wreq uses BoringSSL instead of system TLS
210-
- **Dependency Conflicts**: Avoid openssl-sys conflicts
211-
212-
**Status**: Planning phase complete, ready for implementation
93+
## Markdown Extraction
94+
- Extracts high-density content as structured markdown
95+
- Uses `htmd` for HTML to markdown conversion
96+
- Feature-gated behind `markdown` flag
97+
98+
## CLI Tool
99+
- `--format text` (default): Plain text extraction
100+
- `--format markdown`: Structured markdown output
101+
- Supports file/URL input with proper error handling
102+
103+
## HTTP Client Migration (Completed ✅)
104+
**Migrated to wreq for browser emulation and TLS fingerprinting:**
105+
- Async runtime with `tokio`
106+
- Chrome 120 browser emulation
107+
- TLS fingerprinting avoidance
108+
- HTTP/2 support with advanced features
109+
110+
## Encoding Support (Enhanced ✅)
111+
**Fixed non-UTF-8 encoding handling:**
112+
- Replaced custom detection with `chardetng`
113+
- Fixed NaN threshold bug in extraction algorithm
114+
- Verified with Windows-1251 Russian content

Cargo.toml

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -36,16 +36,19 @@ scraper = "0.24"
3636
thiserror = "2"
3737
# binary
3838
clap = { version = "4.5", features = ["derive"], optional = true }
39-
reqwest = { version = "0.12", features = ["blocking"], optional = true }
4039
tempfile = { version = "3.22", optional = true }
4140
url = { version = "2.5", optional = true }
4241
anyhow = { version = "1", optional = true }
4342
unicode-normalization = "0.1"
4443
unicode-segmentation = "1.12"
4544
htmd = { version = "0.3", optional = true }
46-
wreq-util = { version = "2.2", features = ["full"] }
47-
wreq = { version = "5.3", features = ["full"] }
48-
tokio = { version = "1.47", features = ["full"] }
45+
wreq-util = { version = "2.2", features = ["full"], optional = true }
46+
wreq = { version = "5.3", features = ["full"], optional = true }
47+
tokio = { version = "1.47", features = ["full"], optional = true }
48+
encoding_rs = { version = "0.8", optional = true }
49+
tracing-subscriber = { version = "0.3.20", features = ["env-filter"], optional = true }
50+
tracing = { version = "0.1.41", optional = true }
51+
chardetng = { version = "0.1.17", optional = true }
4952

5053
[dev-dependencies]
5154
criterion = "0.7"
@@ -71,8 +74,14 @@ default = ["cli", "markdown"]
7174
markdown = ["dep:htmd"]
7275
cli = [
7376
"dep:clap",
74-
"dep:reqwest",
77+
"dep:wreq",
78+
"dep:wreq-util",
79+
"dep:tokio",
7580
"dep:tempfile",
7681
"dep:url",
77-
"dep:anyhow"
82+
"dep:anyhow",
83+
"dep:encoding_rs",
84+
"dep:tracing",
85+
"dep:tracing-subscriber",
86+
"dep:chardetng"
7887
]

Justfile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
1-
coverage:
2-
cargo tarpaulin
1+
lines:
2+
tokei

examples/debug_density.rs

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
use dom_content_extraction::{DensityTree, get_node_text};
2+
use scraper::Html;
3+
use std::fs;
4+
5+
fn main() {
6+
let html_content =
7+
fs::read_to_string("html/test_1.html").expect("Unable to read file");
8+
let document = Html::parse_document(&html_content);
9+
let mut dtree = DensityTree::from_document(&document).unwrap();
10+
dtree.calculate_density_sum().unwrap();
11+
12+
println!("Density analysis for test_1.html:");
13+
println!("================================");
14+
15+
// Get nodes sorted by density sum
16+
let sorted_nodes = dtree.sorted_nodes();
17+
18+
for (i, node) in sorted_nodes.iter().enumerate() {
19+
if let Ok(text) = get_node_text(node.node_id, &document) {
20+
if !text.trim().is_empty() {
21+
println!(
22+
"\nNode {} (density_sum: {:.2}):",
23+
i,
24+
node.density_sum.unwrap_or(0.0)
25+
);
26+
println!("Text: '{}'", text.trim());
27+
}
28+
}
29+
}
30+
31+
// Show the max density node
32+
if let Some(max_node) = dtree.get_max_density_sum_node() {
33+
println!("\n=== MAX DENSITY NODE ===");
34+
println!(
35+
"Density sum: {:.2}",
36+
max_node.value().density_sum.unwrap_or(0.0)
37+
);
38+
if let Ok(text) = get_node_text(max_node.value().node_id, &document) {
39+
println!("Content: '{}'", text.trim());
40+
}
41+
}
42+
}

0 commit comments

Comments
 (0)