Skip to content

Commit 70ff7c3

Browse files
authored
Merge pull request #38 from oiwn/dev
coming out with agent, disable text extraction from inside script tags
2 parents 4601d0f + 0c4fda0 commit 70ff7c3

File tree

2 files changed

+108
-6
lines changed

2 files changed

+108
-6
lines changed

CLAUDE.md

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
This is a Rust library implementing the Content Extraction via Text Density (CETD) algorithm for extracting main content from web pages. The core concept analyzes text density patterns to distinguish content-rich sections from navigational elements.
8+
9+
## Architecture
10+
11+
### Core Components
12+
13+
- **`DensityTree`** (`src/cetd.rs`): Main structure representing text density analysis of HTML documents. Contains methods for building density trees, calculating metrics, and extracting content.
14+
- **`DensityNode`** (`src/cetd.rs`): Individual nodes containing text density metrics (character count, tag count, link density).
15+
- **Tree operations** (`src/tree.rs`): HTML document traversal and node metrics calculation.
16+
- **Unicode handling** (`src/unicode.rs`): Proper character counting using grapheme clusters and Unicode normalization.
17+
- **Utilities** (`src/utils.rs`): Helper functions for text extraction and link analysis.
18+
19+
### Algorithm Flow
20+
21+
1. Parse HTML document using `scraper::Html`
22+
2. Build density tree mirroring HTML structure (`DensityTree::from_document`)
23+
3. Calculate text density metrics for each node
24+
4. Compute composite density scores (`calculate_density_sum`)
25+
5. Extract high-density regions as main content
26+
27+
### Binary Tool
28+
29+
The `dce` binary (`src/main.rs`) provides CLI access to the library functionality, supporting both local files and URL fetching.
30+
31+
## Development Commands
32+
33+
### Build and Test
34+
```bash
35+
cargo build # Build library
36+
cargo build --release # Optimized build
37+
cargo test # Run tests
38+
cargo bench # Run benchmarks
39+
```
40+
41+
### Code Quality
42+
```bash
43+
cargo fmt # Format code (max_width = 84, see rustfmt.toml)
44+
cargo clippy # Lint code
45+
cargo tarpaulin # Generate coverage report (target: 80%+, see .tarpaulin.toml)
46+
just coverage # Alternative coverage command (requires just)
47+
```
48+
49+
### Examples
50+
```bash
51+
cargo run --example check -- lorem-ipsum # Extract from generated lorem ipsum
52+
cargo run --example check -- test4 # Show highest density node
53+
cargo run --example ce_score # Benchmark against CleanEval dataset
54+
```
55+
56+
### Binary Usage
57+
```bash
58+
cargo run --bin dce -- --url "https://example.com" # Extract from URL
59+
cargo run --bin dce -- --file input.html --output out.txt # Extract from file
60+
```
61+
62+
## Project Structure
63+
64+
- `src/lib.rs` - Main library interface and public API
65+
- `src/cetd.rs` - Core CETD algorithm implementation
66+
- `src/tree.rs` - HTML tree traversal and metrics
67+
- `src/unicode.rs` - Unicode-aware text processing
68+
- `src/utils.rs` - Text extraction utilities
69+
- `src/main.rs` - CLI binary implementation
70+
- `examples/` - Usage examples and benchmarking tools
71+
72+
## Key Dependencies
73+
74+
- `scraper` - HTML parsing and CSS selector support
75+
- `ego-tree` - Tree data structure for density calculations
76+
- `unicode-segmentation` - Proper Unicode grapheme handling
77+
- `unicode-normalization` - Text normalization for consistent processing
78+
79+
## Features
80+
81+
- Default features include CLI functionality (`cli` feature)
82+
- Library can be used without CLI dependencies by disabling default features

src/utils.rs

Lines changed: 26 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -46,16 +46,36 @@ pub fn get_node_text(
4646
) -> Result<String, DomExtractionError> {
4747
let mut text_fragments: Vec<String> = vec![];
4848
let root_node = get_node_by_id(node_id, document)?;
49-
for node in root_node.descendants() {
50-
if let Some(txt) = node.value().as_text() {
49+
collect_text_filtered(&root_node, &mut text_fragments);
50+
// Use the Unicode join function instead of simple join
51+
Ok(crate::unicode::join_text_fragments(text_fragments))
52+
}
53+
54+
/// Recursively collect text from nodes while filtering out script/style content
55+
fn collect_text_filtered(node: &ego_tree::NodeRef<'_, scraper::node::Node>, text_fragments: &mut Vec<String>) {
56+
match node.value() {
57+
scraper::Node::Text(txt) => {
5158
let clean_text = txt.trim();
5259
if !clean_text.is_empty() {
5360
text_fragments.push(clean_text.to_string());
54-
};
55-
};
61+
}
62+
}
63+
scraper::Node::Element(elem) => {
64+
// Skip script, noscript, and style elements entirely
65+
if !matches!(elem.name(), "script" | "noscript" | "style") {
66+
// Process children only if this isn't a filtered element
67+
for child in node.children() {
68+
collect_text_filtered(&child, text_fragments);
69+
}
70+
}
71+
}
72+
_ => {
73+
// For other node types, process children
74+
for child in node.children() {
75+
collect_text_filtered(&child, text_fragments);
76+
}
77+
}
5678
}
57-
// Use the Unicode join function instead of simple join
58-
Ok(crate::unicode::join_text_fragments(text_fragments))
5979
}
6080

6181
/// Helper function to extract all links (`href` attributes) from a `scraper::Html`

0 commit comments

Comments
 (0)