|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Project Overview |
| 6 | + |
| 7 | +This is a Rust library implementing the Content Extraction via Text Density (CETD) algorithm for extracting main content from web pages. The core concept analyzes text density patterns to distinguish content-rich sections from navigational elements. |
| 8 | + |
| 9 | +## Architecture |
| 10 | + |
| 11 | +### Core Components |
| 12 | + |
| 13 | +- **`DensityTree`** (`src/cetd.rs`): Main structure representing text density analysis of HTML documents. Contains methods for building density trees, calculating metrics, and extracting content. |
| 14 | +- **`DensityNode`** (`src/cetd.rs`): Individual nodes containing text density metrics (character count, tag count, link density). |
| 15 | +- **Tree operations** (`src/tree.rs`): HTML document traversal and node metrics calculation. |
| 16 | +- **Unicode handling** (`src/unicode.rs`): Proper character counting using grapheme clusters and Unicode normalization. |
| 17 | +- **Utilities** (`src/utils.rs`): Helper functions for text extraction and link analysis. |
| 18 | + |
| 19 | +### Algorithm Flow |
| 20 | + |
| 21 | +1. Parse HTML document using `scraper::Html` |
| 22 | +2. Build density tree mirroring HTML structure (`DensityTree::from_document`) |
| 23 | +3. Calculate text density metrics for each node |
| 24 | +4. Compute composite density scores (`calculate_density_sum`) |
| 25 | +5. Extract high-density regions as main content |
| 26 | + |
| 27 | +### Binary Tool |
| 28 | + |
| 29 | +The `dce` binary (`src/main.rs`) provides CLI access to the library functionality, supporting both local files and URL fetching. |
| 30 | + |
| 31 | +## Development Commands |
| 32 | + |
| 33 | +### Build and Test |
| 34 | +```bash |
| 35 | +cargo build # Build library |
| 36 | +cargo build --release # Optimized build |
| 37 | +cargo test # Run tests |
| 38 | +cargo bench # Run benchmarks |
| 39 | +``` |
| 40 | + |
| 41 | +### Code Quality |
| 42 | +```bash |
| 43 | +cargo fmt # Format code (max_width = 84, see rustfmt.toml) |
| 44 | +cargo clippy # Lint code |
| 45 | +cargo tarpaulin # Generate coverage report (target: 80%+, see .tarpaulin.toml) |
| 46 | +just coverage # Alternative coverage command (requires just) |
| 47 | +``` |
| 48 | + |
| 49 | +### Examples |
| 50 | +```bash |
| 51 | +cargo run --example check -- lorem-ipsum # Extract from generated lorem ipsum |
| 52 | +cargo run --example check -- test4 # Show highest density node |
| 53 | +cargo run --example ce_score # Benchmark against CleanEval dataset |
| 54 | +``` |
| 55 | + |
| 56 | +### Binary Usage |
| 57 | +```bash |
| 58 | +cargo run --bin dce -- --url "https://example.com" # Extract from URL |
| 59 | +cargo run --bin dce -- --file input.html --output out.txt # Extract from file |
| 60 | +``` |
| 61 | + |
| 62 | +## Project Structure |
| 63 | + |
| 64 | +- `src/lib.rs` - Main library interface and public API |
| 65 | +- `src/cetd.rs` - Core CETD algorithm implementation |
| 66 | +- `src/tree.rs` - HTML tree traversal and metrics |
| 67 | +- `src/unicode.rs` - Unicode-aware text processing |
| 68 | +- `src/utils.rs` - Text extraction utilities |
| 69 | +- `src/main.rs` - CLI binary implementation |
| 70 | +- `examples/` - Usage examples and benchmarking tools |
| 71 | + |
| 72 | +## Key Dependencies |
| 73 | + |
| 74 | +- `scraper` - HTML parsing and CSS selector support |
| 75 | +- `ego-tree` - Tree data structure for density calculations |
| 76 | +- `unicode-segmentation` - Proper Unicode grapheme handling |
| 77 | +- `unicode-normalization` - Text normalization for consistent processing |
| 78 | + |
| 79 | +## Features |
| 80 | + |
| 81 | +- Default features include CLI functionality (`cli` feature) |
| 82 | +- Library can be used without CLI dependencies by disabling default features |
0 commit comments