diff --git a/.deny.toml b/.deny.toml index 268c830..17fbffa 100644 --- a/.deny.toml +++ b/.deny.toml @@ -6,7 +6,9 @@ allow = [ "MPL-2.0", "ISC", "BSD-3-Clause", - "Zlib" + "Zlib", + "CDLA-Permissive-2.0", + "GPL-3.0" ] [advisories] diff --git a/CLAUDE.md b/CLAUDE.md index 1d43ac1..5c04041 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -122,3 +122,91 @@ cargo run --bin dce -- --file input.html --output out.txt # Extract from file - ✅ All existing tests continue to pass - ✅ Generated markdown includes proper formatting (headers, paragraphs) - ✅ Works with both markdown feature enabled and disabled + +## CLI Integration Complete + +**Goal**: Add markdown output option to the `dce` CLI tool + +**Implementation**: +- Added `--format` option to CLI with values `text` (default) and `markdown` +- Modified `process_html()` function to handle both text and markdown formats +- Added proper feature gating with clear error messages when markdown feature not enabled +- Maintained backward compatibility with existing text output + +**CLI Usage**: +```bash +# Extract as text (default) +cargo run -- --file input.html +cargo run -- --url "https://example.com" + +# Extract as markdown +cargo run -- --file input.html --format markdown +cargo run -- --url "https://example.com" --format markdown + +# Output to file +cargo run -- --file input.html --format markdown --output content.md +``` + +**Technical Details**: +- Uses long option `--format` (no short option to avoid conflict with `--file -f`) +- Proper error handling when markdown feature is not enabled +- Clean integration with existing density analysis pipeline +- Coverage exclusion for `src/main.rs` via `.llvm-cov` configuration + +**Testing**: +- ✅ CLI builds successfully with and without markdown feature +- ✅ Help output shows new `--format` option +- ✅ Error handling works correctly when markdown requested but feature disabled +- ✅ Backward compatibility maintained for existing text output + +## Current Task: Replace reqwest with wreq for browser-like HTTP requests + +**Goal**: Migrate from simple reqwest HTTP client to wreq for advanced browser emulation and TLS fingerprinting capabilities + +### Migration Plan + +#### 1. Dependency Updates +```bash +# Remove reqwest from Cargo.toml cli features +# Add wreq and related dependencies +wreq = "6.0.0-rc.20" +wreq-util = "3.0.0-rc.3" +tokio = { version = "1", features = ["full"] } +``` + +#### 2. Code Changes (src/main.rs) +- Add `#[tokio::main]` attribute to main function +- Convert `fetch_url()` from blocking to async +- Replace `reqwest::blocking::Client` with `wreq::Client` +- Add browser emulation configuration using `wreq_util::Emulation` +- Update error handling for wreq's Result type + +#### 3. Browser Emulation Configuration +```rust +use wreq::Client; +use wreq_util::Emulation; + +let client = Client::builder() + .emulation(Emulation::Chrome120) // Or other browser profiles + .build()?; +``` + +#### 4. Key Benefits +- **TLS Fingerprinting**: Avoids detection as bot/scraper +- **Browser Emulation**: Mimics real browser behavior +- **HTTP/2 Support**: Modern protocol support +- **Advanced Features**: Cookie store, redirect policies, rotating proxies + +#### 5. Testing Strategy +- Verify URL fetching still works with various websites +- Test TLS fingerprinting effectiveness +- Ensure error handling is robust +- Maintain backward compatibility with existing CLI interface + +#### 6. Technical Considerations +- **Async Migration**: Move from blocking to async architecture +- **Error Handling**: wreq uses different error types than reqwest +- **TLS Backend**: wreq uses BoringSSL instead of system TLS +- **Dependency Conflicts**: Avoid openssl-sys conflicts + +**Status**: Planning phase complete, ready for implementation diff --git a/Cargo.toml b/Cargo.toml index f32b1f1..10eca66 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -43,6 +43,9 @@ anyhow = { version = "1", optional = true } unicode-normalization = "0.1" unicode-segmentation = "1.12" htmd = { version = "0.3", optional = true } +wreq-util = { version = "2.2", features = ["full"] } +wreq = { version = "5.3", features = ["full"] } +tokio = { version = "1.47", features = ["full"] } [dev-dependencies] criterion = "0.7" diff --git a/README.md b/README.md index 3cf28a1..acdb328 100644 --- a/README.md +++ b/README.md @@ -36,10 +36,11 @@ the main content programmatically. This library helps solve this problem by: - Support for nested HTML structures - Efficient processing of large documents - Error handling for malformed HTML +- **Markdown output** (optional feature) - Extract content as structured markdown ## Unicode Support -DOM Content Extraction includes robust Unicode support for handling multilingual content: +DOM Content Extraction includes Unicode support for handling multilingual content: - Proper character counting using Unicode grapheme clusters - Unicode normalization (NFC) for consistent text representation @@ -93,16 +94,40 @@ cargo add dom-content-extraction or add to you `Cargo.toml` -``` +```toml dom-content-extraction = "0.3" ``` +### Optional Features + +To enable markdown output support: + +```toml +dom-content-extraction = { version = "0.3", features = ["markdown"] } +``` + ## Documentation Read the docs! [dom-content-extraction documentation](https://docs.rs/dom-content-extraction/latest/dom_content_extraction/) +### Library Usage with Markdown + +```rust +use dom_content_extraction::{DensityTree, extract_content_as_markdown, scraper::Html}; + +let html = "

Title

Content

"; +let document = Html::parse_document(html); +let mut dtree = DensityTree::from_document(&document)?; +dtree.calculate_density_sum()?; + +// Extract as markdown +let markdown = extract_content_as_markdown(&dtree, &document)?; +println!("{}", markdown); +# Ok::<(), dom_content_extraction::DomExtractionError>(()) +``` + ## Run examples Check examples. @@ -113,10 +138,16 @@ This one will extract content from generated "lorem ipsum" page cargo run --example check -- lorem-ipsum ``` -This one print node with highest density: +This one prints node with highest density: ```bash -cargo run --examples check -- test4 +cargo run --example check -- test4 +``` + +Extract content as markdown from lorem ipsum (requires markdown feature): + +```bash +cargo run --example check -- lorem-ipsum-markdown ``` There is scoring example i'm trying to implement scoring. @@ -154,7 +185,9 @@ Overall Performance: ## Binary Usage -The crate includes a command-line binary tool `dce` (DOM Content Extraction) for extracting main content from HTML documents. It supports both local files and remote URLs as input sources. +The crate includes a command-line binary tool `dce` (DOM Content Extraction) for +extracting main content from HTML documents. It supports both local files and +remote URLs as input sources. ### Installation @@ -173,12 +206,27 @@ Options: -u, --url URL to fetch HTML content from -f, --file Local HTML file to process -o, --output Output file (stdout if not specified) + --format Output format [default: text] [possible values: text, markdown] -h, --help Print help -V, --version Print version ``` Note: Either `--url` or `--file` must be specified, but not both. +### Markdown Output + +To extract content as markdown format, use the `--format markdown` option: + +```bash +# Extract as markdown from URL +cargo run --bin dce -- --url "https://example.com" --format markdown + +# Extract as markdown from file and save to output +cargo run --bin dce -- --file input.html --format markdown --output content.md +``` + +Note: Markdown output requires the `markdown` feature to be enabled. + ### Features - **URL Fetching**: Automatically downloads HTML content from specified URLs @@ -186,6 +234,7 @@ Note: Either `--url` or `--file` must be specified, but not both. - **Error Handling**: Comprehensive error messages for common failure cases - **Flexible Output**: Write to file or stdout - **Temporary File Management**: Automatic cleanup of downloaded content +- **Markdown Support**: Extract content as structured markdown (requires `markdown` feature) ### Examples @@ -204,16 +253,6 @@ Extract from URL and save directly to file: dce --url "https://example.com/page" --output content.txt ``` -### Error Handling - -The binary provides clear error messages for common scenarios: - -- Invalid URLs -- Network timeouts -- File access issues -- HTML parsing errors -- Content extraction failures - ### Dependencies The binary functionality requires the following additional dependencies: @@ -223,6 +262,8 @@ The binary functionality requires the following additional dependencies: - `tempfile`: Temporary file management - `url`: URL parsing and validation - `anyhow`: Error handling +- `htmd`: HTML to markdown conversion (for markdown feature) -These dependencies are only included when building with the default `cli` feature. +These dependencies are only included when building with the default `cli` +feature. The `markdown` feature requires the `htmd` dependency.