update README.md

oiwn · oiwn · commit 08d18433d0e1 · 2025-09-21T13:41:33.000+07:00
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -122,3 +122,91 @@ cargo run --bin dce -- --file input.html --output out.txt # Extract from file
 - ✅ All existing tests continue to pass
 - ✅ Generated markdown includes proper formatting (headers, paragraphs)
 - ✅ Works with both markdown feature enabled and disabled
+
+## CLI Integration Complete
+
+**Goal**: Add markdown output option to the `dce` CLI tool
+
+**Implementation**:
+- Added `--format` option to CLI with values `text` (default) and `markdown`
+- Modified `process_html()` function to handle both text and markdown formats
+- Added proper feature gating with clear error messages when markdown feature not enabled
+- Maintained backward compatibility with existing text output
+
+**CLI Usage**:
+```bash
+# Extract as text (default)
+cargo run -- --file input.html
+cargo run -- --url "https://example.com"
+
+# Extract as markdown
+cargo run -- --file input.html --format markdown
+cargo run -- --url "https://example.com" --format markdown
+
+# Output to file
+cargo run -- --file input.html --format markdown --output content.md
+```
+
+**Technical Details**:
+- Uses long option `--format` (no short option to avoid conflict with `--file -f`)
+- Proper error handling when markdown feature is not enabled
+- Clean integration with existing density analysis pipeline
+- Coverage exclusion for `src/main.rs` via `.llvm-cov` configuration
+
+**Testing**:
+- ✅ CLI builds successfully with and without markdown feature
+- ✅ Help output shows new `--format` option
+- ✅ Error handling works correctly when markdown requested but feature disabled
+- ✅ Backward compatibility maintained for existing text output
+
+## Current Task: Replace reqwest with wreq for browser-like HTTP requests
+
+**Goal**: Migrate from simple reqwest HTTP client to wreq for advanced browser emulation and TLS fingerprinting capabilities
+
+### Migration Plan
+
+#### 1. Dependency Updates
+```bash
+# Remove reqwest from Cargo.toml cli features
+# Add wreq and related dependencies
+wreq = "6.0.0-rc.20"
+wreq-util = "3.0.0-rc.3"
+tokio = { version = "1", features = ["full"] }
+```
+
+#### 2. Code Changes (src/main.rs)
+- Add `#[tokio::main]` attribute to main function
+- Convert `fetch_url()` from blocking to async
+- Replace `reqwest::blocking::Client` with `wreq::Client`
+- Add browser emulation configuration using `wreq_util::Emulation`
+- Update error handling for wreq's Result type
+
+#### 3. Browser Emulation Configuration
+```rust
+use wreq::Client;
+use wreq_util::Emulation;
+
+let client = Client::builder()
+    .emulation(Emulation::Chrome120)  // Or other browser profiles
+    .build()?;
+```
+
+#### 4. Key Benefits
+- **TLS Fingerprinting**: Avoids detection as bot/scraper
+- **Browser Emulation**: Mimics real browser behavior
+- **HTTP/2 Support**: Modern protocol support
+- **Advanced Features**: Cookie store, redirect policies, rotating proxies
+
+#### 5. Testing Strategy
+- Verify URL fetching still works with various websites
+- Test TLS fingerprinting effectiveness
+- Ensure error handling is robust
+- Maintain backward compatibility with existing CLI interface
+
+#### 6. Technical Considerations
+- **Async Migration**: Move from blocking to async architecture
+- **Error Handling**: wreq uses different error types than reqwest
+- **TLS Backend**: wreq uses BoringSSL instead of system TLS
+- **Dependency Conflicts**: Avoid openssl-sys conflicts
+
+**Status**: Planning phase complete, ready for implementation
diff --git a/Cargo.toml b/Cargo.toml
@@ -43,6 +43,9 @@ anyhow = { version = "1", optional = true }
 unicode-normalization = "0.1"
 unicode-segmentation = "1.12"
 htmd = { version = "0.3", optional = true }
+wreq-util = { version = "2.2", features = ["full"] }
+wreq = { version = "5.3", features = ["full"] }
+tokio = { version = "1.47", features = ["full"] }
 
 [dev-dependencies]
 criterion = "0.7"
diff --git a/README.md b/README.md
@@ -36,10 +36,11 @@ the main content programmatically. This library helps solve this problem by:
 - Support for nested HTML structures
 - Efficient processing of large documents
 - Error handling for malformed HTML
+- **Markdown output** (optional feature) - Extract content as structured markdown
 
 ## Unicode Support
 
-DOM Content Extraction includes robust Unicode support for handling multilingual content:
+DOM Content Extraction includes Unicode support for handling multilingual content:
 
 - Proper character counting using Unicode grapheme clusters
 - Unicode normalization (NFC) for consistent text representation
@@ -93,16 +94,40 @@ cargo add dom-content-extraction
 
 or add to you  `Cargo.toml`
 
-```
+```toml
 dom-content-extraction = "0.3"
 ```
 
+### Optional Features
+
+To enable markdown output support:
+
+```toml
+dom-content-extraction = { version = "0.3", features = ["markdown"] }
+```
+
 ## Documentation
 
 Read the docs! 
 
 [dom-content-extraction documentation](https://docs.rs/dom-content-extraction/latest/dom_content_extraction/)
 
+### Library Usage with Markdown
+
+```rust
+use dom_content_extraction::{DensityTree, extract_content_as_markdown, scraper::Html};
+
+let html = "<html><body><article><h1>Title</h1><p>Content</p></article></body></html>";
+let document = Html::parse_document(html);
+let mut dtree = DensityTree::from_document(&document)?;
+dtree.calculate_density_sum()?;
+
+// Extract as markdown
+let markdown = extract_content_as_markdown(&dtree, &document)?;
+println!("{}", markdown);
+# Ok::<(), dom_content_extraction::DomExtractionError>(())
+```
+
 ## Run examples
 
 Check examples.
@@ -113,10 +138,16 @@ This one will extract content from generated "lorem ipsum" page
 cargo run --example check -- lorem-ipsum 
 ```
 
-This one print node with highest density:
+This one prints node with highest density:
 
 ```bash
-cargo run --examples check -- test4
+cargo run --example check -- test4
+```
+
+Extract content as markdown from lorem ipsum (requires markdown feature):
+
+```bash
+cargo run --example check -- lorem-ipsum-markdown
 ```
 
 There is scoring example i'm trying to implement scoring.
@@ -154,7 +185,9 @@ Overall Performance:
 
 ## Binary Usage
 
-The crate includes a command-line binary tool `dce` (DOM Content Extraction) for extracting main content from HTML documents. It supports both local files and remote URLs as input sources.
+The crate includes a command-line binary tool `dce` (DOM Content Extraction) for
+extracting main content from HTML documents. It supports both local files and
+remote URLs as input sources.
 
 ### Installation
 
@@ -173,19 +206,35 @@ Options:
   -u, --url <URL>        URL to fetch HTML content from
   -f, --file <FILE>      Local HTML file to process
   -o, --output <FILE>    Output file (stdout if not specified)
+      --format <FORMAT>  Output format [default: text] [possible values: text, markdown]
   -h, --help            Print help
   -V, --version         Print version
 ```
 
 Note: Either `--url` or `--file` must be specified, but not both.
 
+### Markdown Output
+
+To extract content as markdown format, use the `--format markdown` option:
+
+```bash
+# Extract as markdown from URL
+cargo run --bin dce -- --url "https://example.com" --format markdown
+
+# Extract as markdown from file and save to output
+cargo run --bin dce -- --file input.html --format markdown --output content.md
+```
+
+Note: Markdown output requires the `markdown` feature to be enabled.
+
 ### Features
 
 - **URL Fetching**: Automatically downloads HTML content from specified URLs
 - **Timeout Control**: 30-second timeout for URL fetching to prevent hangs
 - **Error Handling**: Comprehensive error messages for common failure cases
 - **Flexible Output**: Write to file or stdout
 - **Temporary File Management**: Automatic cleanup of downloaded content
+- **Markdown Support**: Extract content as structured markdown (requires `markdown` feature)
 
 ### Examples
 
@@ -204,16 +253,6 @@ Extract from URL and save directly to file:
 dce --url "https://example.com/page" --output content.txt
 ```
 
-### Error Handling
-
-The binary provides clear error messages for common scenarios:
-
-- Invalid URLs
-- Network timeouts
-- File access issues
-- HTML parsing errors
-- Content extraction failures
-
 ### Dependencies
 
 The binary functionality requires the following additional dependencies:
@@ -223,6 +262,8 @@ The binary functionality requires the following additional dependencies:
 - `tempfile`: Temporary file management
 - `url`: URL parsing and validation
 - `anyhow`: Error handling
+- `htmd`: HTML to markdown conversion (for markdown feature)
 
-These dependencies are only included when building with the default `cli` feature.
+These dependencies are only included when building with the default `cli`
+feature. The `markdown` feature requires the `htmd` dependency.