Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .deny.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@ allow = [
"MPL-2.0",
"ISC",
"BSD-3-Clause",
"Zlib"
"Zlib",
"CDLA-Permissive-2.0",
"GPL-3.0"
]

[advisories]
Expand Down
88 changes: 88 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,3 +122,91 @@ cargo run --bin dce -- --file input.html --output out.txt # Extract from file
- ✅ All existing tests continue to pass
- ✅ Generated markdown includes proper formatting (headers, paragraphs)
- ✅ Works with both markdown feature enabled and disabled

## CLI Integration Complete

**Goal**: Add markdown output option to the `dce` CLI tool

**Implementation**:
- Added `--format` option to CLI with values `text` (default) and `markdown`
- Modified `process_html()` function to handle both text and markdown formats
- Added proper feature gating with clear error messages when markdown feature not enabled
- Maintained backward compatibility with existing text output

**CLI Usage**:
```bash
# Extract as text (default)
cargo run -- --file input.html
cargo run -- --url "https://example.com"

# Extract as markdown
cargo run -- --file input.html --format markdown
cargo run -- --url "https://example.com" --format markdown

# Output to file
cargo run -- --file input.html --format markdown --output content.md
```

**Technical Details**:
- Uses long option `--format` (no short option to avoid conflict with `--file -f`)
- Proper error handling when markdown feature is not enabled
- Clean integration with existing density analysis pipeline
- Coverage exclusion for `src/main.rs` via `.llvm-cov` configuration

**Testing**:
- ✅ CLI builds successfully with and without markdown feature
- ✅ Help output shows new `--format` option
- ✅ Error handling works correctly when markdown requested but feature disabled
- ✅ Backward compatibility maintained for existing text output

## Current Task: Replace reqwest with wreq for browser-like HTTP requests

**Goal**: Migrate from simple reqwest HTTP client to wreq for advanced browser emulation and TLS fingerprinting capabilities

### Migration Plan

#### 1. Dependency Updates
```bash
# Remove reqwest from Cargo.toml cli features
# Add wreq and related dependencies
wreq = "6.0.0-rc.20"
wreq-util = "3.0.0-rc.3"
tokio = { version = "1", features = ["full"] }
```

#### 2. Code Changes (src/main.rs)
- Add `#[tokio::main]` attribute to main function
- Convert `fetch_url()` from blocking to async
- Replace `reqwest::blocking::Client` with `wreq::Client`
- Add browser emulation configuration using `wreq_util::Emulation`
- Update error handling for wreq's Result type

#### 3. Browser Emulation Configuration
```rust
use wreq::Client;
use wreq_util::Emulation;

let client = Client::builder()
.emulation(Emulation::Chrome120) // Or other browser profiles
.build()?;
```

#### 4. Key Benefits
- **TLS Fingerprinting**: Avoids detection as bot/scraper
- **Browser Emulation**: Mimics real browser behavior
- **HTTP/2 Support**: Modern protocol support
- **Advanced Features**: Cookie store, redirect policies, rotating proxies

#### 5. Testing Strategy
- Verify URL fetching still works with various websites
- Test TLS fingerprinting effectiveness
- Ensure error handling is robust
- Maintain backward compatibility with existing CLI interface

#### 6. Technical Considerations
- **Async Migration**: Move from blocking to async architecture
- **Error Handling**: wreq uses different error types than reqwest
- **TLS Backend**: wreq uses BoringSSL instead of system TLS
- **Dependency Conflicts**: Avoid openssl-sys conflicts

**Status**: Planning phase complete, ready for implementation
3 changes: 3 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,9 @@ anyhow = { version = "1", optional = true }
unicode-normalization = "0.1"
unicode-segmentation = "1.12"
htmd = { version = "0.3", optional = true }
wreq-util = { version = "2.2", features = ["full"] }
wreq = { version = "5.3", features = ["full"] }
tokio = { version = "1.47", features = ["full"] }

[dev-dependencies]
criterion = "0.7"
Expand Down
73 changes: 57 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,11 @@ the main content programmatically. This library helps solve this problem by:
- Support for nested HTML structures
- Efficient processing of large documents
- Error handling for malformed HTML
- **Markdown output** (optional feature) - Extract content as structured markdown

## Unicode Support

DOM Content Extraction includes robust Unicode support for handling multilingual content:
DOM Content Extraction includes Unicode support for handling multilingual content:

- Proper character counting using Unicode grapheme clusters
- Unicode normalization (NFC) for consistent text representation
Expand Down Expand Up @@ -93,16 +94,40 @@ cargo add dom-content-extraction

or add to you `Cargo.toml`

```
```toml
dom-content-extraction = "0.3"
```

### Optional Features

To enable markdown output support:

```toml
dom-content-extraction = { version = "0.3", features = ["markdown"] }
```

## Documentation

Read the docs!

[dom-content-extraction documentation](https://docs.rs/dom-content-extraction/latest/dom_content_extraction/)

### Library Usage with Markdown

```rust
use dom_content_extraction::{DensityTree, extract_content_as_markdown, scraper::Html};

let html = "<html><body><article><h1>Title</h1><p>Content</p></article></body></html>";
let document = Html::parse_document(html);
let mut dtree = DensityTree::from_document(&document)?;
dtree.calculate_density_sum()?;

// Extract as markdown
let markdown = extract_content_as_markdown(&dtree, &document)?;
println!("{}", markdown);
# Ok::<(), dom_content_extraction::DomExtractionError>(())
```

## Run examples

Check examples.
Expand All @@ -113,10 +138,16 @@ This one will extract content from generated "lorem ipsum" page
cargo run --example check -- lorem-ipsum
```

This one print node with highest density:
This one prints node with highest density:

```bash
cargo run --examples check -- test4
cargo run --example check -- test4
```

Extract content as markdown from lorem ipsum (requires markdown feature):

```bash
cargo run --example check -- lorem-ipsum-markdown
```

There is scoring example i'm trying to implement scoring.
Expand Down Expand Up @@ -154,7 +185,9 @@ Overall Performance:

## Binary Usage

The crate includes a command-line binary tool `dce` (DOM Content Extraction) for extracting main content from HTML documents. It supports both local files and remote URLs as input sources.
The crate includes a command-line binary tool `dce` (DOM Content Extraction) for
extracting main content from HTML documents. It supports both local files and
remote URLs as input sources.

### Installation

Expand All @@ -173,19 +206,35 @@ Options:
-u, --url <URL> URL to fetch HTML content from
-f, --file <FILE> Local HTML file to process
-o, --output <FILE> Output file (stdout if not specified)
--format <FORMAT> Output format [default: text] [possible values: text, markdown]
-h, --help Print help
-V, --version Print version
```

Note: Either `--url` or `--file` must be specified, but not both.

### Markdown Output

To extract content as markdown format, use the `--format markdown` option:

```bash
# Extract as markdown from URL
cargo run --bin dce -- --url "https://example.com" --format markdown

# Extract as markdown from file and save to output
cargo run --bin dce -- --file input.html --format markdown --output content.md
```

Note: Markdown output requires the `markdown` feature to be enabled.

### Features

- **URL Fetching**: Automatically downloads HTML content from specified URLs
- **Timeout Control**: 30-second timeout for URL fetching to prevent hangs
- **Error Handling**: Comprehensive error messages for common failure cases
- **Flexible Output**: Write to file or stdout
- **Temporary File Management**: Automatic cleanup of downloaded content
- **Markdown Support**: Extract content as structured markdown (requires `markdown` feature)

### Examples

Expand All @@ -204,16 +253,6 @@ Extract from URL and save directly to file:
dce --url "https://example.com/page" --output content.txt
```

### Error Handling

The binary provides clear error messages for common scenarios:

- Invalid URLs
- Network timeouts
- File access issues
- HTML parsing errors
- Content extraction failures

### Dependencies

The binary functionality requires the following additional dependencies:
Expand All @@ -223,6 +262,8 @@ The binary functionality requires the following additional dependencies:
- `tempfile`: Temporary file management
- `url`: URL parsing and validation
- `anyhow`: Error handling
- `htmd`: HTML to markdown conversion (for markdown feature)

These dependencies are only included when building with the default `cli` feature.
These dependencies are only included when building with the default `cli`
feature. The `markdown` feature requires the `htmd` dependency.

Loading