Skip to content

Commit 08d1843

Browse files
committed
update README.md
1 parent 6e621ed commit 08d1843

File tree

3 files changed

+148
-16
lines changed

3 files changed

+148
-16
lines changed

CLAUDE.md

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,3 +122,91 @@ cargo run --bin dce -- --file input.html --output out.txt # Extract from file
122122
- ✅ All existing tests continue to pass
123123
- ✅ Generated markdown includes proper formatting (headers, paragraphs)
124124
- ✅ Works with both markdown feature enabled and disabled
125+
126+
## CLI Integration Complete
127+
128+
**Goal**: Add markdown output option to the `dce` CLI tool
129+
130+
**Implementation**:
131+
- Added `--format` option to CLI with values `text` (default) and `markdown`
132+
- Modified `process_html()` function to handle both text and markdown formats
133+
- Added proper feature gating with clear error messages when markdown feature not enabled
134+
- Maintained backward compatibility with existing text output
135+
136+
**CLI Usage**:
137+
```bash
138+
# Extract as text (default)
139+
cargo run -- --file input.html
140+
cargo run -- --url "https://example.com"
141+
142+
# Extract as markdown
143+
cargo run -- --file input.html --format markdown
144+
cargo run -- --url "https://example.com" --format markdown
145+
146+
# Output to file
147+
cargo run -- --file input.html --format markdown --output content.md
148+
```
149+
150+
**Technical Details**:
151+
- Uses long option `--format` (no short option to avoid conflict with `--file -f`)
152+
- Proper error handling when markdown feature is not enabled
153+
- Clean integration with existing density analysis pipeline
154+
- Coverage exclusion for `src/main.rs` via `.llvm-cov` configuration
155+
156+
**Testing**:
157+
- ✅ CLI builds successfully with and without markdown feature
158+
- ✅ Help output shows new `--format` option
159+
- ✅ Error handling works correctly when markdown requested but feature disabled
160+
- ✅ Backward compatibility maintained for existing text output
161+
162+
## Current Task: Replace reqwest with wreq for browser-like HTTP requests
163+
164+
**Goal**: Migrate from simple reqwest HTTP client to wreq for advanced browser emulation and TLS fingerprinting capabilities
165+
166+
### Migration Plan
167+
168+
#### 1. Dependency Updates
169+
```bash
170+
# Remove reqwest from Cargo.toml cli features
171+
# Add wreq and related dependencies
172+
wreq = "6.0.0-rc.20"
173+
wreq-util = "3.0.0-rc.3"
174+
tokio = { version = "1", features = ["full"] }
175+
```
176+
177+
#### 2. Code Changes (src/main.rs)
178+
- Add `#[tokio::main]` attribute to main function
179+
- Convert `fetch_url()` from blocking to async
180+
- Replace `reqwest::blocking::Client` with `wreq::Client`
181+
- Add browser emulation configuration using `wreq_util::Emulation`
182+
- Update error handling for wreq's Result type
183+
184+
#### 3. Browser Emulation Configuration
185+
```rust
186+
use wreq::Client;
187+
use wreq_util::Emulation;
188+
189+
let client = Client::builder()
190+
.emulation(Emulation::Chrome120) // Or other browser profiles
191+
.build()?;
192+
```
193+
194+
#### 4. Key Benefits
195+
- **TLS Fingerprinting**: Avoids detection as bot/scraper
196+
- **Browser Emulation**: Mimics real browser behavior
197+
- **HTTP/2 Support**: Modern protocol support
198+
- **Advanced Features**: Cookie store, redirect policies, rotating proxies
199+
200+
#### 5. Testing Strategy
201+
- Verify URL fetching still works with various websites
202+
- Test TLS fingerprinting effectiveness
203+
- Ensure error handling is robust
204+
- Maintain backward compatibility with existing CLI interface
205+
206+
#### 6. Technical Considerations
207+
- **Async Migration**: Move from blocking to async architecture
208+
- **Error Handling**: wreq uses different error types than reqwest
209+
- **TLS Backend**: wreq uses BoringSSL instead of system TLS
210+
- **Dependency Conflicts**: Avoid openssl-sys conflicts
211+
212+
**Status**: Planning phase complete, ready for implementation

Cargo.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,9 @@ anyhow = { version = "1", optional = true }
4343
unicode-normalization = "0.1"
4444
unicode-segmentation = "1.12"
4545
htmd = { version = "0.3", optional = true }
46+
wreq-util = { version = "2.2", features = ["full"] }
47+
wreq = { version = "5.3", features = ["full"] }
48+
tokio = { version = "1.47", features = ["full"] }
4649

4750
[dev-dependencies]
4851
criterion = "0.7"

README.md

Lines changed: 57 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -36,10 +36,11 @@ the main content programmatically. This library helps solve this problem by:
3636
- Support for nested HTML structures
3737
- Efficient processing of large documents
3838
- Error handling for malformed HTML
39+
- **Markdown output** (optional feature) - Extract content as structured markdown
3940

4041
## Unicode Support
4142

42-
DOM Content Extraction includes robust Unicode support for handling multilingual content:
43+
DOM Content Extraction includes Unicode support for handling multilingual content:
4344

4445
- Proper character counting using Unicode grapheme clusters
4546
- Unicode normalization (NFC) for consistent text representation
@@ -93,16 +94,40 @@ cargo add dom-content-extraction
9394

9495
or add to you `Cargo.toml`
9596

96-
```
97+
```toml
9798
dom-content-extraction = "0.3"
9899
```
99100

101+
### Optional Features
102+
103+
To enable markdown output support:
104+
105+
```toml
106+
dom-content-extraction = { version = "0.3", features = ["markdown"] }
107+
```
108+
100109
## Documentation
101110

102111
Read the docs!
103112

104113
[dom-content-extraction documentation](https://docs.rs/dom-content-extraction/latest/dom_content_extraction/)
105114

115+
### Library Usage with Markdown
116+
117+
```rust
118+
use dom_content_extraction::{DensityTree, extract_content_as_markdown, scraper::Html};
119+
120+
let html = "<html><body><article><h1>Title</h1><p>Content</p></article></body></html>";
121+
let document = Html::parse_document(html);
122+
let mut dtree = DensityTree::from_document(&document)?;
123+
dtree.calculate_density_sum()?;
124+
125+
// Extract as markdown
126+
let markdown = extract_content_as_markdown(&dtree, &document)?;
127+
println!("{}", markdown);
128+
# Ok::<(), dom_content_extraction::DomExtractionError>(())
129+
```
130+
106131
## Run examples
107132

108133
Check examples.
@@ -113,10 +138,16 @@ This one will extract content from generated "lorem ipsum" page
113138
cargo run --example check -- lorem-ipsum
114139
```
115140

116-
This one print node with highest density:
141+
This one prints node with highest density:
117142

118143
```bash
119-
cargo run --examples check -- test4
144+
cargo run --example check -- test4
145+
```
146+
147+
Extract content as markdown from lorem ipsum (requires markdown feature):
148+
149+
```bash
150+
cargo run --example check -- lorem-ipsum-markdown
120151
```
121152

122153
There is scoring example i'm trying to implement scoring.
@@ -154,7 +185,9 @@ Overall Performance:
154185

155186
## Binary Usage
156187

157-
The crate includes a command-line binary tool `dce` (DOM Content Extraction) for extracting main content from HTML documents. It supports both local files and remote URLs as input sources.
188+
The crate includes a command-line binary tool `dce` (DOM Content Extraction) for
189+
extracting main content from HTML documents. It supports both local files and
190+
remote URLs as input sources.
158191

159192
### Installation
160193

@@ -173,19 +206,35 @@ Options:
173206
-u, --url <URL> URL to fetch HTML content from
174207
-f, --file <FILE> Local HTML file to process
175208
-o, --output <FILE> Output file (stdout if not specified)
209+
--format <FORMAT> Output format [default: text] [possible values: text, markdown]
176210
-h, --help Print help
177211
-V, --version Print version
178212
```
179213

180214
Note: Either `--url` or `--file` must be specified, but not both.
181215

216+
### Markdown Output
217+
218+
To extract content as markdown format, use the `--format markdown` option:
219+
220+
```bash
221+
# Extract as markdown from URL
222+
cargo run --bin dce -- --url "https://example.com" --format markdown
223+
224+
# Extract as markdown from file and save to output
225+
cargo run --bin dce -- --file input.html --format markdown --output content.md
226+
```
227+
228+
Note: Markdown output requires the `markdown` feature to be enabled.
229+
182230
### Features
183231

184232
- **URL Fetching**: Automatically downloads HTML content from specified URLs
185233
- **Timeout Control**: 30-second timeout for URL fetching to prevent hangs
186234
- **Error Handling**: Comprehensive error messages for common failure cases
187235
- **Flexible Output**: Write to file or stdout
188236
- **Temporary File Management**: Automatic cleanup of downloaded content
237+
- **Markdown Support**: Extract content as structured markdown (requires `markdown` feature)
189238

190239
### Examples
191240

@@ -204,16 +253,6 @@ Extract from URL and save directly to file:
204253
dce --url "https://example.com/page" --output content.txt
205254
```
206255

207-
### Error Handling
208-
209-
The binary provides clear error messages for common scenarios:
210-
211-
- Invalid URLs
212-
- Network timeouts
213-
- File access issues
214-
- HTML parsing errors
215-
- Content extraction failures
216-
217256
### Dependencies
218257

219258
The binary functionality requires the following additional dependencies:
@@ -223,6 +262,8 @@ The binary functionality requires the following additional dependencies:
223262
- `tempfile`: Temporary file management
224263
- `url`: URL parsing and validation
225264
- `anyhow`: Error handling
265+
- `htmd`: HTML to markdown conversion (for markdown feature)
226266

227-
These dependencies are only included when building with the default `cli` feature.
267+
These dependencies are only included when building with the default `cli`
268+
feature. The `markdown` feature requires the `htmd` dependency.
228269

0 commit comments

Comments
 (0)