Skip to content

Commit edd87ac

Browse files
authored
Merge pull request #46 from oiwn/dev
update README.md
2 parents 20eb6cb + cdc17b1 commit edd87ac

File tree

4 files changed

+151
-17
lines changed

4 files changed

+151
-17
lines changed

.deny.toml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,9 @@ allow = [
66
"MPL-2.0",
77
"ISC",
88
"BSD-3-Clause",
9-
"Zlib"
9+
"Zlib",
10+
"CDLA-Permissive-2.0",
11+
"GPL-3.0"
1012
]
1113

1214
[advisories]

CLAUDE.md

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,3 +122,91 @@ cargo run --bin dce -- --file input.html --output out.txt # Extract from file
122122
- ✅ All existing tests continue to pass
123123
- ✅ Generated markdown includes proper formatting (headers, paragraphs)
124124
- ✅ Works with both markdown feature enabled and disabled
125+
126+
## CLI Integration Complete
127+
128+
**Goal**: Add markdown output option to the `dce` CLI tool
129+
130+
**Implementation**:
131+
- Added `--format` option to CLI with values `text` (default) and `markdown`
132+
- Modified `process_html()` function to handle both text and markdown formats
133+
- Added proper feature gating with clear error messages when markdown feature not enabled
134+
- Maintained backward compatibility with existing text output
135+
136+
**CLI Usage**:
137+
```bash
138+
# Extract as text (default)
139+
cargo run -- --file input.html
140+
cargo run -- --url "https://example.com"
141+
142+
# Extract as markdown
143+
cargo run -- --file input.html --format markdown
144+
cargo run -- --url "https://example.com" --format markdown
145+
146+
# Output to file
147+
cargo run -- --file input.html --format markdown --output content.md
148+
```
149+
150+
**Technical Details**:
151+
- Uses long option `--format` (no short option to avoid conflict with `--file -f`)
152+
- Proper error handling when markdown feature is not enabled
153+
- Clean integration with existing density analysis pipeline
154+
- Coverage exclusion for `src/main.rs` via `.llvm-cov` configuration
155+
156+
**Testing**:
157+
- ✅ CLI builds successfully with and without markdown feature
158+
- ✅ Help output shows new `--format` option
159+
- ✅ Error handling works correctly when markdown requested but feature disabled
160+
- ✅ Backward compatibility maintained for existing text output
161+
162+
## Current Task: Replace reqwest with wreq for browser-like HTTP requests
163+
164+
**Goal**: Migrate from simple reqwest HTTP client to wreq for advanced browser emulation and TLS fingerprinting capabilities
165+
166+
### Migration Plan
167+
168+
#### 1. Dependency Updates
169+
```bash
170+
# Remove reqwest from Cargo.toml cli features
171+
# Add wreq and related dependencies
172+
wreq = "6.0.0-rc.20"
173+
wreq-util = "3.0.0-rc.3"
174+
tokio = { version = "1", features = ["full"] }
175+
```
176+
177+
#### 2. Code Changes (src/main.rs)
178+
- Add `#[tokio::main]` attribute to main function
179+
- Convert `fetch_url()` from blocking to async
180+
- Replace `reqwest::blocking::Client` with `wreq::Client`
181+
- Add browser emulation configuration using `wreq_util::Emulation`
182+
- Update error handling for wreq's Result type
183+
184+
#### 3. Browser Emulation Configuration
185+
```rust
186+
use wreq::Client;
187+
use wreq_util::Emulation;
188+
189+
let client = Client::builder()
190+
.emulation(Emulation::Chrome120) // Or other browser profiles
191+
.build()?;
192+
```
193+
194+
#### 4. Key Benefits
195+
- **TLS Fingerprinting**: Avoids detection as bot/scraper
196+
- **Browser Emulation**: Mimics real browser behavior
197+
- **HTTP/2 Support**: Modern protocol support
198+
- **Advanced Features**: Cookie store, redirect policies, rotating proxies
199+
200+
#### 5. Testing Strategy
201+
- Verify URL fetching still works with various websites
202+
- Test TLS fingerprinting effectiveness
203+
- Ensure error handling is robust
204+
- Maintain backward compatibility with existing CLI interface
205+
206+
#### 6. Technical Considerations
207+
- **Async Migration**: Move from blocking to async architecture
208+
- **Error Handling**: wreq uses different error types than reqwest
209+
- **TLS Backend**: wreq uses BoringSSL instead of system TLS
210+
- **Dependency Conflicts**: Avoid openssl-sys conflicts
211+
212+
**Status**: Planning phase complete, ready for implementation

Cargo.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,9 @@ anyhow = { version = "1", optional = true }
4343
unicode-normalization = "0.1"
4444
unicode-segmentation = "1.12"
4545
htmd = { version = "0.3", optional = true }
46+
wreq-util = { version = "2.2", features = ["full"] }
47+
wreq = { version = "5.3", features = ["full"] }
48+
tokio = { version = "1.47", features = ["full"] }
4649

4750
[dev-dependencies]
4851
criterion = "0.7"

README.md

Lines changed: 57 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -30,10 +30,11 @@ the main content programmatically. This library helps solve this problem by:
3030
- Support for nested HTML structures
3131
- Efficient processing of large documents
3232
- Error handling for malformed HTML
33+
- **Markdown output** (optional feature) - Extract content as structured markdown
3334

3435
## Unicode Support
3536

36-
DOM Content Extraction includes robust Unicode support for handling multilingual content:
37+
DOM Content Extraction includes Unicode support for handling multilingual content:
3738

3839
- Proper character counting using Unicode grapheme clusters
3940
- Unicode normalization (NFC) for consistent text representation
@@ -87,16 +88,40 @@ cargo add dom-content-extraction
8788

8889
or add to you `Cargo.toml`
8990

90-
```
91+
```toml
9192
dom-content-extraction = "0.3"
9293
```
9394

95+
### Optional Features
96+
97+
To enable markdown output support:
98+
99+
```toml
100+
dom-content-extraction = { version = "0.3", features = ["markdown"] }
101+
```
102+
94103
## Documentation
95104

96105
Read the docs!
97106

98107
[dom-content-extraction documentation](https://docs.rs/dom-content-extraction/latest/dom_content_extraction/)
99108

109+
### Library Usage with Markdown
110+
111+
```rust
112+
use dom_content_extraction::{DensityTree, extract_content_as_markdown, scraper::Html};
113+
114+
let html = "<html><body><article><h1>Title</h1><p>Content</p></article></body></html>";
115+
let document = Html::parse_document(html);
116+
let mut dtree = DensityTree::from_document(&document)?;
117+
dtree.calculate_density_sum()?;
118+
119+
// Extract as markdown
120+
let markdown = extract_content_as_markdown(&dtree, &document)?;
121+
println!("{}", markdown);
122+
# Ok::<(), dom_content_extraction::DomExtractionError>(())
123+
```
124+
100125
## Run examples
101126

102127
Check examples.
@@ -107,10 +132,16 @@ This one will extract content from generated "lorem ipsum" page
107132
cargo run --example check -- lorem-ipsum
108133
```
109134

110-
This one print node with highest density:
135+
This one prints node with highest density:
111136

112137
```bash
113-
cargo run --examples check -- test4
138+
cargo run --example check -- test4
139+
```
140+
141+
Extract content as markdown from lorem ipsum (requires markdown feature):
142+
143+
```bash
144+
cargo run --example check -- lorem-ipsum-markdown
114145
```
115146

116147
There is scoring example i'm trying to implement scoring.
@@ -148,7 +179,9 @@ Overall Performance:
148179

149180
## Binary Usage
150181

151-
The crate includes a command-line binary tool `dce` (DOM Content Extraction) for extracting main content from HTML documents. It supports both local files and remote URLs as input sources.
182+
The crate includes a command-line binary tool `dce` (DOM Content Extraction) for
183+
extracting main content from HTML documents. It supports both local files and
184+
remote URLs as input sources.
152185

153186
### Installation
154187

@@ -167,19 +200,35 @@ Options:
167200
-u, --url <URL> URL to fetch HTML content from
168201
-f, --file <FILE> Local HTML file to process
169202
-o, --output <FILE> Output file (stdout if not specified)
203+
--format <FORMAT> Output format [default: text] [possible values: text, markdown]
170204
-h, --help Print help
171205
-V, --version Print version
172206
```
173207

174208
Note: Either `--url` or `--file` must be specified, but not both.
175209

210+
### Markdown Output
211+
212+
To extract content as markdown format, use the `--format markdown` option:
213+
214+
```bash
215+
# Extract as markdown from URL
216+
cargo run --bin dce -- --url "https://example.com" --format markdown
217+
218+
# Extract as markdown from file and save to output
219+
cargo run --bin dce -- --file input.html --format markdown --output content.md
220+
```
221+
222+
Note: Markdown output requires the `markdown` feature to be enabled.
223+
176224
### Features
177225

178226
- **URL Fetching**: Automatically downloads HTML content from specified URLs
179227
- **Timeout Control**: 30-second timeout for URL fetching to prevent hangs
180228
- **Error Handling**: Comprehensive error messages for common failure cases
181229
- **Flexible Output**: Write to file or stdout
182230
- **Temporary File Management**: Automatic cleanup of downloaded content
231+
- **Markdown Support**: Extract content as structured markdown (requires `markdown` feature)
183232

184233
### Examples
185234

@@ -198,16 +247,6 @@ Extract from URL and save directly to file:
198247
dce --url "https://example.com/page" --output content.txt
199248
```
200249

201-
### Error Handling
202-
203-
The binary provides clear error messages for common scenarios:
204-
205-
- Invalid URLs
206-
- Network timeouts
207-
- File access issues
208-
- HTML parsing errors
209-
- Content extraction failures
210-
211250
### Dependencies
212251

213252
The binary functionality requires the following additional dependencies:
@@ -217,6 +256,8 @@ The binary functionality requires the following additional dependencies:
217256
- `tempfile`: Temporary file management
218257
- `url`: URL parsing and validation
219258
- `anyhow`: Error handling
259+
- `htmd`: HTML to markdown conversion (for markdown feature)
220260

221-
These dependencies are only included when building with the default `cli` feature.
261+
These dependencies are only included when building with the default `cli`
262+
feature. The `markdown` feature requires the `htmd` dependency.
222263

0 commit comments

Comments
 (0)