Skip to content

Commit 39ff3f9

Browse files
committed
feat: add remote batching pipeline with buffered per-destination queues
1 parent 2d0c767 commit 39ff3f9

File tree

3 files changed

+280
-61
lines changed

3 files changed

+280
-61
lines changed

README.md

Lines changed: 27 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -6,24 +6,35 @@ core runtime.
66

77
Current fastest speed, with default controls of `max-depth` 4, `max-links-per-page` 16, `politeness-ms` 250,
88
`partition-strategy` 'wiki-prefix' (instead of 'hash'), and `duration-secs` 4 (it crawls for 4 seconds, but any enqued
9-
link is still awaited, so it runs for approximately 30 sec.) is **68.98 pages/sec**.
9+
link is still awaited, so it runs for approximately 30 sec.) is **71.95 pages/sec**.
1010

1111
## Metrics
1212

13-
When running `cargo run --example wiki --features multi_thread -- --duration-secs 4 --partition wiki-prefix`:
13+
When running
1414

1515
```
16-
--- crawl metrics (29.43s) ---
17-
pages fetched: 2030
18-
urls fetched/sec: 68.98
19-
urls discovered: 4844
20-
urls enqueued: 2026
21-
duplicate skips: 2818
16+
cargo run --example wiki --features multi_thread -- \
17+
--duration-secs 4 \
18+
--partition wiki-prefix \
19+
--partition-namespace \
20+
--partition-buckets 26 \
21+
--remote-batch-size 32
22+
```
23+
24+
The crawl metrics were:
25+
26+
```
27+
--- crawl metrics (26.46s) ---
28+
pages fetched: 1904
29+
urls fetched/sec: 71.95
30+
urls discovered: 4351
31+
urls enqueued: 1900
32+
duplicate skips: 2451
2233
frontier rejects: 0
2334
http errors: 0
2435
url parse errors: 0
25-
local shard enqueues: 9315
26-
remote shard links: 3039 (batches 1442)
36+
local shard enqueues: 7870
37+
remote shard links: 2718 (batches 331)
2738
```
2839

2940
## Highlights
@@ -73,14 +84,19 @@ cargo run --example wiki --features multi_thread -- --duration-secs 60
7384
- Cross-shard discoveries are routed through bounded mpsc channels, so enqueue contention happens on a single consumer
7485
instead of every worker.
7586
- Pass `--partition wiki-prefix` (default: `hash`) to keep Wikipedia articles with similar prefixes on the same shard.
87+
- Use `--partition-buckets <n>` (default `0`, meaning shard count) to control how many alphabetical buckets feed into
88+
the shards, and `--partition-namespace` to keep namespaces like `Talk:` or `Help:` on stable shards.
89+
- Tune `--remote-batch-size <n>` (default 8) to control how many cross-shard links get buffered before the router sends
90+
them; higher values reduce channel wakeups at the cost of slightly delayed enqueues on the destination shard.
7691

7792
## Customizing Crawls
7893

7994
- `CrawlControls` (exposed via CLI/env vars) manage maximum depth, per-domain filters, link-per-page caps, politeness
8095
delays, run duration, and more. See `src/controls.rs` for every option.
8196
- `UrlFilter` lets you inject arbitrary site-specific logic—`examples/wiki.rs` filters out non-article namespaces and
8297
query patterns.
83-
- Metrics live in `src/runtime.rs` and can be extended if you need additional counters or telemetry sinks.
98+
- Metrics live in `src/runtime.rs` and can be extended if you need additional counters or telemetry sinks. Multi-thread
99+
runs also report `local shard enqueues` vs `remote shard links (batches)` so you can gauge partition efficiency.
84100

85101
## LLM-Oriented Next Steps
86102

src/controls.rs

Lines changed: 36 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,21 @@ pub struct Cli {
9595
#[cfg(feature = "multi_thread")]
9696
#[arg(long, env = "FASTCRAWL_PARTITION", default_value = "hash")]
9797
pub partition: PartitionStrategyArg,
98+
99+
/// Number of wiki prefix buckets (0 = auto = shard count)
100+
#[cfg(feature = "multi_thread")]
101+
#[arg(long, env = "FASTCRAWL_PARTITION_BUCKETS", default_value_t = 0)]
102+
pub partition_buckets: usize,
103+
104+
/// Treat namespaces (e.g., Talk:, Help:) as part of the partition key
105+
#[cfg(feature = "multi_thread")]
106+
#[arg(long, env = "FASTCRAWL_PARTITION_NAMESPACE", default_value_t = false)]
107+
pub partition_namespace: bool,
108+
109+
/// Maximum remote links to buffer before flushing to another shard (0 = default)
110+
#[cfg(feature = "multi_thread")]
111+
#[arg(long, env = "FASTCRAWL_REMOTE_BATCH_SIZE", default_value_t = 0)]
112+
pub remote_batch_size: usize,
98113
}
99114

100115
impl Cli {
@@ -123,8 +138,13 @@ impl Cli {
123138

124139
#[cfg(feature = "multi_thread")]
125140
/// Returns the requested sharding strategy when multi-threading is enabled.
126-
pub fn partition_strategy(&self) -> PartitionStrategyArg {
127-
self.partition
141+
pub fn partition_settings(&self) -> PartitionSettings {
142+
PartitionSettings {
143+
strategy: self.partition,
144+
wiki_bucket_count: (self.partition_buckets > 0).then_some(self.partition_buckets),
145+
wiki_include_namespace: self.partition_namespace,
146+
remote_batch_size: (self.remote_batch_size > 0).then_some(self.remote_batch_size),
147+
}
128148
}
129149
}
130150

@@ -137,3 +157,17 @@ pub enum PartitionStrategyArg {
137157
/// Use Wikipedia-style namespace/title prefixes to keep related pages together.
138158
WikiPrefix,
139159
}
160+
161+
#[cfg(feature = "multi_thread")]
162+
#[derive(Copy, Clone, Debug)]
163+
/// Parsed partition configuration used by the runtime.
164+
pub struct PartitionSettings {
165+
/// Selected strategy variant.
166+
pub strategy: PartitionStrategyArg,
167+
/// Optional bucket count override for wiki prefix strategy (defaults to shard count when `None`).
168+
pub wiki_bucket_count: Option<usize>,
169+
/// Whether to incorporate namespace prefixes (e.g., `Talk:`) into the partition key.
170+
pub wiki_include_namespace: bool,
171+
/// Optional remote batch size override.
172+
pub remote_batch_size: Option<usize>,
173+
}

0 commit comments

Comments
 (0)