This paper presents a novel approach to web crawling that adaptively determines when sufficient information has been gathered to answer a given query. Unlike traditional exhaustive crawling methods, our Progressive Information Sufficiency (PIS) framework uses statistical measures to balance information completeness against crawling efficiency. We introduce a multi-strategy architecture supporting pure statistical, embedding-enhanced, and LLM-assisted approaches, with theoretical guarantees on convergence and practical evaluation methods using synthetic datasets.
Traditional web crawling approaches follow predetermined patterns (breadth-first, depth-first) without consideration for information sufficiency. This work addresses the fundamental question: "When do we have enough information to answer a query and similar queries in its domain?"
We formalize this as an optimal stopping problem in information foraging, introducing metrics for coverage, consistency, and saturation that enable crawlers to make intelligent decisions about when to stop crawling and which links to follow.
Let:
- K = {d₁, d₂, ..., dₙ} be the current knowledge base (crawled documents)
- Q be the user query
- L = {l₁, l₂, ..., lₘ} be available links with preview metadata
- θ be the confidence threshold for information sufficiency
- Minimize |K| (number of crawled pages)
- Maximize P(answers(Q) | K) (probability of answering Q given K)
- Ensure coverage of Q's domain (similar queries)
We define Information Sufficiency as:
IS(K, Q) = min(Coverage(K, Q), Consistency(K, Q), 1 - Redundancy(K)) × DomainCoverage(K, Q)
Coverage measures how well current knowledge covers query terms and related concepts:
Coverage(K, Q) = Σ(t ∈ Q) log(df(t, K) + 1) × idf(t) / |Q|
Where:
- df(t, K) = document frequency of term t in knowledge base K
- idf(t) = inverse document frequency weight
Consistency measures information coherence across documents:
Consistency(K, Q) = 1 - Var(answers from random subsets of K)
This captures the principle that sufficient knowledge should provide stable answers regardless of document subset.
Saturation detects diminishing returns:
Saturation(K) = 1 - (ΔInfo(Kₙ) / ΔInfo(K₁))
Where ΔInfo represents marginal information gain from the nth crawl.
Expected information gain from uncrawled links:
ExpectedGain(l) = Relevance(l, Q) × Novelty(l, K) × Authority(l)
Components:
- Relevance: BM25(preview_text, Q)
- Novelty: 1 - max_similarity(preview, K)
- Authority: f(url_structure, domain_metrics)
Algorithm: ProgressiveCrawl(start_url, query, θ)
K ← ∅
crawled ← {start_url}
pending ← extract_links(crawl(start_url))
while IS(K, Q) < θ and |crawled| < max_pages:
candidates ← rank_by_expected_gain(pending, Q, K)
if max(ExpectedGain(candidates)) < min_gain:
break // Diminishing returns
to_crawl ← top_k(candidates)
new_docs ← parallel_crawl(to_crawl)
K ← K ∪ new_docs
crawled ← crawled ∪ to_crawl
pending ← extract_new_links(new_docs) - crawled
return K
Crawling terminates when:
- IS(K, Q) ≥ θ (sufficient information)
- d(IS)/d(crawls) < ε (plateau reached)
- |crawled| ≥ max_pages (resource limit)
- max(ExpectedGain) < min_gain (no promising links)
AbstractStrategy
├── StatisticalStrategy (no LLM, no embeddings)
├── EmbeddingStrategy (with semantic similarity)
└── LLMStrategy (with language model assistance)
Pure statistical approach using:
- BM25 for relevance scoring
- Term frequency analysis for coverage
- Graph structure for authority
- No external models required
Advantages: Fast, no API costs, works offline Best for: Technical documentation, specific terminology
Semantic understanding through embeddings:
- Query expansion into semantic variations
- Coverage mapping in embedding space
- Gap-driven link selection
- Validation-based stopping criteria
Mathematical Framework:
Coverage(K, Q) = mean(max_similarity(q, K) for q in Q_expanded)
Gap(q) = 1 - max_similarity(q, K)
LinkScore(l) = Σ(Gap(q) × relevance(l, q)) × (1 - redundancy(l, K))
Key Parameters:
embedding_k_exp: Exponential decay factor for distance-to-score mappingembedding_coverage_radius: Distance threshold for query coverageembedding_min_confidence_threshold: Minimum relevance threshold
Advantages: Semantic understanding, handles ambiguity, detects irrelevance Best for: Research queries, conceptual topics, diverse content
- Level 0: Statistical only (implemented)
- Level 1: + Embeddings for semantic similarity (implemented)
- Level 2: + LLM for query understanding (future)
Using LLM to create evaluation data:
def generate_synthetic_dataset(domain_url):
# 1. Fully crawl domain
full_knowledge = exhaustive_crawl(domain_url)
# 2. Generate answerable queries
queries = llm_generate_queries(full_knowledge)
# 3. Create query variations
for q in queries:
variations = generate_variations(q) # synonyms, sub/super queries
return queries, variations, full_knowledge- Efficiency: Information gained / Pages crawled
- Completeness: Answerable queries / Total queries
- Redundancy: 1 - (Unique information / Total information)
- Convergence Rate: Pages to 95% completeness
- Impact of each score component (coverage, consistency, saturation)
- Sensitivity to threshold parameters
- Performance across different domain types
Theorem: For finite websites, ProgressiveCrawl converges to IS(K, Q) ≥ θ or exhausts all reachable pages.
Proof sketch: IS(K, Q) is monotonically non-decreasing with each crawl, bounded above by 1.
Under certain assumptions about link preview accuracy:
- Expected crawls ≤ 2 × optimal_crawls
- Approximation ratio improves with preview quality
- CrawlState: Maintains crawl history and metrics
- AdaptiveConfig: Configuration parameters
- CrawlStrategy: Pluggable strategy interface
- AdaptiveCrawler: Main orchestrator
- Wraps existing AsyncWebCrawler
- Leverages link preview functionality
- Maintains backward compatibility
Knowledge base serialization for:
- Resumable crawls
- Knowledge sharing
- Offline analysis
- Temporal information value
- Multi-query optimization
- Active learning from user feedback
- Collaborative knowledge building
- Federated information sufficiency
- Transfer learning across domains
- Meta-learning for threshold selection
Progressive crawling with adaptive information foraging provides a principled approach to efficient web information extraction. By combining coverage, consistency, and saturation metrics, we can determine information sufficiency without ground truth labels. The multi-strategy architecture allows graceful enhancement from pure statistical to LLM-assisted approaches based on requirements and resources.
-
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
-
Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval.
-
Pirolli, P., & Card, S. (1999). Information Foraging. Psychological Review, 106(4), 643-675.
-
Dasgupta, S. (2005). Analysis of a greedy active learning strategy. Advances in Neural Information Processing Systems.
class StatisticalStrategy:
def calculate_confidence(self, state):
coverage = self.calculate_coverage(state)
consistency = self.calculate_consistency(state)
saturation = self.calculate_saturation(state)
return min(coverage, consistency, saturation)
def calculate_coverage(self, state):
# BM25-based term coverage
term_scores = []
for term in state.query.split():
df = state.document_frequencies.get(term, 0)
idf = self.idf_cache.get(term, 1.0)
term_scores.append(log(df + 1) * idf)
return mean(term_scores) / max_possible_score
def rank_links(self, state):
scored_links = []
for link in state.pending_links:
relevance = self.bm25_score(link.preview_text, state.query)
novelty = self.calculate_novelty(link, state.knowledge_base)
authority = self.url_authority(link.href)
score = relevance * novelty * authority
scored_links.append((link, score))
return sorted(scored_links, key=lambda x: x[1], reverse=True)-
Dataset Creation:
- Select diverse domains (documentation, blogs, e-commerce)
- Generate 100 queries per domain using LLM
- Create query variations (5-10 per query)
-
Baseline Comparisons:
- BFS crawler (depth-limited)
- DFS crawler (depth-limited)
- Random crawler
- Oracle (knows relevant pages)
-
Metrics Collection:
- Pages crawled vs query answerability
- Time to sufficient confidence
- False positive/negative rates
-
Statistical Analysis:
- ANOVA for strategy comparison
- Regression for parameter sensitivity
- Bootstrap for confidence intervals