Skip to content

Commit 0c8bb74

Browse files
committed
Release v0.7.0-r1: The Adaptive Intelligence Update
- Bump version to 0.7.0 - Add release notes and demo files - Update README with v0.7.0 features - Update Docker configurations for v0.7.0-r1 - Move v0.7.0 demo files to releases_review - Fix BM25 scoring bug in URLSeeder Major features: - Adaptive Crawling with pattern learning - Virtual Scroll support for infinite pages - Link Preview with 3-layer scoring - Async URL Seeder for massive discovery - Performance optimizations
1 parent ba2ed53 commit 0c8bb74

File tree

11 files changed

+1307
-89
lines changed

11 files changed

+1307
-89
lines changed

Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
FROM python:3.12-slim-bookworm AS build
22

33
# C4ai version
4-
ARG C4AI_VER=0.6.0
4+
ARG C4AI_VER=0.7.0-r1
55
ENV C4AI_VERSION=$C4AI_VER
66
LABEL c4ai.version=$C4AI_VER
77

README.md

Lines changed: 67 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -26,9 +26,9 @@
2626

2727
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
2828

29-
[✨ Check out latest update v0.6.0](#-recent-updates)
29+
[✨ Check out latest update v0.7.0](#-recent-updates)
3030

31-
🎉 **Version 0.6.0 is now available!** This release candidate introduces World-aware Crawling with geolocation and locale settings, Table-to-DataFrame extraction, Browser pooling with pre-warming, Network and console traffic capture, MCP integration for AI tools, and a completely revamped Docker deployment! [Read the release notes →](https://docs.crawl4ai.com/blog)
31+
🎉 **Version 0.7.0 is now available!** The Adaptive Intelligence Update introduces groundbreaking features: Adaptive Crawling that learns website patterns, Virtual Scroll support for infinite pages, intelligent Link Preview with 3-layer scoring, Async URL Seeder for massive discovery, and significant performance improvements. [Read the release notes →](https://docs.crawl4ai.com/blog/release-v0.7.0)
3232

3333
<details>
3434
<summary>🤓 <strong>My Personal Story</strong></summary>
@@ -274,8 +274,8 @@ The new Docker implementation includes:
274274

275275
```bash
276276
# Pull and run the latest release candidate
277-
docker pull unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
278-
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
277+
docker pull unclecode/crawl4ai:0.7.0
278+
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.7.0
279279

280280
# Visit the playground at http://localhost:11235/playground
281281
```
@@ -518,7 +518,69 @@ async def test_news_crawl():
518518

519519
## ✨ Recent Updates
520520

521-
### Version 0.6.0 Release Highlights
521+
### Version 0.7.0 Release Highlights - The Adaptive Intelligence Update
522+
523+
- **🧠 Adaptive Crawling**: Your crawler now learns and adapts to website patterns automatically:
524+
```python
525+
config = AdaptiveConfig(
526+
confidence_threshold=0.7,
527+
max_history=100,
528+
learning_rate=0.2
529+
)
530+
531+
result = await crawler.arun(
532+
"https://news.example.com",
533+
config=CrawlerRunConfig(adaptive_config=config)
534+
)
535+
# Crawler learns patterns and improves extraction over time
536+
```
537+
538+
- **🌊 Virtual Scroll Support**: Complete content extraction from infinite scroll pages:
539+
```python
540+
scroll_config = VirtualScrollConfig(
541+
container_selector="[data-testid='feed']",
542+
scroll_count=20,
543+
scroll_by="container_height",
544+
wait_after_scroll=1.0
545+
)
546+
547+
result = await crawler.arun(url, config=CrawlerRunConfig(
548+
virtual_scroll_config=scroll_config
549+
))
550+
```
551+
552+
- **🔗 Intelligent Link Analysis**: 3-layer scoring system for smart link prioritization:
553+
```python
554+
link_config = LinkPreviewConfig(
555+
query="machine learning tutorials",
556+
score_threshold=0.3,
557+
concurrent_requests=10
558+
)
559+
560+
result = await crawler.arun(url, config=CrawlerRunConfig(
561+
link_preview_config=link_config,
562+
score_links=True
563+
))
564+
# Links ranked by relevance and quality
565+
```
566+
567+
- **🎣 Async URL Seeder**: Discover thousands of URLs in seconds:
568+
```python
569+
seeder = AsyncUrlSeeder(SeedingConfig(
570+
source="sitemap+cc",
571+
pattern="*/blog/*",
572+
query="python tutorials",
573+
score_threshold=0.4
574+
))
575+
576+
urls = await seeder.discover("https://example.com")
577+
```
578+
579+
- **⚡ Performance Boost**: Up to 3x faster with optimized resource handling and memory efficiency
580+
581+
Read the full details in our [0.7.0 Release Notes](https://docs.crawl4ai.com/blog/release-v0.7.0) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
582+
583+
### Previous Version: 0.6.0 Release Highlights
522584

523585
- **🌎 World-aware Crawling**: Set geolocation, language, and timezone for authentic locale-specific content:
524586
```python
@@ -588,7 +650,6 @@ async def test_news_crawl():
588650

589651
- **📱 Multi-stage Build System**: Optimized Dockerfile with platform-specific performance enhancements
590652

591-
Read the full details in our [0.6.0 Release Notes](https://docs.crawl4ai.com/blog/releases/0.6.0.html) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
592653

593654
### Previous Version: 0.5.0 Major Release Highlights
594655

crawl4ai/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# crawl4ai/__version__.py
22

33
# This is the version that will be used for stable releases
4-
__version__ = "0.6.3"
4+
__version__ = "0.7.0"
55

66
# For nightly builds, this gets set during build process
77
__nightly_version__ = None

crawl4ai/async_configs.py

Lines changed: 50 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1659,22 +1659,57 @@ class SeedingConfig:
16591659
"""
16601660
def __init__(
16611661
self,
1662-
source: str = "sitemap+cc", # Options: "sitemap", "cc", "sitemap+cc"
1663-
pattern: Optional[str] = "*", # URL pattern to filter discovered URLs (e.g., "*example.com/blog/*")
1664-
live_check: bool = False, # Whether to perform HEAD requests to verify URL liveness
1665-
extract_head: bool = False, # Whether to fetch and parse <head> section for metadata
1666-
max_urls: int = -1, # Maximum number of URLs to discover (default: -1 for no limit)
1667-
concurrency: int = 1000, # Maximum concurrent requests for live checks/head extraction
1668-
hits_per_sec: int = 5, # Rate limit in requests per second
1669-
force: bool = False, # If True, bypasses the AsyncUrlSeeder's internal .jsonl cache
1670-
base_directory: Optional[str] = None, # Base directory for UrlSeeder's cache files (.jsonl)
1671-
llm_config: Optional[LLMConfig] = None, # Forward LLM config for future use (e.g., relevance scoring)
1672-
verbose: Optional[bool] = None, # Override crawler's general verbose setting
1673-
query: Optional[str] = None, # Search query for relevance scoring
1674-
score_threshold: Optional[float] = None, # Minimum relevance score to include URL (0.0-1.0)
1675-
scoring_method: str = "bm25", # Scoring method: "bm25" (default), future: "semantic"
1676-
filter_nonsense_urls: bool = True, # Filter out utility URLs like robots.txt, sitemap.xml, etc.
1662+
source: str = "sitemap+cc",
1663+
pattern: Optional[str] = "*",
1664+
live_check: bool = False,
1665+
extract_head: bool = False,
1666+
max_urls: int = -1,
1667+
concurrency: int = 1000,
1668+
hits_per_sec: int = 5,
1669+
force: bool = False,
1670+
base_directory: Optional[str] = None,
1671+
llm_config: Optional[LLMConfig] = None,
1672+
verbose: Optional[bool] = None,
1673+
query: Optional[str] = None,
1674+
score_threshold: Optional[float] = None,
1675+
scoring_method: str = "bm25",
1676+
filter_nonsense_urls: bool = True,
16771677
):
1678+
"""
1679+
Initialize URL seeding configuration.
1680+
1681+
Args:
1682+
source: Discovery source(s) to use. Options: "sitemap", "cc" (Common Crawl),
1683+
or "sitemap+cc" (both). Default: "sitemap+cc"
1684+
pattern: URL pattern to filter discovered URLs (e.g., "*example.com/blog/*").
1685+
Supports glob-style wildcards. Default: "*" (all URLs)
1686+
live_check: Whether to perform HEAD requests to verify URL liveness.
1687+
Default: False
1688+
extract_head: Whether to fetch and parse <head> section for metadata extraction.
1689+
Required for BM25 relevance scoring. Default: False
1690+
max_urls: Maximum number of URLs to discover. Use -1 for no limit.
1691+
Default: -1
1692+
concurrency: Maximum concurrent requests for live checks/head extraction.
1693+
Default: 1000
1694+
hits_per_sec: Rate limit in requests per second to avoid overwhelming servers.
1695+
Default: 5
1696+
force: If True, bypasses the AsyncUrlSeeder's internal .jsonl cache and
1697+
re-fetches URLs. Default: False
1698+
base_directory: Base directory for UrlSeeder's cache files (.jsonl).
1699+
If None, uses default ~/.crawl4ai/. Default: None
1700+
llm_config: LLM configuration for future features (e.g., semantic scoring).
1701+
Currently unused. Default: None
1702+
verbose: Override crawler's general verbose setting for seeding operations.
1703+
Default: None (inherits from crawler)
1704+
query: Search query for BM25 relevance scoring (e.g., "python tutorials").
1705+
Requires extract_head=True. Default: None
1706+
score_threshold: Minimum relevance score (0.0-1.0) to include URL.
1707+
Only applies when query is provided. Default: None
1708+
scoring_method: Scoring algorithm to use. Currently only "bm25" is supported.
1709+
Future: "semantic". Default: "bm25"
1710+
filter_nonsense_urls: Filter out utility URLs like robots.txt, sitemap.xml,
1711+
ads.txt, favicon.ico, etc. Default: True
1712+
"""
16781713
self.source = source
16791714
self.pattern = pattern
16801715
self.live_check = live_check

crawl4ai/async_url_seeder.py

Lines changed: 31 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -424,10 +424,21 @@ async def worker(res_list: List[Dict[str, Any]]):
424424
self._log("info", "Finished URL seeding for {domain}. Total URLs: {count}",
425425
params={"domain": domain, "count": len(results)}, tag="URL_SEED")
426426

427-
# Sort by relevance score if query was provided
427+
# Apply BM25 scoring if query was provided
428428
if query and extract_head and scoring_method == "bm25":
429-
results.sort(key=lambda x: x.get(
430-
"relevance_score", 0.0), reverse=True)
429+
# Apply collective BM25 scoring across all documents
430+
results = await self._apply_bm25_scoring(results, config)
431+
432+
# Filter by score threshold if specified
433+
if score_threshold is not None:
434+
original_count = len(results)
435+
results = [r for r in results if r.get("relevance_score", 0) >= score_threshold]
436+
if original_count > len(results):
437+
self._log("info", "Filtered {filtered} URLs below score threshold {threshold}",
438+
params={"filtered": original_count - len(results), "threshold": score_threshold}, tag="URL_SEED")
439+
440+
# Sort by relevance score
441+
results.sort(key=lambda x: x.get("relevance_score", 0.0), reverse=True)
431442
self._log("info", "Sorted {count} URLs by relevance score for query: '{query}'",
432443
params={"count": len(results), "query": query}, tag="URL_SEED")
433444
elif query and not extract_head:
@@ -982,28 +993,6 @@ async def _validate(self, url: str, res_list: List[Dict[str, Any]], live: bool,
982993
"head_data": head_data,
983994
}
984995

985-
# Apply BM25 scoring if query is provided and head data exists
986-
if query and ok and scoring_method == "bm25" and head_data:
987-
text_context = self._extract_text_context(head_data)
988-
if text_context:
989-
# Calculate BM25 score for this single document
990-
# scores = self._calculate_bm25_score(query, [text_context])
991-
scores = await asyncio.to_thread(self._calculate_bm25_score, query, [text_context])
992-
relevance_score = scores[0] if scores else 0.0
993-
entry["relevance_score"] = float(relevance_score)
994-
else:
995-
# No text context, use URL-based scoring as fallback
996-
relevance_score = self._calculate_url_relevance_score(
997-
query, entry["url"])
998-
entry["relevance_score"] = float(relevance_score)
999-
elif query:
1000-
# Query provided but no head data - we reject this entry
1001-
self._log("debug", "No head data for {url}, using URL-based scoring",
1002-
params={"url": url}, tag="URL_SEED")
1003-
return
1004-
# relevance_score = self._calculate_url_relevance_score(query, entry["url"])
1005-
# entry["relevance_score"] = float(relevance_score)
1006-
1007996
elif live:
1008997
self._log("debug", "Performing live check for {url}", params={
1009998
"url": url}, tag="URL_SEED")
@@ -1013,35 +1002,13 @@ async def _validate(self, url: str, res_list: List[Dict[str, Any]], live: bool,
10131002
params={"status": status.upper(), "url": url}, tag="URL_SEED")
10141003
entry = {"url": url, "status": status, "head_data": {}}
10151004

1016-
# Apply URL-based scoring if query is provided
1017-
if query:
1018-
relevance_score = self._calculate_url_relevance_score(
1019-
query, url)
1020-
entry["relevance_score"] = float(relevance_score)
1021-
10221005
else:
10231006
entry = {"url": url, "status": "unknown", "head_data": {}}
10241007

1025-
# Apply URL-based scoring if query is provided
1026-
if query:
1027-
relevance_score = self._calculate_url_relevance_score(
1028-
query, url)
1029-
entry["relevance_score"] = float(relevance_score)
1030-
1031-
# Now decide whether to add the entry based on score threshold
1032-
if query and "relevance_score" in entry:
1033-
if score_threshold is None or entry["relevance_score"] >= score_threshold:
1034-
if live or extract:
1035-
await self._cache_set(cache_kind, url, entry)
1036-
res_list.append(entry)
1037-
else:
1038-
self._log("debug", "URL {url} filtered out with score {score} < {threshold}",
1039-
params={"url": url, "score": entry["relevance_score"], "threshold": score_threshold}, tag="URL_SEED")
1040-
else:
1041-
# No query or no scoring - add as usual
1042-
if live or extract:
1043-
await self._cache_set(cache_kind, url, entry)
1044-
res_list.append(entry)
1008+
# Add entry to results (scoring will be done later)
1009+
if live or extract:
1010+
await self._cache_set(cache_kind, url, entry)
1011+
res_list.append(entry)
10451012

10461013
async def _head_ok(self, url: str, timeout: int) -> bool:
10471014
try:
@@ -1436,8 +1403,19 @@ def _calculate_bm25_score(self, query: str, documents: List[str]) -> List[float]
14361403
scores = bm25.get_scores(query_tokens)
14371404

14381405
# Normalize scores to 0-1 range
1439-
max_score = max(scores) if max(scores) > 0 else 1.0
1440-
normalized_scores = [score / max_score for score in scores]
1406+
# BM25 can return negative scores, so we need to handle the full range
1407+
if len(scores) == 0:
1408+
return []
1409+
1410+
min_score = min(scores)
1411+
max_score = max(scores)
1412+
1413+
# If all scores are the same, return 0.5 for all
1414+
if max_score == min_score:
1415+
return [0.5] * len(scores)
1416+
1417+
# Normalize to 0-1 range using min-max normalization
1418+
normalized_scores = [(score - min_score) / (max_score - min_score) for score in scores]
14411419

14421420
return normalized_scores
14431421
except Exception as e:

deploy/docker/README.md

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -58,13 +58,15 @@ Pull and run images directly from Docker Hub without building locally.
5858

5959
#### 1. Pull the Image
6060

61-
Our latest release candidate is `0.6.0-r1`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
61+
Our latest release candidate is `0.7.0-r1`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
62+
63+
> ⚠️ **Important Note**: The `latest` tag currently points to the stable `0.6.0` version. After testing and validation, `0.7.0` (without -r1) will be released and `latest` will be updated. For now, please use `0.7.0-r1` to test the new features.
6264
6365
```bash
64-
# Pull the release candidate (recommended for latest features)
65-
docker pull unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
66+
# Pull the release candidate (for testing new features)
67+
docker pull unclecode/crawl4ai:0.7.0-r1
6668

67-
# Or pull the latest stable version
69+
# Or pull the current stable version (0.6.0)
6870
docker pull unclecode/crawl4ai:latest
6971
```
7072

@@ -99,7 +101,7 @@ EOL
99101
-p 11235:11235 \
100102
--name crawl4ai \
101103
--shm-size=1g \
102-
unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
104+
unclecode/crawl4ai:0.7.0-r1
103105
```
104106

105107
* **With LLM support:**
@@ -110,7 +112,7 @@ EOL
110112
--name crawl4ai \
111113
--env-file .llm.env \
112114
--shm-size=1g \
113-
unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
115+
unclecode/crawl4ai:0.7.0-r1
114116
```
115117

116118
> The server will be available at `http://localhost:11235`. Visit `/playground` to access the interactive testing interface.
@@ -124,7 +126,7 @@ docker stop crawl4ai && docker rm crawl4ai
124126
#### Docker Hub Versioning Explained
125127

126128
* **Image Name:** `unclecode/crawl4ai`
127-
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.6.0-r1`)
129+
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.7.0-r1`)
128130
* `LIBRARY_VERSION`: The semantic version of the core `crawl4ai` Python library
129131
* `SUFFIX`: Optional tag for release candidates (``) and revisions (`r1`)
130132
* **`latest` Tag:** Points to the most recent stable version
@@ -160,7 +162,7 @@ The `docker-compose.yml` file in the project root provides a simplified approach
160162
```bash
161163
# Pulls and runs the release candidate from Docker Hub
162164
# Automatically selects the correct architecture
163-
IMAGE=unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number docker compose up -d
165+
IMAGE=unclecode/crawl4ai:0.7.0-r1 docker compose up -d
164166
```
165167

166168
* **Build and Run Locally:**

0 commit comments

Comments
 (0)