unclecode
diff --git a/‎Dockerfile‎
Lines changed: 1 addition & 1 deletion b/‎Dockerfile‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 67 additions & 6 deletions b/‎README.md‎
Lines changed: 67 additions & 6 deletions
diff --git a/‎crawl4ai/__version__.py‎
Lines changed: 1 addition & 1 deletion b/‎crawl4ai/__version__.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎crawl4ai/async_configs.py‎
Lines changed: 50 additions & 15 deletions b/‎crawl4ai/async_configs.py‎
Lines changed: 50 additions & 15 deletions
diff --git a/‎crawl4ai/async_url_seeder.py‎
Lines changed: 31 additions & 53 deletions b/‎crawl4ai/async_url_seeder.py‎
Lines changed: 31 additions & 53 deletions
diff --git a/‎deploy/docker/README.md‎
Lines changed: 10 additions & 8 deletions b/‎deploy/docker/README.md‎
Lines changed: 10 additions & 8 deletions
@@ -1,7 +1,7 @@
 FROM python:3.12-slim-bookworm AS build
 
 # C4ai version
-ARG C4AI_VER=0.6.0
+ARG C4AI_VER=0.7.0-r1
 ENV C4AI_VERSION=$C4AI_VER
 LABEL c4ai.version=$C4AI_VER
 
 
@@ -26,9 +26,9 @@
 
 Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.  
 
-[✨ Check out latest update v0.6.0](#-recent-updates)
+[✨ Check out latest update v0.7.0](#-recent-updates)
 
-🎉 **Version 0.6.0 is now available!** This release candidate introduces World-aware Crawling with geolocation and locale settings, Table-to-DataFrame extraction, Browser pooling with pre-warming, Network and console traffic capture, MCP integration for AI tools, and a completely revamped Docker deployment! [Read the release notes →](https://docs.crawl4ai.com/blog)
+🎉 **Version 0.7.0 is now available!** The Adaptive Intelligence Update introduces groundbreaking features: Adaptive Crawling that learns website patterns, Virtual Scroll support for infinite pages, intelligent Link Preview with 3-layer scoring, Async URL Seeder for massive discovery, and significant performance improvements. [Read the release notes →](https://docs.crawl4ai.com/blog/release-v0.7.0)
 
 <details>
 <summary>🤓 <strong>My Personal Story</strong></summary>
@@ -274,8 +274,8 @@ The new Docker implementation includes:
 
 ```bash
 # Pull and run the latest release candidate
-docker pull unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
-docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
+docker pull unclecode/crawl4ai:0.7.0
+docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.7.0
 
 # Visit the playground at http://localhost:11235/playground
 ```
@@ -518,7 +518,69 @@ async def test_news_crawl():
 
 ## ✨ Recent Updates
 
-### Version 0.6.0 Release Highlights
+### Version 0.7.0 Release Highlights - The Adaptive Intelligence Update
+
+- **🧠 Adaptive Crawling**: Your crawler now learns and adapts to website patterns automatically:
+  ```python
+  config = AdaptiveConfig(
+      confidence_threshold=0.7,
+      max_history=100,
+      learning_rate=0.2
+  )
+  
+  result = await crawler.arun(
+      "https://news.example.com",
+      config=CrawlerRunConfig(adaptive_config=config)
+  )
+  # Crawler learns patterns and improves extraction over time
+  ```
+
+- **🌊 Virtual Scroll Support**: Complete content extraction from infinite scroll pages:
+  ```python
+  scroll_config = VirtualScrollConfig(
+      container_selector="[data-testid='feed']",
+      scroll_count=20,
+      scroll_by="container_height",
+      wait_after_scroll=1.0
+  )
+  
+  result = await crawler.arun(url, config=CrawlerRunConfig(
+      virtual_scroll_config=scroll_config
+  ))
+  ```
+
+- **🔗 Intelligent Link Analysis**: 3-layer scoring system for smart link prioritization:
+  ```python
+  link_config = LinkPreviewConfig(
+      query="machine learning tutorials",
+      score_threshold=0.3,
+      concurrent_requests=10
+  )
+  
+  result = await crawler.arun(url, config=CrawlerRunConfig(
+      link_preview_config=link_config,
+      score_links=True
+  ))
+  # Links ranked by relevance and quality
+  ```
+
+- **🎣 Async URL Seeder**: Discover thousands of URLs in seconds:
+  ```python
+  seeder = AsyncUrlSeeder(SeedingConfig(
+      source="sitemap+cc",
+      pattern="*/blog/*",
+      query="python tutorials",
+      score_threshold=0.4
+  ))
+  
+  urls = await seeder.discover("https://example.com")
+  ```
+
+- **⚡ Performance Boost**: Up to 3x faster with optimized resource handling and memory efficiency
+
+Read the full details in our [0.7.0 Release Notes](https://docs.crawl4ai.com/blog/release-v0.7.0) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
+
+### Previous Version: 0.6.0 Release Highlights
 
 - **🌎 World-aware Crawling**: Set geolocation, language, and timezone for authentic locale-specific content:
   ```python
@@ -588,7 +650,6 @@ async def test_news_crawl():
 
 - **📱 Multi-stage Build System**: Optimized Dockerfile with platform-specific performance enhancements
 
-Read the full details in our [0.6.0 Release Notes](https://docs.crawl4ai.com/blog/releases/0.6.0.html) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
 
 ### Previous Version: 0.5.0 Major Release Highlights
 
 
@@ -1,7 +1,7 @@
 # crawl4ai/__version__.py
 
 # This is the version that will be used for stable releases
-__version__ = "0.6.3"
+__version__ = "0.7.0"
 
 # For nightly builds, this gets set during build process
 __nightly_version__ = None
 
@@ -1659,22 +1659,57 @@ class SeedingConfig:
     """
     def __init__(
         self,
-        source: str = "sitemap+cc",  # Options: "sitemap", "cc", "sitemap+cc"
-        pattern: Optional[str] = "*", # URL pattern to filter discovered URLs (e.g., "*example.com/blog/*")
-        live_check: bool = False,    # Whether to perform HEAD requests to verify URL liveness
-        extract_head: bool = False,  # Whether to fetch and parse <head> section for metadata
-        max_urls: int = -1, # Maximum number of URLs to discover (default: -1 for no limit)
-        concurrency: int = 1000,      # Maximum concurrent requests for live checks/head extraction
-        hits_per_sec: int = 5,      # Rate limit in requests per second
-        force: bool = False, # If True, bypasses the AsyncUrlSeeder's internal .jsonl cache
-        base_directory: Optional[str] = None, # Base directory for UrlSeeder's cache files (.jsonl)
-        llm_config: Optional[LLMConfig] = None, # Forward LLM config for future use (e.g., relevance scoring)
-        verbose: Optional[bool] = None, # Override crawler's general verbose setting
-        query: Optional[str] = None,  # Search query for relevance scoring
-        score_threshold: Optional[float] = None,  # Minimum relevance score to include URL (0.0-1.0)
-        scoring_method: str = "bm25",  # Scoring method: "bm25" (default), future: "semantic"
-        filter_nonsense_urls: bool = True,  # Filter out utility URLs like robots.txt, sitemap.xml, etc.
+        source: str = "sitemap+cc",
+        pattern: Optional[str] = "*",
+        live_check: bool = False,
+        extract_head: bool = False,
+        max_urls: int = -1,
+        concurrency: int = 1000,
+        hits_per_sec: int = 5,
+        force: bool = False,
+        base_directory: Optional[str] = None,
+        llm_config: Optional[LLMConfig] = None,
+        verbose: Optional[bool] = None,
+        query: Optional[str] = None,
+        score_threshold: Optional[float] = None,
+        scoring_method: str = "bm25",
+        filter_nonsense_urls: bool = True,
     ):
+        """
+        Initialize URL seeding configuration.
+        
+        Args:
+            source: Discovery source(s) to use. Options: "sitemap", "cc" (Common Crawl), 
+                   or "sitemap+cc" (both). Default: "sitemap+cc"
+            pattern: URL pattern to filter discovered URLs (e.g., "*example.com/blog/*"). 
+                    Supports glob-style wildcards. Default: "*" (all URLs)
+            live_check: Whether to perform HEAD requests to verify URL liveness. 
+                       Default: False
+            extract_head: Whether to fetch and parse <head> section for metadata extraction.
+                         Required for BM25 relevance scoring. Default: False
+            max_urls: Maximum number of URLs to discover. Use -1 for no limit. 
+                     Default: -1
+            concurrency: Maximum concurrent requests for live checks/head extraction. 
+                        Default: 1000
+            hits_per_sec: Rate limit in requests per second to avoid overwhelming servers. 
+                         Default: 5
+            force: If True, bypasses the AsyncUrlSeeder's internal .jsonl cache and 
+                  re-fetches URLs. Default: False
+            base_directory: Base directory for UrlSeeder's cache files (.jsonl). 
+                           If None, uses default ~/.crawl4ai/. Default: None
+            llm_config: LLM configuration for future features (e.g., semantic scoring). 
+                       Currently unused. Default: None
+            verbose: Override crawler's general verbose setting for seeding operations. 
+                    Default: None (inherits from crawler)
+            query: Search query for BM25 relevance scoring (e.g., "python tutorials"). 
+                  Requires extract_head=True. Default: None
+            score_threshold: Minimum relevance score (0.0-1.0) to include URL. 
+                           Only applies when query is provided. Default: None
+            scoring_method: Scoring algorithm to use. Currently only "bm25" is supported. 
+                          Future: "semantic". Default: "bm25"
+            filter_nonsense_urls: Filter out utility URLs like robots.txt, sitemap.xml, 
+                                 ads.txt, favicon.ico, etc. Default: True
+        """
         self.source = source
         self.pattern = pattern
         self.live_check = live_check
 
@@ -424,10 +424,21 @@ async def worker(res_list: List[Dict[str, Any]]):
         self._log("info", "Finished URL seeding for {domain}. Total URLs: {count}",
                   params={"domain": domain, "count": len(results)}, tag="URL_SEED")
 
-        # Sort by relevance score if query was provided
+        # Apply BM25 scoring if query was provided
         if query and extract_head and scoring_method == "bm25":
-            results.sort(key=lambda x: x.get(
-                "relevance_score", 0.0), reverse=True)
+            # Apply collective BM25 scoring across all documents
+            results = await self._apply_bm25_scoring(results, config)
+            
+            # Filter by score threshold if specified
+            if score_threshold is not None:
+                original_count = len(results)
+                results = [r for r in results if r.get("relevance_score", 0) >= score_threshold]
+                if original_count > len(results):
+                    self._log("info", "Filtered {filtered} URLs below score threshold {threshold}",
+                              params={"filtered": original_count - len(results), "threshold": score_threshold}, tag="URL_SEED")
+            
+            # Sort by relevance score
+            results.sort(key=lambda x: x.get("relevance_score", 0.0), reverse=True)
             self._log("info", "Sorted {count} URLs by relevance score for query: '{query}'",
                       params={"count": len(results), "query": query}, tag="URL_SEED")
         elif query and not extract_head:
@@ -982,28 +993,6 @@ async def _validate(self, url: str, res_list: List[Dict[str, Any]], live: bool,
                 "head_data": head_data,
             }
 
-            # Apply BM25 scoring if query is provided and head data exists
-            if query and ok and scoring_method == "bm25" and head_data:
-                text_context = self._extract_text_context(head_data)
-                if text_context:
-                    # Calculate BM25 score for this single document
-                    # scores = self._calculate_bm25_score(query, [text_context])
-                    scores = await asyncio.to_thread(self._calculate_bm25_score, query, [text_context])
-                    relevance_score = scores[0] if scores else 0.0
-                    entry["relevance_score"] = float(relevance_score)
-                else:
-                    # No text context, use URL-based scoring as fallback
-                    relevance_score = self._calculate_url_relevance_score(
-                        query, entry["url"])
-                    entry["relevance_score"] = float(relevance_score)
-            elif query:
-                # Query provided but no head data - we reject this entry
-                self._log("debug", "No head data for {url}, using URL-based scoring",
-                          params={"url": url}, tag="URL_SEED")
-                return
-                # relevance_score = self._calculate_url_relevance_score(query, entry["url"])
-                # entry["relevance_score"] = float(relevance_score)
-
         elif live:
             self._log("debug", "Performing live check for {url}", params={
                       "url": url}, tag="URL_SEED")
@@ -1013,35 +1002,13 @@ async def _validate(self, url: str, res_list: List[Dict[str, Any]], live: bool,
                       params={"status": status.upper(), "url": url}, tag="URL_SEED")
             entry = {"url": url, "status": status, "head_data": {}}
 
-            # Apply URL-based scoring if query is provided
-            if query:
-                relevance_score = self._calculate_url_relevance_score(
-                    query, url)
-                entry["relevance_score"] = float(relevance_score)
-
         else:
             entry = {"url": url, "status": "unknown", "head_data": {}}
 
-            # Apply URL-based scoring if query is provided
-            if query:
-                relevance_score = self._calculate_url_relevance_score(
-                    query, url)
-                entry["relevance_score"] = float(relevance_score)
-
-        # Now decide whether to add the entry based on score threshold
-        if query and "relevance_score" in entry:
-            if score_threshold is None or entry["relevance_score"] >= score_threshold:
-                if live or extract:
-                    await self._cache_set(cache_kind, url, entry)
-                res_list.append(entry)
-            else:
-                self._log("debug", "URL {url} filtered out with score {score} < {threshold}",
-                          params={"url": url, "score": entry["relevance_score"], "threshold": score_threshold}, tag="URL_SEED")
-        else:
-            # No query or no scoring - add as usual
-            if live or extract:
-                await self._cache_set(cache_kind, url, entry)
-            res_list.append(entry)
+        # Add entry to results (scoring will be done later)
+        if live or extract:
+            await self._cache_set(cache_kind, url, entry)
+        res_list.append(entry)
 
     async def _head_ok(self, url: str, timeout: int) -> bool:
         try:
@@ -1436,8 +1403,19 @@ def _calculate_bm25_score(self, query: str, documents: List[str]) -> List[float]
             scores = bm25.get_scores(query_tokens)
 
             # Normalize scores to 0-1 range
-            max_score = max(scores) if max(scores) > 0 else 1.0
-            normalized_scores = [score / max_score for score in scores]
+            # BM25 can return negative scores, so we need to handle the full range
+            if len(scores) == 0:
+                return []
+            
+            min_score = min(scores)
+            max_score = max(scores)
+            
+            # If all scores are the same, return 0.5 for all
+            if max_score == min_score:
+                return [0.5] * len(scores)
+            
+            # Normalize to 0-1 range using min-max normalization
+            normalized_scores = [(score - min_score) / (max_score - min_score) for score in scores]
 
             return normalized_scores
         except Exception as e:
 
@@ -58,13 +58,15 @@ Pull and run images directly from Docker Hub without building locally.
 
 #### 1. Pull the Image
 
-Our latest release candidate is `0.6.0-r1`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
+Our latest release candidate is `0.7.0-r1`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
+
+> ⚠️ **Important Note**: The `latest` tag currently points to the stable `0.6.0` version. After testing and validation, `0.7.0` (without -r1) will be released and `latest` will be updated. For now, please use `0.7.0-r1` to test the new features.
 
 ```bash
-# Pull the release candidate (recommended for latest features)
-docker pull unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
+# Pull the release candidate (for testing new features)
+docker pull unclecode/crawl4ai:0.7.0-r1
 
-# Or pull the latest stable version
+# Or pull the current stable version (0.6.0)
 docker pull unclecode/crawl4ai:latest
 ```
 
@@ -99,7 +101,7 @@ EOL
       -p 11235:11235 \
       --name crawl4ai \
       --shm-size=1g \
-      unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
+      unclecode/crawl4ai:0.7.0-r1
     ```
 
 *   **With LLM support:**
@@ -110,7 +112,7 @@ EOL
       --name crawl4ai \
       --env-file .llm.env \
       --shm-size=1g \
-      unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
+      unclecode/crawl4ai:0.7.0-r1
     ```
 
 > The server will be available at `http://localhost:11235`. Visit `/playground` to access the interactive testing interface.
@@ -124,7 +126,7 @@ docker stop crawl4ai && docker rm crawl4ai
 #### Docker Hub Versioning Explained
 
 *   **Image Name:** `unclecode/crawl4ai`
-*   **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.6.0-r1`)
+*   **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.7.0-r1`)
     *   `LIBRARY_VERSION`: The semantic version of the core `crawl4ai` Python library
     *   `SUFFIX`: Optional tag for release candidates (``) and revisions (`r1`)
 *   **`latest` Tag:** Points to the most recent stable version
@@ -160,7 +162,7 @@ The `docker-compose.yml` file in the project root provides a simplified approach
     ```bash
     # Pulls and runs the release candidate from Docker Hub
     # Automatically selects the correct architecture
-    IMAGE=unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number docker compose up -d
+    IMAGE=unclecode/crawl4ai:0.7.0-r1 docker compose up -d
     ```
 
 *   **Build and Run Locally:**