You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
-1Lines changed: 0 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -185,7 +185,6 @@ MySpider().start()
185
185
<ahref="https://visit.decodo.com/Dy6W0b"target="_blank"title="Try the Most Efficient Residential Proxies for Free"><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
186
186
<ahref="https://hasdata.com/?utm_source=github&utm_medium=banner&utm_campaign=D4Vinci"target="_blank"title="The web scraping service that actually beats anti-bot systems!"><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/hasdata.png"></a>
187
187
<ahref="https://proxyempire.io/?ref=scrapling&utm_source=scrapling"target="_blank"title="Collect The Data Your Project Needs with the Best Residential Proxies"><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/ProxyEmpire.png"></a><ahref="https://www.swiftproxy.net/"target="_blank"title="Unlock Reliable Proxy Services with Swiftproxy!"><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/swiftproxy.png"></a>
188
-
<ahref="https://www.rapidproxy.io/?ref=d4v"target="_blank"title="Affordable Access to the Proxy World – bypass CAPTCHAs blocks, and avoid additional costs."><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/rapidproxy.jpg"></a>
189
188
<ahref="https://browser.cash/?utm_source=D4Vinci&utm_medium=referral"target="_blank"title="Browser Automation & AI Browser Agent Platform"><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/browserCash.png"></a>
Copy file name to clipboardExpand all lines: agent-skill/Scrapling-Skill/references/fetching/dynamic.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -44,7 +44,7 @@ Instead of launching a browser locally (Chromium/Google Chrome), you can connect
44
44
45
45
**Notes:**
46
46
* There was a `stealth` option here, but it was moved to the `StealthyFetcher` class, as explained on the next page, with additional features since version 0.3.13.
47
-
* This makes it less confusing for new users, easier to maintain, and provides other benefits, as explained on the [StealthyFetcher page](fetching/stealthy.md).
47
+
* This makes it less confusing for new users, easier to maintain, and provides other benefits, as explained on the [StealthyFetcher page](stealthy.md).
48
48
49
49
## Full list of arguments
50
50
All arguments for `DynamicFetcher` and its session classes:
Copy file name to clipboardExpand all lines: agent-skill/Scrapling-Skill/references/spiders/architecture.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,8 +11,8 @@ Here's what happens step by step when you run a spider:
11
11
1. The **Spider** produces the first batch of `Request` objects. By default, it creates one request for each URL in `start_urls`, but you can override `start_requests()` for custom logic.
12
12
2. The **Scheduler** receives requests and places them in a priority queue, and creates fingerprints for them. Higher-priority requests are dequeued first.
13
13
3. The **Crawler Engine** asks the **Scheduler** to dequeue the next request, respecting concurrency limits (global and per-domain) and download delays. Once the **Crawler Engine** receives the request, it passes it to the **Session Manager**, which routes it to the correct session based on the request's `sid` (session ID).
14
-
4. The **session** fetches the page and returns a [Response](fetching/choosing.md#response-object) object to the **Crawler Engine**. The engine records statistics and checks for blocked responses. If the response is blocked, the engine retries the request up to `max_blocked_retries` times. Of course, the blocking detection and the retry logic for blocked requests can be customized.
15
-
5. The **Crawler Engine** passes the [Response](fetching/choosing.md#response-object) to the request's callback. The callback either yields a dictionary, which gets treated as a scraped item, or a follow-up request, which gets sent to the scheduler for queuing.
14
+
4. The **session** fetches the page and returns a [Response](../fetching/choosing.md#response-object) object to the **Crawler Engine**. The engine records statistics and checks for blocked responses. If the response is blocked, the engine retries the request up to `max_blocked_retries` times. Of course, the blocking detection and the retry logic for blocked requests can be customized.
15
+
5. The **Crawler Engine** passes the [Response](../fetching/choosing.md#response-object) to the request's callback. The callback either yields a dictionary, which gets treated as a scraped item, or a follow-up request, which gets sent to the scheduler for queuing.
16
16
6. The cycle repeats from step 2 until the scheduler is empty and no tasks are active, or the spider is paused.
17
17
7. If `crawldir` is set while starting the spider, the **Crawler Engine** periodically saves a checkpoint (pending requests + seen URLs set) to disk. On graceful shutdown (Ctrl+C), a final checkpoint is saved. The next time the spider runs with the same `crawldir`, it resumes from where it left off — skipping `start_requests()` and restoring the scheduler state.
18
18
@@ -50,9 +50,9 @@ A priority queue with built-in URL deduplication. Requests are fingerprinted bas
50
50
51
51
Manages one or more named session instances. Each session is one of:
52
52
53
-
-[FetcherSession](fetching/static.md)
54
-
-[AsyncDynamicSession](fetching/dynamic.md)
55
-
-[AsyncStealthySession](fetching/stealthy.md)
53
+
-[FetcherSession](../fetching/static.md)
54
+
-[AsyncDynamicSession](../fetching/dynamic.md)
55
+
-[AsyncStealthySession](../fetching/stealthy.md)
56
56
57
57
When a request comes in, the Session Manager routes it to the correct session based on the request's `sid` field. Sessions can be started with the spider start (default) or lazily (started on the first use).
Copy file name to clipboardExpand all lines: agent-skill/Scrapling-Skill/references/spiders/getting-started.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,7 +25,7 @@ Every spider needs three things:
25
25
2.**`start_urls`** — A list of URLs to start crawling from.
26
26
3.**`parse()`** — An async generator method that processes each response and yields results.
27
27
28
-
The `parse()` method processes each response. You use the same selection methods you'd use with Scrapling's [Selector](parsing/main_classes.md#selector)/[Response](fetching/choosing.md#response-object), and `yield` dictionaries to output scraped items.
28
+
The `parse()` method processes each response. You use the same selection methods you'd use with Scrapling's [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object), and `yield` dictionaries to output scraped items.
Copy file name to clipboardExpand all lines: agent-skill/Scrapling-Skill/references/spiders/sessions.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,14 +6,14 @@ A spider can use multiple fetcher sessions simultaneously — for example, a fas
6
6
7
7
A session is a pre-configured fetcher instance that stays alive for the duration of the crawl. Instead of creating a new connection or browser for every request, the spider reuses sessions, which is faster and more resource-efficient.
8
8
9
-
By default, every spider creates a single [FetcherSession](fetching/static.md). You can add more sessions or swap the default by overriding the `configure_sessions()` method, but you have to use the async version of each session only, as the table shows below:
9
+
By default, every spider creates a single [FetcherSession](../fetching/static.md). You can add more sessions or swap the default by overriding the `configure_sessions()` method, but you have to use the async version of each session only, as the table shows below:
Copy file name to clipboardExpand all lines: docs/README_AR.md
-1Lines changed: 0 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -180,7 +180,6 @@ MySpider().start()
180
180
<ahref="https://visit.decodo.com/Dy6W0b"target="_blank"title="Try the Most Efficient Residential Proxies for Free"><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
181
181
<ahref="https://hasdata.com/?utm_source=github&utm_medium=banner&utm_campaign=D4Vinci"target="_blank"title="The web scraping service that actually beats anti-bot systems!"><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/hasdata.png"></a>
182
182
<ahref="https://proxyempire.io/?ref=scrapling&utm_source=scrapling"target="_blank"title="Collect The Data Your Project Needs with the Best Residential Proxies"><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/ProxyEmpire.png"></a><ahref="https://www.swiftproxy.net/"target="_blank"title="Unlock Reliable Proxy Services with Swiftproxy!"><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/swiftproxy.png"></a>
183
-
<ahref="https://www.rapidproxy.io/?ref=d4v"target="_blank"title="Affordable Access to the Proxy World – bypass CAPTCHAs blocks, and avoid additional costs."><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/rapidproxy.jpg"></a>
184
183
<ahref="https://browser.cash/?utm_source=D4Vinci&utm_medium=referral"target="_blank"title="Browser Automation & AI Browser Agent Platform"><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/browserCash.png"></a>
Copy file name to clipboardExpand all lines: docs/README_CN.md
-1Lines changed: 0 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -180,7 +180,6 @@ MySpider().start()
180
180
<ahref="https://visit.decodo.com/Dy6W0b"target="_blank"title="Try the Most Efficient Residential Proxies for Free"><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
181
181
<ahref="https://hasdata.com/?utm_source=github&utm_medium=banner&utm_campaign=D4Vinci"target="_blank"title="The web scraping service that actually beats anti-bot systems!"><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/hasdata.png"></a>
182
182
<ahref="https://proxyempire.io/?ref=scrapling&utm_source=scrapling"target="_blank"title="Collect The Data Your Project Needs with the Best Residential Proxies"><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/ProxyEmpire.png"></a><ahref="https://www.swiftproxy.net/"target="_blank"title="Unlock Reliable Proxy Services with Swiftproxy!"><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/swiftproxy.png"></a>
183
-
<ahref="https://www.rapidproxy.io/?ref=d4v"target="_blank"title="Affordable Access to the Proxy World – bypass CAPTCHAs blocks, and avoid additional costs."><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/rapidproxy.jpg"></a>
184
183
<ahref="https://browser.cash/?utm_source=D4Vinci&utm_medium=referral"target="_blank"title="Browser Automation & AI Browser Agent Platform"><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/browserCash.png"></a>
Copy file name to clipboardExpand all lines: docs/README_DE.md
-1Lines changed: 0 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -180,7 +180,6 @@ MySpider().start()
180
180
<ahref="https://visit.decodo.com/Dy6W0b"target="_blank"title="Try the Most Efficient Residential Proxies for Free"><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
181
181
<ahref="https://hasdata.com/?utm_source=github&utm_medium=banner&utm_campaign=D4Vinci"target="_blank"title="The web scraping service that actually beats anti-bot systems!"><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/hasdata.png"></a>
182
182
<ahref="https://proxyempire.io/?ref=scrapling&utm_source=scrapling"target="_blank"title="Collect The Data Your Project Needs with the Best Residential Proxies"><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/ProxyEmpire.png"></a><ahref="https://www.swiftproxy.net/"target="_blank"title="Unlock Reliable Proxy Services with Swiftproxy!"><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/swiftproxy.png"></a>
183
-
<ahref="https://www.rapidproxy.io/?ref=d4v"target="_blank"title="Affordable Access to the Proxy World – bypass CAPTCHAs blocks, and avoid additional costs."><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/rapidproxy.jpg"></a>
184
183
<ahref="https://browser.cash/?utm_source=D4Vinci&utm_medium=referral"target="_blank"title="Browser Automation & AI Browser Agent Platform"><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/browserCash.png"></a>
Copy file name to clipboardExpand all lines: docs/README_ES.md
-1Lines changed: 0 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -180,7 +180,6 @@ MySpider().start()
180
180
<ahref="https://visit.decodo.com/Dy6W0b"target="_blank"title="Try the Most Efficient Residential Proxies for Free"><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
181
181
<ahref="https://hasdata.com/?utm_source=github&utm_medium=banner&utm_campaign=D4Vinci"target="_blank"title="The web scraping service that actually beats anti-bot systems!"><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/hasdata.png"></a>
182
182
<ahref="https://proxyempire.io/?ref=scrapling&utm_source=scrapling"target="_blank"title="Collect The Data Your Project Needs with the Best Residential Proxies"><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/ProxyEmpire.png"></a><ahref="https://www.swiftproxy.net/"target="_blank"title="Unlock Reliable Proxy Services with Swiftproxy!"><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/swiftproxy.png"></a>
183
-
<ahref="https://www.rapidproxy.io/?ref=d4v"target="_blank"title="Affordable Access to the Proxy World – bypass CAPTCHAs blocks, and avoid additional costs."><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/rapidproxy.jpg"></a>
184
183
<ahref="https://browser.cash/?utm_source=D4Vinci&utm_medium=referral"target="_blank"title="Browser Automation & AI Browser Agent Platform"><imgsrc="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/browserCash.png"></a>
0 commit comments