You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(request-queue): Extend rq documentation with rqv2 features (#904)
This is part of announcing request queue V2. The features described here
were already implemented on backend but not promoted yet.
I will create one more PR with adding new tutorial under [Advanced web
scraping academy](https://docs.apify.com/academy/advanced-web-scraping)
on How to scraper one request queue from multiple Actor runs.
---------
Co-authored-by: Michał Olender <[email protected]>
Co-authored-by: František Nesveda <[email protected]>
Copy file name to clipboardExpand all lines: sources/platform/storage/request_queue.md
+188Lines changed: 188 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -274,6 +274,194 @@ Example payload:
274
274
275
275
For further details and a breakdown of each storage API endpoint, refer to the [API documentation](/api/v2#/reference/key-value-stores).
276
276
277
+
## Features {#features}
278
+
279
+
Request queue is a storage type built with scraping in mind, enabling developers to write scraping logic efficiently and scalably.
280
+
The Apify tooling, including [Crawlee](https://crawlee.dev/), [Apify SDK for JavaScript](https://docs.apify.com/sdk/js/), and [Apify SDK for Python](https://docs.apify.com/sdk/python/), incorporates all these features, enabling users to leverage them effortlessly without extra configuration.
281
+
282
+
In the following section, we will discuss each of the main features in depth.
283
+
284
+
### Persistence and retention
285
+
286
+
Request queues prioritize persistence, ensuring indefinite retention of your requests in named request queues, and for the data retention period in your subscription in unnamed request queues.
287
+
This capability facilitates incremental crawling, where you can append new URLs to the queue and resume from where you stopped in subsequent Actor runs.
288
+
Consider the scenario of scraping an e-commerce website with thousands of products. Incremental scraping allows you to scrape only the products
289
+
added since the last product discovery.
290
+
291
+
In the following code example, we demonstrate how to use the Apify SDK and Crawlee to create an incremental crawler that saves the title of each new found page in Apify Docs to a dataset.
292
+
By running this Actor multiple times, you can incrementally crawl the source website and save only pages added since the last crawl, as reusing a single request queue ensures that only URLs not yet visited are processed.
293
+
294
+
```ts
295
+
// Basic example of incremental crawling with Crawlee.
296
+
import { Actor } from'apify';
297
+
import { CheerioCrawler, Dataset } from'crawlee';
298
+
299
+
interfaceInput {
300
+
startUrls:string[];
301
+
persistRquestQueueName:string;
302
+
}
303
+
304
+
awaitActor.init();
305
+
306
+
// Structure of input is defined in input_schema.json
307
+
const {
308
+
startUrls = ['https://docs.apify.com/'],
309
+
persistRequestQueueName ='persist-request-queue',
310
+
} =awaitActor.getInput<Input>() ?? {} asInput;
311
+
312
+
// Open or create request queue for incremental scrape.
313
+
// By opening same request queue, the crawler will continue where it left off and skips already visited URLs.
log.info(`New page with ${title}`, { url: request.loadedUrl });
328
+
329
+
// Save the URL and title of the loaded page to the output dataset.
330
+
awaitDataset.pushData({ url: request.loadedUrl, title });
331
+
},
332
+
});
333
+
334
+
awaitcrawler.run(startUrls);
335
+
336
+
awaitActor.exit();
337
+
```
338
+
339
+
### Batch operations {#batch-operations}
340
+
341
+
Request queues support batch operations on requests to enqueue or retrieve multiple requests in bulk, to cut down on network latency and enable easier parallel processing of requests.
342
+
You can find the batch operations in the [Apify API](https://docs.apify.com/api/v2#/reference/request-queues/batch-request-operations), as well in the Apify API client for [JavaScript](https://docs.apify.com/api/client/js/reference/class/RequestQueueClient#batchAddRequests) and [Python](https://docs.apify.com/api/client/python/reference/class/RequestQueueClient#batch_add_requests).
Request queue includes a locking mechanism to avoid concurrent processing of one request by multiple clients (for example Actor runs).
398
+
You can lock a request so that no other clients receive it when they fetch the queue head, with an expiration period on the lock so that requests which fail processing are eventually unlocked and retried.
399
+
400
+
This feature is seamlessly integrated into Crawlee, requiring minimal extra setup. By default, requests are locked for the same duration as the timeout for processing requests in the crawler ([`requestHandlerTimeoutSecs`](https://crawlee.dev/api/next/basic-crawler/interface/BasicCrawlerOptions#requestHandlerTimeoutSecs)).
401
+
If the Actor processing the request fails, the lock expires, and the request is processed again eventually. For more details, refer to the [Crawlee documentation](https://crawlee.dev/docs/next/experiments/experiments-request-locking).
402
+
403
+
In the following example, we demonstrate how we can use locking mechanisms to avoid concurrent processing of the same request.
A detailed tutorial on how to crawl one request queue from multiple Actor runs can be found in [Advanced web scraping academy](https://docs.apify.com/academy/advanced-web-scraping/multiple-runs-scrape).
464
+
277
465
## Sharing {#sharing}
278
466
279
467
You can grant [access rights](../collaboration/index.md) to your request queue through the **Share** button under the **Actions**menu. For more details check the [full list of permissions](../collaboration/list_of_permissions.md).
0 commit comments