Skip to content

Commit 9abdcb1

Browse files
drobnikjTC-MOfnesveda
authored
feat(request-queue): Extend rq documentation with rqv2 features (#904)
This is part of announcing request queue V2. The features described here were already implemented on backend but not promoted yet. I will create one more PR with adding new tutorial under [Advanced web scraping academy](https://docs.apify.com/academy/advanced-web-scraping) on How to scraper one request queue from multiple Actor runs. --------- Co-authored-by: Michał Olender <[email protected]> Co-authored-by: František Nesveda <[email protected]>
1 parent 347acb6 commit 9abdcb1

File tree

1 file changed

+188
-0
lines changed

1 file changed

+188
-0
lines changed

sources/platform/storage/request_queue.md

Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -274,6 +274,194 @@ Example payload:
274274
275275
For further details and a breakdown of each storage API endpoint, refer to the [API documentation](/api/v2#/reference/key-value-stores).
276276

277+
## Features {#features}
278+
279+
Request queue is a storage type built with scraping in mind, enabling developers to write scraping logic efficiently and scalably.
280+
The Apify tooling, including [Crawlee](https://crawlee.dev/), [Apify SDK for JavaScript](https://docs.apify.com/sdk/js/), and [Apify SDK for Python](https://docs.apify.com/sdk/python/), incorporates all these features, enabling users to leverage them effortlessly without extra configuration.
281+
282+
In the following section, we will discuss each of the main features in depth.
283+
284+
### Persistence and retention
285+
286+
Request queues prioritize persistence, ensuring indefinite retention of your requests in named request queues, and for the data retention period in your subscription in unnamed request queues.
287+
This capability facilitates incremental crawling, where you can append new URLs to the queue and resume from where you stopped in subsequent Actor runs.
288+
Consider the scenario of scraping an e-commerce website with thousands of products. Incremental scraping allows you to scrape only the products
289+
added since the last product discovery.
290+
291+
In the following code example, we demonstrate how to use the Apify SDK and Crawlee to create an incremental crawler that saves the title of each new found page in Apify Docs to a dataset.
292+
By running this Actor multiple times, you can incrementally crawl the source website and save only pages added since the last crawl, as reusing a single request queue ensures that only URLs not yet visited are processed.
293+
294+
```ts
295+
// Basic example of incremental crawling with Crawlee.
296+
import { Actor } from 'apify';
297+
import { CheerioCrawler, Dataset } from 'crawlee';
298+
299+
interface Input {
300+
startUrls: string[];
301+
persistRquestQueueName: string;
302+
}
303+
304+
await Actor.init();
305+
306+
// Structure of input is defined in input_schema.json
307+
const {
308+
startUrls = ['https://docs.apify.com/'],
309+
persistRequestQueueName = 'persist-request-queue',
310+
} = await Actor.getInput<Input>() ?? {} as Input;
311+
312+
// Open or create request queue for incremental scrape.
313+
// By opening same request queue, the crawler will continue where it left off and skips already visited URLs.
314+
const requestQueue = await Actor.openRequestQueue(persistRequestQueueName);
315+
316+
const proxyConfiguration = await Actor.createProxyConfiguration();
317+
318+
const crawler = new CheerioCrawler({
319+
proxyConfiguration,
320+
requestQueue, // Pass incremental request queue to the crawler.
321+
requestHandler: async ({ enqueueLinks, request, $, log }) => {
322+
log.info('enqueueing new URLs');
323+
await enqueueLinks();
324+
325+
// Extract title from the page.
326+
const title = $('title').text();
327+
log.info(`New page with ${title}`, { url: request.loadedUrl });
328+
329+
// Save the URL and title of the loaded page to the output dataset.
330+
await Dataset.pushData({ url: request.loadedUrl, title });
331+
},
332+
});
333+
334+
await crawler.run(startUrls);
335+
336+
await Actor.exit();
337+
```
338+
339+
### Batch operations {#batch-operations}
340+
341+
Request queues support batch operations on requests to enqueue or retrieve multiple requests in bulk, to cut down on network latency and enable easier parallel processing of requests.
342+
You can find the batch operations in the [Apify API](https://docs.apify.com/api/v2#/reference/request-queues/batch-request-operations), as well in the Apify API client for [JavaScript](https://docs.apify.com/api/client/js/reference/class/RequestQueueClient#batchAddRequests) and [Python](https://docs.apify.com/api/client/python/reference/class/RequestQueueClient#batch_add_requests).
343+
344+
<Tabs groupId="main">
345+
<TabItem value="JavaScript" label="JavaScript">
346+
347+
```js
348+
const { ApifyClient } = require('apify-client');
349+
350+
const client = new ApifyClient({
351+
token: 'MY-APIFY-TOKEN',
352+
});
353+
354+
const requestQueueClient = client.requestQueue('my-queue-id');
355+
356+
// Add multiple requests to the queue
357+
await requestQueueClient.batchAddRequests([
358+
{ url: 'http://example.com/foo', uniqueKey: 'http://example.com/foo', method: 'GET' },
359+
{ url: 'http://example.com/bar', uniqueKey: 'http://example.com/bar', method: 'GET' },
360+
]);
361+
362+
// Remove multiple requests from the queue
363+
await requestQueueClient.batchDeleteRequests([
364+
{ uniqueKey: 'http://example.com/foo' },
365+
{ uniqueKey: 'http://example.com/bar' },
366+
]);
367+
```
368+
369+
</TabItem>
370+
<TabItem value="Python" label="Python">
371+
372+
```python
373+
from apify_client import ApifyClient
374+
375+
apify_client = ApifyClient('MY-APIFY-TOKEN')
376+
377+
request_queue_client = apify_client.request_queue('my-queue-id')
378+
379+
# Add multiple requests to the queue
380+
request_queue_client.batch_add_requests([
381+
{'url': 'http://example.com/foo', 'uniqueKey': 'http://example.com/foo', 'method': 'GET'},
382+
{'url': 'http://example.com/bar', 'uniqueKey': 'http://example.com/bar', 'method': 'GET'},
383+
])
384+
385+
# Remove multiple requests from the queue
386+
request_queue_client.batch_delete_requests([
387+
{'uniqueKey': 'http://example.com/foo'},
388+
{'uniqueKey': 'http://example.com/bar'},
389+
])
390+
```
391+
392+
</TabItem>
393+
</Tabs>
394+
395+
### Distributivity {#distributivity}
396+
397+
Request queue includes a locking mechanism to avoid concurrent processing of one request by multiple clients (for example Actor runs).
398+
You can lock a request so that no other clients receive it when they fetch the queue head, with an expiration period on the lock so that requests which fail processing are eventually unlocked and retried.
399+
400+
This feature is seamlessly integrated into Crawlee, requiring minimal extra setup. By default, requests are locked for the same duration as the timeout for processing requests in the crawler ([`requestHandlerTimeoutSecs`](https://crawlee.dev/api/next/basic-crawler/interface/BasicCrawlerOptions#requestHandlerTimeoutSecs)).
401+
If the Actor processing the request fails, the lock expires, and the request is processed again eventually. For more details, refer to the [Crawlee documentation](https://crawlee.dev/docs/next/experiments/experiments-request-locking).
402+
403+
In the following example, we demonstrate how we can use locking mechanisms to avoid concurrent processing of the same request.
404+
405+
```js
406+
import { Actor, ApifyClient } from 'apify';
407+
408+
await Actor.init();
409+
410+
const client = new ApifyClient({
411+
token: 'MY-APIFY-TOKEN',
412+
});
413+
414+
// Creates a new request queue.
415+
const requestQueue = await client.requestQueues().getOrCreate('example-queue');
416+
417+
// Creates two clients with different keys for the same request queue.
418+
const requestQueueClientOne = client.requestQueue(requestQueue.id, { clientKey: 'requestqueueone' });
419+
const requestQueueClientTwo = client.requestQueue(requestQueue.id, { clientKey: 'requestqueuetwo' });
420+
421+
// Adds multiple requests to the queue.
422+
await requestQueueClientOne.batchAddRequests([
423+
{ url: 'http://example.com/foo', uniqueKey: 'http://example.com/foo', method: 'GET' },
424+
{ url: 'http://example.com/bar', uniqueKey: 'http://example.com/bar', method: 'GET' },
425+
{ url: 'http://example.com/baz', uniqueKey: 'http://example.com/baz', method: 'GET' },
426+
{ url: 'http://example.com/qux', uniqueKey: 'http://example.com/qux', method: 'GET' },
427+
]);
428+
429+
// Locks the first two requests at the head of the queue.
430+
const processingRequestsClientOne = await requestQueueClientOne.listAndLockHead({
431+
limit: 2,
432+
lockSecs: 60,
433+
});
434+
435+
// Other clients cannot list and lock these requests; the listAndLockHead call returns other requests from the queue.
436+
const processingRequestsClientTwo = await requestQueueClientTwo.listAndLockHead({
437+
limit: 2,
438+
lockSecs: 60,
439+
});
440+
441+
// Checks when the lock will expire. The locked request will have a lockExpiresAt attribute.
442+
const theFirstRequestLockedByClientOne = processingRequestsClientOne.items[0];
443+
const requestLockedByClientOne = await requestQueueClientOne.getRequest(theFirstRequestLockedByClientOne.id);
444+
console.log(`Request locked until ${requestLockedByClientOne?.lockExpiresAt}`);
445+
446+
// Other clients cannot modify the lock; attempting to do so will throw an error.
447+
try {
448+
await requestQueueClientTwo.prolongRequestLock(theFirstRequestLockedByClientOne.id, { lockSecs: 60 });
449+
} catch (err) {
450+
// This will throw an error.
451+
}
452+
453+
// Prolongs the lock of the first request or unlocks it.
454+
await requestQueueClientOne.prolongRequestLock(theFirstRequestLockedByClientOne.id, { lockSecs: 60 });
455+
await requestQueueClientOne.deleteRequestLock(theFirstRequestLockedByClientOne.id);
456+
457+
// Cleans up the queue.
458+
await requestQueueClientOne.delete();
459+
460+
await Actor.exit();
461+
```
462+
463+
A detailed tutorial on how to crawl one request queue from multiple Actor runs can be found in [Advanced web scraping academy](https://docs.apify.com/academy/advanced-web-scraping/multiple-runs-scrape).
464+
277465
## Sharing {#sharing}
278466
279467
You can grant [access rights](../collaboration/index.md) to your request queue through the **Share** button under the **Actions** menu. For more details check the [full list of permissions](../collaboration/list_of_permissions.md).

0 commit comments

Comments
 (0)