Skip to content

Commit 9c046e9

Browse files
B4nanclaude
andauthored
docs: add Crawlee v3.16 release blog post (apify#3390)
## Summary - Adds blog post for the Crawlee v3.16 release at `website/blog/2026/02-06/index.md` - Covers StagehandCrawler (AI-powered browser automation), async iterators for Dataset & KeyValueStore, `discoverValidSitemaps` utility, and improved Cloudflare challenge handling - Adds `B4nan` author entry to `authors.yml` ## Test plan - [x] Verify blog post renders correctly on the website - [ ] Check all internal doc links resolve - [ ] Check anchor links in the intro work from both blog listing and post page 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 0ddf67f commit 9c046e9

File tree

5 files changed

+184
-10
lines changed

5 files changed

+184
-10
lines changed

docs/guides/stagehand_crawler.mdx

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -31,16 +31,16 @@ StagehandCrawler extends <ApiLink to="browser-crawler/class/BrowserCrawler">`Bro
3131

3232
```
3333
┌─────────────────────────────────────────────────────────┐
34-
│ StagehandCrawler
34+
│ StagehandCrawler │
3535
├─────────────────────────────────────────────────────────┤
36-
│ BrowserPool (manages browser lifecycle & concurrency)
36+
│ BrowserPool (manages browser lifecycle & concurrency) │
3737
├─────────────────────────────────────────────────────────┤
38-
│ Stagehand Instance
38+
│ Stagehand Instance │
3939
│ ├── Launches Chromium browser │
4040
│ ├── Provides CDP endpoint │
4141
│ └── Handles AI operations (act/extract/observe) │
4242
├─────────────────────────────────────────────────────────┤
43-
│ Playwright (connected via CDP)
43+
│ Playwright (connected via CDP) │
4444
│ └── Standard page operations (goto, click, type, etc.) │
4545
└─────────────────────────────────────────────────────────┘
4646
```
@@ -121,7 +121,7 @@ StagehandCrawler requires an API key for the AI model provider. The recommended
121121
const crawler = new StagehandCrawler({
122122
stagehandOptions: {
123123
model: 'openai/gpt-4.1-mini',
124-
apiKey: 'sk-...', // Your OpenAI API key
124+
apiKey: 'your-api-key', // Your OpenAI API key
125125
},
126126
});
127127
```

packages/stagehand-crawler/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ The `apiKey` option is interpreted based on the `env` setting:
1616
const crawler = new StagehandCrawler({
1717
stagehandOptions: {
1818
model: 'openai/gpt-4.1-mini',
19-
apiKey: 'sk-...', // LLM API key for LOCAL env
19+
apiKey: 'your-api-key', // LLM API key for LOCAL env
2020
},
2121
// ...
2222
});

packages/stagehand-crawler/src/internals/stagehand-crawler.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -60,13 +60,13 @@ export interface StagehandOptions {
6060
* stagehandOptions: {
6161
* env: 'LOCAL',
6262
* model: 'openai/gpt-4.1-mini',
63-
* apiKey: 'sk-...',
63+
* apiKey: 'your-api-key',
6464
* }
6565
*
6666
* // Browserbase cloud
6767
* stagehandOptions: {
6868
* env: 'BROWSERBASE',
69-
* apiKey: 'bb-...',
69+
* apiKey: 'your-browserbase-api-key',
7070
* projectId: 'proj-...',
7171
* }
7272
* ```

website/blog/2026/02-06/index.md

Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
---
2+
slug: crawlee-v3-16
3+
title: "Crawlee v3.16: AI-Powered Crawling with StagehandCrawler"
4+
description: "Crawlee v3.16 introduces StagehandCrawler for AI-powered browser automation, async iterators for Dataset and KeyValueStore, sitemap discovery, and improved Cloudflare handling."
5+
authors: [B4nan]
6+
---
7+
8+
Crawlee v3.16 is here, and the headline feature is the new `StagehandCrawler` — an AI-powered crawler that lets you interact with web pages using natural language instead of CSS selectors. On top of that, we've added async iterators for `Dataset` and `KeyValueStore`, a new `discoverValidSitemaps` utility, and made `handleCloudflareChallenge` more configurable.
9+
10+
Here's what's new:
11+
12+
- [StagehandCrawler — AI-powered browser automation](/blog/crawlee-v3-16#stagehandcrawler--ai-powered-browser-automation)
13+
- [Async iterators for Dataset and KeyValueStore](/blog/crawlee-v3-16#async-iterators-for-dataset-and-keyvaluestore)
14+
- [discoverValidSitemaps utility](/blog/crawlee-v3-16#discovervalidsitemaps-utility)
15+
- [Improved Cloudflare challenge handling](/blog/crawlee-v3-16#improved-cloudflare-challenge-handling)
16+
17+
<!-- truncate -->
18+
19+
## StagehandCrawler — AI-powered browser automation
20+
21+
The new [`@crawlee/stagehand`](https://crawlee.dev/js/api/stagehand-crawler) package integrates [Browserbase's Stagehand](https://github.com/browserbase/stagehand) with Crawlee's crawling infrastructure. Instead of writing brittle CSS selectors or XPath expressions, you describe what you want in plain English and let the AI figure out the rest.
22+
23+
The enhanced page object provides four AI methods:
24+
25+
- **`page.act(instruction)`** — perform actions described in natural language (e.g., "Click the 'Load More' button")
26+
- **`page.extract(instruction, schema)`** — extract structured data from the page using Zod schemas for type safety
27+
- **`page.observe()`** — discover available actions on the current page
28+
- **`page.agent(config)`** — create an autonomous agent for complex multi-step workflows
29+
30+
Since [`StagehandCrawler`](https://crawlee.dev/js/api/stagehand-crawler/class/StagehandCrawler) extends [`BrowserCrawler`](https://crawlee.dev/js/api/browser-crawler/class/BrowserCrawler), you get all the standard Crawlee features out of the box — [request queues](https://crawlee.dev/js/docs/guides/request-storage), [proxy rotation](https://crawlee.dev/js/docs/guides/proxy-management), [autoscaling](https://crawlee.dev/js/api/core/class/AutoscaledPool), [session management](https://crawlee.dev/js/docs/guides/session-management), and [browser fingerprinting](https://crawlee.dev/js/docs/guides/avoid-blocking). It's not a separate tool you have to wire up manually; it's a full Crawlee crawler with AI superpowers.
31+
32+
Here's a basic example showing how to interact with a page and extract structured data:
33+
34+
```typescript
35+
import { StagehandCrawler } from '@crawlee/stagehand';
36+
import { z } from 'zod';
37+
38+
const crawler = new StagehandCrawler({
39+
stagehandOptions: {
40+
model: 'openai/gpt-4.1-mini',
41+
apiKey: 'your-api-key', // Your OpenAI API key (or use OPENAI_API_KEY env var)
42+
},
43+
async requestHandler({ page, request, log }) {
44+
log.info(`Processing ${request.url}`);
45+
46+
// Use natural language to interact with the page
47+
await page.act('Click the "Load More" button');
48+
49+
// Extract structured data with AI
50+
const data = await page.extract(
51+
'Get all product names and prices',
52+
z.object({
53+
products: z.array(z.object({
54+
name: z.string(),
55+
price: z.number(),
56+
})),
57+
}),
58+
);
59+
60+
log.info(`Found ${data.products.length} products`);
61+
},
62+
});
63+
64+
await crawler.run(['https://example.com']);
65+
```
66+
67+
The `StagehandCrawler` is especially useful for websites with complex or frequently changing layouts where traditional selectors are hard to maintain. If the target website has a stable structure, [`PlaywrightCrawler`](https://crawlee.dev/js/api/playwright-crawler/class/PlaywrightCrawler) remains the better choice — it's faster and doesn't require AI API keys.
68+
69+
**Installation:**
70+
71+
```bash
72+
npm install @crawlee/stagehand @browserbasehq/stagehand
73+
```
74+
75+
For a deeper dive into the architecture, all four AI methods, configuration options, and more examples, check out the [StagehandCrawler guide](https://crawlee.dev/js/docs/guides/stagehand-crawler-guide).
76+
77+
## Async iterators for Dataset and KeyValueStore
78+
79+
Previously, iterating over all items in a [`Dataset`](https://crawlee.dev/js/api/core/class/Dataset) or all keys in a [`KeyValueStore`](https://crawlee.dev/js/api/core/class/KeyValueStore) required manual pagination with `getData()` or `forEachKey()`. This release adds `for await...of` support, making iteration straightforward and memory-efficient.
80+
81+
Both `Dataset` and `KeyValueStore` now support direct iteration as well as `values()`, `entries()`, and `keys()` methods:
82+
83+
```typescript
84+
import { Dataset, KeyValueStore } from 'crawlee';
85+
86+
// Dataset — iterate over all items
87+
const dataset = await Dataset.open();
88+
89+
for await (const item of dataset) {
90+
console.log(item);
91+
}
92+
93+
// Or use values()/entries() for more control
94+
for await (const [index, item] of dataset.entries()) {
95+
console.log(`Item #${index}:`, item);
96+
}
97+
98+
// KeyValueStore — iterate over entries
99+
const kvs = await KeyValueStore.open();
100+
101+
for await (const [key, value] of kvs) {
102+
console.log(key, value);
103+
}
104+
105+
// Or iterate over just keys or values
106+
for await (const key of kvs.keys()) {
107+
console.log(key);
108+
}
109+
110+
for await (const value of kvs.values()) {
111+
console.log(value);
112+
}
113+
```
114+
115+
The iteration handles pagination internally, so you don't have to worry about offsets or cursors. Existing code that uses `await` on `listItems()` or `listKeys()` continues to work unchanged — the methods now return hybrid objects that support both `await` and `for await...of`.
116+
117+
## discoverValidSitemaps utility
118+
119+
The new [`discoverValidSitemaps`](https://crawlee.dev/js/api/utils/function/discoverValidSitemaps) async generator in `@crawlee/utils` takes a list of URLs and automatically discovers sitemap files for those domains. It checks `robots.txt` for sitemap declarations, then tries common paths like `/sitemap.xml`, `/sitemap.txt`, and `/sitemap_index.xml`.
120+
121+
```typescript
122+
import { discoverValidSitemaps } from '@crawlee/utils';
123+
124+
for await (const sitemapUrl of discoverValidSitemaps(['https://example.com'])) {
125+
console.log('Found sitemap:', sitemapUrl);
126+
}
127+
```
128+
129+
This is handy when you want to seed a crawl from sitemaps without knowing the exact sitemap URL upfront.
130+
131+
## Improved Cloudflare challenge handling
132+
133+
The [`handleCloudflareChallenge`](https://crawlee.dev/js/api/playwright-crawler/namespace/playwrightUtils) helper now accepts configuration callbacks for more control over how Cloudflare challenges are detected and solved. The new options include:
134+
135+
- **`clickPositionCallback`** — override how the checkbox click position is calculated
136+
- **`clickCallback`** — override the actual checkbox clicking logic
137+
- **`isChallengeCallback`** — customize detection of Cloudflare challenge pages
138+
- **`isBlockedCallback`** — customize detection of Cloudflare block pages
139+
- **`preChallengeSleepSecs`** — add a delay before the first click attempt (defaults to 1s)
140+
141+
```typescript
142+
import { PlaywrightCrawler } from 'crawlee';
143+
144+
const crawler = new PlaywrightCrawler({
145+
postNavigationHooks: [
146+
async ({ handleCloudflareChallenge }) => {
147+
await handleCloudflareChallenge({
148+
// Custom click position for environments where the
149+
// default detection doesn't work
150+
clickPositionCallback: async (page) => {
151+
const box = await page.locator('iframe').first().boundingBox();
152+
return box ? { x: box.x + 25, y: box.y + 25 } : null;
153+
},
154+
preChallengeSleepSecs: 2,
155+
});
156+
},
157+
],
158+
// ...
159+
});
160+
```
161+
162+
These options are particularly useful when running in environments where the default checkbox detection needs adjustment.
163+
164+
---
165+
166+
That's a wrap for Crawlee v3.16! For the full list of changes, check out the [changelog on GitHub](https://github.com/apify/crawlee/blob/master/CHANGELOG.md). If you have questions or feedback, [open a GitHub discussion](https://github.com/apify/crawlee/discussions) or [join our Discord community](https://apify.com/discord).

website/blog/authors.yml

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -72,12 +72,20 @@ VladaD:
7272
image_url: https://avatars.githubusercontent.com/u/25082181?v=4
7373
socials:
7474
github: vdusek
75-
75+
7676
RadoC:
7777
name: Radoslav Chudovský
7878
title: Web Automation Engineer
7979
url: https://github.com/chudovskyr
8080
image_url: https://ca.slack-edge.com/T0KRMEKK6-U04MGU11VUK-7f59c4a9343b-512
8181
socials:
8282
github: chudovskyr
83-
83+
84+
B4nan:
85+
name: Martin Adámek
86+
title: Crawlee Maintainer
87+
url: https://github.com/B4nan
88+
image_url: https://avatars1.githubusercontent.com/u/615580?s=460&v=4
89+
socials:
90+
x: B4nan
91+
github: B4nan

0 commit comments

Comments
 (0)