Skip to content

Commit 5b51214

Browse files
committed
Merge branch 'v4' into copilot/add-abortsignal-option-basehttpclient
2 parents 3a5a416 + 9ae2994 commit 5b51214

32 files changed

+503174
-396619
lines changed

docs/guides/http-clients.mdx

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ BaseHttpClient --|> GotScrapingHttpClient
4949

5050
## Switching between HTTP clients
5151

52-
Crawlee currently provides two main HTTP clients: <ApiLink to="core/class/GotScrapingHttpClient">`GotScrapingHttpClient`</ApiLink>, which uses the `got-scraping` library, and <ApiLink to="impit-client/class/ImpitHttpClient">`ImpitHttpClient`</ApiLink>, which uses the `impit` library. You can switch between them by setting the `BasehttpClient` parameter when initializing a crawler class. The default HTTP client is <ApiLink to="core/class/GotScrapingHttpClient">`GotScrapingHttpClient`</ApiLink>. For more details on anti-blocking features, see our [avoid getting blocked guide](./avoid-blocking).
52+
Crawlee currently provides two main HTTP clients: <ApiLink to="got-scraping-client/class/GotScrapingHttpClient">`GotScrapingHttpClient`</ApiLink>, which uses the `got-scraping` library, and <ApiLink to="impit-client/class/ImpitHttpClient">`ImpitHttpClient`</ApiLink>, which uses the `impit` library. You can switch between them by setting the `BasehttpClient` parameter when initializing a crawler class. The default HTTP client is <ApiLink to="got-scraping-client/class/GotScrapingHttpClient">`GotScrapingHttpClient`</ApiLink>. For more details on anti-blocking features, see our [avoid getting blocked guide](./avoid-blocking).
5353

5454
Below are examples of how to configure the HTTP client for the <ApiLink to="cheerio-crawler/class/CheerioCrawler">`CheerioCrawler`</ApiLink>:
5555

@@ -68,7 +68,7 @@ Below are examples of how to configure the HTTP client for the <ApiLink to="chee
6868

6969
## Installation requirements
7070

71-
Since <ApiLink to="core/class/GotScrapingHttpClient">`GotScrapingHttpClient`</ApiLink> is the default HTTP client, it's included with the base Crawlee installation and requires no additional packages.
71+
Since <ApiLink to="got-scraping-client/class/GotScrapingHttpClient">`GotScrapingHttpClient`</ApiLink> is the default HTTP client, it's included with the base Crawlee installation and requires no additional packages.
7272

7373
For <ApiLink to="impit-client/class/ImpitHttpClient">`ImpitHttpClient`</ApiLink>, you need to install a separate `@crawlee/impit-client` package:
7474

@@ -78,7 +78,7 @@ npm i @crawlee/impit-client
7878

7979
## Creating custom HTTP clients
8080

81-
Crawlee provides an interface, <ApiLink to="core/interface/BaseHttpClient">`BaseHttpClient`</ApiLink>, which defines the interface that all HTTP clients must implement. This allows you to create custom HTTP clients tailored to your specific requirements.
81+
Crawlee provides an interface, <ApiLink to="types/interface/BaseHttpClient">`BaseHttpClient`</ApiLink>, which defines the interface that all HTTP clients must implement. This allows you to create custom HTTP clients tailored to your specific requirements.
8282

8383
HTTP clients are responsible for several key operations:
8484

@@ -88,10 +88,10 @@ HTTP clients are responsible for several key operations:
8888
- managing proxy configurations,
8989
- connection pooling with timeout management.
9090

91-
To create a custom HTTP client, you need to implement the <ApiLink to="core/interface/BaseHttpClient">`BaseHttpClient`</ApiLink> interface. Your implementation must be async-compatible and include proper cleanup and resource management to work seamlessly with Crawlee's concurrent processing model.
91+
To create a custom HTTP client, you need to implement the <ApiLink to="types/interface/BaseHttpClient">`BaseHttpClient`</ApiLink> interface. Your implementation must be async-compatible and include proper cleanup and resource management to work seamlessly with Crawlee's concurrent processing model.
9292

9393
## Conclusion
9494

95-
This guide introduced you to the HTTP clients available in Crawlee and demonstrated how to switch between them, including their installation requirements and usage examples. You also learned about the responsibilities of HTTP clients and how to implement your own custom HTTP client by inheriting from the <ApiLink to="core/interface/BaseHttpClient">`BaseHttpClient`</ApiLink> base class.
95+
This guide introduced you to the HTTP clients available in Crawlee and demonstrated how to switch between them, including their installation requirements and usage examples. You also learned about the responsibilities of HTTP clients and how to implement your own custom HTTP client by inheriting from the <ApiLink to="types/interface/BaseHttpClient">`BaseHttpClient`</ApiLink> base class.
9696

9797
If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/crawlee) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!

docs/guides/http-clients/cheerio-got-scraping-example.ts

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
import { CheerioCrawler, GotScrapingHttpClient } from 'crawlee';
1+
import { CheerioCrawler } from 'crawlee';
2+
import { GotScrapingHttpClient } from '@crawlee/got-scraping-client';
23

34
const crawler = new CheerioCrawler({
45
httpClient: new GotScrapingHttpClient(),

docs/guides/impit-http-client/basic-usage.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ const crawler = new BasicCrawler({
77
}),
88
async requestHandler({ sendRequest, log }) {
99
const response = await sendRequest();
10-
log.info('Received response', { statusCode: response.statusCode });
10+
log.info('Received response', { status: response.status });
1111
},
1212
});
1313

package.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,7 @@
102102
"globby": "^15.0.0",
103103
"got": "^14.4.7",
104104
"husky": "^9.1.7",
105+
"iconv-lite": "^0.7.2",
105106
"is-ci": "^4.1.0",
106107
"lerna": "^9.0.0",
107108
"lint-staged": "^16.0.0",

packages/http-crawler/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@
5656
"@types/content-type": "^1.1.8",
5757
"cheerio": "^1.0.0",
5858
"content-type": "^1.0.5",
59-
"iconv-lite": "^0.7.0",
59+
"iconv-lite": "^0.7.2",
6060
"mime-types": "^3.0.1",
6161
"ow": "^2.0.0",
6262
"tslib": "^2.8.1",

packages/http-crawler/src/internals/http-crawler.ts

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -722,11 +722,9 @@ export class HttpCrawler<
722722
if (Buffer.isEncoding(encoding)) return { response, encoding };
723723

724724
// Try to re-encode a variety of unsupported encodings to utf-8
725-
if (iconv.default.encodingExists(encoding)) {
726-
const encodeStream = iconv.default.encodeStream(utf8);
727-
const decodeStream = iconv.default
728-
.decodeStream(encoding)
729-
.on('error', (err) => encodeStream.emit('error', err));
725+
if (iconv.encodingExists(encoding)) {
726+
const encodeStream = iconv.encodeStream(utf8);
727+
const decodeStream = iconv.decodeStream(encoding).on('error', (err) => encodeStream.emit('error', err));
730728
const reencodedBody = response.body
731729
? Readable.toWeb(
732730
Readable.from(

packages/stagehand-crawler/src/internals/stagehand-plugin.ts

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,13 @@
1-
import log from '@apify/log';
21
import type { Stagehand, V3Options } from '@browserbasehq/stagehand';
32
import type { BrowserController, BrowserPluginOptions, LaunchContext } from '@crawlee/browser-pool';
43
import { anonymizeProxySugar, BrowserPlugin } from '@crawlee/browser-pool';
5-
import type { BrowserType, LaunchOptions, Browser as PlaywrightBrowser } from 'playwright';
4+
import type { Browser as PlaywrightBrowser, BrowserType, LaunchOptions } from 'playwright';
65
// Stagehand is built on CDP (Chrome DevTools Protocol), which only works with Chromium-based browsers.
76
// Firefox and WebKit are not supported by Stagehand.
87
import { chromium } from 'playwright';
98

9+
import log from '@apify/log';
10+
1011
import { StagehandController } from './stagehand-controller';
1112
import type { StagehandOptions } from './stagehand-crawler';
1213

test/core/crawlers/playwright_crawler.test.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ import os from 'node:os';
44

55
import type { PlaywrightCrawlingContext, PlaywrightGotoOptions, Request } from '@crawlee/playwright';
66
import { PlaywrightCrawler, RequestList } from '@crawlee/playwright';
7+
import type { Cheerio, CheerioAPI, CheerioRoot, Element } from '@crawlee/utils';
78
import express from 'express';
89
import playwright from 'playwright';
910
import { MemoryStorageEmulator } from 'test/shared/MemoryStorageEmulator.js';

test/core/crawlers/rendering_type_predictor.test.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ import { Request } from '@crawlee/core';
22
import { RenderingTypePredictor } from '@crawlee/playwright';
33
import { afterEach, beforeEach, describe, expect, it } from 'vitest';
44

5-
import { MemoryStorageEmulator } from '../../shared/MemoryStorageEmulator';
5+
import { MemoryStorageEmulator } from '../../shared/MemoryStorageEmulator.js';
66

77
describe('RenderingTypePredictor', () => {
88
const localStorageEmulator = new MemoryStorageEmulator();

test/stagehand-crawler/stagehand-controller.test.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
import log from '@apify/log';
22

3-
import { StagehandController } from '../../packages/stagehand-crawler/src/internals/stagehand-controller';
4-
import type { StagehandPlugin } from '../../packages/stagehand-crawler/src/internals/stagehand-plugin';
3+
import { StagehandController } from '../../packages/stagehand-crawler/src/internals/stagehand-controller.js';
4+
import type { StagehandPlugin } from '../../packages/stagehand-crawler/src/internals/stagehand-plugin.js';
55

66
describe('StagehandController', () => {
77
let mockPlugin: StagehandPlugin;

0 commit comments

Comments
 (0)