Skip to content

Commit ace4b36

Browse files
authored
docs: port the Python HTTP clients guide to JS (#3301)
Ports https://crawlee.dev/python/docs/guides/http-clients to the JS docs for Crawlee JS. Closes #3299
1 parent efac644 commit ace4b36

File tree

7 files changed

+1367
-109
lines changed

7 files changed

+1367
-109
lines changed

docs/guides/http-clients.mdx

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
---
2+
id: http-clients
3+
title: HTTP clients
4+
description: Learn about Crawlee's HTTP client architecture, how to switch between different implementations, and create custom HTTP clients for specialized web scraping needs.
5+
---
6+
7+
import ApiLink from '@site/src/components/ApiLink';
8+
import Tabs from '@theme/Tabs';
9+
import TabItem from '@theme/TabItem';
10+
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
11+
12+
import CheerioGotScrapingExample from '!!raw-loader!roa-loader!./http-clients/cheerio-got-scraping-example.ts';
13+
import CheerioImpitExample from '!!raw-loader!roa-loader!./http-clients/cheerio-impit-example.ts';
14+
15+
HTTP clients are utilized by HTTP-based crawlers (e.g., <ApiLink to="cheerio-crawler/class/CheerioCrawler">`CheerioCrawler`</ApiLink>) to communicate with web servers. They use external HTTP libraries for communication rather than a browser. Examples of such libraries include [`impit`](https://apify.github.io/impit/) or [`got-scraping`](https://github.com/apify/got-scraping/). After retrieving page content, an HTML parsing library is typically used to facilitate data extraction. Examples of such libraries include [Cheerio](https://cheerio.js.org/), [`jsdom`](https://github.com/jsdom/jsdom) or [`linkedom`](https://github.com/WebReflection/linkedom). These crawlers are faster than browser-based crawlers but generally cannot execute client-side JavaScript.
16+
17+
```mermaid
18+
---
19+
config:
20+
class:
21+
hideEmptyMembersBox: true
22+
---
23+
24+
classDiagram
25+
26+
%% ========================
27+
%% Abstract classes
28+
%% ========================
29+
30+
class BaseHttpClient {
31+
<<abstract>>
32+
}
33+
34+
%% ========================
35+
%% Specific classes
36+
%% ========================
37+
38+
class ImpitHttpClient
39+
40+
class GotScrapingHttpClient
41+
42+
%% ========================
43+
%% Inheritance arrows
44+
%% ========================
45+
46+
BaseHttpClient --|> ImpitHttpClient
47+
BaseHttpClient --|> GotScrapingHttpClient
48+
```
49+
50+
## Switching between HTTP clients
51+
52+
Crawlee currently provides two main HTTP clients: <ApiLink to="core/class/GotScrapingHttpClient">`GotScrapingHttpClient`</ApiLink>, which uses the `got-scraping` library, and <ApiLink to="impit-client/class/ImpitHttpClient">`ImpitHttpClient`</ApiLink>, which uses the `impit` library. You can switch between them by setting the `BasehttpClient` parameter when initializing a crawler class. The default HTTP client is <ApiLink to="core/class/GotScrapingHttpClient">`GotScrapingHttpClient`</ApiLink>. For more details on anti-blocking features, see our [avoid getting blocked guide](./avoid-blocking).
53+
54+
Below are examples of how to configure the HTTP client for the <ApiLink to="cheerio-crawler/class/CheerioCrawler">`CheerioCrawler`</ApiLink>:
55+
56+
<Tabs>
57+
<TabItem value="GotScrapingHttpClientExample" label="CheerioCrawler with got-scraping">
58+
<RunnableCodeBlock className="language-typescript" language="typescript">
59+
{CheerioGotScrapingExample}
60+
</RunnableCodeBlock>
61+
</TabItem>
62+
<TabItem value="ImpitHttpClientExample" label="CheerioCrawler with impit">
63+
<RunnableCodeBlock className="language-typescript" language="typescript">
64+
{CheerioImpitExample}
65+
</RunnableCodeBlock>
66+
</TabItem>
67+
</Tabs>
68+
69+
## Installation requirements
70+
71+
Since <ApiLink to="core/class/GotScrapingHttpClient">`GotScrapingHttpClient`</ApiLink> is the default HTTP client, it's included with the base Crawlee installation and requires no additional packages.
72+
73+
For <ApiLink to="impit-client/class/ImpitHttpClient">`ImpitHttpClient`</ApiLink>, you need to install a separate `@crawlee/impit-client` package:
74+
75+
```sh
76+
npm i @crawlee/impit-client
77+
```
78+
79+
## Creating custom HTTP clients
80+
81+
Crawlee provides an interface, <ApiLink to="core/interface/BaseHttpClient">`BaseHttpClient`</ApiLink>, which defines the interface that all HTTP clients must implement. This allows you to create custom HTTP clients tailored to your specific requirements.
82+
83+
HTTP clients are responsible for several key operations:
84+
85+
- sending HTTP requests and receiving responses,
86+
- managing cookies and sessions,
87+
- handling headers and authentication,
88+
- managing proxy configurations,
89+
- connection pooling with timeout management.
90+
91+
To create a custom HTTP client, you need to implement the <ApiLink to="core/interface/BaseHttpClient">`BaseHttpClient`</ApiLink> interface. Your implementation must be async-compatible and include proper cleanup and resource management to work seamlessly with Crawlee's concurrent processing model.
92+
93+
## Conclusion
94+
95+
This guide introduced you to the HTTP clients available in Crawlee and demonstrated how to switch between them, including their installation requirements and usage examples. You also learned about the responsibilities of HTTP clients and how to implement your own custom HTTP client by inheriting from the <ApiLink to="core/interface/BaseHttpClient">`BaseHttpClient`</ApiLink> base class.
96+
97+
If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/crawlee) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
import { CheerioCrawler, GotScrapingHttpClient } from 'crawlee';
2+
3+
const crawler = new CheerioCrawler({
4+
httpClient: new GotScrapingHttpClient(),
5+
async requestHandler() {
6+
/* ... */
7+
},
8+
});
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
import { CheerioCrawler } from 'crawlee';
2+
import { ImpitHttpClient } from '@crawlee/impit-client';
3+
4+
const crawler = new CheerioCrawler({
5+
httpClient: new ImpitHttpClient({
6+
// Set-up options for the impit library
7+
ignoreTlsErrors: true,
8+
browser: 'firefox',
9+
}),
10+
async requestHandler() {
11+
/* ... */
12+
},
13+
});

website/docusaurus.config.js

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ const packages = [
1616
'memory-storage',
1717
'utils',
1818
'types',
19+
'impit-client',
1920
];
2021
const packagesOrder = [
2122
'@crawlee/core',
@@ -31,6 +32,7 @@ const packagesOrder = [
3132
'@crawlee/browser-pool',
3233
'@crawlee/utils',
3334
'@crawlee/types',
35+
'@crawlee/impit-client',
3436
];
3537

3638
/** @type {Partial<import('@docusaurus/types').DocusaurusConfig>} */
@@ -53,10 +55,14 @@ module.exports = {
5355
},
5456
onBrokenLinks: 'throw',
5557
markdown: {
58+
mermaid: true,
5659
hooks: {
5760
onBrokenMarkdownLinks: 'throw',
5861
},
5962
},
63+
themes: [
64+
'@docusaurus/theme-mermaid',
65+
],
6066
future: {
6167
experimental_faster: {
6268
// ssgWorkerThreads: true,

website/package.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@
4444
"@docusaurus/plugin-content-docs": "3.9.2",
4545
"@docusaurus/preset-classic": "3.9.2",
4646
"@docusaurus/theme-common": "3.9.2",
47+
"@docusaurus/theme-mermaid": "^3.9.2",
4748
"@giscus/react": "^3.0.0",
4849
"@mdx-js/react": "^3.0.1",
4950
"@signalwire/docusaurus-plugin-llms-txt": "^1.2.1",

website/sidebars.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ module.exports = {
3333
items: [
3434
'guides/request-storage',
3535
'guides/result-storage',
36+
'guides/http-clients',
3637
'guides/configuration',
3738
'guides/cheerio-crawler-guide',
3839
'guides/javascript-rendering',

0 commit comments

Comments
 (0)