You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sources/platform/integrations/workflows-and-notifications/n8n/ai-crawling.md
+18-47Lines changed: 18 additions & 47 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -59,30 +59,15 @@ Once connected, you can build workflows to automate website extraction and integ
59
59
60
60
After connecting the app, you can use one of the two modules as native scrapers to extract website content.
61
61
62
-
### Standard Settings Module
62
+
### Standard Settings module
63
63
64
-
The Standard Settings module is a streamlined component of the Website Content Crawler that allows you to quickly extract content from websites using optimized default settings. This module is perfect for extracting content from blogs, documentation sites, knowledge bases, or any text-rich website to feed into AI models.
64
+
The Standard Settings module lets you quickly extract content from websites using optimized default settings. This module is ideal for extracting content from blogs, documentation, and knowledge bases to feed into AI models.
65
65
66
66
#### How it works
67
67
68
-
The crawler starts with one or more **Start URLs** you provide, typically the top-level URLof a documentation site, blog, or knowledge base. It then:
68
+
The crawler starts with one or more URLs. It then crawls these initial URLs and discovers links to other pages on the same site, which it adds to a queue. The crawler will recursively follow these links as long as they are under the same path as the start URL. You can customize this behavior by defining specific URL patterns for inclusion or exclusion. To ensure efficiency, the crawler automatically skips any duplicate pages it encounters. A variety of settings are available to fine-tune the crawling process, including the crawler type, the maximum number of pages to crawl, the crawl depth, and concurrency.
69
69
70
-
- Crawls these start URLs
71
-
- Finds links to other pages on the site
72
-
- Recursively crawls those pages as long as their URL is under the start URL
73
-
- Respects URL patterns for inclusion/exclusion
74
-
- Automatically skips duplicate pages with the same canonical URL
75
-
- Provides various settings to customize crawling behavior (crawler type, max pages, depth, concurrency, etc.)
76
-
77
-
Once a web page is loaded, the Actor processes its HTML to ensure quality content extraction:
78
-
79
-
- Waits for dynamic content to load if using a headless browser
80
-
- Can scroll to a certain height to ensure all page content is loaded
81
-
- Can expand clickable elements to reveal hidden content
82
-
- Removes DOM nodes matching specific CSS selectors (like navigation, headers, footers)
83
-
- Optionally keeps only content matching specific CSS selectors
84
-
- Removes cookie warnings using browser extensions
85
-
- Transforms the page using the selected HTML transformer to extract the main content
70
+
Once a page is loaded, the Actor processes its HTML to extract high-quality content. It can be configured to wait for dynamic content to load and can scroll the page to trigger the loading of additional content. To access information hidden in interactive sections, the crawler can be set up to expand clickable elements. It also cleans the HTML by removing irrelevant DOM nodes, such as navigation bars, headers, and footers, and can be configured to keep only the content that matches specific CSS selectors. The crawler also handles cookie warnings automatically and transforms the page to extract the main content.
86
71
87
72
#### Output data
88
73
@@ -115,7 +100,7 @@ For each crawled web page, you'll receive:
115
100
}
116
101
```
117
102
118
-
### Advanced Settings Module
103
+
### Advanced Settings module
119
104
120
105
The Advanced Settings module provides complete control over the content extraction process, allowing you to fine-tune every aspect of the crawling and transformation pipeline. This module is ideal for complex websites, JavaScript-heavy applications, or when you need precise control over content extraction.
121
106
@@ -132,53 +117,39 @@ The Advanced Settings module provides complete control over the content extracti
132
117
133
118
#### How it works
134
119
135
-
The Advanced Settings module provides granular control over the entire crawling process:
136
-
137
-
1._Crawler Selection_: Choose from Playwright (Firefox/Chrome), or Cheerio based on website complexity
138
-
2._URL Management_: Define precise scoping with include/exclude URL patterns
139
-
3._DOM Manipulation_: Control which HTML elements to keep or remove
140
-
4._Content Transformation_: Apply specialized algorithms for content extraction
141
-
5._Output Formatting_: Select from multiple formats for AI model compatibility
120
+
The Advanced Settings module provides granular control over the entire crawling process. For _Crawler selection_, you can choose from Playwright (Firefox/Chrome) or Cheerio, depending on the complexity of the target website. _URL management_ allows you to define the crawling scope with include and exclude URL patterns. You can also exercise precise _DOM manipulation_ by controlling which HTML elements to keep or remove. To ensure the best results, you can apply specialized algorithms for _Content transformation_ and select from various _Output formatting_ options for better AI model compatibility.
-_Crawler Type_: Select the rendering engine (browser or HTTP client)
148
-
-_Content Extraction Algorithm_: Choose from multiple HTML transformers
149
-
-_Element Selectors_: Specify which elements to keep, remove, or click
150
-
-_URL Patterns_: Define URL inclusion/exclusion patterns with glob syntax
151
-
-_Crawling Parameters_: Set concurrency, depth, timeouts, and retries
152
-
-_Proxy Configuration_: Configure proxy settings for robust crawling
153
-
-_Output Options_: Select content formats and storage options
124
+
Advanced Settings offers a wide range of configuration options. You can select the _Crawler type_ by choosing the rendering engine (browser or HTTP client) and the _Content extraction algorithm_ from multiple HTML transformers. _Element selectors_ allow you to specify which elements to keep, remove, or click, while _URL patterns_ let you define inclusion and exclusion rules with glob syntax. You can also set _Crawling parameters_ like concurrency, depth, timeouts, and retries. For robust crawling, you can configure _Proxy configuration_ settings and select from various _Output options_ for content formats and storage.
154
125
155
126
#### Output data
156
127
157
-
In addition to the standard output fields, Advanced Settings provides:
128
+
In addition to the standard output fields, this module provides:
158
129
159
-
-_Multiple Format Options_: Content in Markdown, HTML, or plain text
160
-
-_Debug Information_: Detailed extraction diagnostics and snapshots
161
-
-_HTML Transformations_: Results from different content extraction algorithms
162
-
-_File Storage Options_: Flexible storage for HTML, screenshots, or downloaded files
130
+
-_Multiple format options_: Content in Markdown, HTML, or plain text
131
+
-_Debug information_: Detailed extraction diagnostics and snapshots
132
+
-_HTML transformations_: Results from different content extraction algorithms
133
+
-_File storage options_: Flexible storage for HTML, screenshots, or downloaded files
163
134
164
-
You can access any of our 6,000+ scrapers on Apify Store by using the [general Apify app](https://n8n.io/integrations/apify).
135
+
You can access any of thousands of our scrapers on Apify Store by using the [general Apify app](https://n8n.io/integrations/apify).
165
136
166
137
## Usage as an AI Agent Tool
167
138
168
-
You can setup Apify's Website Content Crawler app as a tool for your AI Agents. Below is a very simple configuration for your agents.
139
+
You can setup Apify's Website Content Crawler app as a tool for your AI Agents.
169
140
170
141

171
142
172
-
### Dynamic url crawling
143
+
### Dynamic URL crawling
173
144
174
-
In the Website Content Crawler module you can set the **Start URLs** to be filled in by your AI Agent dynamically as shown in the image below. This allows the Agent to decide on which pages to scrape off the internet.
145
+
In the Website Content Crawler module you can set the **Start URLs** to be filled in by your AI Agent dynamically. This allows the Agent to decide on which pages to scrape off the internet.
175
146
176
-
We recommend using the **Advanced options** module with your AI Agent. Two key parameters in the Advanced module to set are **Max crawling depth** and **Max pages**. Remember that the scraping results are passed into the AI Agent’s context, so using smaller values for these parameters helps stay within context limits.
147
+
We recommend using the Advanced Settings module with your AI Agent. Two key parameters to set are **Max crawling depth** and **Max pages**. Remember that the scraping results are passed into the AI Agent’s context, so using smaller values helps stay within context limits.
177
148
178
149

179
150
180
151
### Example usage
181
152
182
-
Here I used it to find information about the latest blog post of Apify and its content. As you can see the AI Agent correctly filled the url for Apify's blog and summarized it's content
153
+
Here, the agent was used to find information about Apify's latest blog post. It correctly filled in the URL for the blog and summarized its content.
0 commit comments