Skip to content

Commit 7c7aec9

Browse files
protoss70TC-MO
andauthored
feat: improve wording
Co-authored-by: Michał Olender <[email protected]>
1 parent 313cebd commit 7c7aec9

File tree

1 file changed

+18
-47
lines changed
  • sources/platform/integrations/workflows-and-notifications/n8n

1 file changed

+18
-47
lines changed

sources/platform/integrations/workflows-and-notifications/n8n/ai-crawling.md

Lines changed: 18 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -59,30 +59,15 @@ Once connected, you can build workflows to automate website extraction and integ
5959

6060
After connecting the app, you can use one of the two modules as native scrapers to extract website content.
6161

62-
### Standard Settings Module
62+
### Standard Settings module
6363

64-
The Standard Settings module is a streamlined component of the Website Content Crawler that allows you to quickly extract content from websites using optimized default settings. This module is perfect for extracting content from blogs, documentation sites, knowledge bases, or any text-rich website to feed into AI models.
64+
The Standard Settings module lets you quickly extract content from websites using optimized default settings. This module is ideal for extracting content from blogs, documentation, and knowledge bases to feed into AI models.
6565

6666
#### How it works
6767

68-
The crawler starts with one or more **Start URLs** you provide, typically the top-level URL of a documentation site, blog, or knowledge base. It then:
68+
The crawler starts with one or more URLs. It then crawls these initial URLs and discovers links to other pages on the same site, which it adds to a queue. The crawler will recursively follow these links as long as they are under the same path as the start URL. You can customize this behavior by defining specific URL patterns for inclusion or exclusion. To ensure efficiency, the crawler automatically skips any duplicate pages it encounters. A variety of settings are available to fine-tune the crawling process, including the crawler type, the maximum number of pages to crawl, the crawl depth, and concurrency.
6969

70-
- Crawls these start URLs
71-
- Finds links to other pages on the site
72-
- Recursively crawls those pages as long as their URL is under the start URL
73-
- Respects URL patterns for inclusion/exclusion
74-
- Automatically skips duplicate pages with the same canonical URL
75-
- Provides various settings to customize crawling behavior (crawler type, max pages, depth, concurrency, etc.)
76-
77-
Once a web page is loaded, the Actor processes its HTML to ensure quality content extraction:
78-
79-
- Waits for dynamic content to load if using a headless browser
80-
- Can scroll to a certain height to ensure all page content is loaded
81-
- Can expand clickable elements to reveal hidden content
82-
- Removes DOM nodes matching specific CSS selectors (like navigation, headers, footers)
83-
- Optionally keeps only content matching specific CSS selectors
84-
- Removes cookie warnings using browser extensions
85-
- Transforms the page using the selected HTML transformer to extract the main content
70+
Once a page is loaded, the Actor processes its HTML to extract high-quality content. It can be configured to wait for dynamic content to load and can scroll the page to trigger the loading of additional content. To access information hidden in interactive sections, the crawler can be set up to expand clickable elements. It also cleans the HTML by removing irrelevant DOM nodes, such as navigation bars, headers, and footers, and can be configured to keep only the content that matches specific CSS selectors. The crawler also handles cookie warnings automatically and transforms the page to extract the main content.
8671

8772
#### Output data
8873

@@ -115,7 +100,7 @@ For each crawled web page, you'll receive:
115100
}
116101
```
117102

118-
### Advanced Settings Module
103+
### Advanced Settings module
119104

120105
The Advanced Settings module provides complete control over the content extraction process, allowing you to fine-tune every aspect of the crawling and transformation pipeline. This module is ideal for complex websites, JavaScript-heavy applications, or when you need precise control over content extraction.
121106

@@ -132,53 +117,39 @@ The Advanced Settings module provides complete control over the content extracti
132117

133118
#### How it works
134119

135-
The Advanced Settings module provides granular control over the entire crawling process:
136-
137-
1. _Crawler Selection_: Choose from Playwright (Firefox/Chrome), or Cheerio based on website complexity
138-
2. _URL Management_: Define precise scoping with include/exclude URL patterns
139-
3. _DOM Manipulation_: Control which HTML elements to keep or remove
140-
4. _Content Transformation_: Apply specialized algorithms for content extraction
141-
5. _Output Formatting_: Select from multiple formats for AI model compatibility
120+
The Advanced Settings module provides granular control over the entire crawling process. For _Crawler selection_, you can choose from Playwright (Firefox/Chrome) or Cheerio, depending on the complexity of the target website. _URL management_ allows you to define the crawling scope with include and exclude URL patterns. You can also exercise precise _DOM manipulation_ by controlling which HTML elements to keep or remove. To ensure the best results, you can apply specialized algorithms for _Content transformation_ and select from various _Output formatting_ options for better AI model compatibility.
142121

143122
#### Configuration options
144123

145-
Advanced Settings offers numerous configuration options, including:
146-
147-
- _Crawler Type_: Select the rendering engine (browser or HTTP client)
148-
- _Content Extraction Algorithm_: Choose from multiple HTML transformers
149-
- _Element Selectors_: Specify which elements to keep, remove, or click
150-
- _URL Patterns_: Define URL inclusion/exclusion patterns with glob syntax
151-
- _Crawling Parameters_: Set concurrency, depth, timeouts, and retries
152-
- _Proxy Configuration_: Configure proxy settings for robust crawling
153-
- _Output Options_: Select content formats and storage options
124+
Advanced Settings offers a wide range of configuration options. You can select the _Crawler type_ by choosing the rendering engine (browser or HTTP client) and the _Content extraction algorithm_ from multiple HTML transformers. _Element selectors_ allow you to specify which elements to keep, remove, or click, while _URL patterns_ let you define inclusion and exclusion rules with glob syntax. You can also set _Crawling parameters_ like concurrency, depth, timeouts, and retries. For robust crawling, you can configure _Proxy configuration_ settings and select from various _Output options_ for content formats and storage.
154125

155126
#### Output data
156127

157-
In addition to the standard output fields, Advanced Settings provides:
128+
In addition to the standard output fields, this module provides:
158129

159-
- _Multiple Format Options_: Content in Markdown, HTML, or plain text
160-
- _Debug Information_: Detailed extraction diagnostics and snapshots
161-
- _HTML Transformations_: Results from different content extraction algorithms
162-
- _File Storage Options_: Flexible storage for HTML, screenshots, or downloaded files
130+
- _Multiple format options_: Content in Markdown, HTML, or plain text
131+
- _Debug information_: Detailed extraction diagnostics and snapshots
132+
- _HTML transformations_: Results from different content extraction algorithms
133+
- _File storage options_: Flexible storage for HTML, screenshots, or downloaded files
163134

164-
You can access any of our 6,000+ scrapers on Apify Store by using the [general Apify app](https://n8n.io/integrations/apify).
135+
You can access any of thousands of our scrapers on Apify Store by using the [general Apify app](https://n8n.io/integrations/apify).
165136

166137
## Usage as an AI Agent Tool
167138

168-
You can setup Apify's Website Content Crawler app as a tool for your AI Agents. Below is a very simple configuration for your agents.
139+
You can setup Apify's Website Content Crawler app as a tool for your AI Agents.
169140

170141
![Setup AI Agent](./images/setup.png)
171142

172-
### Dynamic url crawling
143+
### Dynamic URL crawling
173144

174-
In the Website Content Crawler module you can set the **Start URLs** to be filled in by your AI Agent dynamically as shown in the image below. This allows the Agent to decide on which pages to scrape off the internet.
145+
In the Website Content Crawler module you can set the **Start URLs** to be filled in by your AI Agent dynamically. This allows the Agent to decide on which pages to scrape off the internet.
175146

176-
We recommend using the **Advanced options** module with your AI Agent. Two key parameters in the Advanced module to set are **Max crawling depth** and **Max pages**. Remember that the scraping results are passed into the AI Agent’s context, so using smaller values for these parameters helps stay within context limits.
147+
We recommend using the Advanced Settings module with your AI Agent. Two key parameters to set are **Max crawling depth** and **Max pages**. Remember that the scraping results are passed into the AI Agent’s context, so using smaller values helps stay within context limits.
177148

178149
![Config Apify](./images/config.png)
179150

180151
### Example usage
181152

182-
Here I used it to find information about the latest blog post of Apify and its content. As you can see the AI Agent correctly filled the url for Apify's blog and summarized it's content
153+
Here, the agent was used to find information about Apify's latest blog post. It correctly filled in the URL for the blog and summarized its content.
183154

184155
![Scraping Results](./images/result.png)

0 commit comments

Comments
 (0)