Skip to content

Commit e2a7544

Browse files
committed
feat: n8n user docs for WCC actor app
1 parent 07c3f0b commit e2a7544

File tree

2 files changed

+144
-0
lines changed

2 files changed

+144
-0
lines changed
Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
title: N8N - AI crawling Actor integration
2+
description: Learn about AI Crawling scraper modules.
3+
sidebar_label: AI Crawling
4+
sidebar_position: 6
5+
slug: /integrations/n8n/ai-crawling
6+
toc_max_heading_level: 4
7+
---
8+
9+
## Apify Scraper for AI Crawling
10+
11+
Apify Scraper for AI Crawling from [Apify](https://apify.com/) lets you extract text content from websites to feed AI models, LLM applications, vector databases, or Retrieval Augmented Generation (RAG) pipelines. It supports rich formatting using Markdown, cleans the HTML of irrelevant elements, downloads linked files, and integrates with AI ecosystems like LangChain, LlamaIndex, and other LLM frameworks.
12+
13+
To use these modules, you need an [Apify account](https://console.apify.com) and an [API token](https://docs.apify.com/platform/integrations/api#api-token). You can find your token in the [Apify Console](https://console.apify.com/) under **Settings > Integrations**. After connecting, you can automate content extraction at scale and incorporate the results into your AI workflows.
14+
15+
## Connect Apify Scraper for AI Crawling
16+
17+
1. Create an account at [Apify](https://console.apify.com/). You can sign up using your email, Gmail, or GitHub account.
18+
19+
![Sign up page](images/ai-crawling/wcc-signup.png)
20+
21+
1. To connect your Apify account to n8n, you can use an OAuth connection (recommended) or an Apify API token. To get the Apify API token, navigate to **[Settings > API & Integrations](https://console.apify.com/settings/integrations)** in the Apify Console.
22+
23+
![Apify Console token for Make.png](images/Apify_Console_token_for_Make.png)
24+
25+
1. Find your token under **Personal API tokens** section. You can also create a new API token with multiple customizable permissions by clicking on **+ Create a new token**.
26+
1. Click the **Copy** icon next to your API token to copy it to your clipboard. Then, return to your n8n workflow interface.
27+
28+
![Apify token on Make.png](images/Apify_token_on_Make.png)
29+
30+
1. In n8n, click **Create new credential** of the chosen Apify Scraper module.
31+
1. In the **API key** field, paste the API token you copied from Apify and click **Save**.
32+
33+
IMG
34+
35+
Once connected, you can build workflows to automate website extraction and integrate results into your AI applications.
36+
37+
## Apify Scraper for Website Content modules
38+
39+
After connecting the app, you can use one of the two modules as native scrapers to extract website content.
40+
41+
### Standard Settings Module
42+
43+
The Standard Settings module is a streamlined component of the Website Content Crawler that allows you to quickly extract content from websites using optimized default settings. This module is perfect for extracting content from blogs, documentation sites, knowledge bases, or any text-rich website to feed into AI models.
44+
45+
#### How it works
46+
47+
The crawler starts with one or more **Start URLs** you provide, typically the top-level URL of a documentation site, blog, or knowledge base. It then:
48+
49+
- Crawls these start URLs
50+
- Finds links to other pages on the site
51+
- Recursively crawls those pages as long as their URL is under the start URL
52+
- Respects URL patterns for inclusion/exclusion
53+
- Automatically skips duplicate pages with the same canonical URL
54+
- Provides various settings to customize crawling behavior (crawler type, max pages, depth, concurrency, etc.)
55+
56+
Once a web page is loaded, the Actor processes its HTML to ensure quality content extraction:
57+
58+
- Waits for dynamic content to load if using a headless browser
59+
- Can scroll to a certain height to ensure all page content is loaded
60+
- Can expand clickable elements to reveal hidden content
61+
- Removes DOM nodes matching specific CSS selectors (like navigation, headers, footers)
62+
- Optionally keeps only content matching specific CSS selectors
63+
- Removes cookie warnings using browser extensions
64+
- Transforms the page using the selected HTML transformer to extract the main content
65+
66+
#### Output data
67+
68+
For each crawled web page, you'll receive:
69+
70+
- _Page metadata_: URL, title, description, canonical URL
71+
- _Cleaned text content_: The main article content with irrelevant elements removed
72+
- _Markdown formatting_: Structured content with headers, lists, links, and other formatting preserved
73+
- _Crawl information_: Loaded URL, referrer URL, timestamp, HTTP status
74+
- _Optional file downloads_: PDFs, DOCs, and other linked documents
75+
76+
```json title="Sample output (shortened)"
77+
{
78+
"url": "https://docs.apify.com/academy/web-scraping-for-beginners",
79+
"crawl": {
80+
"loadedUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
81+
"loadedTime": "2025-04-22T14:33:20.514Z",
82+
"referrerUrl": "https://docs.apify.com/academy",
83+
"depth": 1,
84+
"httpStatusCode": 200
85+
},
86+
"metadata": {
87+
"canonicalUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
88+
"title": "Web scraping for beginners | Apify Documentation",
89+
"description": "Learn the basics of web scraping with a step-by-step tutorial and practical exercises.",
90+
"languageCode": "en",
91+
"markdown": "# Web scraping for beginners\n\nWelcome to our comprehensive web scraping tutorial for beginners. This guide will take you through the fundamentals of extracting data from websites, with practical examples and exercises.\n\n## What is web scraping?\n\nWeb scraping is the process of extracting data from websites. It involves making HTTP requests to web servers, downloading HTML pages, and parsing them to extract the desired information.\n\n## Why learn web scraping?\n\n- **Data collection**: Gather information for research, analysis, or business intelligence\n- **Automation**: Save time by automating repetitive data collection tasks\n- **Integration**: Connect web data with your applications or databases\n- **Monitoring**: Track changes on websites automatically\n\n## Getting started\n\nTo begin web scraping, you'll need to understand the basics of HTML, CSS selectors, and HTTP. This tutorial will guide you through these concepts step by step.\n\n...",
92+
"text": "Web scraping for beginners\n\nWelcome to our comprehensive web scraping tutorial for beginners. This guide will take you through the fundamentals of extracting data from websites, with practical examples and exercises.\n\nWhat is web scraping?\n\nWeb scraping is the process of extracting data from websites. It involves making HTTP requests to web servers, downloading HTML pages, and parsing them to extract the desired information.\n\nWhy learn web scraping?\n\n- Data collection: Gather information for research, analysis, or business intelligence\n- Automation: Save time by automating repetitive data collection tasks\n- Integration: Connect web data with your applications or databases\n- Monitoring: Track changes on websites automatically\n\nGetting started\n\nTo begin web scraping, you'll need to understand the basics of HTML, CSS selectors, and HTTP. This tutorial will guide you through these concepts step by step.\n\n..."
93+
}
94+
}
95+
```
96+
97+
### Advanced Settings Module
98+
99+
The Advanced Settings module provides complete control over the content extraction process, allowing you to fine-tune every aspect of the crawling and transformation pipeline. This module is ideal for complex websites, JavaScript-heavy applications, or when you need precise control over content extraction.
100+
101+
#### Key features
102+
103+
- _Multiple Crawler Options_: Choose between headless browsers (Playwright) or faster HTTP clients (Cheerio)
104+
- _Custom Content Selection_: Specify exactly which elements to keep or remove
105+
- _Advanced Navigation Control_: Set crawling depth, scope, and URL patterns
106+
- _Dynamic Content Handling_: Wait for JavaScript-rendered content to load
107+
- _Interactive Element Support_: Click expandable sections to reveal hidden content
108+
- _Multiple Output Formats_: Save content as Markdown, HTML, or plain text
109+
- _Proxy Configuration_: Use proxies to handle geo-restrictions or avoid IP blocks
110+
- _Content Transformation Options_: Multiple algorithms for optimal content extraction
111+
112+
#### How it works
113+
114+
The Advanced Settings module provides granular control over the entire crawling process:
115+
116+
1. _Crawler Selection_: Choose from Playwright (Firefox/Chrome), or Cheerio based on website complexity
117+
2. _URL Management_: Define precise scoping with include/exclude URL patterns
118+
3. _DOM Manipulation_: Control which HTML elements to keep or remove
119+
4. _Content Transformation_: Apply specialized algorithms for content extraction
120+
5. _Output Formatting_: Select from multiple formats for AI model compatibility
121+
122+
#### Configuration options
123+
124+
Advanced Settings offers numerous configuration options, including:
125+
126+
- _Crawler Type_: Select the rendering engine (browser or HTTP client)
127+
- _Content Extraction Algorithm_: Choose from multiple HTML transformers
128+
- _Element Selectors_: Specify which elements to keep, remove, or click
129+
- _URL Patterns_: Define URL inclusion/exclusion patterns with glob syntax
130+
- _Crawling Parameters_: Set concurrency, depth, timeouts, and retries
131+
- _Proxy Configuration_: Configure proxy settings for robust crawling
132+
- _Output Options_: Select content formats and storage options
133+
134+
#### Output data
135+
136+
In addition to the standard output fields, Advanced Settings provides:
137+
138+
- _Multiple Format Options_: Content in Markdown, HTML, or plain text
139+
- _Debug Information_: Detailed extraction diagnostics and snapshots
140+
- _HTML Transformations_: Results from different content extraction algorithms
141+
- _File Storage Options_: Flexible storage for HTML, screenshots, or downloaded files
142+
143+
You can access any of our 6,000+ scrapers on Apify Store by using the [general Apify app](https://n8n.io/integrations/apify).
144+
File renamed without changes.

0 commit comments

Comments
 (0)