Skip to content

Commit 8a8309f

Browse files
committed
docs: added ollama
1 parent bffed08 commit 8a8309f

14 files changed

+387
-31
lines changed

docs/cn/guide/crawl-openai-custom.md

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,27 @@
11
# 用户自定义 AI 功能 {#user-defined-ai-fuctions}
22

3-
为了满足不同用户的个性化需求,x-crawl 还提供了用户自定义 AI 的功能。将 openai 实例提供出来,这意味着您可以根据自己的需求,对 AI 进行定制和优化,使其更好地适应您的爬虫工作。
3+
为了满足不同用户的个性化需求,x-crawl 还提供了用户自定义 AI 的功能。将 ai 实例提供出来,这意味着您可以根据自己的需求,对 AI 进行定制和优化,使其更好地适应您的爬虫工作。
4+
5+
## Ollama
6+
7+
使用 AI 应用实例的 custom() 方法。
8+
9+
示例:
10+
11+
```js{8}
12+
import { createCrawlOllama } from 'x-crawl'
13+
14+
const crawlOllamaApp = createCrawlOllama({
15+
model: "你的模型",
16+
clientOptions: { ... }
17+
})
18+
19+
const Ollama = crawlOllamaApp.custom()
20+
```
21+
22+
调用 custom 拿到的 Ollama 可参考:https://github.com/ollama/ollama-js?tab=readme-ov-file#custom-client ,调用 custom 拿到的 Ollama 与网站示例 new Ollama() 拿到的实例差不多,不同的是 x-crawl 会将创建 AI 应用实例时传入的 clientOptions 传给 new Ollama ,拿到的是完好无损 Ollama 实例,x-crawl 并不会对其重写。
23+
24+
## Openai
425

526
使用 AI 应用实例的 [custom()](/cn/api/custom#custom) 方法。
627

docs/cn/guide/crawl-openai-help.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,45 @@
22

33
可以为您提供智能的解答和建议。无论是关于爬虫策略、反爬虫技巧还是数据处理等方面的问题,您都可以向AI提问,AI会根据其强大的学习和推理能力,为您提供专业的解答和建议,帮助您更好地完成爬虫任务。
44

5+
## Ollama
6+
7+
使用 AI 应用实例的 help() 方法。
8+
9+
示例:
10+
11+
```js{8,16}
12+
import { createCrawlOllama } from 'x-crawl'
13+
14+
const crawlOllamaApp = createCrawlOllama({
15+
model: "你的模型",
16+
clientOptions: { ... }
17+
})
18+
19+
crawlOllamaApp.help('x-crawl 是什么').then((res) => {
20+
console.log(res)
21+
/*
22+
res:
23+
x-crawl 是一个灵活的 Node.js AI 辅助爬虫库,它提供了强大的人工智能辅助功能,可以帮助开发者更高效、智能和便捷地进行网络爬虫工作。您可以在 GitHub 上找到更多关于 x-crawl 的详细信息和使用方式:https://github.com/coder-hxl/x-crawl。
24+
*/
25+
})
26+
27+
crawlOllamaApp.help('爬虫的三大注意事项').then((res) => {
28+
console.log(res)
29+
/*
30+
res:
31+
在进行爬虫工作时,有三个重要的注意事项需要特别注意:
32+
33+
1. **遵守网站规则和法律法规**:在进行数据爬取时,一定要遵守网站的robots.txt文件中的规则,并且不要违反任何相关的法律法规。尊重网站所有者的意愿和数据的所有权是非常重要的。
34+
35+
2. **避免对网站造成过大负担**:爬虫在爬取数据时会占用网站的带宽和资源,过度频繁的访问会给网站带来压力甚至是瘫痪。因此,需要合理设置爬虫的访问频率,并且避免对网站造成过大的访问负担。
36+
37+
3. **数据处理和存储的合法性和隐私保护**:爬取到的数据可能涉及用户的隐私信息,因此在收集、存储和使用这些数据时,要符合相关的隐私保护法律法规,并且不要滥用这些数据。另外,在处理数据时也要保证数据的准确性和可靠性,避免因不当的数据处理而产生误解或造成不良影响。
38+
*/
39+
})
40+
```
41+
42+
## Openai
43+
544
使用 AI 应用实例的 [help()](/cn/api/help#help) 方法。
645

746
示例:

docs/cn/guide/create-ai-application.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,21 @@
11
# 创建 AI 应用 {#create-ai-application}
22

3-
目前 x-crawl 的 AI 辅助功能是依靠 OpenAI ,需要用到 OpenAI 的 API Key 。后续还可能加入其他 AI 的。
3+
## Ollama
4+
5+
通过 [createCrawlOllama()](/cn/api/create-crawl-ollama#createxcrawlollama) 创建一个新的 **应用实例**:
6+
7+
```js
8+
import { createCrawlOllama } from 'x-crawl'
9+
10+
const crawlOllamaApp = createCrawlOllama({
11+
model: "你的模型",
12+
clientOptions: { ... }
13+
})
14+
```
15+
16+
## Openai
17+
18+
需要用到 OpenAI 的 API Key
419

520
通过 [createCrawlOpenAI()](/cn/api/create-crawl-openai#createxcrawlopenai) 创建一个新的 **应用实例**:
621

docs/cn/guide/get-element-selectors.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,50 @@
22

33
能够帮助我们快速定位到页面中的特定元素。只需将 HTML 代码输入到 AI 中,并告知 AI 您想获取哪些元素的选择器,AI 便会根据页面结构自动为您生成合适的选择器,大大简化了确定选择器的繁琐过程。
44

5+
## Ollama
6+
7+
使用 AI 应用实例的 getElementSelectors() 方法。
8+
9+
示例:
10+
11+
```js{24}
12+
import { createCrawlOllama } from 'x-crawl'
13+
14+
const crawlOllamaApp = createCrawlOllama({
15+
model: "你的模型",
16+
clientOptions: { ... }
17+
})
18+
19+
const HTMLContent = `
20+
<div class="scroll-list">
21+
<div class="list-item">女装带帽卫衣</div>
22+
<div class="list-item">男装卫衣</div>
23+
<div class="list-item">女装卫衣</div>
24+
<div class="list-item">男装带帽卫衣</div>
25+
</div>
26+
<div class="scroll-list">
27+
<div class="list-item">男装纯棉短袖</div>
28+
<div class="list-item">男装纯棉短袖</div>
29+
<div class="list-item">女装纯棉短袖</div>
30+
<div class="list-item">男装冰丝短袖</div>
31+
<div class="list-item">男装圆领短袖</div>
32+
</div>
33+
`
34+
35+
crawlOllamaApp.getElementSelectors(HTMLContent, '获取所有女装').then((res) => {
36+
console.log(res)
37+
/*
38+
res:
39+
{
40+
selectors: '.scroll-list:nth-child(1) .list-item:nth-of-type(1), .scroll-list:nth-child(1) .list-item:nth-of-type(3), .scroll-list:nth-child(2) .list-item:nth-of-type(3)',
41+
type: 'single'
42+
}
43+
*/
44+
})
45+
```
46+
47+
## Openai
48+
549
使用 AI 应用实例的 [getElementSelectors()](/cn/api/get-element-selectors#getelementselectors) 方法。
650

751
示例:
@@ -41,4 +85,6 @@ crawlOpenAIApp.getElementSelectors(HTMLContent, '获取所有女装').then((res)
4185
})
4286
```
4387

88+
---
89+
4490
也可以将整个 HTML 传给 AI 帮我们操作,但是会消耗更多 Tokens ,OpenAI 是根据 Tokens 进行收费的。

docs/cn/guide/index.md

Lines changed: 21 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -7,13 +7,13 @@ x-crawl 是一个灵活的 Node.js AI 辅助爬虫库。灵活的使用方式和
77
它由两部分组成:
88

99
- 爬虫:由爬虫 API 以及各种功能组成,即使不依靠 AI 也能正常工作。
10-
- AI:目前基于 OpenAI 提供的 AI 大模型,让 AI 简化很多繁琐的操作。
10+
- AI:集成 ollama 和 openai ,让 AI 简化很多繁琐的操作。
1111

1212
> 如果您觉得 x-crawl 对您有所帮助,或者您喜欢 x-crawl ,可以在 GitHub 上给 [x-crawl 存储库](https://github.com/coder-hxl/x-crawl) 点个 star 。您的支持是我们持续改进的动力!感谢您的支持!
1313
1414
## 特征 {#features}
1515

16-
- **🤖 AI 辅助** - 强大的 AI 辅助功能,使爬虫工作变得更加高效、智能和便捷。
16+
- **🤖 AI 辅助** - 集成 ollama 和 openai ,强大的 AI 辅助功能,使爬虫工作变得更加高效、智能和便捷。
1717
- **🖋️ 写法灵活** - 单个爬取 API 都适配多种配置,每种配置方式都各有千秋。
1818
- **⚙️ 多种用途** - 支持爬动态页面、静态页面、接口数据以及文件数据。
1919
- **⚒️ 控制页面** - 爬取动态页面支持自动化操作、键盘输入、事件操作等。
@@ -61,28 +61,30 @@ const crawlOpenAIApp = createCrawlOpenAI({
6161
})
6262

6363
// crawlPage 用于爬取页面
64-
crawlApp.crawlPage('https://www.example.cn/s/select_homes').then(async (res) => {
65-
const { page, browser } = res.data
64+
crawlApp
65+
.crawlPage('https://www.example.cn/s/select_homes')
66+
.then(async (res) => {
67+
const { page, browser } = res.data
6668

67-
// 等待元素出现在页面中, 并获取 HTML
68-
const targetSelector = '[data-tracking-id="TOP_REVIEWED_LISTINGS"]'
69-
await page.waitForSelector(targetSelector)
70-
const highlyHTML = await page.$eval(targetSelector, (el) => el.innerHTML)
69+
// 等待元素出现在页面中, 并获取 HTML
70+
const targetSelector = '[data-tracking-id="TOP_REVIEWED_LISTINGS"]'
71+
await page.waitForSelector(targetSelector)
72+
const highlyHTML = await page.$eval(targetSelector, (el) => el.innerHTML)
7173

72-
// 让 AI 获取图片链接, 并去重 (描述越详细越好)
73-
const srcResult = await crawlOpenAIApp.parseElements(
74-
highlyHTML,
75-
'获取图片链接, 不要source里面的, 并去重'
76-
)
74+
// 让 AI 获取图片链接, 并去重 (描述越详细越好)
75+
const srcResult = await crawlOpenAIApp.parseElements(
76+
highlyHTML,
77+
'获取图片链接, 不要source里面的, 并去重'
78+
)
7779

78-
browser.close()
80+
browser.close()
7981

80-
// crawlFile 用于爬取文件资源
81-
crawlApp.crawlFile({
82-
targets: srcResult.elements.map((item) => item.src),
83-
storeDirs: './upload'
82+
// crawlFile 用于爬取文件资源
83+
crawlApp.crawlFile({
84+
targets: srcResult.elements.map((item) => item.src),
85+
storeDirs: './upload'
86+
})
8487
})
85-
})
8688
```
8789

8890
运行:

docs/cn/guide/parse-elements.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,56 @@
22

33
无需手动分析 HTML 页面结构再提取所需的元素属性或值。现在只需将 HTML 代码输入到 AI 中,并告知 AI 您想获取哪些元素的信息,AI便会自动分析页面结构,提取出相应的元素属性或值。
44

5+
## Ollama
6+
7+
使用 AI 应用实例的 parseElements() 方法。
8+
9+
示例:
10+
11+
```js{24}
12+
import { createCrawlOllama } from 'x-crawl'
13+
14+
const crawlOllamaApp = createCrawlOllama({
15+
model: "你的模型",
16+
clientOptions: { ... }
17+
})
18+
19+
const HTMLContent = `
20+
<div class="scroll-list">
21+
<div class="list-item">女装带帽卫衣</div>
22+
<div class="list-item">男装卫衣</div>
23+
<div class="list-item">女装卫衣</div>
24+
<div class="list-item">男装带帽卫衣</div>
25+
</div>
26+
<div class="scroll-list">
27+
<div class="list-item">男装纯棉短袖</div>
28+
<div class="list-item">男装纯棉短袖</div>
29+
<div class="list-item">女装纯棉短袖</div>
30+
<div class="list-item">男装冰丝短袖</div>
31+
<div class="list-item">男装圆领短袖</div>
32+
</div>
33+
`
34+
35+
crawlOllamaApp.parseElements(HTMLContent, '获取男装, 并去重').then((res) => {
36+
console.log(res)
37+
/*
38+
res:
39+
{
40+
elements: [
41+
{ class: 'list-item', text: '男装卫衣' },
42+
{ class: 'list-item', text: '男装带帽卫衣' },
43+
{ class: 'list-item', text: '男装纯棉短袖' },
44+
{ class: 'list-item', text: '男装冰丝短袖' },
45+
{ class: 'list-item', text: '男装圆领短袖' }
46+
],
47+
type: 'multiple'
48+
}
49+
*/
50+
})
51+
```
52+
53+
## Openai
54+
555
使用 AI 应用实例的 [parseElements()](/cn/api/parse-elements#parseelements) 方法。
656

757
示例:
@@ -47,4 +97,6 @@ crawlOpenAIApp.parseElements(HTMLContent, '获取男装, 并去重').then((res)
4797
})
4898
```
4999

100+
---
101+
50102
也可以将整个 HTML 传给 AI 帮我们操作,但是会消耗更多 Tokens ,OpenAI 是根据 Tokens 进行收费的。

docs/cn/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ hero:
1919

2020
features:
2121
- title: 🤖 AI 辅助
22-
details: 强大的 AI 辅助功能,使爬虫工作变得更加高效、智能和便捷。
22+
details: 集成 ollama 和 openai ,强大的 AI 辅助功能,使爬虫工作变得更加高效、智能和便捷。
2323
- title: 🖋️ 写法灵活
2424
details: 单个爬取 API 都适配多种配置,每种配置方式都各有千秋。
2525
- title: ⚙️ 功能丰富

docs/guide/crawl-openai-custom.md

Lines changed: 23 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,33 @@
11
# User defined AI functions
22

3-
In order to meet the personalized needs of different users, x-crawl also provides user-customized AI functions. Providing openai instances means you can tailor and optimize the AI to your needs to better suit your crawling efforts.
3+
In order to meet the personalized needs of different users, x-crawl also provides user-customized AI functions. Providing ai instances means you can tailor and optimize the AI to your needs to better suit your crawling efforts.
4+
5+
## Ollama
6+
7+
Use the custom() method for the AI application instance.
8+
9+
Example:
10+
11+
```js{8}
12+
import { createCrawlOllama } from 'x-crawl'
13+
14+
const crawlOllamaApp = createCrawlOllama({
15+
model: "Your model ",
16+
clientOptions: { ... }
17+
})
18+
19+
const Ollama = crawlOllamaApp.custom()
20+
```
21+
22+
You can refer to Ollama obtained by calling custom: https://github.com/ollama/ollama-js?tab=readme-ov-file#custom-client, call the custom get Ollama and site sample new Ollama () to get the instance of about the same, The difference is that x-crawl will pass the clientOptions that were passed in when the AI application instance was created to new Ollama. It will get the intact Ollama instance, and x-crawl will not rewrite it.
23+
24+
## Openai
425

526
Use the [custom()](/api/custom#custom) method of the AI application instance.
627

728
Example:
829

9-
```js
30+
```js{7}
1031
import { createXCrawlOpenAI } from 'x-crawl'
1132
1233
const xCrawlOpenAIApp = createXCrawlOpenAI({

docs/guide/crawl-openai-help.md

Lines changed: 44 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,54 @@
22

33
Can provide you with intelligent answers and suggestions. Whether it is about crawling strategies, anti-crawling techniques or data processing, you can ask AI questions, and AI will provide you with professional answers and suggestions based on its powerful learning and reasoning capabilities to help you complete your tasks better. Reptile task.
44

5+
## Ollama
6+
7+
Use the help() method of the AI application instance.
8+
9+
Example:
10+
11+
```js{8,17}
12+
import { createXCrawlOllama } from 'x-crawl'
13+
14+
const xCrawlOllamaApp = createXCrawlOllama({
15+
model: "Your model ",
16+
clientOptions: { ... }
17+
})
18+
19+
xCrawlOllamaApp.help('What is x-crawl').then((res) => {
20+
console.log(res)
21+
/*
22+
res:
23+
x-crawl is a flexible Node.js AI-assisted web crawling library. It offers powerful AI-assisted features that make web crawling more efficient, intelligent, and convenient. You can find more information and the source code on x-crawl's GitHub page: https://github.com/coder-hxl/x-crawl.
24+
*/
25+
})
26+
27+
xCrawlOllamaApp
28+
.help('Three major things to note about crawlers')
29+
.then((res) => {
30+
console.log(res)
31+
/*
32+
res:
33+
There are several important aspects to consider when working with crawlers:
34+
35+
1. **Robots.txt:** It's important to respect the rules set in a website's robots.txt file. This file specifies which parts of a website can be crawled by search engines and other bots. Not following these rules can lead to your crawler being blocked or even legal issues.
36+
37+
2. **Crawl Delay:** It's a good practice to implement a crawl delay between your requests to a website. This helps to reduce the load on the server and also shows respect for the server resources.
38+
39+
3. **User-Agent:** Always set a descriptive User-Agent header for your crawler. This helps websites identify your crawler and allows them to contact you if there are any issues. Using a generic or misleading User-Agent can also lead to your crawler being blocked.
40+
41+
By keeping these points in mind, you can ensure that your crawler operates efficiently and ethically.
42+
*/
43+
})
44+
```
45+
46+
## Openai
47+
548
Use the [help()](/api/help#help) method of the AI application instance.
649

750
Example:
851

9-
```js
52+
```js{7,16}
1053
import { createXCrawlOpenAI } from 'x-crawl'
1154
1255
const xCrawlOpenAIApp = createXCrawlOpenAI({

0 commit comments

Comments
 (0)