Skip to content

Commit c0be268

Browse files
committed
文档
1 parent 1031680 commit c0be268

File tree

2 files changed

+476
-6
lines changed

2 files changed

+476
-6
lines changed

README.md

Lines changed: 238 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,240 @@
1-
# x-crawl
1+
# <div id="en">x-crawl</div>
22

3-
XCrawl is a Nodejs multifunctional crawler library, which can crawl HTML, JSON or files in batches by providing configuration.
3+
English | <a href="#cn" style="text-decoration: none">简体中文</a>
44

5-
XCrawl 是 Nodejs 多功能爬虫库,提供配置即可批量抓取 HTML 、JSON 或文件。
5+
Crawl is a Nodejs multifunctional crawler library. Provide configuration to batch fetch HTML, JSON, images, etc.
6+
7+
---
8+
9+
10+
# <div id="cn">x-crawl</div>
11+
12+
<a href="#en" style="text-decoration: none">English</a> | 简体中文
13+
14+
Crawl 是 Nodejs 多功能爬虫库。提供配置即可批量抓取 HTML 、JSON、图片等等。
15+
16+
## 安装
17+
18+
以 NPM 为例:
19+
20+
```shell
21+
npm install x-crawl
22+
````
23+
24+
## 示例
25+
26+
获取 https://docs.github.com/zh/get-started 的标题为例:
27+
28+
```js
29+
// 导入模块 ES/CJS
30+
import XCrawl from 'x-crawl'
31+
32+
// 创建一个爬虫实例
33+
const docsXCrawl = new XCrawl({
34+
baseUrl: 'https://docs.github.com',
35+
timeout: 10000,
36+
intervalTime: { max: 2000, min: 1000 }
37+
})
38+
39+
// 调用 fetchHTML API 爬取
40+
docsXCrawl.fetchHTML('/zh/get-started').then((jsdom) => {
41+
console.log(jsdom.window.document.querySelector('title')?.textContent)
42+
})
43+
```
44+
45+
## 核心概念
46+
47+
### XCrawl
48+
49+
通过 new XCrawl 创建一个爬虫实例。
50+
51+
* 类型
52+
53+
```ts
54+
class XCrawl {
55+
private readonly baseConfig
56+
constructor(baseConfig?: IXCrawlBaseConifg)
57+
fetch<T = any>(config: IFetchConfig): Promise<T>
58+
fetchFile(config: IFetchFileConfig): Promise<IFetchFile>
59+
fetchHTML(url: string): Promise<JSDOM>
60+
}
61+
```
62+
63+
* <div id="myXCrawl" style="text-decoration: none">示例</div>
64+
65+
myXCrawl 为后面示例的爬虫实例。
66+
67+
```js
68+
const myXCrawl = new XCrawl({
69+
baseUrl: 'https://xxx.com',
70+
timeout: 10000,
71+
// 相当于默认值, 下次请求的间隔时间, 多个请求才有效
72+
intervalTime: {
73+
max: 2000,
74+
min: 1000
75+
}
76+
})
77+
```
78+
79+
### fetch
80+
81+
fetch 是上面 <a href="#myXCrawl" style="text-decoration: none">myXCrawl</a> 实例的方法,通常用于爬取 API ,可获取 JSON 数据等等。
82+
83+
* 类型
84+
85+
```ts
86+
function fetch<T = any>(config: IFetchConfig): Promise<T>
87+
```
88+
89+
* 示例
90+
91+
```js
92+
const requestConifg = [
93+
{ url: '/xxxx', method: 'GET' },
94+
{ url: '/xxxx', method: 'GET' },
95+
{ url: '/xxxx', method: 'GET' }
96+
]
97+
98+
myXCrawl.fetch({
99+
requestConifg, // 请求配置, 可以是 Array | Object
100+
intervalTime: 800 // 下次请求的间隔时间, 多个请求才有效
101+
}).then(res => {
102+
console.log(res)
103+
})
104+
```
105+
106+
### fetchFile
107+
108+
fetchFile 是上面 <a href="#myXCrawl" style="text-decoration: none">myXCrawl</a> 实例的方法,通常用于爬取文件,可获取图片、pdf 文件等等。
109+
110+
* 类型
111+
112+
```ts
113+
function fetchFile(config: IFetchFileConfig): Promise<IFetchFile>
114+
```
115+
116+
* 示例
117+
118+
```js
119+
const requestConifg = [
120+
{ url: '/xxxx', method: 'GET' },
121+
{ url: '/xxxx', method: 'GET' },
122+
{ url: '/xxxx', method: 'GET' }
123+
]
124+
125+
myXCrawl.fetchFile({
126+
requestConifg, // 请求配置, 可以是 Array | Object
127+
fileConfig: {
128+
storeDir: path.resolve(__dirname, './upload') // 存放文件夹
129+
}
130+
}).then(fileInfos => {
131+
console.log(fileInfos)
132+
})
133+
```
134+
135+
### fetchHTML
136+
137+
fetchHTML 是上面 <a href="#myXCrawl" style="text-decoration: none">myXCrawl</a> 实例的方法,通常用于爬取 HTML 。
138+
139+
* 类型
140+
141+
```ts
142+
function fetchHTML(url: string): Promise<JSDOM>
143+
```
144+
145+
* 示例
146+
147+
```js
148+
myXCrawl.fetchHTML('/xxx').then((jsdom) => {
149+
console.log(jsdom.window.document.querySelector('title')?.textContent)
150+
})
151+
```
152+
153+
## 类型
154+
155+
* IAnyObject
156+
157+
```ts
158+
interface IAnyObject extends Object {
159+
[key: string | number | symbol]: any
160+
}
161+
```
162+
163+
* IMethod
164+
165+
```ts
166+
export type IMethod = 'get' | 'GET' | 'delete' | 'DELETE' | 'head' | 'HEAD' | 'options' | 'OPTIONS' | 'post' | 'POST' | 'put' | 'PUT' | 'patch' | 'PATCH' | 'purge' | 'PURGE' | 'link' | 'LINK' | 'unlink' | 'UNLINK'
167+
```
168+
169+
* IRequestConfig
170+
171+
```ts
172+
export interface IRequestConfig {
173+
url: string
174+
method?: IMethod
175+
headers?: IAnyObject
176+
params?: IAnyObject
177+
data?: any
178+
timeout?: number
179+
}
180+
```
181+
182+
* IIntervalTime
183+
184+
```ts
185+
type IIntervalTime = number | {
186+
max: number
187+
min?: number
188+
}
189+
```
190+
191+
* IFetchBaseConifg
192+
193+
```ts
194+
interface IFetchBaseConifg {
195+
requestConifg: IRequestConfig | IRequestConfig[]
196+
intervalTime?: IIntervalTime
197+
}
198+
```
199+
200+
* IFetchFile
201+
202+
```ts
203+
type IFetchFile = {
204+
fileName: string
205+
mimeType: string
206+
size: number
207+
filePath: string
208+
}[]
209+
```
210+
211+
* IXCrawlBaseConifg
212+
213+
```ts
214+
interface IXCrawlBaseConifg {
215+
baseUrl?: string
216+
timeout?: number
217+
intervalTime?: IIntervalTime
218+
}
219+
```
220+
221+
* IFetchConfig
222+
223+
```ts
224+
interface IFetchConfig extends IFetchBaseConifg {
225+
}
226+
```
227+
228+
* IFetchFileConfig
229+
230+
```ts
231+
interface IFetchFileConfig extends IFetchBaseConifg {
232+
fileConfig: {
233+
storeDir: string
234+
}
235+
}
236+
```
237+
238+
## 更多
239+
240+
如有 **问题** 或 **需求** 请在 https://github.com/coder-hxl/x-crawl 中提 **Issues** 。

0 commit comments

Comments
 (0)