Skip to content

Commit 4771ddb

Browse files
authored
Merge pull request #38 from marcelovicentegc/main
Make GPT Crawler a CLI
2 parents 9e8c329 + e928bd6 commit 4771ddb

File tree

10 files changed

+1206
-1181
lines changed

10 files changed

+1206
-1181
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,6 @@ node_modules
66
apify_storage
77
crawlee_storage
88
storage
9+
10+
# any output from the crawler
11+
.json

Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,4 +48,4 @@ COPY --chown=myuser . ./
4848

4949
# Run the image. If you know you won't need headful browsers,
5050
# you can remove the XVFB start script for a micro perf gain.
51-
CMD ./start_xvfb_and_run_cmd.sh && npm run start:prod --silent
51+
CMD ./start_xvfb_and_run_cmd.sh && npm run start:prod --silent

README.md

Lines changed: 48 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,51 +1,63 @@
1-
# GPT Crawler
1+
<!-- Markdown written with https://marketplace.visualstudio.com/items?itemName=yzhang.markdown-all-in-one -->
2+
3+
# GPT Crawler <!-- omit from toc -->
24

35
Crawl a site to generate knowledge files to create your own custom GPT from one or multiple URLs
46

57
![Gif showing the crawl run](https://github.com/BuilderIO/gpt-crawler/assets/844291/feb8763a-152b-4708-9c92-013b5c70d2f2)
68

9+
- [Example](#example)
10+
- [Get started](#get-started)
11+
- [Running locally](#running-locally)
12+
- [Clone the repository](#clone-the-repository)
13+
- [Install dependencies](#install-dependencies)
14+
- [Configure the crawler](#configure-the-crawler)
15+
- [Run your crawler](#run-your-crawler)
16+
- [Alternative methods](#alternative-methods)
17+
- [Running in a container with Docker](#running-in-a-container-with-docker)
18+
- [Running as a CLI](#running-as-a-cli)
19+
- [Development](#development)
20+
- [Upload your data to OpenAI](#upload-your-data-to-openai)
21+
- [Create a custom GPT](#create-a-custom-gpt)
22+
- [Create a custom assistant](#create-a-custom-assistant)
23+
- [Contributing](#contributing)
724

825
## Example
926

10-
[Here is a custom GPT](https://chat.openai.com/g/g-kywiqipmR-builder-io-assistant) that I quickly made to help answer questions about how to use and integrate [Builder.io](https://www.builder.io) by simply providing the URL to the Builder docs.
27+
[Here is a custom GPT](https://chat.openai.com/g/g-kywiqipmR-builder-io-assistant) that I quickly made to help answer questions about how to use and integrate [Builder.io](https://www.builder.io) by simply providing the URL to the Builder docs.
1128

1229
This project crawled the docs and generated the file that I uploaded as the basis for the custom GPT.
1330

14-
[Try it out yourself](https://chat.openai.com/g/g-kywiqipmR-builder-io-assistant) by asking questions about how to integrate Builder.io into a site.
31+
[Try it out yourself](https://chat.openai.com/g/g-kywiqipmR-builder-io-assistant) by asking questions about how to integrate Builder.io into a site.
1532

1633
> Note that you may need a paid ChatGPT plan to access this feature
1734
1835
## Get started
1936

20-
### Prerequisites
37+
### Running locally
2138

22-
Be sure you have Node.js >= 16 installed
39+
#### Clone the repository
2340

24-
### Clone the repo
41+
Be sure you have Node.js >= 16 installed.
2542

2643
```sh
2744
git clone https://github.com/builderio/gpt-crawler
2845
```
2946

30-
### Install Dependencies
47+
#### Install dependencies
3148

3249
```sh
3350
npm i
3451
```
3552

36-
If you do not have Playwright installed:
37-
```sh
38-
npx playwright install
39-
```
40-
41-
### Configure the crawler
53+
#### Configure the crawler
4254

4355
Open [config.ts](config.ts) and edit the `url` and `selectors` properties to match your needs.
4456

4557
E.g. to crawl the Builder.io docs to make our custom GPT you can use:
4658

4759
```ts
48-
export const config: Config = {
60+
export const defaultConfig: Config = {
4961
url: "https://www.builder.io/c/docs/developers",
5062
match: "https://www.builder.io/c/docs/**",
5163
selector: `.docs-builder-container`,
@@ -69,23 +81,41 @@ type Config = {
6981
/** File name for the finished data */
7082
outputFileName: string;
7183
/** Optional cookie to be set. E.g. for Cookie Consent */
72-
cookie?: {name: string; value: string}
84+
cookie?: { name: string; value: string };
7385
/** Optional function to run for each page found */
7486
onVisitPage?: (options: {
7587
page: Page;
7688
pushData: (data: any) => Promise<void>;
7789
}) => Promise<void>;
78-
/** Optional timeout for waiting for a selector to appear */
79-
waitForSelectorTimeout?: number;
90+
/** Optional timeout for waiting for a selector to appear */
91+
waitForSelectorTimeout?: number;
8092
};
8193
```
8294

83-
### Run your crawler
95+
#### Run your crawler
8496

8597
```sh
8698
npm start
8799
```
88100

101+
### Alternative methods
102+
103+
#### [Running in a container with Docker](./containerapp/README.md)
104+
105+
To obtain the `output.json` with a containerized execution. Go into the `containerapp` directory. Modify the `config.ts` same as above, the `output.json`file should be generated in the data folder. Note : the `outputFileName` property in the `config.ts` file in containerapp folder is configured to work with the container.
106+
107+
#### Running as a CLI
108+
109+
<!-- TODO: Needs to be actually published -->
110+
111+
##### Development
112+
113+
To run the CLI locally while developing it:
114+
115+
```sh
116+
npm run start:cli --url https://www.builder.io/c/docs/developers --match https://www.builder.io/c/docs/** --selector .docs-builder-container --maxPagesToCrawl 50 --outputFileName output.json
117+
```
118+
89119
### Upload your data to OpenAI
90120

91121
The crawl will generate a file called `output.json` at the root of this project. Upload that [to OpenAI](https://platform.openai.com/docs/assistants/overview) to create your custom assistant or custom GPT.
@@ -105,7 +135,6 @@ Use this option for UI access to your generated knowledge that you can easily sh
105135

106136
![Gif of how to upload a custom GPT](https://github.com/BuilderIO/gpt-crawler/assets/844291/22f27fb5-6ca5-4748-9edd-6bcf00b408cf)
107137

108-
109138
#### Create a custom assistant
110139

111140
Use this option for API access to your generated knowledge that you can integrate into your product.
@@ -116,10 +145,6 @@ Use this option for API access to your generated knowledge that you can integrat
116145

117146
![Gif of how to upload to an assistant](https://github.com/BuilderIO/gpt-crawler/assets/844291/06e6ad36-e2ba-4c6e-8d5a-bf329140de49)
118147

119-
## (Alternate method) Running in a container with Docker
120-
To obtain the `output.json` with a containerized execution. Go into the `containerapp` directory. Modify the `config.ts` same as above, the `output.json`file should be generated in the data folder. Note : the `outputFileName` property in the `config.ts` file in containerapp folder is configured to work with the container.
121-
122-
123148
## Contributing
124149

125150
Know how to make this project better? Send a PR!

config.ts

Lines changed: 27 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,33 @@
11
import { Page } from "playwright";
2-
type Config = {
3-
/** URL to start the crawl */
2+
3+
export type Config = {
4+
/**
5+
* URL to start the crawl
6+
* @example "https://www.builder.io/c/docs/developers"
7+
* @default ""
8+
*/
49
url: string;
5-
/** Pattern to match against for links on a page to subsequently crawl */
10+
/**
11+
* Pattern to match against for links on a page to subsequently crawl
12+
* @example "https://www.builder.io/c/docs/**"
13+
* @default ""
14+
*/
615
match: string | string[];
7-
/** Selector to grab the inner text from */
16+
/**
17+
* Selector to grab the inner text from
18+
* @example ".docs-builder-container"
19+
* @default ""
20+
*/
821
selector: string;
9-
/** Don't crawl more than this many pages */
22+
/**
23+
* Don't crawl more than this many pages
24+
* @default 50
25+
*/
1026
maxPagesToCrawl: number;
11-
/** File name for the finished data */
27+
/**
28+
* File name for the finished data
29+
* @default "output.json"
30+
*/
1231
outputFileName: string;
1332
/** Optional cookie to be set. E.g. for Cookie Consent */
1433
cookie?: { name: string; value: string };
@@ -21,10 +40,10 @@ type Config = {
2140
waitForSelectorTimeout?: number;
2241
};
2342

24-
export const config: Config = {
43+
export const defaultConfig: Config = {
2544
url: "https://www.builder.io/c/docs/developers",
2645
match: "https://www.builder.io/c/docs/**",
2746
selector: `.docs-builder-container`,
2847
maxPagesToCrawl: 50,
29-
outputFileName: "output.json",
48+
outputFileName: "../output.json",
3049
};

0 commit comments

Comments
 (0)