You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: pages/INPUT_SCHEMA.md
+73-75Lines changed: 73 additions & 75 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,99 +5,97 @@ Actor (see [Input](../README.md#input) for details).
5
5
The file is referenced from the main [Actor file (.actor/actor.json)](ACTOR_FILE.md) using the `input` directive,
6
6
and it is typically stored in `.actor/input_schema.json`.
7
7
8
-
The file is a JSON schema with our extensions,
9
-
which defines input properties for an Actor, including documentation, default value, and user interface definition.
8
+
The file is a JSON schema with our extensions describing a single Actor input object
9
+
and its properties, including documentation, default value, and user interface definition.
10
10
11
11
**For full reference, see [Input schema specification](https://docs.apify.com/platform/actors/development/actor-definition/input-schema/specification/v1) in Apify documentation.**
12
12
13
-
<!-- TODO: Move the full specs to this repo -->
13
+
<!-- TODO: Move the full specs including JSON meta schema to this repo -->
14
+
<!-- TODO: Consider renaming "editor" values to camelCase, for consistency -->
14
15
15
16
## Example Actor input schema
16
17
17
18
```jsonc
18
19
{
19
-
"actorInputSchemaVersion":1,
20
-
"title":"Input schema for Website Content Crawler",
21
-
"description":"Enter the start URL(s) of the website(s) to crawl, configure other optional settings, and run the Actor to crawl the pages and extract their text content.",
22
-
"type":"object",
23
-
"properties": {
24
-
"startUrls": {
25
-
"title":"Start URLs",
26
-
"type":"array",
27
-
"description":"One or more URLs of the pages where the crawler will start. Note that the Actor will additionally only crawl sub-pages of these URLs. For example, for the start URL `https://www.example.com/blog`, it will crawl pages like `https://example.com/blog/article-1`, but will skip `https://example.com/docs/something-else`.",
"enumTitles": ["Headless web browser (Chrome+Playwright)", "Raw HTTP client (Cheerio)", "Raw HTTP client with JS execution (JSDOM) (experimental!)"],
37
-
"description":"Select the crawling engine:\n- **Headless web browser** (default) - Useful for modern websites with anti-scraping protections and JavaScript rendering. It recognizes common blocking patterns like CAPTCHAs and automatically retries blocked requests through new sessions. However, running web browsers is more expensive as it requires more computing resources and is slower. It is recommended to use at least 8 GB of RAM.\n- **Raw HTTP client** - High-performance crawling mode that uses raw HTTP requests to fetch the pages. It is faster and cheaper, but it might not work on all websites.",
38
-
"default":"playwright:chrome"
39
-
},
40
-
"maxCrawlDepth": {
41
-
"title":"Max crawling depth",
42
-
"type":"integer",
43
-
"description":"The maximum number of links starting from the start URL that the crawler will recursively descend. The start URLs have a depth of 0, the pages linked directly from the start URLs have a depth of 1, and so on.\n\nThis setting is useful to prevent accidental crawler runaway. By setting it to 0, the Actor will only crawl start URLs.",
44
-
"minimum":0,
45
-
"default":20
46
-
},
47
-
"maxCrawlPages": {
48
-
"title":"Max pages",
49
-
"type":"integer",
50
-
"description":"The maximum number pages to crawl. It includes the start URLs, pagination pages, pages with no content, etc. The crawler will automatically finish after reaching this number. This setting is useful to prevent accidental crawler runaway.",
51
-
"minimum":0,
52
-
"default":9999999
53
-
},
54
-
// ...
20
+
"actorInputSchemaVersion":1,
21
+
22
+
"title":"Input schema for an Actor",
23
+
"description":"Enter the start URL(s) of the website(s) to crawl, configure other optional settings, and run the Actor to crawl the pages and extract their text content.",
24
+
"type":"object",
25
+
26
+
"properties": {
27
+
28
+
"startUrls": {
29
+
"title":"Start URLs",
30
+
"type":"array",
31
+
"description":"One or more URLs of the pages where the crawler will start. Note that the Actor will additionally only crawl sub-pages of these URLs. For example, for the start URL `https://www.example.com/blog`, it will crawl pages like `https://example.com/blog/article-1`, but will skip `https://example.com/docs/something-else`.",
32
+
"editor":"requestListSources",
33
+
"prefill": [{ "url":"https://docs.apify.com/" }]
34
+
},
35
+
36
+
// The input value is another Dataset. The system can generate an UI to make it easy to select the dataset.
37
+
"processDatasetId": {
38
+
"title":"Input dataset",
39
+
"type":"string",
40
+
"resourceType":"dataset",
41
+
"description":"Dataset to be processed by the Actor",
42
+
// Optional link to dataset schema, used by the system to validate the input dataset
"enumTitles": ["Headless web browser (Chrome+Playwright)", "Raw HTTP client (Cheerio)", "Raw HTTP client with JS execution (JSDOM) (experimental!)"],
68
+
"description":"Select the crawling engine:\n- **Headless web browser** (default) - Useful for modern websites with anti-scraping protections and JavaScript rendering. It recognizes common blocking patterns like CAPTCHAs and automatically retries blocked requests through new sessions. However, running web browsers is more expensive as it requires more computing resources and is slower. It is recommended to use at least 8 GB of RAM.\n- **Raw HTTP client** - High-performance crawling mode that uses raw HTTP requests to fetch the pages. It is faster and cheaper, but it might not work on all websites.",
69
+
"default":"playwright:chrome"
70
+
},
71
+
72
+
"maxCrawlDepth": {
73
+
"title":"Max crawling depth",
74
+
"type":"integer",
75
+
"description":"The maximum number of links starting from the start URL that the crawler will recursively descend. The start URLs have a depth of 0, the pages linked directly from the start URLs have a depth of 1, and so on.\n\nThis setting is useful to prevent accidental crawler runaway. By setting it to 0, the Actor will only crawl start URLs.",
76
+
"minimum":0,
77
+
"default":20
78
+
},
79
+
80
+
"maxCrawlPages": {
81
+
"title":"Max pages",
82
+
"type":"integer",
83
+
"description":"The maximum number pages to crawl. It includes the start URLs, pagination pages, pages with no content, etc. The crawler will automatically finish after reaching this number. This setting is useful to prevent accidental crawler runaway.",
84
+
"minimum":0,
85
+
"default":9999999
55
86
}
87
+
88
+
}
56
89
}
57
90
```
58
91
59
92
## Random notes
60
93
61
-
To make Actors easier to pipeline, we could add e.g.
62
-
`dataset`, `keyValueStore` and `requestQueue` types, each optionally
63
-
restricted by the referenced schema to make sure that selected storage is compatible.
64
94
65
-
Another idea is to add type`actor`. The use case could be for example a testing Actor with 3 inputs:
95
+
We could also add an`actor` resource type. The use case could be for example a testing Actor with three inputs:
66
96
- Actor to be tested
67
97
- test function containing for example Jest unit test over the output
68
98
- input for the Actor
69
99
70
100
...and the testing Actor would call the given Actor with a given output and in the end execute tests if the results are correct.
71
101
72
-
73
-
74
-
For example:
75
-
76
-
```jsonc
77
-
"inputDataset": {
78
-
"title":"Input dataset",
79
-
"type":"string",
80
-
"resourceType":"Dataset",
81
-
"schema":"./input_dataset_schema.json",
82
-
"description":"Dataset to be processed",
83
-
},
84
-
85
-
"inputScreenshots": {
86
-
"title":"Input screenshots",
87
-
"type":"string",
88
-
"resourceType":"KeyValueStore",
89
-
"description":"Screenshots to be compressed",
90
-
"schema":"./input_key_value_store_schema.json",
91
-
// Specify records groups from the schema that Actor is interested in.
92
-
// Note that a recordGroup can be a single file too!
93
-
"recordGroups": ["screenshots", "images"]
94
-
}
95
-
```
96
-
97
-
This example would be rendered in Input UI as a search/dropdown that would only list named
98
-
datasets or key-value stores with matching schema. This feature will make it easy to integrate Actors,
99
-
and pipe results from one to another.
100
-
Note from Franta: It would be cool to have an option in the dropdown to create a
101
-
new dataset/key-value store with the right schema,
0 commit comments