Skip to content

Commit 1e6ff46

Browse files
authored
Merge pull request #73 from apify/feat/output
Finishing output schema
2 parents dd90063 + 1fc6af8 commit 1e6ff46

File tree

4 files changed

+159
-168
lines changed

4 files changed

+159
-168
lines changed

README.md

Lines changed: 6 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1826,7 +1826,8 @@ For details, see the [Actor file specification](./pages/ACTOR_FILE.md) page.
18261826
"title": "Screenshotter",
18271827
"description": "Take a screenshot of any URL",
18281828
"version": "0.0",
1829-
"input": "./input_schema.json",
1829+
"inputSchema": "./input_schema.json",
1830+
"outputSchema": "./output_schema.json",
18301831
"dockerfile": "./Dockerfile"
18311832
}
18321833
```
@@ -2013,21 +2014,11 @@ This is an example of the output schema file for the `bob/screenshotter` Actor:
20132014
"title": "Output schema for Screenshotter Actor",
20142015
"description": "The URL to the resulting screenshot",
20152016
"properties": {
2016-
2017-
"currentProducts": {
2018-
"type": "$defaultDataset",
2019-
"views": ["productVariants"]
2020-
},
2021-
20222017
"screenshotUrl": {
2023-
"type": "$defaultKeyValueStore",
2024-
"keys": ["screenshot.png"],
2025-
"title": "Product page screenshot"
2026-
},
2027-
2028-
"productExplorer": {
2029-
"type": "$defaultWebServer",
2030-
"title": "API server"
2018+
"type": "string",
2019+
"title": "Web page screenshot",
2020+
"resourceType": "file",
2021+
"template": "{{actorRun.defaultKeyValueStoreUrl}}/screenshot.png"
20312022
}
20322023
}
20332024
}

pages/ACTOR_FILE.md

Lines changed: 33 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -2,17 +2,17 @@
22

33
This JSON file must be present at `.actor/actor.json` and defines core properties of a single web Actor.
44

5-
The file has the following structure:
5+
The file contains a single JSON object with the following properties:
66

77
```jsonc
88
{
9-
// Required, indicates that this is an Actor definition file and the specific version of the Actor specification.
9+
// Required field, indicates that this is an Actor definition file and the specific version of the Actor specification.
1010
"actorSpecification": 1,
11-
12-
// Required "technical" name of the Actor, must be a DNS-friendly text
11+
12+
// Required "technical" name of the Actor, must be a DNS hostname-friendly text.
1313
"name": "google-search-scraper",
1414

15-
// Human-friendly name and description of the Actor
15+
// Human-friendly name and description of the Actor.
1616
"title": "Google Search Scraper",
1717
"description": "A 200-char description",
1818

@@ -22,56 +22,59 @@ The file has the following structure:
2222

2323
// Optional tag that is applied to the builds of this Actor. If omitted, it defaults to "latest".
2424
"buildTag": "latest",
25-
26-
// An object with environment variables expected by the Actor.
25+
26+
// An optional object with environment variables expected by the Actor.
2727
// Secret values are prefixed by @ and their actual values need to be registered with the CLI, for example:
2828
// $ apify secrets add mySecretPassword pwd1234
2929
"environmentVariables": {
3030
"MYSQL_USER": "my_username",
3131
"MYSQL_PASSWORD": "@mySecretPassword"
3232
},
33-
34-
// If true, the Actor indicates it can be run in the Standby mode,
33+
34+
// Optional field. If true, the Actor indicates it can be run in the Standby mode,
3535
// to get started and be kept alive by the system to handle incoming HTTP REST requests by the Actor's web server.
3636
"usesStandbyMode": true,
37-
38-
// A metadata object enabling implementations to pass arbitrary additional properties.
37+
38+
// An optional metadata object enabling implementations to pass arbitrary additional properties.
39+
// The properties and their values must be strings.
3940
"labels": {
4041
"something": "bla bla"
4142
},
4243

4344
// Optional minimum and maximum memory for running the Actor.
4445
"minMemoryMbytes": 128,
4546
"maxMemoryMbytes": 4096,
46-
47-
// Link to the Actor Dockerfile. If omitted, the system looks for "./Dockerfile" or "../Dockerfile"
47+
48+
// Optional link to the Actor Dockerfile.
49+
// If omitted, the system looks for "./Dockerfile" or "../Dockerfile"
4850
"dockerfile": "./Dockerfile",
49-
50-
// Link to the Actor README file in Markdown format. If omitted, the system looks for "./ACTOR.md" and "../README.md"
51+
52+
// Optional link to the Actor README file in Markdown format.
53+
// If omitted, the system looks for "./ACTOR.md" and "../README.md"
5154
"readme": "./README.md",
5255

5356
// Optional link to the Actor changelog file in Markdown format.
5457
"changelog": "../../../shared/CHANGELOG.md",
55-
56-
// Links to input/output extened JSON schema files or inlined objects.
57-
// COMPATIBILITY: This used to be called "input", all implementations should support it
58+
59+
// Optional link to Actor input or output schema file, or inlined schema object,
60+
// which is a JSON schema with our extensions. For details see ./INPUT_SCHEMA.md or ./OUTPUT_SCHEMA.md, respectively.
61+
// BACKWARDS COMPATIBILITY: "inputSchema" used to be called "input", all implementations should support this.
5862
"inputSchema": "./input_schema.json",
5963
"outputSchema": "./output_schema.json",
60-
61-
// Links to storages schema files, or inlined schema objects.
62-
// These aren't standard JSON schema files, but our own format. See ./DATASET_SCHEMA.md
63-
// COMPATIBILITY: This used to be "storages.keyValueStore", all implementations should support it
64+
65+
// Optional path to Dataset or Key-value Store schema file or inlined schema object for the Actor's default dataset or key-value store.
66+
// For detail, see ./DATASET_SCHEMA.md or ./KEY_VALUE_STORE_SCHEMA.md, respectively.
67+
// BACKWARDS COMPATIBILITY: "datasetSchema" used to be "storages.keyValueStore" sub-object, all implementations should support this.
6468
"datasetSchema": "../shared_schemas/generic_dataset_schema.json",
65-
6669
"keyValueStoreSchema": "./key_value_store_schema.json",
67-
68-
// Optional link to an OpenAPI definition file or inlined object describing the Actor web server API
69-
"webServerOpenapi": "./web_server_openapi.json",
70-
71-
// Optional URL path to the Model Context Protocol (MCP) server exposed on the Actor web server.
70+
71+
// Optional path or inlined schema object of the Actor's web server in OpenAPI formation.
72+
"webServerSchema": "./web_server_openapi.json",
73+
74+
// Optional URL path and query parameters to the Model Context Protocol (MCP) server exposed by the Actor web server.
7275
// If present, the system knows the Actor provides an MCP server, which can be used by the platform
73-
// and integrations to integrate the Actor from AI/LLM systems.
74-
"webServerMcpPath": "/mcp?someVar=1",
76+
// and integrations to integrate the Actor with various AI/LLM systems.
77+
"webServerMcpPath": "/mcp?version=2",
7578

7679
// Scripts can be used by tools like the CLI to do certain actions based on the commands you run.
7780
// The presence of this object in your Actor config is optional, but we recommend always defining at least the `run` key.

pages/INPUT_SCHEMA.md

Lines changed: 73 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -5,99 +5,97 @@ Actor (see [Input](../README.md#input) for details).
55
The file is referenced from the main [Actor file (.actor/actor.json)](ACTOR_FILE.md) using the `input` directive,
66
and it is typically stored in `.actor/input_schema.json`.
77

8-
The file is a JSON schema with our extensions,
9-
which defines input properties for an Actor, including documentation, default value, and user interface definition.
8+
The file is a JSON schema with our extensions describing a single Actor input object
9+
and its properties, including documentation, default value, and user interface definition.
1010

1111
**For full reference, see [Input schema specification](https://docs.apify.com/platform/actors/development/actor-definition/input-schema/specification/v1) in Apify documentation.**
1212

13-
<!-- TODO: Move the full specs to this repo -->
13+
<!-- TODO: Move the full specs including JSON meta schema to this repo -->
14+
<!-- TODO: Consider renaming "editor" values to camelCase, for consistency -->
1415

1516
## Example Actor input schema
1617

1718
```jsonc
1819
{
19-
"actorInputSchemaVersion": 1,
20-
"title": "Input schema for Website Content Crawler",
21-
"description": "Enter the start URL(s) of the website(s) to crawl, configure other optional settings, and run the Actor to crawl the pages and extract their text content.",
22-
"type": "object",
23-
"properties": {
24-
"startUrls": {
25-
"title": "Start URLs",
26-
"type": "array",
27-
"description": "One or more URLs of the pages where the crawler will start. Note that the Actor will additionally only crawl sub-pages of these URLs. For example, for the start URL `https://www.example.com/blog`, it will crawl pages like `https://example.com/blog/article-1`, but will skip `https://example.com/docs/something-else`.",
28-
"editor": "requestListSources",
29-
"prefill": [{ "url": "https://docs.apify.com/" }]
30-
},
31-
"crawlerType": {
32-
"sectionCaption": "Crawler settings",
33-
"title": "Crawler type",
34-
"type": "string",
35-
"enum": ["playwright:chrome", "cheerio", "jsdom"],
36-
"enumTitles": ["Headless web browser (Chrome+Playwright)", "Raw HTTP client (Cheerio)", "Raw HTTP client with JS execution (JSDOM) (experimental!)"],
37-
"description": "Select the crawling engine:\n- **Headless web browser** (default) - Useful for modern websites with anti-scraping protections and JavaScript rendering. It recognizes common blocking patterns like CAPTCHAs and automatically retries blocked requests through new sessions. However, running web browsers is more expensive as it requires more computing resources and is slower. It is recommended to use at least 8 GB of RAM.\n- **Raw HTTP client** - High-performance crawling mode that uses raw HTTP requests to fetch the pages. It is faster and cheaper, but it might not work on all websites.",
38-
"default": "playwright:chrome"
39-
},
40-
"maxCrawlDepth": {
41-
"title": "Max crawling depth",
42-
"type": "integer",
43-
"description": "The maximum number of links starting from the start URL that the crawler will recursively descend. The start URLs have a depth of 0, the pages linked directly from the start URLs have a depth of 1, and so on.\n\nThis setting is useful to prevent accidental crawler runaway. By setting it to 0, the Actor will only crawl start URLs.",
44-
"minimum": 0,
45-
"default": 20
46-
},
47-
"maxCrawlPages": {
48-
"title": "Max pages",
49-
"type": "integer",
50-
"description": "The maximum number pages to crawl. It includes the start URLs, pagination pages, pages with no content, etc. The crawler will automatically finish after reaching this number. This setting is useful to prevent accidental crawler runaway.",
51-
"minimum": 0,
52-
"default": 9999999
53-
},
54-
// ...
20+
"actorInputSchemaVersion": 1,
21+
22+
"title": "Input schema for an Actor",
23+
"description": "Enter the start URL(s) of the website(s) to crawl, configure other optional settings, and run the Actor to crawl the pages and extract their text content.",
24+
"type": "object",
25+
26+
"properties": {
27+
28+
"startUrls": {
29+
"title": "Start URLs",
30+
"type": "array",
31+
"description": "One or more URLs of the pages where the crawler will start. Note that the Actor will additionally only crawl sub-pages of these URLs. For example, for the start URL `https://www.example.com/blog`, it will crawl pages like `https://example.com/blog/article-1`, but will skip `https://example.com/docs/something-else`.",
32+
"editor": "requestListSources",
33+
"prefill": [{ "url": "https://docs.apify.com/" }]
34+
},
35+
36+
// The input value is another Dataset. The system can generate an UI to make it easy to select the dataset.
37+
"processDatasetId": {
38+
"title": "Input dataset",
39+
"type": "string",
40+
"resourceType": "dataset",
41+
"description": "Dataset to be processed by the Actor",
42+
// Optional link to dataset schema, used by the system to validate the input dataset
43+
"schema": "./input_dataset_schema.json"
44+
},
45+
46+
"screenshotsKeyValueStoreId": {
47+
"title": "Screenshots to process",
48+
"type": "string",
49+
"resourceType": "keyValueStore",
50+
"description": "Screenshots to be compressed",
51+
"schema": "./input_key_value_store_schema.json"
52+
},
53+
54+
"singleFileUrl": {
55+
"title": "Some file",
56+
"type": "string",
57+
"editor": "fileupload",
58+
"description": "Screenshots to be compressed",
59+
"schema": "./input_key_value_store_schema.json"
60+
},
61+
62+
"crawlerType": {
63+
"sectionCaption": "Crawler settings",
64+
"title": "Crawler type",
65+
"type": "string",
66+
"enum": ["playwright:chrome", "cheerio", "jsdom"],
67+
"enumTitles": ["Headless web browser (Chrome+Playwright)", "Raw HTTP client (Cheerio)", "Raw HTTP client with JS execution (JSDOM) (experimental!)"],
68+
"description": "Select the crawling engine:\n- **Headless web browser** (default) - Useful for modern websites with anti-scraping protections and JavaScript rendering. It recognizes common blocking patterns like CAPTCHAs and automatically retries blocked requests through new sessions. However, running web browsers is more expensive as it requires more computing resources and is slower. It is recommended to use at least 8 GB of RAM.\n- **Raw HTTP client** - High-performance crawling mode that uses raw HTTP requests to fetch the pages. It is faster and cheaper, but it might not work on all websites.",
69+
"default": "playwright:chrome"
70+
},
71+
72+
"maxCrawlDepth": {
73+
"title": "Max crawling depth",
74+
"type": "integer",
75+
"description": "The maximum number of links starting from the start URL that the crawler will recursively descend. The start URLs have a depth of 0, the pages linked directly from the start URLs have a depth of 1, and so on.\n\nThis setting is useful to prevent accidental crawler runaway. By setting it to 0, the Actor will only crawl start URLs.",
76+
"minimum": 0,
77+
"default": 20
78+
},
79+
80+
"maxCrawlPages": {
81+
"title": "Max pages",
82+
"type": "integer",
83+
"description": "The maximum number pages to crawl. It includes the start URLs, pagination pages, pages with no content, etc. The crawler will automatically finish after reaching this number. This setting is useful to prevent accidental crawler runaway.",
84+
"minimum": 0,
85+
"default": 9999999
5586
}
87+
88+
}
5689
}
5790
```
5891

5992
## Random notes
6093

61-
To make Actors easier to pipeline, we could add e.g.
62-
`dataset`, `keyValueStore` and `requestQueue` types, each optionally
63-
restricted by the referenced schema to make sure that selected storage is compatible.
6494

65-
Another idea is to add type `actor`. The use case could be for example a testing Actor with 3 inputs:
95+
We could also add an `actor` resource type. The use case could be for example a testing Actor with three inputs:
6696
- Actor to be tested
6797
- test function containing for example Jest unit test over the output
6898
- input for the Actor
6999

70100
...and the testing Actor would call the given Actor with a given output and in the end execute tests if the results are correct.
71101

72-
73-
74-
For example:
75-
76-
```jsonc
77-
"inputDataset": {
78-
"title": "Input dataset",
79-
"type": "string",
80-
"resourceType": "Dataset",
81-
"schema": "./input_dataset_schema.json",
82-
"description": "Dataset to be processed",
83-
},
84-
85-
"inputScreenshots": {
86-
"title": "Input screenshots",
87-
"type": "string",
88-
"resourceType": "KeyValueStore",
89-
"description": "Screenshots to be compressed",
90-
"schema": "./input_key_value_store_schema.json",
91-
// Specify records groups from the schema that Actor is interested in.
92-
// Note that a recordGroup can be a single file too!
93-
"recordGroups": ["screenshots", "images"]
94-
}
95-
```
96-
97-
This example would be rendered in Input UI as a search/dropdown that would only list named
98-
datasets or key-value stores with matching schema. This feature will make it easy to integrate Actors,
99-
and pipe results from one to another.
100-
Note from Franta: It would be cool to have an option in the dropdown to create a
101-
new dataset/key-value store with the right schema,
102-
if it's the first time you're running some Actor,
103-
and then in the next runs you could reuse it.

0 commit comments

Comments
 (0)