Skip to content

Commit a32c56e

Browse files
authored
Merge branch 'master' into feat/kick-off-code-samples
2 parents 64df1b7 + 6873e01 commit a32c56e

File tree

22 files changed

+529
-18
lines changed

22 files changed

+529
-18
lines changed
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
type: object
2+
properties:
3+
error:
4+
type: object
5+
properties:
6+
type:
7+
type: string
8+
description: The type of the error.
9+
example: "schema-validation-error"
10+
message:
11+
type: string
12+
description: A human-readable message describing the error.
13+
example: "Schema validation failed"
14+
data:
15+
type: object
16+
properties:
17+
invalidItems:
18+
type: array
19+
description: A list of invalid items in the received array of items.
20+
items:
21+
type: object
22+
properties:
23+
itemPosition:
24+
type: number
25+
description: The position of the invalid item in the array.
26+
example: 2
27+
validationErrors:
28+
type: array
29+
description: A complete list of AJV validation error objects for the invalid item.
30+
items:
31+
type: object
32+
properties:
33+
instancePath:
34+
type: string
35+
description: The path to the instance being validated.
36+
schemaPath:
37+
type: string
38+
description: The path to the schema that failed the validation.
39+
keyword:
40+
type: string
41+
description: The validation keyword that caused the error.
42+
message:
43+
type: string
44+
description: A message describing the validation error.
45+
params:
46+
type: object
47+
description: Additional parameters specific to the validation error.
48+
required:
49+
- invalidItems
50+
required:
51+
- type
52+
- message
53+
- data
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
title: PutItemResponseError
2+
required:
3+
- error
4+
type: object
5+
properties:
6+
error:
7+
allOf:
8+
- $ref: ./DatasetSchemaValidationError.yaml
9+
- {}

apify-api/openapi/paths/datasets/datasets.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,7 @@ post:
106106
Keep in mind that data stored under unnamed dataset follows [data retention period](https://docs.apify.com/platform/storage#data-retention).
107107
It creates a dataset with the given name if the parameter name is used.
108108
If a dataset with the given name already exists then returns its object.
109+
109110
operationId: datasets_post
110111
parameters:
111112
- name: name

apify-api/openapi/paths/datasets/datasets@{datasetId}@items.yaml

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -478,6 +478,10 @@ post:
478478
The POST payload is a JSON object or a JSON array of objects to save into the dataset.
479479
480480
481+
If the data you attempt to store in the dataset is invalid (meaning any of the items received by the API fails the validation), the whole request is discarded and the API will return a response with status code 400.
482+
For more information about dataset schema validation, see [Dataset schema](https://docs.apify.com/platform/actors/development/actor-definition/dataset-schema/validation).
483+
484+
481485
**IMPORTANT:** The limit of request payload size for the dataset is 5 MB. If the array exceeds the size, you'll need to split it into a number of smaller arrays.
482486
operationId: dataset_items_post
483487
parameters:
@@ -523,6 +527,33 @@ post:
523527
type: object
524528
example: {}
525529
example: {}
530+
'400':
531+
description: ''
532+
headers: {}
533+
content:
534+
application/json:
535+
schema:
536+
allOf:
537+
- $ref: >-
538+
../../components/schemas/datasets/PutItemResponseError.yaml
539+
- example:
540+
error:
541+
type: schema-validation-error
542+
message: Schema validation failed
543+
example:
544+
error:
545+
type: schema-validation-error
546+
message: Schema validation failed
547+
data:
548+
invalidItems:
549+
- itemPosition: 2
550+
validationErrors:
551+
- instancePath: /1/stringField
552+
schemaPath: /items/properties/stringField/type
553+
keyword: type
554+
params:
555+
type: string
556+
message: 'must be string'
526557
deprecated: false
527558
x-legacy-doc-urls:
528559
- https://docs.apify.com/api/v2#/reference/datasets/item-collection/put-items

package-lock.json

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,5 +111,5 @@
111111
"engines": {
112112
"node": ">=18.0.0"
113113
},
114-
"packageManager": "[email protected].1"
114+
"packageManager": "[email protected].2"
115115
}

sources/platform/actors/development/actor_definition/output_schema.md renamed to sources/platform/actors/development/actor_definition/dataset_schema/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,7 @@ The template above defines the configuration for the default dataset output view
118118

119119
The default behavior of the Output tab UI table is to display all fields from `transformation.fields` in the specified order. You can customize the display properties for specific formats or column labels if needed.
120120

121-
![Output tab UI](./images/output-schema-example.png)
121+
![Output tab UI](../images/output-schema-example.png)
122122

123123
## Structure
124124

Lines changed: 222 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,222 @@
1+
---
2+
title: Dataset validation
3+
description: Specify the dataset schema within the Actors so you can add monitoring and validation at the field level.
4+
slug: /actors/development/actor-definition/dataset-schema/validation
5+
---
6+
7+
**Specify the dataset schema within the Actors so you can add monitoring and validation at the field level.**
8+
9+
---
10+
11+
To define a schema for a default dataset of an Actor run, you need to set `fields` property in the dataset schema.
12+
13+
:::info
14+
15+
The schema defines a single item in the dataset. Be careful not to define the schema as an array, it always needs to be a schema of an object.
16+
17+
Schema configuration is not available for named datasets or dataset views.
18+
19+
:::
20+
21+
You can either do that directly through `actor.json`:
22+
23+
```json title=".actor.json"
24+
{
25+
"actorSpecification": 1,
26+
"storages": {
27+
"dataset": {
28+
"actorSpecification": 1,
29+
"fields": {
30+
"$schema": "http://json-schema.org/draft-07/schema#",
31+
"type": "object",
32+
"properties": {
33+
"name": {
34+
"type": "string"
35+
}
36+
},
37+
"required": ["name"]
38+
},
39+
"views": {}
40+
}
41+
}
42+
}
43+
```
44+
45+
Or in a separate file linked from the `.actor.json`:
46+
47+
```json title=".actor.json"
48+
{
49+
"actorSpecification": 1,
50+
"storages": {
51+
"dataset": "./dataset_schema.json"
52+
}
53+
}
54+
```
55+
56+
```json title="dataset_schema.json"
57+
{
58+
"actorSpecification": 1,
59+
"fields": {
60+
"$schema": "http://json-schema.org/draft-07/schema#",
61+
"type": "object",
62+
"properties": {
63+
"name": {
64+
"type": "string"
65+
}
66+
},
67+
"required": ["name"]
68+
},
69+
"views": {}
70+
}
71+
```
72+
73+
:::important
74+
75+
Dataset schema needs to be a valid JSON schema draft-07, so the `$schema` line is important and must be exactly this value or it must be omitted:
76+
77+
`"$schema": "http://json-schema.org/draft-07/schema#"`
78+
79+
:::
80+
81+
## Dataset validation
82+
83+
When you define a schema of your default dataset, the schema is then always used when you insert data into the dataset to perform validation (we use [AJV](https://ajv.js.org/)).
84+
85+
If the validation succeeds, nothing changes from the current behavior, data is stored and an empty response with status code `201` is returned.
86+
87+
If the data you attempt to store in the dataset is _invalid_ (meaning any of the items received by the API fails validation), _the entire request will be discarded_, The API will return a response with status code `400` and the following JSON response:
88+
89+
```json
90+
{
91+
"error": {
92+
"type": "schema-validation-error",
93+
"message": "Schema validation failed",
94+
"data": {
95+
"invalidItems": [{
96+
"itemPosition": "<array index in the received array of items>",
97+
"validationErrors": "<Complete list of AJV validation error objects>"
98+
}]
99+
}
100+
}
101+
}
102+
```
103+
104+
The type of the AJV validation error object is [here](https://github.com/ajv-validator/ajv/blob/master/lib/types/index.ts#L86).
105+
106+
If you use the Apify JS client or Apify SDK and call `pushData` function you can access the validation errors in a `try catch` block like this:
107+
108+
```javascript
109+
try {
110+
const response = await Actor.pushData(items);
111+
} catch (error) {
112+
if (!error.data?.invalidItems) throw error;
113+
error.data.invalidItems.forEach((item) => {
114+
const { itemPosition, validationErrors } = item;
115+
});
116+
}
117+
```
118+
119+
## Examples of common types of validation
120+
121+
Optional field (price is optional in this case):
122+
123+
```json
124+
{
125+
"$schema": "http://json-schema.org/draft-07/schema#",
126+
"type": "object",
127+
"properties": {
128+
"name": {
129+
"type": "string"
130+
},
131+
"price": {
132+
"type": "number"
133+
}
134+
},
135+
"required": ["name"]
136+
}
137+
```
138+
139+
Field with multiple types:
140+
141+
```json
142+
{
143+
"price": {
144+
"type": ["string", "number"]
145+
}
146+
}
147+
```
148+
149+
Field with type `any`:
150+
151+
```json
152+
{
153+
"price": {
154+
"type": ["string", "number", "object", "array", "boolean"]
155+
}
156+
}
157+
```
158+
159+
Enabling fields to be `null` :
160+
161+
```json
162+
{
163+
"name": {
164+
"type": "string",
165+
"nullable": true
166+
}
167+
}
168+
```
169+
170+
Define type of objects in array:
171+
172+
```json
173+
{
174+
"comments": {
175+
"type": "array",
176+
"items": {
177+
"type": "object",
178+
"properties": {
179+
"author_name": {
180+
"type": "string"
181+
}
182+
}
183+
}
184+
}
185+
}
186+
```
187+
188+
Define specific fields, but allow anything else to be added to the item:
189+
190+
```json
191+
{
192+
"$schema": "http://json-schema.org/draft-07/schema#",
193+
"type": "object",
194+
"properties": {
195+
"name": {
196+
"type": "string"
197+
}
198+
},
199+
"additionalProperties": true
200+
}
201+
```
202+
203+
See [json schema reference](https://json-schema.org/understanding-json-schema/reference) for additional options.
204+
205+
You can also use [conversion tools](https://www.liquid-technologies.com/online-json-to-schema-converter) to convert an existing JSON document into it's JSON schema.
206+
207+
## Dataset field statistics
208+
209+
When you configure the dataset fields schema, we generate a field list and measure the following statistics:
210+
211+
- **Null count:** how many items in the dataset have the field set to null
212+
- **Empty count:** how many items in the dataset are `undefined` , meaning that for example empty string is not considered empty
213+
- **Minimum and maximum**
214+
- For numbers, this is calculated directly
215+
- For strings, this field tracks string length
216+
- For arrays, this field tracks the number of items in the array
217+
- For objects, this tracks the number of keys
218+
- For booleans, this tracks whether the boolean was set to true. Minimum is always 0, but maximum can be either 1 or 0 based on whether at least one item in the dataset has the boolean field set to true.
219+
220+
221+
You can use them in [monitoring](../../../../monitoring#alert-configuration).
222+

0 commit comments

Comments
 (0)