Skip to content

Commit 9ecd6ef

Browse files
committed
docs: add dataset schema validation
1 parent bc9a6e8 commit 9ecd6ef

File tree

3 files changed

+324
-2
lines changed

3 files changed

+324
-2
lines changed

sources/platform/actors/development/actor_definition/output_schema.md renamed to sources/platform/actors/development/actor_definition/dataset_schema/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,7 @@ The template above defines the configuration for the default dataset output view
118118

119119
The default behavior of the Output tab UI table is to display all fields from `transformation.fields` in the specified order. You can customize the display properties for specific formats or column labels if needed.
120120

121-
![Output tab UI](./images/output-schema-example.png)
121+
![Output tab UI](../images/output-schema-example.png)
122122

123123
## Structure
124124

Lines changed: 312 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,312 @@
1+
---
2+
title: Dataset validation
3+
description: Specify the dataset schema within the Actors so you can add monitoring and validation down to the field level.
4+
slug: /actors/development/actor-definition/dataset-schema/validation
5+
---
6+
7+
**Specify the dataset schema within the Actors so you can add monitoring and validation down to the field level.**
8+
9+
---
10+
11+
To define a schema for a default dataset of an actor run, you need to set `fields` property in the dataset schema. It’s currently impossible to set a schema for a named dataset (same as for dataset views).
12+
13+
:::info
14+
15+
The schema defines a single item in the dataset. Be careful not to define the schema as an array, it always needs to be a schema of an object.
16+
17+
:::
18+
19+
You can either do that directly through `actor.json` like this:
20+
21+
```json title=".actor.json"
22+
{
23+
"actorSpecification": 1,
24+
"storages": {
25+
"dataset": {
26+
"actorSpecification": 1,
27+
"fields": {
28+
"$schema": "http://json-schema.org/draft-07/schema#",
29+
"type": "object",
30+
"properties": {
31+
"name": {
32+
"type": "string"
33+
}
34+
},
35+
"required": ["name"]
36+
},
37+
"views": {}
38+
}
39+
}
40+
}
41+
```
42+
43+
Or in a separate separate file like this:
44+
45+
```json title=".actor.json"
46+
{
47+
"actorSpecification": 1,
48+
"storages": {
49+
"dataset": "./dataset_schema.json"
50+
}
51+
}
52+
```
53+
54+
```json title="dataset_schema.json"
55+
{
56+
"actorSpecification": 1,
57+
"fields": {
58+
"$schema": "http://json-schema.org/draft-07/schema#",
59+
"type": "object",
60+
"properties": {
61+
"name": {
62+
"type": "string"
63+
}
64+
},
65+
"required": ["name"]
66+
},
67+
"views": {}
68+
}
69+
```
70+
71+
:::important
72+
73+
The `$schema` line is important and must be exactly this value or it must be omitted:
74+
75+
`"$schema": "http://json-schema.org/draft-07/schema#"`
76+
77+
:::
78+
79+
## Dataset validation
80+
81+
When you define a schema of your default dataset, the schema is then always used when you insert data into the dataset to perform validation (we use [AJV](https://ajv.js.org/)).
82+
83+
If the validation succeeds, nothing changes from the current behavior, data is stored and an empty response with status code 201 is returned.
84+
85+
**If the data you attempt to store in the dataset is invalid** (meaning any of the items received by the API fails the validation), **the whole request is discarded** and the API will return a response with status code 400 and the following JSON response:
86+
87+
```json
88+
{
89+
"error": {
90+
"type": "schema-validation-error",
91+
"message": "Schema validation failed",
92+
"data": {
93+
"invalidItems": [{
94+
"itemPosition": "<array index in the received array of items>",
95+
"validationErrors": "<Complete list of AJV validation error objects>"
96+
}]
97+
}
98+
}
99+
}
100+
```
101+
102+
The type of the AJV validation error object is [here](https://github.com/ajv-validator/ajv/blob/master/lib/types/index.ts#L86)
103+
104+
If you use the Apify JS client or Apify SDK and call `pushData` function you can access the validation errors in a `try catch` block like this:
105+
106+
```javascript
107+
try {
108+
const response = await Actor.pushData(items);
109+
} catch (error) {
110+
if (!error.data?.invalidItems) throw error;
111+
error.data.invalidItems.forEach((item) => {
112+
const { itemPosition, validationErrors } = item;
113+
});
114+
}
115+
```
116+
117+
## Examples
118+
119+
Optional field (price is optional in this case):
120+
121+
```json
122+
{
123+
"$schema": "http://json-schema.org/draft-07/schema#",
124+
"type": "object",
125+
"properties": {
126+
"name": {
127+
"type": "string"
128+
},
129+
"price": {
130+
"type": "number"
131+
}
132+
},
133+
"required": ["name"]
134+
}
135+
```
136+
137+
Field with multiple types:
138+
139+
```json
140+
{
141+
"price": {
142+
"type": ["string", "number"]
143+
}
144+
}
145+
```
146+
147+
Field with type `any`:
148+
149+
```json
150+
{
151+
"price": {
152+
"type": ["string", "number", "object", "array", "boolean"]
153+
}
154+
}
155+
```
156+
157+
Enabling fields to be `null` :
158+
159+
```json
160+
{
161+
"name": {
162+
"type": "string",
163+
"nullable": true
164+
}
165+
}
166+
```
167+
168+
Define type of objects in array:
169+
170+
```json
171+
{
172+
"comments": {
173+
"type": "array",
174+
"items": {
175+
"type": "object",
176+
"properties": {
177+
"author_name": {
178+
"type": "string"
179+
}
180+
}
181+
}
182+
}
183+
}
184+
```
185+
186+
Define specific fields, but allow anything else to be added to the item:
187+
188+
```json
189+
{
190+
"$schema": "http://json-schema.org/draft-07/schema#",
191+
"type": "object",
192+
"properties": {
193+
"name": {
194+
"type": "string"
195+
}
196+
},
197+
"additionalProperties": true
198+
}
199+
```
200+
201+
See [json schema reference](https://json-schema.org/understanding-json-schema/reference) for additional options.
202+
203+
Example of schema generator [here](https://www.liquid-technologies.com/online-json-to-schema-converter).
204+
205+
# Dataset field statistics
206+
207+
When you have the dataset fields schema set up, we then use the schema to generate a list of fields and measure statistics for these fields.
208+
209+
The measured statistics are following:
210+
211+
- **Null count:** how many items in the dataset have the field set to null
212+
- **Empty count:** how many items in the dataset are `undefined` , meaning that for example empty string is not considered empty
213+
- **Minimum and maximum**
214+
- For numbers, this is calculated directly
215+
- For strings, this field tracks string length
216+
- For arrays, this field tracks the number of items in the array
217+
- For objects, this tracks the number of keys
218+
219+
:::note
220+
221+
Currently, you cannot view these statistics. We will add API endpoint soon. But you can already use them in monitoring.
222+
223+
:::
224+
225+
## Examples
226+
227+
For this schema:
228+
229+
```json
230+
{
231+
"$schema": "http://json-schema.org/draft-07/schema#",
232+
"type": "object",
233+
"properties": {
234+
"name": {
235+
"type": "string"
236+
},
237+
"description": {
238+
"type": "string"
239+
},
240+
"dimensions": {
241+
"type": "object",
242+
"nullable": true,
243+
"properties": {
244+
"width": {
245+
"type": "number"
246+
},
247+
"height": {
248+
"type": "number"
249+
}
250+
},
251+
"required": ["width", "height"]
252+
},
253+
"price": {
254+
"type": ["string", "number"]
255+
}
256+
},
257+
"required": ["name", "price"]
258+
}
259+
```
260+
261+
The stored statistics and fields in the database look like this:
262+
263+
```json
264+
{
265+
"_id" : "1lVGVBkWIhSYPY1dD",
266+
"fields" : [
267+
"name",
268+
"description",
269+
"dimensions",
270+
"dimensions/width",
271+
"dimensions/height",
272+
"price"
273+
],
274+
"stats": {
275+
"description": {
276+
"emptyCount": 105,
277+
"max": 19,
278+
"min": 19
279+
},
280+
"dimensions": {
281+
"emptyCount": 144,
282+
"max": 2,
283+
"min": 2,
284+
"nullCount": 86
285+
},
286+
"dimensions/height": {
287+
"emptyCount": 230,
288+
"max": 992,
289+
"min": 18
290+
},
291+
"dimensions/width": {
292+
"emptyCount": 230,
293+
"max": 977,
294+
"min": 4
295+
},
296+
"name": {
297+
"max": 13,
298+
"min": 11
299+
},
300+
"price": {
301+
"max": 999,
302+
"min": 1
303+
}
304+
}
305+
}
306+
```
307+
308+
:::note
309+
310+
If you want to see for yourself, check `datasetStatistics` collection. The ids correspond to the ids of datasets.
311+
312+
:::

sources/platform/monitoring/index.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,12 +41,22 @@ Currently, the monitoring option offers the following features:
4141
4242
### Alert configuration
4343

44-
When you set up an alert, you have two choices for how you want the metrics to be evaluated. And depending on your choices, the alerting system will behave differently:
44+
When you set up an alert, you have three choices for how you want the metrics to be evaluated. And depending on your choices, the alerting system will behave differently:
4545

4646
1. **Alert, when the metric is lower than** - This type of alert is checked after the run finishes. If the metric is lower than the value you set, the alert will be triggered and you will receive a notification.
4747

4848
2. **Alert, when the metric is higher than** - This type of alert is checked both during the run and after the run finishes. During the run, we do periodic checks (approximately every 5 minutes) so that we can notify you as soon as possible if the metric is higher than the value you set. After the run finishes, we do a final check to make sure that the metric does not go over the limit in the last few minutes of the run.
4949

50+
3. **Alert, when run status is one of following** - This type of alert is checked only after the run finishes. It makes possible to track the status of your finished runs and send an alert if the run finishes in a state you do not expect. If your actor runs very often and suddenly starts failing, you will receive a single alert after the first failed run in 1 minute, and then aggregated alert every 15 minutes.
51+
52+
4. **Alert for dataset field statistics** - If you have a [dataset schema](../actors/development/actor_definition/dataset_schema/validation.md) set up, then you can use the field statistics to set up an alert. You can use field statistics for example to track if some field is filled in in all records, if some numeric value is too low/high (for example when tracking the price of a product over multiple sources), if the number of items in an array is too low/high (for example alert on Instagram actor if post has a lot of comments) and many other tasks like these.
53+
54+
:::important
55+
56+
Available dataset fields are taken from the last successful build of the monitored actor. If different versions have different fields, currently the solution will always display only those from the default version.
57+
58+
:::
59+
5060
![Metric condition configuration](./images/metric-options.png)
5161

5262
You can get notified by email, Slack, or in Apify Console. If you use Slack, we suggest using Slack notifications instead of email because they are more reliable, and you can also get notified quicker.

0 commit comments

Comments
 (0)