Skip to content

Commit 74d072f

Browse files
committed
data schema
1 parent f11177d commit 74d072f

File tree

5 files changed

+112
-36
lines changed

5 files changed

+112
-36
lines changed
21.7 KB
Binary file not shown.

md-docs/user_guide/data_schema.md

Lines changed: 76 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,78 @@
11
# Data Schema
22

3-
Data schema is the core attribute of a Machine Learning **Task**.
4-
It specifies the name of the quantities, their data type and role.
5-
A Data Schema is a collection of *Column* entity that has the following attributes:
6-
7-
- `name`: name of the quantity. For Tabular data is the name of the column in the table.
8-
- `data type`: as the name suggests, it is the type data of the quantity: float, integer, categorical, array, ...
9-
- `nullable`: if the quantity is allowed to be missing
10-
- `role`: specifies what this quantity represents, it can be:
11-
- INPUT
12-
- PREDICTION
13-
- TARGET
14-
- ID
15-
- TIME_ID
16-
- `dims`: if the data type is array, it represents its dimensions, for instance an image is an ARRAY_3 with dimensions (1920, 1080, 3)
17-
18-
The data schema must always have:
19-
20-
- one Column with role ID that is the unique identifier of samples used to recognize them
21-
- one Column with role TIME_ID that is used to temporally order the samples
22-
- at least one Column with role INPUT
23-
- one Column with role TARGET
24-
- one Column for each model created in the Task (this Column is added automatically by the application and have the following name `MODEL_NAME@MODEL_VERSION`)
3+
Data schema contains all the information about the data in the [Task], it is created at the beginning and is immutable.
4+
5+
!!! tip
6+
Data Schema can be easily created starting from a template from the Web App. Go in Data Schema page after you created a Task and see the precompiled version of Data Schema, update and insert new Columns to create your custom version.
7+
8+
A Data schema is composed of a list of objects named _Column_ that represent each data entity in the Task.
9+
The number and type of Column objects depend on the task type and task data structure.
10+
11+
A Column object has some mandatory attributes and others that depends on its role or data type:
12+
13+
| Attribute | Description | Mandatory |
14+
| --------- | ------- | ------ |
15+
| Name | Name of the entity used to read it from raw data. For instance, in Tabular tasks, it represents the name of the column of the CSV file. | Mandatory |
16+
| Data type | Data format of the entity. Possible values are <br><ul><li>Float: numeric value</li><li>Categorical: entity that can assume a only specified values. A Categorical Column requires the attribute _possible_values_ to be specified.</li><li>String: generic textual data like text input or customer id. To not be used for categorical columns.</li><li>Array 1: one-dimensional array. Requires _dims_ attribute to be defined like a list of 1 element \[n\] that specifies the number of elements of the array.</li><li>Array 2: two-dimensional array. Requires _dims_ attribute to be defined like a list of 2 elements \[n, m\] that specifies the number elements of the each dimension of the array.</li><li>Array 3: three-dimensional array. Requires _dims_ attribute to be defined like a list of 3 elements \[n, m, k\] that specifies the number elements of the each dimension of the array.</li></ul> | Mandatory |
17+
| Role | Defines the role the Column object has in the Task. According to the Task type some roles are required or not allowed. More information in the following sections. | Mandatory |
18+
| Subrole | Additional specification of the role in the Task. Some entities belong to the same Role but have different meanings, the Subroles allows to distinguish between them. More information in the following sections. | Depends on Task Type |
19+
| Is Nullable | If the entity allows missing values. | Mandatory |
20+
| Dims | List with the number of elements each dimension of the array has. The value -1 indicates that that dimension can have an arbitrary number of elements. | Required when Data Type is Array |
21+
| Possible values | List of values the categorical variable can assume. They can be either strings or numbers. When Task Type is Classification Multilabel and Role is Target, possibile values must be \[0, 1\] indicating the presence or not of that class. | Mandatory when Column Data Type is Categorical |
22+
| Classes Names | Names of the classes in the Task. The length of this list must match the length of the Dims of the array. | Required when Column Role is Target and Task Type is Classification Multilabel.|
23+
|Image Mode| Type of image, it can be RGB, RGBA, GRAYSCALE. It also determines the Data Type, which is Array 3 for RBG and RGBA and Array 2 for GRAYSCALE. | Required when Column Role is Input and Data Structure is Image.|
24+
25+
26+
## Role
27+
28+
The Role defines what the Column object represents for the Task.
29+
Roles are used by ML cube Platform to correctly use provided data.
30+
Some Roles are needed to uniquely identify a sample, other to retrieve the correct information.
31+
Moreover, some Roles must be inserted by you when creating the Data Schema the first time, while others, like the model predictions, are created automatically by ML cube Platform.
32+
33+
User defined roles are:
34+
35+
|Role|Data Type| Description | Mandatory
36+
|--|--|--|--|
37+
|ID|String| Unique identifier of the sample. It is used during data validation to avoid duplicates of data and to communicate information about data with you without sending the actual data| It must be always present when sending data to ML cube Platform. |
38+
| Time ID | Float | Timestamp of the sample expressed in seconds (for that reason it is a Float). It is used to temporally order samples maintaining coherence in the analysis of ML cube Platform.|It must be always present when sending data to ML cube Platform. |
39+
| Input | Any available Data Type | Represents input data like a single feature for Tabular tasks or image in Image tasks or text in Text tasks | According to Task Type the number of Input Column object varies from 1 to illimitate. See Section [Data schema templates](data_schema.md#data-schema-templates)|
40+
| Target | Any available Data Type. It must be coherent with Task Type | Represents the true value of the sample in supervised tasks.| It is mandatory for supervised tasks. |
41+
| Input additional embedding | Array 1 | Embedding vector of the Input Column. It is allowed only then Data Structure of Task is Image or Text. When this Column object is present, ML cube Platform uses it as numerical representation of the data, otherwise, it uses an internal embedding algorithm. | It is optional since it depends on your choice to share with ML cube Platform this type of data.|
42+
| Target additional embedding | Array 1 | Embedding vector of the Target Column. It is allowed only then Task Type is RAG. When this Column object is present, ML cube Platform uses it as numerical representation of the data, otherwise, it uses an internal embedding algorithm. | It is optional since it depends on your choice to share with ML cube Platform this type of data.|
43+
44+
ML cube Platform defined roles are:
45+
46+
|Role| Data Type | Description |
47+
| --| --| --|
48+
| Prediction | Same Data Type of Target Column | Prediction Column object automatically created when the Task [Model] is created. The name has the fixed template: <MODEL_NAME\>\@<MODEL_VERSION\>|
49+
| Prediction additional embedding | Array 1 | Embedding vector of the Prediction Column. It is allowed only then Task Type is RAG. When this Column object is present, ML cube Platform uses it as numerical representation of the data, otherwise, it uses an internal embedding algorithm. | It is created automatically by ML cube Platform if Column Object with Role Target additional embedding is present. The name has fixed template: <MODEL_NAME\>_embeddings\@<MODEL_VERSION\>|
50+
51+
## Subrole
52+
53+
Some tasks can have different data entities for the same Role, the Column object's attribute Subrole helps to specify the correct type of data.
54+
55+
| Subrole | Associated Role | Data Type | Description |
56+
| --|--|--|--|
57+
| RAG User Input | INPUT | String | In RAG Tasks it is the user query submitted to the system. |
58+
| RAG Retrieved Context | INPUT | String | In RAG Tasks it is the retrieved contexts (separated with the Task attribute *context separator*) that the retrieval system has selected to answer the query.|
59+
| Model probability | PREDICTION | Depends on Task Type:<br><ul><li>RAG: Array 1</li><li>Classification Binary: Float</li><li>Classification Multiclass: Array 1</li><li>Classification Multilabel: Array 1</li></ul> | It is automatically created by ML cube Platform when the created Model has the flag additional probabilistic output set as True. The name has fixed template: <MODEL_NAME\>_probability\@<MODEL_VERSION\>.|
60+
| Object detection prediction label| PREDICTION | Array 1 | It is automatically created when Task Type is Object detection. It is an array with length equal to the number of predicted bounding boxes where each element contains the class label assigned to the bounding box. The name has a fixed template: <MODEL_NAME\>_predicted_labels\@<MODEL_VERSION\>.|
61+
| Object detection target label| TARGET | Array 1 | It is mandatory when Task Type is Object detection. It is an array with length equal to the number of actual bounding boxes where each element contains the class label assigned to the bounding box. |
62+
63+
## Data schema constraints
64+
65+
Each combination of Task Type and Data Structure leads to different Data Schema requirements that must be satisfied when it is created for the Task.
66+
For instance, image binary classification tasks requires only one input column object with image data type and target column object must be categorical with only two possible values.
67+
68+
Here the list of constraints about quantities for each Role:
69+
70+
{{ read_excel('../tables/data schema validation.xlsx', engine='openpyxl', sheet_name='qts') }}
71+
72+
Here the list of constraints about Data Types for each Role:
73+
74+
{{ read_excel('../tables/data schema validation.xlsx', engine='openpyxl', sheet_name='types') }}
75+
76+
77+
[Task]: task.md
78+
[Model]: model.md

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,7 @@ plugins:
5050
minify_html: true
5151
- glightbox
5252
- table-reader
53+
- macros
5354

5455
# Extensions
5556
markdown_extensions:

pyproject.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,9 @@ dependencies = [
1818
"mkdocs-minify-plugin>=0.7.1",
1919
"mkdocs-glightbox>=0.3.4",
2020
"mkdocs-table-reader-plugin>=2.0.1",
21+
"mkdocs-macros-plugin",
22+
"openpyxl",
23+
"pandas",
2124
]
2225

2326
[project.optional-dependencies]

requirements.txt

Lines changed: 32 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@ anyio==4.4.0
1414
# via
1515
# httpx
1616
# jupyter-server
17+
appnope==0.1.4
18+
# via ipykernel
1719
argon2-cffi==23.1.0
1820
# via jupyter-server
1921
argon2-cffi-bindings==21.2.0
@@ -49,12 +51,7 @@ charset-normalizer==3.3.2
4951
click==8.1.7
5052
# via mkdocs
5153
colorama==0.4.6
52-
# via
53-
# click
54-
# ipython
55-
# mkdocs
56-
# mkdocs-material
57-
# tqdm
54+
# via mkdocs-material
5855
comm==0.2.2
5956
# via
6057
# ipykernel
@@ -76,6 +73,8 @@ dill==0.3.7
7673
# datasets
7774
# evaluate
7875
# multiprocess
76+
et-xmlfile==1.1.0
77+
# via openpyxl
7978
evaluate==0.4.2
8079
# via ml3-platform-docs (pyproject.toml)
8180
executing==2.0.1
@@ -103,6 +102,10 @@ ghp-import==2.1.0
103102
# via mkdocs
104103
h11==0.14.0
105104
# via httpcore
105+
hjson==3.1.0
106+
# via
107+
# mkdocs-macros-plugin
108+
# super-collections
106109
htmlmin2==0.1.13
107110
# via mkdocs-minify-plugin
108111
httpcore==1.0.5
@@ -149,6 +152,7 @@ jinja2==3.1.3
149152
# jupyterlab
150153
# jupyterlab-server
151154
# mkdocs
155+
# mkdocs-macros-plugin
152156
# mkdocs-material
153157
# nbconvert
154158
# torch
@@ -234,11 +238,14 @@ mistune==3.0.2
234238
mkdocs==1.5.3
235239
# via
236240
# ml3-platform-docs (pyproject.toml)
241+
# mkdocs-macros-plugin
237242
# mkdocs-material
238243
# mkdocs-minify-plugin
239244
# mkdocs-table-reader-plugin
240245
mkdocs-glightbox==0.3.7
241246
# via ml3-platform-docs (pyproject.toml)
247+
mkdocs-macros-plugin==1.3.6
248+
# via ml3-platform-docs (pyproject.toml)
242249
mkdocs-material==9.5.17
243250
# via ml3-platform-docs (pyproject.toml)
244251
mkdocs-material-extensions==1.3.1
@@ -294,6 +301,8 @@ numpy==1.26.4
294301
# sentence-transformers
295302
# torchvision
296303
# transformers
304+
openpyxl==3.1.5
305+
# via ml3-platform-docs (pyproject.toml)
297306
overrides==7.7.0
298307
# via jupyter-server
299308
packaging==24.0
@@ -307,6 +316,7 @@ packaging==24.0
307316
# jupyterlab
308317
# jupyterlab-server
309318
# mkdocs
319+
# mkdocs-macros-plugin
310320
# nbconvert
311321
# qtconsole
312322
# qtpy
@@ -315,6 +325,7 @@ paginate==0.5.6
315325
# via mkdocs-material
316326
pandas==2.2.1
317327
# via
328+
# ml3-platform-docs (pyproject.toml)
318329
# datasets
319330
# evaluate
320331
# mkdocs-table-reader-plugin
@@ -323,7 +334,11 @@ pandocfilters==1.5.1
323334
parso==0.8.4
324335
# via jedi
325336
pathspec==0.12.1
326-
# via mkdocs
337+
# via
338+
# mkdocs
339+
# mkdocs-macros-plugin
340+
pexpect==4.9.0
341+
# via ipython
327342
pillow==10.3.0
328343
# via
329344
# ml3-platform-docs (pyproject.toml)
@@ -345,6 +360,10 @@ psutil==5.9.8
345360
# via
346361
# accelerate
347362
# ipykernel
363+
ptyprocess==0.7.0
364+
# via
365+
# pexpect
366+
# terminado
348367
pure-eval==0.2.2
349368
# via stack-data
350369
pyarrow==16.1.0
@@ -374,25 +393,20 @@ python-dateutil==2.9.0.post0
374393
# arrow
375394
# ghp-import
376395
# jupyter-client
396+
# mkdocs-macros-plugin
377397
# pandas
378398
python-json-logger==2.0.7
379399
# via jupyter-events
380400
pytz==2024.1
381401
# via pandas
382-
pywin32==306
383-
# via jupyter-core
384-
pywinpty==2.0.13
385-
# via
386-
# jupyter-server
387-
# jupyter-server-terminals
388-
# terminado
389402
pyyaml==6.0.1
390403
# via
391404
# accelerate
392405
# datasets
393406
# huggingface-hub
394407
# jupyter-events
395408
# mkdocs
409+
# mkdocs-macros-plugin
396410
# mkdocs-table-reader-plugin
397411
# pymdown-extensions
398412
# pyyaml-env-tag
@@ -473,12 +487,16 @@ soupsieve==2.5
473487
# via beautifulsoup4
474488
stack-data==0.6.3
475489
# via ipython
490+
super-collections==0.5.3
491+
# via mkdocs-macros-plugin
476492
sympy==1.12.1
477493
# via torch
478494
tabulate==0.9.0
479495
# via
480496
# mkdocs-table-reader-plugin
481497
# ml3-platform-sdk
498+
termcolor==2.5.0
499+
# via mkdocs-macros-plugin
482500
terminado==0.18.1
483501
# via
484502
# jupyter-server

0 commit comments

Comments
 (0)