You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/designs/preprocess_with_analysis.md
+72-31Lines changed: 72 additions & 31 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,13 +1,28 @@
1
1
# Preprocess with Analysis Result Design
2
-
This document describes the design about how to use analysis result while preprocessing feature inputs.
2
+
3
+
This document describes the design about how to use analysis result while
4
+
preprocessing feature inputs.
3
5
4
6
## Motivation
5
-
Before preprocessing the feature inputs, we need to analyze the training dataset to collect the feature statistical results. For example, we need the mean and standard deviation to normalize a numeric value, `vocabulary` to lookup a string value to an integer id and `boundary` to discretize a numeric value. Using SQLFlow, the training dataset is usually a table saved in MySQL or MaxCompute and other databases, so we can use SQL to analyze the training table. During [data transformation pipeline](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/data_transform.md), we may launch a pod to analyze the training table and then submit the ElasticDL training job. So, the design is to solve how to pass the analysis results into the pods of an ElasticDL training job.
7
+
8
+
Before preprocessing the feature inputs, we need to analyze the training dataset
9
+
to collect the feature statistical results. For example, we need the mean and
10
+
standard deviation to normalize a numeric value, `vocabulary` to lookup a string
11
+
value to an integer id and `boundary` to discretize a numeric value. Using
12
+
SQLFlow, the training dataset is usually a table saved in MySQL or MaxCompute
13
+
and other databases, so we can use SQL to analyze the training table. During
The SQL expression use MaxCompute SQL syntax and `PERCENTILE` is a function in MaxCompute SQL
30
46
31
-
For feature to hash with bucket size, we can get the count of distinct values by
47
+
The SQL expression use MaxCompute SQL syntax and `PERCENTILE` is a function in
48
+
MaxCompute SQL
49
+
50
+
For feature to hash with bucket size, we can get the count of distinct values by
51
+
32
52
```sql
33
53
SELECTcount(distinct(marital)) AS martial_distinct_count FROM ${training_table}
34
54
```
35
55
36
56
For feature to lookup with vocabulary, we can get the vocabulary by
57
+
37
58
```sql
38
59
SELECT value FROM (
39
60
SELECT education AS value, count(education) AS _count
@@ -51,42 +72,53 @@ SELECT value FROM (
51
72
)
52
73
WHERE _count >= {threshold};
53
74
```
54
-
The `WHERE _count >= {threshold}` will filter the values whose count is less than threshold to avoid overfitting.
55
75
56
-
Besides vocabulary, other analysis results are a number or a list of number like bucket boundaries. So we can save them into a table like:
76
+
The `WHERE _count >= {threshold}` will filter the values whose count is less
77
+
than threshold to avoid overfitting.
78
+
79
+
Besides vocabulary, other analysis results are a number or a list of number like
80
+
bucket boundaries. So we can save them into a table like:
57
81
58
-
| feature_stats | value |
59
-
| ---- | --- |
82
+
| feature_stats | value |
83
+
| ---- | --- |
60
84
| age_min | 10 |
61
-
| age_max | 90 |
85
+
| age_max | 90 |
62
86
| age_mean | 44.75 |
63
-
| age_std_dev | 56.6875 |
64
-
| age_bucket_boundries | 30,40,50 |
65
-
| martial-status-count | 2 |
87
+
| age_std_dev | 56.6875 |
88
+
| age_bucket_boundries | 30,40,50 |
89
+
| martial-status-count | 2 |
66
90
67
91
Because the vocabulary size may be huge, we cannot save it into a record like:
68
92
69
-
| feature_stats | value |
70
-
| ---- | --- |
93
+
| feature_stats | value |
94
+
| ---- | --- |
71
95
| education_vocab | Master,Doctor,Bachelor |
72
96
| martial_vocab | Divorced,Never-married |
73
97
74
98
So, we save the vocabulary into a column and each record has an element, like
75
99
76
-
| education | martial |
77
-
| ---- | --- |
100
+
| education | martial |
101
+
| ---- | --- |
78
102
| Master | Divorced |
79
103
| Doctor | Never-married |
80
104
| Bachelor||
81
105
82
-
After analysis, we get two tables with the analysis results. One is the statistics table which saves the mean, standard deviation, bucket boundaries and distinct count. And another is vocabulary table which saves the vocabulary.
106
+
After analysis, we get two tables with the analysis results. One is the
107
+
statistics table which saves the mean, standard deviation, bucket boundaries and
108
+
distinct count. And another is vocabulary table which saves the vocabulary.
83
109
84
-
### Pass analysis results to build a model in training pods.
85
-
For the values in the statistics table, we can write them into environment variables for the training pod to build model. For example:
86
-
```shell
110
+
### Pass analysis results to build a model in training pods
111
+
112
+
For the values in the statistics table, we can write them into environment
113
+
variables for the training pod to build model. For example:
Using the default values in `analyzer_utils`, users can debug the model without analysis.
117
151
118
-
So, we can define the preprocessing layers like:
152
+
Using the default values in `analyzer_utils`, users can debug the model without
153
+
analysis. So, we can define the preprocessing layers like:
154
+
119
155
```python
120
156
import os
121
157
from elasticdl_preprocessing.layers import Discretization, Hashing
@@ -132,8 +168,13 @@ hash_layer = Hashing(
132
168
)
133
169
```
134
170
135
-
For the values in the vocabulary table, we cannot save the vocabulary into environment variables because the vocabulary size may be huge. However, we can save the vocabulary into the shared storage like glusterfs and write the path into the environment variables of a training pod. After the training job completes, we can clear the vocabulary files in the storage.
136
-
```shell
171
+
For the values in the vocabulary table, we cannot save the vocabulary into
172
+
environment variables because the vocabulary size may be huge. However, we can
173
+
save the vocabulary into the shared storage like glusterfs and write the path
174
+
into the environment variables of a training pod. After the training job
175
+
completes, we can clear the vocabulary files in the storage.
0 commit comments