Skip to content

Commit a9a6121

Browse files
authored
Reformat docs/design/preprocess_with_analysis.md (#2058)
1 parent 3411773 commit a9a6121

File tree

1 file changed

+72
-31
lines changed

1 file changed

+72
-31
lines changed

docs/designs/preprocess_with_analysis.md

Lines changed: 72 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,28 @@
11
# Preprocess with Analysis Result Design
2-
This document describes the design about how to use analysis result while preprocessing feature inputs.
2+
3+
This document describes the design about how to use analysis result while
4+
preprocessing feature inputs.
35

46
## Motivation
5-
Before preprocessing the feature inputs, we need to analyze the training dataset to collect the feature statistical results. For example, we need the mean and standard deviation to normalize a numeric value, `vocabulary` to lookup a string value to an integer id and `boundary` to discretize a numeric value. Using SQLFlow, the training dataset is usually a table saved in MySQL or MaxCompute and other databases, so we can use SQL to analyze the training table. During [data transformation pipeline](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/data_transform.md), we may launch a pod to analyze the training table and then submit the ElasticDL training job. So, the design is to solve how to pass the analysis results into the pods of an ElasticDL training job.
7+
8+
Before preprocessing the feature inputs, we need to analyze the training dataset
9+
to collect the feature statistical results. For example, we need the mean and
10+
standard deviation to normalize a numeric value, `vocabulary` to lookup a string
11+
value to an integer id and `boundary` to discretize a numeric value. Using
12+
SQLFlow, the training dataset is usually a table saved in MySQL or MaxCompute
13+
and other databases, so we can use SQL to analyze the training table. During
14+
[data transformation
15+
pipeline](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/data_transform.md),
16+
we may launch a pod to analyze the training table and then submit the ElasticDL
17+
training job. So, the design is to solve how to pass the analysis results into
18+
the pods of an ElasticDL training job.
619

720
## Define preprocess layers with analysis result
821

9-
### 1. Persist the analysis result collected in the analysis pod.
10-
For MySQL or MaxCompute table, we can use SQL to analyze each column. For example, the table is
22+
### 1. Persist the analysis result collected in the analysis pod
23+
24+
For MySQL or MaxCompute table, we can use SQL to analyze each column. For
25+
example, the table is
1126

1227
| age | education | marital |
1328
| ---- | --- | --- |
@@ -16,24 +31,30 @@ For MySQL or MaxCompute table, we can use SQL to analyze each column. For exampl
1631
| 42 | Bachelor | Never-married |
1732
| 49 | Bachelor | Divorced |
1833

19-
For numeric column, we can get the min, max, mean, standard deviation and bucket boundaries using
34+
For numeric column, we can get the min, max, mean, standard deviation and bucket
35+
boundaries using
36+
2037
```sql
21-
SELECT
38+
SELECT
2239
MIN(age) as age_min,
2340
MAX(age) as age_max,
2441
AVG(age) AS age_avg,
2542
STDDEV(age) AS age_stddev,
26-
PERCENTILE(age, ARRAY(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) AS age_boundaries
43+
PERCENTILE(age, ARRAY(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) AS age_boundaries
2744
FROM ${training_table}
2845
```
29-
The SQL expression use MaxCompute SQL syntax and `PERCENTILE` is a function in MaxCompute SQL
3046

31-
For feature to hash with bucket size, we can get the count of distinct values by
47+
The SQL expression use MaxCompute SQL syntax and `PERCENTILE` is a function in
48+
MaxCompute SQL
49+
50+
For feature to hash with bucket size, we can get the count of distinct values by
51+
3252
```sql
3353
SELECT count(distinct(marital)) AS martial_distinct_count FROM ${training_table}
3454
```
3555

3656
For feature to lookup with vocabulary, we can get the vocabulary by
57+
3758
```sql
3859
SELECT value FROM (
3960
SELECT education AS value, count(education) AS _count
@@ -51,42 +72,53 @@ SELECT value FROM (
5172
)
5273
WHERE _count >= {threshold};
5374
```
54-
The `WHERE _count >= {threshold}` will filter the values whose count is less than threshold to avoid overfitting.
5575

56-
Besides vocabulary, other analysis results are a number or a list of number like bucket boundaries. So we can save them into a table like:
76+
The `WHERE _count >= {threshold}` will filter the values whose count is less
77+
than threshold to avoid overfitting.
78+
79+
Besides vocabulary, other analysis results are a number or a list of number like
80+
bucket boundaries. So we can save them into a table like:
5781

58-
| feature_stats | value |
59-
| ---- | --- |
82+
| feature_stats | value |
83+
| ---- | --- |
6084
| age_min | 10 |
61-
| age_max | 90 |
85+
| age_max | 90 |
6286
| age_mean | 44.75 |
63-
| age_std_dev | 56.6875 |
64-
| age_bucket_boundries | 30,40,50 |
65-
| martial-status-count | 2 |
87+
| age_std_dev | 56.6875 |
88+
| age_bucket_boundries | 30,40,50 |
89+
| martial-status-count | 2 |
6690

6791
Because the vocabulary size may be huge, we cannot save it into a record like:
6892

69-
| feature_stats | value |
70-
| ---- | --- |
93+
| feature_stats | value |
94+
| ---- | --- |
7195
| education_vocab | Master,Doctor,Bachelor |
7296
| martial_vocab | Divorced,Never-married |
7397

7498
So, we save the vocabulary into a column and each record has an element, like
7599

76-
| education | martial |
77-
| ---- | --- |
100+
| education | martial |
101+
| ---- | --- |
78102
| Master | Divorced |
79103
| Doctor | Never-married |
80104
| Bachelor| |
81105

82-
After analysis, we get two tables with the analysis results. One is the statistics table which saves the mean, standard deviation, bucket boundaries and distinct count. And another is vocabulary table which saves the vocabulary.
106+
After analysis, we get two tables with the analysis results. One is the
107+
statistics table which saves the mean, standard deviation, bucket boundaries and
108+
distinct count. And another is vocabulary table which saves the vocabulary.
83109

84-
### Pass analysis results to build a model in training pods.
85-
For the values in the statistics table, we can write them into environment variables for the training pod to build model. For example:
86-
```shell
110+
### Pass analysis results to build a model in training pods
111+
112+
For the values in the statistics table, we can write them into environment
113+
variables for the training pod to build model. For example:
114+
115+
```bash
87116
envs='_age_mean=44.75,_age_std=56.6875,_age_boundaries="30,40,50"'
88117
```
89-
In preprocessing layers, we can get the statistics from environment variables like:
118+
119+
In preprocessing layers, we can get the statistics from environment variables
120+
like:
121+
90122
```python
91123
import os
92124
from elasticdl_preprocessing.layers import Discretization
@@ -95,7 +127,10 @@ age_boundaries = list(
95127
)
96128
layer = Discretization(bins=age_boundaries)
97129
```
98-
Further, we can provide an `analyzer_utils` in `elasticdl_preprocessing` to get the statistics from environment variables like:
130+
131+
Further, we can provide an `analyzer_utils` in `elasticdl_preprocessing` to get
132+
the statistics from environment variables like:
133+
99134
```python
100135
def get_bucket_boundaries(feature_name, default_value):
101136
env_name = "_" + feature_name + "_boundaries"
@@ -113,9 +148,10 @@ def get_distinct_count(feature_name, default_value):
113148
else:
114149
return int(count)
115150
```
116-
Using the default values in `analyzer_utils`, users can debug the model without analysis.
117151

118-
So, we can define the preprocessing layers like:
152+
Using the default values in `analyzer_utils`, users can debug the model without
153+
analysis. So, we can define the preprocessing layers like:
154+
119155
```python
120156
import os
121157
from elasticdl_preprocessing.layers import Discretization, Hashing
@@ -132,8 +168,13 @@ hash_layer = Hashing(
132168
)
133169
```
134170

135-
For the values in the vocabulary table, we cannot save the vocabulary into environment variables because the vocabulary size may be huge. However, we can save the vocabulary into the shared storage like glusterfs and write the path into the environment variables of a training pod. After the training job completes, we can clear the vocabulary files in the storage.
136-
```shell
171+
For the values in the vocabulary table, we cannot save the vocabulary into
172+
environment variables because the vocabulary size may be huge. However, we can
173+
save the vocabulary into the shared storage like glusterfs and write the path
174+
into the environment variables of a training pod. After the training job
175+
completes, we can clear the vocabulary files in the storage.
176+
177+
```bash
137178
envs="_education_vocab=/testdata/elasticdl/vocabulary/education.txt"
138179
```
139180

0 commit comments

Comments
 (0)