You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/feature-set-specification-transformation-concepts.md
+9-9Lines changed: 9 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,15 +8,15 @@ ms.topic: how-to
8
8
ms.author: franksolomon
9
9
author: fbsolo-ms1
10
10
ms.reviewer: yogipandey
11
-
ms.date: 12/06/2023
11
+
ms.date: 01/23/2025
12
12
ms.custom: template-concept
13
13
---
14
14
15
15
# Feature transformation and best practices
16
16
17
-
This article describes feature set specifications, the different kinds of transformations that can be used with it, and related best practices.
17
+
This article describes feature set specifications, the different kinds of transformations that can be used with them, and related best practices.
18
18
19
-
A feature set is a collection of features generated by source data transformations. A feature set specification is a self-contained definition for feature set development and local testing. After its development and local testing, you can register that feature set as a feature set asset with the feature store. You then have versioning and materialization available as managed capabilities.
19
+
A feature set is a collection of features generated by source data transformations. A feature set specification is a self-contained definition for feature set development and local testing. After development and local testing of a feature set, you can register that feature set as a feature set asset with the feature store. You then have versioning and materialization available as managed capabilities.
20
20
21
21
## Define a feature set
22
22
@@ -85,7 +85,7 @@ The calculation happens in these steps:
85
85
- Apply the feature transformer, defined by `feature_transformation.transformation_code`, on the data, and get the calculated features
86
86
- Filter the feature values to return only those feature records within the feature window `[feature_window_start_ts, feature_window_end_ts)`
87
87
88
-
In this code sample, the feature store API computes the features:
88
+
In this code sample, the feature store API calculates the features:
89
89
90
90
```python
91
91
# define the source data time window according to feature window
@@ -137,7 +137,7 @@ class UserTotalSpendProfileTransformer(Transformer):
This feature set has three features, with data types as shown:
140
+
The feature set has three features, with data types as shown:
141
141
142
142
- `total_spend`: double
143
143
- `is_high_spend_user`: bool
@@ -153,9 +153,9 @@ This shows the calculated feature values:
153
153
154
154
### Sliding window aggregation
155
155
156
-
Sliding window aggregation can help handle feature values that present statistics (for example, sum, average, etc.) that accumulate over time. The SparkSQL `Window` function defines a sliding window around each row in the data, is useful in these cases.
156
+
Sliding window aggregation can help handle feature values that present statistics (for example, sum, average, etc.) that accumulate over time. The SparkSQL `Window` function defines a sliding window around each row in the data, which is useful in these cases.
157
157
158
-
For each row, the `Window` object can look into both future and past. In the context of machine learning features, you should define the `Window` object to look only the past, for each row. Visit the [Best Practice](#prevent-data-leakage-in-feature-transformation) section for more details.
158
+
For each row, the `Window` object can look into both the future and the past. In the context of machine learning features, you should define the `Window` object to look only in the past, for each row. Visit the [Best Practice](#prevent-data-leakage-in-feature-transformation) section for more information.
159
159
160
160
Start with this source data:
161
161
@@ -329,7 +329,7 @@ Data leakage in the feature transformation definition can lead to these problems
329
329
330
330
### Set proper `source_lookback`
331
331
332
-
For time-series (sliding/tumbling/stagger window aggregation) data aggregations, properly set the `source_lookback` property. This diagram shows the relationship between the source data window and the feature window in the feature (set) calculation:
332
+
For time-series (sliding/tumbling/stagger window aggregation) data aggregations, set the `source_lookback` property correctly. This diagram shows the relationship between the source data window and the feature window in the feature (set) calculation:
333
333
334
334
:::image type="content" source="./media/feature-set-specification-transformation-concepts/illustration-source-lookback.png" lightbox="./media/feature-set-specification-transformation-concepts/illustration-source-lookback.png" alt-text="Illustration showing the concept of source_lookback.":::
335
335
@@ -338,7 +338,7 @@ Define `source_lookback` as a time delta value, which presents the range of sour
338
338
| Transformation type | `source_lookback` |
339
339
|---|---|
340
340
| Row-level transformation | 0 (default) |
341
-
| Sliding window | size of the largest window range in the transformer.<br> e.g.<br> `source_lookback` = 3 days when the feature set defines 3 day rolling features <br> `source_lookback` = 7 days when the feature set defines both 3 day and 7 day rolling features |
341
+
| Sliding window | size of the largest window range in the transformer.<br> e.g.<br> `source_lookback` = 3 days when the feature set defines three day rolling features <br> `source_lookback` = 7 days when the feature set defines both three day and seven day rolling features |
342
342
| Tumbling/stagger window | value of `windowDuration` in `window` definition. e.g. source_lookback = 1day when using `window("timestamp", windowDuration="1 day",slideDuration="6 hours)` |
343
343
344
344
Incorrect `source_lookback` settings can lead to incorrect calculated/materialized feature values.
0 commit comments