Skip to content

Commit 6bb5c84

Browse files
Merge pull request #3 from shivamsanju/feature-notebook-lineage
Moved callback to constructor
2 parents 4394440 + 8ad30ee commit 6bb5c84

22 files changed

+929
-242
lines changed

docs/README.md

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,6 @@ Feathr automatically computes your feature values and joins them to your trainin
2828
- **Native cloud integration** with simplified and scalable architecture, which is illustrated in the next section.
2929
- **Feature sharing and reuse made easy:** Feathr has built-in feature registry so that features can be easily shared across different teams and boost team productivity.
3030

31-
3231
## Running Feathr on Azure with 3 Simple Steps
3332

3433
Feathr has native cloud integration. To use Feathr on Azure, you only need three steps:
@@ -50,7 +49,7 @@ Feathr has native cloud integration. To use Feathr on Azure, you only need three
5049
If you are not using the above Jupyter Notebook and want to install Feathr client locally, use this:
5150

5251
```bash
53-
pip install -U feathr
52+
pip install feathr
5453
```
5554

5655
Or use the latest code from GitHub:
@@ -126,31 +125,30 @@ Read the [Streaming Source Ingestion Guide](https://linkedin.github.io/feathr/ho
126125

127126
Read [Point-in-time Correctness and Point-in-time Join in Feathr](https://linkedin.github.io/feathr/concepts/point-in-time-join.html) for more details.
128127

129-
130128
## Running Feathr Examples
131129

132-
Follow the [quick start Jupyter Notebook](./feathr_project/feathrcli/data/feathr_user_workspace/product_recommendation_demo.ipynb) to try it out. There is also a companion [quick start guide](https://linkedin.github.io/feathr/quickstart.html) containing a bit more explanation on the notebook.
133-
130+
Follow the [quick start Jupyter Notebook](https://github.com/linkedin/feathr/blob/main/feathr_project/feathrcli/data/feathr_user_workspace/product_recommendation_demo.ipynb) to try it out.
131+
There is also a companion [quick start guide](https://linkedin.github.io/feathr/quickstart_synapse.html) containing a bit more explanation on the notebook.
134132

135133
## Cloud Architecture
136134

137135
Feathr has native integration with Azure and other cloud services, and here's the high-level architecture to help you get started.
138136
![Architecture](images/architecture.png)
139137

140-
# Next Steps
138+
## Next Steps
141139

142-
## Quickstart
140+
### Quickstart
143141

144142
- [Quickstart for Azure Synapse](quickstart_synapse.md)
145143

146-
## Concepts
144+
### Concepts
147145

148146
- [Feature Definition](concepts/feature-definition.md)
149147
- [Feature Generation](concepts/feature-generation.md)
150148
- [Feature Join](concepts/feature-join.md)
151149
- [Point-in-time Correctness](concepts/point-in-time-join.md)
152150

153-
## How-to-guides
151+
### How-to-guides
154152

155153
- [Azure Deployment](how-to-guides/azure-deployment.md)
156154
- [Local Feature Testing](how-to-guides/local-feature-testing.md)
@@ -159,4 +157,5 @@ Feathr has native integration with Azure and other cloud services, and here's th
159157
- [Feathr Job Configuration](how-to-guides/feathr-job-configuration.md)
160158

161159
## API Documentation
160+
162161
- [Python API Documentation](https://feathr.readthedocs.io/en/latest/)
Lines changed: 66 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,19 @@
11
---
22
layout: default
3-
title: Feathr Feature Generation
3+
title: Feature Generation and Materialization
44
parent: Feathr Concepts
55
---
66

7-
# Feature Generation
8-
Feature generation is the process to create features from raw source data into a certain persisted storage.
7+
# Feature Generation and Materialization
98

10-
User could utilize feature generation to pre-compute and materialize pre-defined features to online and/or offline storage. This is desirable when the feature transformation is computation intensive or when the features can be reused(usually in offline setting). Feature generation is also useful in generating embedding features. Embedding distill information from large data and it is usually more compact.
9+
Feature generation (also known as feature materialization) is the process to create features from raw source data into a certain persisted storage in either offline store (for further reuse), or online store (for online inference).
10+
11+
User can utilize feature generation to pre-compute and materialize pre-defined features to online and/or offline storage. This is desirable when the feature transformation is computation intensive or when the features can be reused (usually in offline setting). Feature generation is also useful in generating embedding features, where those embeddings distill information from large data and is usually more compact.
1112

1213
## Generating Features to Online Store
13-
When we need to serve the models online, we also need to serve the features online. We provide APIs to generate features to online storage for future consumption. For example:
14+
15+
When the models are served in an online environment, we also need to serve the corresponding features in the same online environment as well. Feathr provides APIs to generate features to online storage for future consumption. For example:
16+
1417
```python
1518
client = FeathrClient()
1619
redisSink = RedisSink(table_name="nycTaxiDemoFeature")
@@ -21,12 +24,16 @@ settings = MaterializationSettings("nycTaxiMaterializationJob",
2124
client.materialize_features(settings)
2225
```
2326

24-
([MaterializationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.MaterializationSettings),
25-
[RedisSink API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.RedisSink)
27+
More reference on the APIs:
28+
29+
- [MaterializationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.MaterializationSettings)
30+
- [RedisSink API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.RedisSink)
2631

2732
In the above example, we define a Redis table called `nycTaxiDemoFeature` and materialize two features called `f_location_avg_fare` and `f_location_max_fare` to Redis.
2833

29-
It is also possible to backfill the features for a previous time range, like below. If the `BackfillTime` part is not specified, it's by default to `now()` (i.e. if not specified, it's equivilant to `BackfillTime(start=now, end=now, step=timedelta(days=1))`).
34+
## Feature Backfill
35+
36+
It is also possible to backfill the features for a particular time range, like below. If the `BackfillTime` part is not specified, it's by default to `now()` (i.e. if not specified, it's equivalent to `BackfillTime(start=now, end=now, step=timedelta(days=1))`).
3037

3138
```python
3239
client = FeathrClient()
@@ -39,29 +46,34 @@ settings = MaterializationSettings("nycTaxiMaterializationJob",
3946
client.materialize_features(settings)
4047
```
4148

42-
([BackfillTime API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.BackfillTime),
43-
[client.materialize_features() API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.materialize_features))
49+
Note that if you don't have features available in `now`, you'd better specify a `BackfillTime` range where you have features.
4450

45-
## Consuming the online features
51+
Also, Feathr will submit a materialization job for each of the step for performance reasons. I.e. if you have
52+
`BackfillTime(start=datetime(2022, 2, 1), end=datetime(2022, 2, 20), step=timedelta(days=1))`, Feathr will submit 20 jobs to run in parallel for maximum performance.
4653

47-
```python
48-
client.wait_job_to_finish(timeout_sec=600)
54+
More reference on the APIs:
55+
56+
- [BackfillTime API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.BackfillTime)
57+
- [client.materialize_features() API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.materialize_features)
4958

50-
res = client.get_online_features('nycTaxiDemoFeature', '265', [
51-
'f_location_avg_fare', 'f_location_max_fare'])
59+
60+
61+
## Consuming features in online environment
62+
63+
After the materialization job is finished, we can get the online features by querying the `feature table`, corresponding `entity key` and a list of `feature names`. In the example below, we query the online features called `f_location_avg_fare` and `f_location_max_fare`, and query with a key `265` (which is the location ID).
64+
65+
```python
66+
res = client.get_online_features('nycTaxiDemoFeature', '265', ['f_location_avg_fare', 'f_location_max_fare'])
5267
```
5368

54-
([client.get_online_features API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.get_online_features))
69+
More reference on the APIs:
70+
- [client.get_online_features API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.get_online_features)
5571

56-
After we finish running the materialization job, we can get the online features by querying the feature name, with the
57-
corresponding keys. In the example above, we query the online features called `f_location_avg_fare` and
58-
`f_location_max_fare`, and query with a key `265` (which is the location ID).
72+
## Materializing Features to Offline Store
5973

60-
## Generating Features to Offline Store
74+
This is useful when the feature transformation is compute intensive and features can be re-used. For example, you have a feature that needs more than 24 hours to compute and the feature can be reused by more than one model training pipeline. In this case, you should consider generating features to offline.
6175

62-
This is a useful when the feature transformation is computation intensive and features can be re-used. For example, you
63-
have a feature that needs more than 24 hours to compute and the feature can be reused by more than one model training
64-
pipeline. In this case, you should consider generate features to offline. Here is an API example:
76+
The API call is very similar to materializing features to online store, and here is an API example:
6577

6678
```python
6779
client = FeathrClient()
@@ -73,14 +85,14 @@ settings = MaterializationSettings("nycTaxiMaterializationJob",
7385
client.materialize_features(settings)
7486
```
7587

76-
This will generate features on latest date(assuming it's `2022/05/21`) and output data to the following path:
88+
This will generate features on latest date(assuming it's `2022/05/21`) and output data to the following path:
7789
`abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/df0/daily/2022/05/21`
7890

79-
You can also specify a BackfillTime so the features will be generated for those dates. For example:
91+
You can also specify a `BackfillTime` so the features will be generated only for those dates. For example:
8092

8193
```Python
8294
backfill_time = BackfillTime(start=datetime(
83-
2020, 5, 20), end=datetime(2020, 5, 20), step=timedelta(days=1))
95+
2020, 5, 10), end=datetime(2020, 5, 20), step=timedelta(days=1))
8496
offline_sink = HdfsSink(output_path="abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/")
8597
settings = MaterializationSettings("nycTaxiTable",
8698
sinks=[offline_sink],
@@ -89,8 +101,32 @@ settings = MaterializationSettings("nycTaxiTable",
89101
backfill_time=backfill_time)
90102
```
91103

92-
This will generate features only for 2020/05/20 for me and it will be in folder:
93-
`abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/df0/daily/2020/05/20`
104+
This will generate features from `2020/05/10` to `2020/05/20` and the output will have 11 folders, from
105+
`abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/df0/daily/2020/05/10` to `abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/df0/daily/2020/05/20`. Note that currently Feathr only supports materializing data in daily step (i.e. even if you specify an hourly step, the generated features in offline store will still be presented in a daily hierarchy).
106+
107+
You can also specify the format of the materialized features in the offline store by using `execution_configurations` like below. Please refer to the [documentation](../how-to-guides/feathr-job-configuration.md) here for those configuration details.
108+
109+
```python
110+
111+
from feathr import HdfsSink
112+
offlineSink = HdfsSink(output_path="abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_data/")
113+
# Materialize two features into a Offline store.
114+
settings = MaterializationSettings("nycTaxiMaterializationJob",
115+
sinks=[offlineSink],
116+
feature_names=["f_location_avg_fare", "f_location_max_fare"])
117+
client.materialize_features(settings, execution_configurations={ "spark.feathr.outputFormat": "parquet"})
118+
119+
```
120+
121+
For reading those materialized features, Feathr has a convenient helper function called `get_result_df` to help you view the data. For example, you can use the sample code below to read from the materialized result in offline store:
122+
123+
```python
124+
125+
path = "abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/df0/daily/2020/05/20/"
126+
res = get_result_df(client=client, format="parquet", res_url=path)
127+
```
128+
129+
More reference on the APIs:
94130

95-
([MaterializationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.MaterializationSettings),
96-
[HdfsSink API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.HdfsSink))
131+
- [MaterializationSettings API](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.MaterializationSettings)
132+
- [HdfsSink API](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.HdfsSource)

docs/how-to-guides/azure_resource_provision.json

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,52 @@
4040
"description": "Whether or not to deploy eventhub provision script"
4141
}
4242
},
43+
"databaseServerName": {
44+
"type": "string",
45+
"defaultValue": "[concat('server-', uniqueString(resourceGroup().id, deployment().name))]",
46+
"metadata": {
47+
"description": "Specifies the name for the SQL server"
48+
}
49+
},
50+
"databaseName": {
51+
"type": "string",
52+
"defaultValue": "[concat('db-', uniqueString(resourceGroup().id, deployment().name), '-1')]",
53+
"metadata": {
54+
"description": "Specifies the name for the SQL database under the SQL server"
55+
}
56+
},
57+
"location": {
58+
"type": "string",
59+
"defaultValue": "[resourceGroup().location]",
60+
"metadata": {
61+
"description": "Specifies the location for server and database"
62+
}
63+
},
64+
"adminUser": {
65+
"type": "string",
66+
"metadata": {
67+
"description": "Specifies the username for admin"
68+
}
69+
},
70+
"adminPassword": {
71+
"type": "securestring",
72+
"metadata": {
73+
"description": "Specifies the password for admin"
74+
}
75+
},
76+
"storageAccountKey": {
77+
"type": "string",
78+
"metadata": {
79+
"description": "Specifies the key of the storage account where the BACPAC file is stored."
80+
}
81+
},
82+
"bacpacUrl": {
83+
"type": "string",
84+
"defaultValue": "https://azurefeathrstorage.blob.core.windows.net/public/feathr-registry-schema.bacpac",
85+
"metadata": {
86+
"description": "This is the pre-created BACPAC file that contains required schemas by the registry server."
87+
}
88+
},
4389
"dockerImage": {
4490
"defaultValue": "blrchen/feathr-sql-registry",
4591
"type": "String",
@@ -393,6 +439,59 @@
393439
"principalId": "[parameters('principalId')]",
394440
"scope": "[resourceGroup().id]"
395441
}
442+
},
443+
{
444+
"type": "Microsoft.Sql/servers",
445+
"apiVersion": "2021-11-01-preview",
446+
"name": "[parameters('databaseServerName')]",
447+
"location": "[parameters('location')]",
448+
"properties": {
449+
"administratorLogin": "[parameters('adminUser')]",
450+
"administratorLoginPassword": "[parameters('adminPassword')]",
451+
"version": "12.0"
452+
},
453+
"resources": [
454+
{
455+
"type": "firewallrules",
456+
"apiVersion": "2021-11-01-preview",
457+
"name": "AllowAllAzureIps",
458+
"location": "[parameters('location')]",
459+
"dependsOn": [
460+
"[parameters('databaseServerName')]"
461+
],
462+
"properties": {
463+
"startIpAddress": "0.0.0.0",
464+
"endIpAddress": "0.0.0.0"
465+
}
466+
}
467+
]
468+
},
469+
{
470+
"type": "Microsoft.Sql/servers/databases",
471+
"apiVersion": "2021-11-01-preview",
472+
"name": "[concat(string(parameters('databaseServerName')), '/', string(parameters('databaseName')))]",
473+
"location": "[parameters('location')]",
474+
"dependsOn": [
475+
"[concat('Microsoft.Sql/servers/', parameters('databaseServerName'))]"
476+
],
477+
"resources": [
478+
{
479+
"type": "extensions",
480+
"apiVersion": "2021-11-01-preview",
481+
"name": "Import",
482+
"dependsOn": [
483+
"[resourceId('Microsoft.Sql/servers/databases', parameters('databaseServerName'), parameters('databaseName'))]"
484+
],
485+
"properties": {
486+
"storageKeyType": "StorageAccessKey",
487+
"storageKey": "[parameters('storageAccountKey')]",
488+
"storageUri": "[parameters('bacpacUrl')]",
489+
"administratorLogin": "[parameters('adminUser')]",
490+
"administratorLoginPassword": "[parameters('adminPassword')]",
491+
"operationMode": "Import"
492+
}
493+
}
494+
]
396495
}
397496
],
398497
"outputs": {}

docs/how-to-guides/client-callback-function.md

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -10,27 +10,29 @@ A callback function is a function that is sent to another function as an argumen
1010

1111
## How to use callback functions
1212

13-
Currently the below functions in feathr client support passing a callback as an argument:
13+
We can pass a callback function when initializing the feathr client.
14+
15+
```python
16+
client = FeathrClient(config_path, callback)
17+
```
18+
19+
The below functions accept an optional parameters named **params**. params is a dictionary where user can pass the arguments for the callback function.
1420

1521
- get_online_features
1622
- multi_get_online_features
1723
- get_offline_features
1824
- monitor_features
1925
- materialize_features
2026

21-
These functions accept two optional parameters named **callback** and **params**.
22-
callback is of type function and params is a dictionary where user can pass the arguments for the callback function.
23-
2427
An example on how to use it:
2528

2629
```python
2730
# inside notebook
28-
client = FeathrClient(config_path)
29-
client.get_offline_features(observation_settings,feature_query,output_path, callback, params)
30-
31-
# users can define their own callback function and params
31+
client = FeathrClient(config_path, callback)
3232
params = {"param1":"value1", "param2":"value2"}
33+
client.get_offline_features(observation_settings,feature_query,output_path, params)
3334

35+
# users can define their own callback function
3436
async def callback(params):
3537
import httpx
3638
async with httpx.AsyncClient() as requestHandler:

0 commit comments

Comments
 (0)