Merge remote-tracking branch 'upstream/master'

dchourasia · dchourasia · commit 7411adf291e8 · 2025-07-20T00:19:19.000Z
diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md
@@ -36,7 +36,7 @@
   * [Offline store](getting-started/components/offline-store.md)
   * [Online store](getting-started/components/online-store.md)
   * [Feature server](getting-started/components/feature-server.md)
-  * [Batch Materialization Engine](getting-started/components/batch-materialization-engine.md)
+  * [Compute Engine](getting-started/components/compute-engine.md)
   * [Provider](getting-started/components/provider.md)
   * [Authorization Manager](getting-started/components/authz_manager.md)
   * [OpenTelemetry Integration](getting-started/components/open-telemetry.md)
@@ -139,10 +139,10 @@
   * [Google Cloud Platform](reference/providers/google-cloud-platform.md)
   * [Amazon Web Services](reference/providers/amazon-web-services.md)
   * [Azure](reference/providers/azure.md)
-* [Batch Materialization Engines](reference/batch-materialization/README.md)  
-  * [Snowflake](reference/batch-materialization/snowflake.md)
-  * [AWS Lambda (alpha)](reference/batch-materialization/lambda.md)
-  * [Spark (contrib)](reference/batch-materialization/spark.md)
+* [Compute Engines](reference/compute-engine/README.md)  
+  * [Snowflake](reference/compute-engine/snowflake.md)
+  * [AWS Lambda (alpha)](reference/compute-engine/lambda.md)
+  * [Spark (contrib)](reference/compute-engine/spark.md)
 * [Feature repository](reference/feature-repository/README.md)
   * [feature\_store.yaml](reference/feature-repository/feature-store-yaml.md)
   * [.feastignore](reference/feature-repository/feast-ignore.md)
diff --git a/docs/getting-started/architecture/write-patterns.md b/docs/getting-started/architecture/write-patterns.md
@@ -16,7 +16,7 @@ There are two ways a client (or Data Producer) can *_send_* data to the online s
    - Using a synchronous API call for a small number of entities or a single entity (e.g., using the [`push` or `write_to_online_store` methods](../../reference/data-sources/push.md#pushing-data)) or the Feature Server's [`push` endpoint](../../reference/feature-servers/python-feature-server.md#pushing-features-to-the-online-and-offline-stores))
 2. Asynchronously 
    - Using an asynchronous API call for a small number of entities or a single entity (e.g., using the [`push` or `write_to_online_store` methods](../../reference/data-sources/push.md#pushing-data)) or the Feature Server's [`push` endpoint](../../reference/feature-servers/python-feature-server.md#pushing-features-to-the-online-and-offline-stores))
-   - Using a "batch job" for a large number of entities (e.g., using a [batch materialization engine](../components/batch-materialization-engine.md))
+   - Using a "batch job" for a large number of entities (e.g., using a [compute engine](../components/compute-engine.md))
 
 Note, in some contexts, developers may "batch" a group of entities together and write them to the online store in a 
 single API call. This is a common pattern when writing data to the online store to reduce write loads but we would 
diff --git a/docs/getting-started/components/compute-engine.md b/docs/getting-started/components/compute-engine.md
@@ -1,4 +1,4 @@
-# Compute Engine (Batch Materialization Engine)
+# Compute Engine
 
 Note: The materialization is now constructed via unified compute engine interface.
 
@@ -20,8 +20,9 @@ engines.
 ```markdown
 | Compute Engine         | Description                                                                                      | Supported  | Link |
 |-------------------------|-------------------------------------------------------------------------------------------------|------------|------|
-| LocalComputeEngine      | Runs on Arrow + Pandas/Polars/Dask etc., designed for light weight transformation.              | ✅         |    |
+| LocalComputeEngine      | Runs on Arrow + Pandas/Polars/Dask etc., designed for light weight transformation.              | ✅         |      |
 | SparkComputeEngine      | Runs on Apache Spark, designed for large-scale distributed feature generation.                  | ✅         |      |
+| SnowflakeComputeEngine  | Runs on Snowflake, designed for scalable feature generation using Snowflake SQL.                | ✅         |      |
 | LambdaComputeEngine     | Runs on AWS Lambda, designed for serverless feature generation.                                 | ✅         |      |
 | FlinkComputeEngine      | Runs on Apache Flink, designed for stream processing and real-time feature generation.          | ❌         |      |
 | RayComputeEngine        | Runs on Ray, designed for distributed feature generation and machine learning workloads.        | ❌         |      |
@@ -31,7 +32,7 @@ engines.
 Batch Engine Config can be configured in the `feature_store.yaml` file, and it serves as the default configuration for all materialization and historical retrieval tasks. The `batch_engine` config in BatchFeatureView. E.g
 ```yaml
 batch_engine:
-    type: SparkComputeEngine
+    type: spark.engine
     config:
         spark_master: "local[*]"
         spark_app_name: "Feast Batch Engine"
@@ -59,7 +60,7 @@ Then, when you materialize the feature view, it will use the batch_engine config
 Stream Engine Config can be configured in the `feature_store.yaml` file, and it serves as the default configuration for all stream materialization and historical retrieval tasks. The `stream_engine` config in FeatureView. E.g
 ```yaml
 stream_engine:
-    type: SparkComputeEngine
+    type: spark.engine
     config:
         spark_master: "local[*]"
         spark_app_name: "Feast Stream Engine"
@@ -108,4 +109,51 @@ defined in the DAG. It handles the execution of transformations, aggregations, j
 
 The Feature resolver is the core component of the compute engine that constructs the execution plan for feature
 generation. It takes the definitions from feature views and builds a directed acyclic graph (DAG) of operations that
-need to be performed to generate the features.
+need to be performed to generate the features.
+
+#### DAG
+The DAG represents the directed acyclic graph of operations that need to be performed to generate the features. It
+contains nodes for each operation, such as transformations, aggregations, joins, and filters. The DAG is built by the
+Feature Resolver and executed by the Feature Builder.
+
+DAG nodes are defined as follows:
+```
+   +---------------------+
+   |   SourceReadNode    |  <- Read data from offline store (e.g. Snowflake, BigQuery, etc. or custom source)
+   +---------------------+
+             |
+             v
+   +--------------------------------------+
+   |  TransformationNode / JoinNode (*)   |   <- Merge data sources, custom transformations by user, or default join
+   +--------------------------------------+ 
+             |
+             v
+   +---------------------+
+   |    FilterNode       |   <- used for point-in-time filtering 
+   +---------------------+
+             |
+             v
+   +---------------------+
+   | AggregationNode (*) |   <- only if aggregations are defined
+   +---------------------+
+             |
+             v
+   +---------------------+
+   |   DeduplicationNode |   <- used if no aggregation and for history
+   +---------------------+    retrieval
+             |
+             v
+   +---------------------+
+   |  ValidationNode (*) |   <- optional validation checks
+   +---------------------+
+             |
+             v
+        +----------+
+        |  Output  |
+        +----------+
+         /        \
+        v          v
++----------------+  +----------------+
+| OnlineStoreWrite|  OfflineStoreWrite|
++----------------+  +----------------+
+```
diff --git a/docs/getting-started/components/overview.md b/docs/getting-started/components/overview.md
@@ -27,7 +27,7 @@ A complete Feast deployment contains the following components:
   * Retrieve online features.
 * **Feature Server:** The Feature Server is a REST API server that serves feature values for a given entity key and feature reference. The Feature Server is designed to be horizontally scalable and can be deployed in a distributed manner.
 * **Stream Processor:** The Stream Processor can be used to ingest feature data from streams and write it into the online or offline stores. Currently, there's an experimental Spark processor that's able to consume data from Kafka.
-* **Batch Materialization Engine:** The [Batch Materialization Engine](batch-materialization-engine.md) component launches a process which loads data into the online store from the offline store. By default, Feast uses a local in-process engine implementation to materialize data. However, additional infrastructure can be used for a more scalable materialization process.
+* **Compute Engine:** The [Compute Engine](compute-engine.md) component launches a process which loads data into the online store from the offline store. By default, Feast uses a local in-process engine implementation to materialize data. However, additional infrastructure can be used for a more scalable materialization process.
 * **Online Store:** The online store is a database that stores only the latest feature values for each entity. The online store is either populated through materialization jobs or through [stream ingestion](../../reference/data-sources/push.md).
 * **Offline Store:** The offline store persists batch data that has been ingested into Feast. This data is used for producing training datasets. For feature retrieval and materialization, Feast does not manage the offline store directly, but runs queries against it. However, offline stores can be configured to support writes if Feast configures logging functionality of served features.
 * **Authorization Manager**: The authorization manager detects authentication tokens from client requests to Feast servers and uses this information to enforce permission policies on the requested services.
diff --git a/docs/getting-started/genai.md b/docs/getting-started/genai.md
@@ -162,4 +162,4 @@ For more detailed information and examples:
 * [MCP Feature Server Reference](../reference/feature-servers/mcp-feature-server.md)
 * [Spark Data Source](../reference/data-sources/spark.md)
 * [Spark Offline Store](../reference/offline-stores/spark.md)
-* [Spark Batch Materialization](../reference/batch-materialization/spark.md)
+* [Spark Compute Engine](../reference/compute-engine/spark.md)
diff --git a/docs/how-to-guides/running-feast-in-production.md b/docs/how-to-guides/running-feast-in-production.md
@@ -57,8 +57,8 @@ To keep your online store up to date, you need to run a job that loads feature d
 Out of the box, Feast's materialization process uses an in-process materialization engine. This engine loads all the data being materialized into memory from the offline store, and writes it into the online store. 
 
 This approach may not scale to large amounts of data, which users of Feast may be dealing with in production.
-In this case, we recommend using one of the more [scalable materialization engines](./scaling-feast.md#scaling-materialization), such as [Snowflake Materialization Engine](../reference/batch-materialization/snowflake.md).
-Users may also need to [write a custom materialization engine](../how-to-guides/customizing-feast/creating-a-custom-materialization-engine.md) to work on their existing infrastructure.  
+In this case, we recommend using one of the more [scalable compute engines](./scaling-feast.md#scaling-materialization), such as [Snowflake Compute Engine](../reference/compute-engine/snowflake.md).
+Users may also need to [write a custom compute engine](../how-to-guides/customizing-feast/creating-a-custom-compute-engine.md) to work on their existing infrastructure.  
 
 
 ### 2.2 Scheduled materialization with Airflow
diff --git a/docs/how-to-guides/scaling-feast.md b/docs/how-to-guides/scaling-feast.md
@@ -20,7 +20,7 @@ The recommended solution in this case is to use the [SQL based registry](../tuto
 The default Feast materialization process is an in-memory process, which pulls data from the offline store before writing it to the online store.
 However, this process does not scale for large data sets, since it's executed on a single-process.
 
-Feast supports pluggable [Materialization Engines](../getting-started/components/batch-materialization-engine.md), that allow the materialization process to be scaled up.
+Feast supports pluggable [Compute Engines](../getting-started/components/compute-engine.md), that allow the materialization process to be scaled up.
 Aside from the local process, Feast supports a [Lambda-based materialization engine](https://rtd.feast.dev/en/master/#alpha-lambda-based-engine), and a [Bytewax-based materialization engine](https://rtd.feast.dev/en/master/#bytewax-engine).
 
 Users may also be able to build an engine to scale up materialization using existing infrastructure in their organizations.
diff --git a/docs/reference/batch-materialization/README.md b/docs/reference/batch-materialization/README.md
diff --git a/docs/reference/codebase-structure.md b/docs/reference/codebase-structure.md
@@ -34,7 +34,7 @@ There are also several important submodules:
 * `ui/` contains the embedded Web UI, to be launched on the `feast ui` command.
 
 Of these submodules, `infra/` is the most important.
-It contains the interfaces for the [provider](getting-started/components/provider.md), [offline store](getting-started/components/offline-store.md), [online store](getting-started/components/online-store.md), [batch materialization engine](getting-started/components/batch-materialization-engine.md), and [registry](getting-started/components/registry.md), as well as all of their individual implementations.
+It contains the interfaces for the [provider](getting-started/components/provider.md), [offline store](getting-started/components/offline-store.md), [online store](getting-started/components/online-store.md), [compute engine](getting-started/components/compute-engine.md), and [registry](getting-started/components/registry.md), as well as all of their individual implementations.
 
 ```
 $ tree --dirsfirst -L 1 infra   
diff --git a/docs/reference/compute-engine/README.md b/docs/reference/compute-engine/README.md
@@ -48,18 +48,35 @@ An example of built output from FeatureBuilder:
 
 ## ✨ Available Engines
 
+
 ### 🔥 SparkComputeEngine
 
+{% page-ref page="spark.md" %}
+
 - Distributed DAG execution via Apache Spark
 - Supports point-in-time joins and large-scale materialization
 - Integrates with `SparkOfflineStore` and `SparkMaterializationJob`
 
 ### 🧪 LocalComputeEngine
 
+{% page-ref page="local.md" %}
+
 - Runs on Arrow + Specified backend (e.g., Pandas, Polars)
 - Designed for local dev, testing, or lightweight feature generation
 - Supports `LocalMaterializationJob` and `LocalHistoricalRetrievalJob`
 
+### 🧊 SnowflakeComputeEngine
+
+- Runs entirely in Snowflake
+- Supports Snowflake SQL for feature transformations and aggregations
+- Integrates with `SnowflakeOfflineStore` and `SnowflakeMaterializationJob`
+
+{% page-ref page="snowflake.md" %}
+
+### LambdaComputeEngine
+
+{% page-ref page="lambda.md" %}
+
 ---
 
 ## 🛠️ Feature Builder Flow 
diff --git a/docs/reference/compute-engine/lambda.md b/docs/reference/compute-engine/lambda.md
diff --git a/docs/reference/compute-engine/snowflake.md b/docs/reference/compute-engine/snowflake.md
@@ -2,7 +2,7 @@
 
 ## Description
 
-The [Snowflake](https://trial.snowflake.com) batch materialization engine provides a highly scalable and parallel execution engine using a Snowflake Warehouse for batch materializations operations (`materialize` and `materialize-incremental`) when using a `SnowflakeSource`.
+The [Snowflake](https://trial.snowflake.com) compute engine provides a highly scalable and parallel execution engine using a Snowflake Warehouse for batch materializations operations (`materialize` and `materialize-incremental`) when using a `SnowflakeSource`.
 
 The engine requires no additional configuration other than for you to supply Snowflake's standard login and context details. The engine leverages custom (automatically deployed for you) Python UDFs to do the proper serialization of your offline store data to your online serving tables.
 
diff --git a/docs/reference/compute-engine/spark.md b/docs/reference/compute-engine/spark.md
@@ -1,10 +1,19 @@
-# Spark (alpha)
+# Spark
 
 ## Description
 
-The Spark batch materialization engine is considered alpha status. It relies on the offline store to output feature values to S3 via `to_remote_storage`, and then loads them into the online store.
+Spark Compute Engine provides a distributed execution engine for batch materialization operations (`materialize` and `materialize-incremental`) and historical retrieval operations (`get_historical_features`).
+
+It is designed to handle large-scale data processing and can be used with various offline stores, such as Snowflake, BigQuery, and Spark SQL.
+
+### Design
+The Spark Compute engine is implemented as a subclass of `feast.infra.compute_engine.ComputeEngine`.
+Offline store is used to read and write data, while the Spark engine is used to perform transformations and aggregations on the data.
+The engine supports the following features:
+- Support for reading different data sources, such as Spark SQL, BigQuery, and Snowflake.
+- Distributed execution of feature transformations and aggregations.
+- Support for custom transformations using Spark SQL or UDFs.
 
-See [SparkMaterializationEngine](https://rtd.feast.dev/en/master/index.html?highlight=SparkMaterializationEngine#feast.infra.materialization.spark.spark_materialization_engine.SparkMaterializationEngineConfig) for configuration options.
 
 ## Example
 
@@ -16,7 +25,12 @@ offline_store:
 ...
 batch_engine:
   type: spark.engine
-  partitions: [optional num partitions to use to write to online store]
+  partitions: 10 # number of partitions when writing to the online or offline store
+  spark_conf:
+    spark.master: "local[*]"
+    spark.app.name: "Feast Spark Engine"
+    spark.sql.shuffle.partitions: 100
+    spark.executor.memory: "4g"
 ```
 {% endcode %}