Skip to content

dataiku/dss-plugin-visual-snowparkml

Repository files navigation

Visual Snowflake ML Plugin

visual_snowflake_ml_diagram

With this plugin, you can train machine learning models and then use them to score new records, all within your Snowflake environment. This no-code UI allows data scientists and domain experts to quickly train models, track experiments, and visualize model performance.

Capabilities

  • No-code ML model training using Snowflake compute, either using a Snowpark optimized warehouse with Snowflake’s snowflake-ml-python package, or using a Snowpark Container Services compute pool with Snowflake ML Jobs
    • Two-class classification, multi-class classification, and regression tasks on tabular data stored in a Snowflake table
    • Hyperparameter tuning using Random Search on certain algorithm parameters
    • Track hyperparameter tuning and model performance through Dataiku’s Experiment Tracking MLflow integration
    • Output the best trained model to a Dataiku Saved Model in the flow, and deploy the model to a Snowflake Model Registry
  • No-code ML batch scoring in Snowflake using a trained model (from the training recipe of this plugin)
  • Macro to clean up models from the Snowflake Model Registry that have been deleted from the Dataiku project

Limitations

  • If doing two-class or multi-class classification, convert your target column to numeric (0,1) or (0, 1, 2, 3, 4) before using this plugin (this is a SnowparkML requirement)
  • No int type columns can have missing values. If you have an int column with missing values, convert the type to double before this recipe (this is an MLflow requirement)
  • If you want to treat a numeric column as categorical, change its storage type to string in a prior recipe

Snowflake Resources and Permissions

  • Must have a Snowflake connection. Plugin recipe Input + Output tables must be in the same Snowflake connection.
  • The plugin uses the Snowflake role used for the Input + Output tables to access all other Snowflake resources
  • The plugin uses a backend runtime environment of a Snowpark Container Services compute pool or Snowpark-optimized warehouse
    • Snowpark Container Services Compute Pool: Snowflake role must have the USAGE permission on the compute pool
    • Snowpark-optimized Warehouse: Snowflake role must have the USAGE permission on the warehouse
  • Snowflake role must have the CREATE MODEL privilege on the schema used in the Snowflake connection for Input + Output tables

Other Requirements

  • Python 3.10 available on the instance

Setup

Build the plugin code environment

Right after installing the plugin, you will need to build its code environment. Note that this plugin requires Python version 3.10 and that conda is not supported.

Build ANOTHER python 3.10 code environment

Name it “py_310_snowpark” Under “Core packages versions”, choose Pandas 2.3 Add these packages, then update the environment:

snowflake-ml-python==1.20.0
numpy==1.26.4
mlflow==2.18.0
scikit-learn==1.7.2
xgboost==3.1.2
lightgbm==4.6.0
statsmodels==0.14.6

Usage

Training models with Snowflake ML

Create the train plugin recipe and outputs

Click once on the input dataset (with known, labeled target values), then find the Visual Snowflake ML plugin:

image

Click the train recipe:

image

Create two output Snowflake tables to hold the generated output train/test sets, and one managed folder to hold saved models (connection doesn’t matter):

image

Design your ML training process and run the recipe

Make sure you fill out all required fields

Target

  • Prediction type: two-class classification, multi-class classification, or regression
  • Target: the name of your target column to predict
  • Class weights: choose to enable (recommended) or disable class weights. Class weights are row weights that are inversely proportional to the cardinality of its target class and help with class imbalance issues
  • Model name: the name of your model. This will be the name of the model created in your Dataiku project flow after running the train recipe. The best trained model will be registered in Snowflake model registry as DATAIKU_PROJECT_ID_MODEL_NAME.

Train/Test Set

  • Time ordering: order the train and test sets by a datetime column (test set will be more recent timestamps than train set)
  • Train ratio: train set / test set ratio. 0.8 is a good start
  • Splitting random seed: Set this (to any integer) to maintain consistent train/test sets over multiple training runs

Metrics

  • Optimize model hyperparameters for: metric to optimize model hyperparameters for while training

Features handling

For each column in the input training dataset, choose whether to include the column as an input features for the model training process, and how to handle this feature. Note: if you want to treat a numeric column as categorical, change its storage type to string in a prior recipe.

  • Status: whether to include or exclude the feature
  • Encoding / Rescaling: choose how to encode categorical features, and rescale numeric features
  • Missing values: choose how to deal with missing values
  • Constant: if “Constant” chosen for missingness imputation, the value to impute

Algorithms

  • Select each algorithm you’d like to train
  • For each algorithm, enter min and max values for each hyperparameter

Hyperparameters

This training recipe will kick off a Randomized Search process with 3-fold cross-validation in Snowflake to find the best hyperparameter combination for each algorithm selected

  • Search space limit: the number of hyperparameter combinations to try for each algorithm within the min/max values chosen

Runtime environment

  • Compute Backend: whether to use Snowflake Container Runtime (Snowpark Container Services) or a Snowflake warehouse for this ML training job
  • (If Container Runtime) Compute Pool: the Snowpark Container Services compute pool to use for ML training.
  • (If Container Runtime) Snowflake Stage: the Snowflake Stage where model training functions will be uploaded prior to execution on the compute pool.
  • (If Warehouse) Snowflake Warehouse: the warehouse to use for ML training. You must use a Snowpark-optimized Snowflake warehouse. A multi-cluster warehouse will allow for parallelized hyperparameter tuning.
  • Model Registry: deploy the best trained model to a Snowflake ML Model Registry (in the same database and schema as the input and output datasets. See Snowflake access requirements here. This is required in order to run a subsequent Visual Snowpark ML Score recipe, to run batch inference in Snowpark using the deployed model.

Outputs

After running the train recipe successfully, you can find all model training and hyperparameter tuning information, including model performance metrics in Dataiku’s Experiment Tracking tab

image

Screenshot 2024-02-23 at 9 11 07 AM

The best model will be deployed to the flow. If you selected “Deploy to Snowflake ML Model Registry”, the model will also be deployed to Snowflake’s Model Registry.

Screenshot 2024-05-15 at 8 09 47 AM

Scoring New Records with your Trained Model and Snowpark ML

Note: you can use a regular Dataiku Score recipe with the Snowpark ML trained model, however, the inference will happen in a local python kernel, and not in Snowflake. Note that for classification models where you did NOT disable class weights, you'll need to add a SAMPLE_WEIGHTS column in your input dataset before a regular Dataiku Score recipe (this column can have all empty values).

In order to run batch inference in Snowpark, use this plugin's Visual Snowpark ML Score recipe. You must have checked “Deploy to Snowflake ML Model Registry” when training the model for this to work.

Create the score plugin recipe and outputs

Click once on the trained model and input dataset you’d like to make predictions for:

image

Click the score recipe:

image

Make sure you’ve selected the trained model and Snowflake table for scoring as inputs. Then create one output Snowflake table to hold the scored dataset. Then click “Create”:

image

Optionally type the name of a Snowpark-optimized Snowflake warehouse to use for scoring. Leave empty to use the Snowflake connection’s default warehouse. Click “Run”.

Screenshot 2024-02-23 at 9 30 05 AM

Your flow should look like this, and the output scored dataset should have prediction column(s):

Screenshot 2024-05-14 at 4 21 17 PM

Screenshot 2024-02-23 at 9 31 46 AM

Clear SnowparkML Registry Models Macro

When deploying trained models to a Snowflake Model Registry, we want to ensure that any trained models that we delete from the Dataiku UI (by deleting a full green diamond saved model, or a saved model version underneath), are also deleted from the Snowflake Model Registry.

This macro checks for any models in the Snowflake Model Registry that have the Dataiku and the current project key tags; if the model has been deleted from the Dataiku project, this macro deletes the model from the Snowflake Model Registry.

You can list the models the macro would delete by un-checking the “Perform deletion” box.

The macro will show the model name and version deleted (or simulated) in the resulting list.

image image Screenshot 2024-02-23 at 12 51 50 PM Screenshot 2024-02-23 at 12 52 30 PM

Release Notes

See the changelog for a history of notable changes to this plugin.

License

This plugin is distributed under the Apache License version 2.0.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •