Visual Snowflake ML Plugin

With this plugin, you can train machine learning models and then use them to score new records, all within your Snowflake environment. This no-code UI allows data scientists and domain experts to quickly train models, track experiments, and visualize model performance.

Capabilities

No-code ML model training using Snowflake compute, either using a Snowpark optimized warehouse with Snowflake’s snowflake-ml-python package, or using a Snowpark Container Services compute pool with Snowflake ML Jobs
- Two-class classification, multi-class classification, and regression tasks on tabular data stored in a Snowflake table
- Hyperparameter tuning using Random Search on certain algorithm parameters
- Track hyperparameter tuning and model performance through Dataiku’s Experiment Tracking MLflow integration
- Output the best trained model to a Dataiku Saved Model in the flow, and deploy the model to a Snowflake Model Registry
No-code ML batch scoring in Snowflake using a trained model (from the training recipe of this plugin)
Macro to clean up models from the Snowflake Model Registry that have been deleted from the Dataiku project

Limitations

If doing two-class or multi-class classification, convert your target column to numeric (0,1) or (0, 1, 2, 3, 4) before using this plugin (this is a SnowparkML requirement)
No int type columns can have missing values. If you have an int column with missing values, convert the type to double before this recipe (this is an MLflow requirement)
If you want to treat a numeric column as categorical, change its storage type to string in a prior recipe

Snowflake Resources and Permissions

Must have a Snowflake connection. Plugin recipe Input + Output tables must be in the same Snowflake connection.
The plugin uses the Snowflake role used for the Input + Output tables to access all other Snowflake resources
The plugin uses a backend runtime environment of a Snowpark Container Services compute pool or Snowpark-optimized warehouse
- Snowpark Container Services Compute Pool: Snowflake role must have the USAGE permission on the compute pool
- Snowpark-optimized Warehouse: Snowflake role must have the USAGE permission on the warehouse
Snowflake role must have the CREATE MODEL privilege on the schema used in the Snowflake connection for Input + Output tables

Other Requirements

Python 3.10 available on the instance

Setup

Build the plugin code environment

Right after installing the plugin, you will need to build its code environment. Note that this plugin requires Python version 3.10 and that conda is not supported.

Build ANOTHER python 3.10 code environment

Name it “py_310_snowpark” Under “Core packages versions”, choose Pandas 2.3 Add these packages, then update the environment:

snowflake-ml-python==1.20.0
numpy==1.26.4
mlflow==2.18.0
scikit-learn==1.7.2
xgboost==3.1.2
lightgbm==4.6.0
statsmodels==0.14.6

Usage

Training models with Snowflake ML

Create the train plugin recipe and outputs

Click once on the input dataset (with known, labeled target values), then find the Visual Snowflake ML plugin:

Click the train recipe:

Create two output Snowflake tables to hold the generated output train/test sets, and one managed folder to hold saved models (connection doesn’t matter):

Design your ML training process and run the recipe

Make sure you fill out all required fields

Target

Prediction type: two-class classification, multi-class classification, or regression
Target: the name of your target column to predict
Class weights: choose to enable (recommended) or disable class weights. Class weights are row weights that are inversely proportional to the cardinality of its target class and help with class imbalance issues
Model name: the name of your model. This will be the name of the model created in your Dataiku project flow after running the train recipe. The best trained model will be registered in Snowflake model registry as DATAIKU_PROJECT_ID_MODEL_NAME.

Train/Test Set

Time ordering: order the train and test sets by a datetime column (test set will be more recent timestamps than train set)
Train ratio: train set / test set ratio. 0.8 is a good start
Splitting random seed: Set this (to any integer) to maintain consistent train/test sets over multiple training runs

Metrics

Optimize model hyperparameters for: metric to optimize model hyperparameters for while training

Features handling

For each column in the input training dataset, choose whether to include the column as an input features for the model training process, and how to handle this feature. Note: if you want to treat a numeric column as categorical, change its storage type to string in a prior recipe.

Status: whether to include or exclude the feature
Encoding / Rescaling: choose how to encode categorical features, and rescale numeric features
Missing values: choose how to deal with missing values
Constant: if “Constant” chosen for missingness imputation, the value to impute

Algorithms

Select each algorithm you’d like to train
For each algorithm, enter min and max values for each hyperparameter

Hyperparameters

This training recipe will kick off a Randomized Search process with 3-fold cross-validation in Snowflake to find the best hyperparameter combination for each algorithm selected

Search space limit: the number of hyperparameter combinations to try for each algorithm within the min/max values chosen

Runtime environment

Compute Backend: whether to use Snowflake Container Runtime (Snowpark Container Services) or a Snowflake warehouse for this ML training job
(If Container Runtime) Compute Pool: the Snowpark Container Services compute pool to use for ML training.
(If Container Runtime) Snowflake Stage: the Snowflake Stage where model training functions will be uploaded prior to execution on the compute pool.
(If Warehouse) Snowflake Warehouse: the warehouse to use for ML training. You must use a Snowpark-optimized Snowflake warehouse. A multi-cluster warehouse will allow for parallelized hyperparameter tuning.
Model Registry: deploy the best trained model to a Snowflake ML Model Registry (in the same database and schema as the input and output datasets. See Snowflake access requirements here. This is required in order to run a subsequent Visual Snowpark ML Score recipe, to run batch inference in Snowpark using the deployed model.

Outputs

After running the train recipe successfully, you can find all model training and hyperparameter tuning information, including model performance metrics in Dataiku’s Experiment Tracking tab

The best model will be deployed to the flow. If you selected “Deploy to Snowflake ML Model Registry”, the model will also be deployed to Snowflake’s Model Registry.

Scoring New Records with your Trained Model and Snowpark ML

Note: you can use a regular Dataiku Score recipe with the Snowpark ML trained model, however, the inference will happen in a local python kernel, and not in Snowflake. Note that for classification models where you did NOT disable class weights, you'll need to add a SAMPLE_WEIGHTS column in your input dataset before a regular Dataiku Score recipe (this column can have all empty values).

In order to run batch inference in Snowpark, use this plugin's Visual Snowpark ML Score recipe. You must have checked “Deploy to Snowflake ML Model Registry” when training the model for this to work.

Create the score plugin recipe and outputs

Click once on the trained model and input dataset you’d like to make predictions for:

Click the score recipe:

Make sure you’ve selected the trained model and Snowflake table for scoring as inputs. Then create one output Snowflake table to hold the scored dataset. Then click “Create”:

Optionally type the name of a Snowpark-optimized Snowflake warehouse to use for scoring. Leave empty to use the Snowflake connection’s default warehouse. Click “Run”.

Your flow should look like this, and the output scored dataset should have prediction column(s):

Clear SnowparkML Registry Models Macro

When deploying trained models to a Snowflake Model Registry, we want to ensure that any trained models that we delete from the Dataiku UI (by deleting a full green diamond saved model, or a saved model version underneath), are also deleted from the Snowflake Model Registry.

This macro checks for any models in the Snowflake Model Registry that have the Dataiku and the current project key tags; if the model has been deleted from the Dataiku project, this macro deletes the model from the Snowflake Model Registry.

You can list the models the macro would delete by un-checking the “Perform deletion” box.

The macro will show the model name and version deleted (or simulated) in the resulting list.

Release Notes

See the changelog for a history of notable changes to this plugin.

License

This plugin is distributed under the Apache License version 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 346 Commits
code-env/python		code-env/python
custom-recipes		custom-recipes
js		js
python-lib/visualsnowparkml		python-lib/visualsnowparkml
python-runnables/clear-deleted-snowparkml-models		python-runnables/clear-deleted-snowparkml-models
resource		resource
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
plugin.json		plugin.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visual Snowflake ML Plugin

Capabilities

Limitations

Snowflake Resources and Permissions

Other Requirements

Setup

Build the plugin code environment

Build ANOTHER python 3.10 code environment

Usage

Training models with Snowflake ML

Create the train plugin recipe and outputs

Design your ML training process and run the recipe

Outputs

Scoring New Records with your Trained Model and Snowpark ML

Create the score plugin recipe and outputs

Clear SnowparkML Registry Models Macro

Release Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

dataiku/dss-plugin-visual-snowparkml

Folders and files

Latest commit

History

Repository files navigation

Visual Snowflake ML Plugin

Capabilities

Limitations

Snowflake Resources and Permissions

Other Requirements

Setup

Build the plugin code environment

Build ANOTHER python 3.10 code environment

Usage

Training models with Snowflake ML

Create the train plugin recipe and outputs

Design your ML training process and run the recipe

Outputs

Scoring New Records with your Trained Model and Snowpark ML

Create the score plugin recipe and outputs

Clear SnowparkML Registry Models Macro

Release Notes

License

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages