With this plugin, you can train machine learning models and then use them to score new records, all within your Snowflake environment. This no-code UI allows data scientists and domain experts to quickly train models, track experiments, and visualize model performance.
- No-code ML model training using Snowflake compute, either using a Snowpark optimized warehouse with Snowflake’s snowflake-ml-python package, or using a Snowpark Container Services compute pool with Snowflake ML Jobs
- Two-class classification, multi-class classification, and regression tasks on tabular data stored in a Snowflake table
- Hyperparameter tuning using Random Search on certain algorithm parameters
- Track hyperparameter tuning and model performance through Dataiku’s Experiment Tracking MLflow integration
- Output the best trained model to a Dataiku Saved Model in the flow, and deploy the model to a Snowflake Model Registry
- No-code ML batch scoring in Snowflake using a trained model (from the training recipe of this plugin)
- Macro to clean up models from the Snowflake Model Registry that have been deleted from the Dataiku project
- If doing two-class or multi-class classification, convert your target column to numeric (0,1) or (0, 1, 2, 3, 4) before using this plugin (this is a SnowparkML requirement)
- No int type columns can have missing values. If you have an int column with missing values, convert the type to double before this recipe (this is an MLflow requirement)
- If you want to treat a numeric column as categorical, change its storage type to string in a prior recipe
- Must have a Snowflake connection. Plugin recipe Input + Output tables must be in the same Snowflake connection.
- The plugin uses the Snowflake role used for the Input + Output tables to access all other Snowflake resources
- The plugin uses a backend runtime environment of a Snowpark Container Services compute pool or Snowpark-optimized warehouse
- Snowpark Container Services Compute Pool: Snowflake role must have the USAGE permission on the compute pool
- Snowpark-optimized Warehouse: Snowflake role must have the USAGE permission on the warehouse
- Snowflake role must have the CREATE MODEL privilege on the schema used in the Snowflake connection for Input + Output tables
- Python 3.10 available on the instance
Right after installing the plugin, you will need to build its code environment. Note that this plugin requires Python version 3.10 and that conda is not supported.
Name it “py_310_snowpark” Under “Core packages versions”, choose Pandas 2.3 Add these packages, then update the environment:
snowflake-ml-python==1.20.0
numpy==1.26.4
mlflow==2.18.0
scikit-learn==1.7.2
xgboost==3.1.2
lightgbm==4.6.0
statsmodels==0.14.6
Click once on the input dataset (with known, labeled target values), then find the Visual Snowflake ML plugin:
Click the train recipe:
Create two output Snowflake tables to hold the generated output train/test sets, and one managed folder to hold saved models (connection doesn’t matter):
Make sure you fill out all required fields
Target
- Prediction type: two-class classification, multi-class classification, or regression
- Target: the name of your target column to predict
- Class weights: choose to enable (recommended) or disable class weights. Class weights are row weights that are inversely proportional to the cardinality of its target class and help with class imbalance issues
- Model name: the name of your model. This will be the name of the model created in your Dataiku project flow after running the train recipe. The best trained model will be registered in Snowflake model registry as DATAIKU_PROJECT_ID_MODEL_NAME.
Train/Test Set
- Time ordering: order the train and test sets by a datetime column (test set will be more recent timestamps than train set)
- Train ratio: train set / test set ratio. 0.8 is a good start
- Splitting random seed: Set this (to any integer) to maintain consistent train/test sets over multiple training runs
Metrics
- Optimize model hyperparameters for: metric to optimize model hyperparameters for while training
Features handling
For each column in the input training dataset, choose whether to include the column as an input features for the model training process, and how to handle this feature. Note: if you want to treat a numeric column as categorical, change its storage type to string in a prior recipe.
- Status: whether to include or exclude the feature
- Encoding / Rescaling: choose how to encode categorical features, and rescale numeric features
- Missing values: choose how to deal with missing values
- Constant: if “Constant” chosen for missingness imputation, the value to impute
Algorithms
- Select each algorithm you’d like to train
- For each algorithm, enter min and max values for each hyperparameter
Hyperparameters
This training recipe will kick off a Randomized Search process with 3-fold cross-validation in Snowflake to find the best hyperparameter combination for each algorithm selected
- Search space limit: the number of hyperparameter combinations to try for each algorithm within the min/max values chosen
Runtime environment
- Compute Backend: whether to use Snowflake Container Runtime (Snowpark Container Services) or a Snowflake warehouse for this ML training job
- (If Container Runtime) Compute Pool: the Snowpark Container Services compute pool to use for ML training.
- (If Container Runtime) Snowflake Stage: the Snowflake Stage where model training functions will be uploaded prior to execution on the compute pool.
- (If Warehouse) Snowflake Warehouse: the warehouse to use for ML training. You must use a Snowpark-optimized Snowflake warehouse. A multi-cluster warehouse will allow for parallelized hyperparameter tuning.
- Model Registry: deploy the best trained model to a Snowflake ML Model Registry (in the same database and schema as the input and output datasets. See Snowflake access requirements here. This is required in order to run a subsequent Visual Snowpark ML Score recipe, to run batch inference in Snowpark using the deployed model.
After running the train recipe successfully, you can find all model training and hyperparameter tuning information, including model performance metrics in Dataiku’s Experiment Tracking tab
The best model will be deployed to the flow. If you selected “Deploy to Snowflake ML Model Registry”, the model will also be deployed to Snowflake’s Model Registry.
Note: you can use a regular Dataiku Score recipe with the Snowpark ML trained model, however, the inference will happen in a local python kernel, and not in Snowflake. Note that for classification models where you did NOT disable class weights, you'll need to add a SAMPLE_WEIGHTS column in your input dataset before a regular Dataiku Score recipe (this column can have all empty values).
In order to run batch inference in Snowpark, use this plugin's Visual Snowpark ML Score recipe. You must have checked “Deploy to Snowflake ML Model Registry” when training the model for this to work.
Click once on the trained model and input dataset you’d like to make predictions for:
Click the score recipe:
Make sure you’ve selected the trained model and Snowflake table for scoring as inputs. Then create one output Snowflake table to hold the scored dataset. Then click “Create”:
Optionally type the name of a Snowpark-optimized Snowflake warehouse to use for scoring. Leave empty to use the Snowflake connection’s default warehouse. Click “Run”.
Your flow should look like this, and the output scored dataset should have prediction column(s):
When deploying trained models to a Snowflake Model Registry, we want to ensure that any trained models that we delete from the Dataiku UI (by deleting a full green diamond saved model, or a saved model version underneath), are also deleted from the Snowflake Model Registry.
This macro checks for any models in the Snowflake Model Registry that have the Dataiku and the current project key tags; if the model has been deleted from the Dataiku project, this macro deletes the model from the Snowflake Model Registry.
You can list the models the macro would delete by un-checking the “Perform deletion” box.
The macro will show the model name and version deleted (or simulated) in the resulting list.
See the changelog for a history of notable changes to this plugin.
This plugin is distributed under the Apache License version 2.0.






