Skip to content

Add data source integrations with Snowflake#330

Merged
sfc-gh-mwyatt merged 14 commits intosnowflakedb:mainfrom
sfc-gh-dhung:dhung-snow-data-integration
Jan 7, 2026
Merged

Add data source integrations with Snowflake#330
sfc-gh-mwyatt merged 14 commits intosnowflakedb:mainfrom
sfc-gh-dhung:dhung-snow-data-integration

Conversation

@sfc-gh-dhung
Copy link
Contributor

@sfc-gh-dhung sfc-gh-dhung commented Dec 17, 2025

Add Snowflake data source integration

This PR adds native support for loading training data directly from Snowflake, enabling users to train models on data stored in their Snowflake data warehouse.

Features

Unified snowflake data source type with three mutually exclusive modes:

  • SQL Query (sql): Execute arbitrary SQL queries against Snowflake
  • Table (table_name): Load data directly from a Snowflake table (auto-generates SELECT * FROM)
  • Dataset (dataset_uri): Load from versioned Snowflake Datasets using the snow:// URI format

Additional options:

  • column_mapping: Rename columns from source to target format
  • limit: Cap the number of rows loaded
  • batch_size: Configure batch size for data retrieval

Configuration

Snowflake credentials can be provided via:

  • Environment variables (SNOWFLAKE_ACCOUNT, SNOWFLAKE_USER, etc.)
  • Snowflake connections config file (~/.snowflake/connections.toml)

Installation

pip install 'arctic_training[snowflake]'

Examples

data:
  sources:
    # From SQL
    - type: snowflake
      sql: "SELECT TEXT FROM ARCTIC_TRAINING.CAUSAL_DEMO.GUTENBERG_100"
      column_mapping: {"TEXT": "text"}
      # limit: 50
      # batch_size: 1024
    # From Snowflake Table
    - type: snowflake
      table_name: ARCTIC_TRAINING.CAUSAL_DEMO.GUTENBERG_100
      column_mapping: {"TEXT": "text"}
    # From Snowflake Dataset
    - type: snowflake
      dataset_uri: "snow://dataset/ARCTIC_TRAINING.CAUSAL_DEMO.GUTENBERG_DATASET/versions/v1"
      column_mapping: {"TEXT": "text"}

Examples are also provided in projects/causal_snowflake

- Renamed SnowflakeTableSourceConfig to SnowflakeSqlSourceConfig and updated its documentation.
- Introduced SnowflakeTableSourceConfig that inherits from SnowflakeSqlSourceConfig, auto-generating SQL queries from table names.
- Updated SnowflakeSqlDataSource to utilize SQL queries instead of table names.
- Enhanced tests for SnowflakeSqlSourceConfig and SnowflakeSqlDataSource to validate new functionality and configurations.
…e data sources

- Introduced README.md detailing project overview, Snowflake data source types, prerequisites, and setup instructions.
- Added setup_snowflake.py script to populate Snowflake with training data from HuggingFace, including database and schema creation.
- Created YAML configuration files for training using SQL queries, tables, and datasets from Snowflake.
- Provided troubleshooting tips for common connection and resource issues.
@sfc-gh-mwyatt sfc-gh-mwyatt merged commit 5c6a984 into snowflakedb:main Jan 7, 2026
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants