Skip to content

Commit b77dc73

Browse files
authored
SNOW-2220946: Add support for unstructured data engineering in Snowpark (#3775)
1 parent 22d1174 commit b77dc73

File tree

14 files changed

+6705
-284
lines changed

14 files changed

+6705
-284
lines changed

CHANGELOG.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,34 @@
11
# Release History
22

3+
## 1.40.0 (YYYY-MM-DD)
4+
5+
### Snowpark Python API Updates
6+
7+
#### New Features
8+
39
## 1.39.0 (YYYY-MM-DD)
410

511
### Snowpark Python API Updates
612

713
#### New Features
814

15+
- Added support for unstructured data engineering in Snowpark, powered by Snowflake AISQL and Cortex functions:
16+
- `DataFrame.ai.complete`: Generate per-row LLM completions from prompts built over columns and files.
17+
- `DataFrame.ai.filter`: Keep rows where an AI classifier returns TRUE for the given predicate.
18+
- `DataFrame.ai.agg`: Reduce a text column into one result using a natural-language task description.
19+
- `RelationalGroupedDataFrame.ai_agg`: Perform the same natural-language aggregation per group.
20+
- `DataFrame.ai.classify`: Assign single or multiple labels from given categories to text or images.
21+
- `DataFrame.ai.similarity`: Compute cosine-based similarity scores between two columns via embeddings.
22+
- `DataFrame.ai.sentiment`: Extract overall and aspect-level sentiment from text into JSON.
23+
- `DataFrame.ai.embed`: Generate VECTOR embeddings for text or images using configurable models.
24+
- `DataFrame.ai.summarize_agg`: Aggregate and produce a single comprehensive summary over many rows.
25+
- `DataFrame.ai.transcribe`: Transcribe audio files to text with optional timestamps and speaker labels.
26+
- `DataFrame.ai.parse_document`: OCR/layout-parse documents or images into structured JSON.
27+
- `DataFrame.ai.extract`: Pull structured fields from text or files using a response schema.
28+
- `DataFrame.ai.count_tokens`: Estimate token usage for a given model and input text per row.
29+
- `DataFrame.ai.split_text_markdown_header`: Split Markdown into hierarchical header-aware chunks.
30+
- `DataFrame.ai.split_text_recursive_character`: Split text into size-bounded chunks using recursive separators.
31+
- `DataFrameReader.file`: Create a DataFrame containing all files from a stage as FILE data type for downstream unstructured data processing.
932
- Added a new datatype `YearMonthIntervalType` that allows users to create intervals for datetime operations.
1033
- Added a new function `interval_year_month_from_parts` that allows users to easily create `YearMonthIntervalType` without using SQL.
1134
- Added a new datatype `DayTimeIntervalType` that allows users to create intervals for datetime operations.

docs/source/snowpark/dataframe.rst

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ DataFrame
1313
DataFrameNaFunctions
1414
DataFrameStatFunctions
1515
DataFrameAnalyticsFunctions
16+
DataFrameAIFunctions
1617

1718
.. rubric:: Methods
1819

@@ -120,6 +121,20 @@ DataFrame
120121
DataFrameAnalyticsFunctions.compute_lag
121122
DataFrameAnalyticsFunctions.compute_lead
122123
DataFrameAnalyticsFunctions.time_series_agg
124+
DataFrameAIFunctions.agg
125+
DataFrameAIFunctions.classify
126+
DataFrameAIFunctions.complete
127+
DataFrameAIFunctions.count_tokens
128+
DataFrameAIFunctions.embed
129+
DataFrameAIFunctions.extract
130+
DataFrameAIFunctions.filter
131+
DataFrameAIFunctions.parse_document
132+
DataFrameAIFunctions.sentiment
133+
DataFrameAIFunctions.similarity
134+
DataFrameAIFunctions.split_text_markdown_header
135+
DataFrameAIFunctions.split_text_recursive_character
136+
DataFrameAIFunctions.summarize_agg
137+
DataFrameAIFunctions.transcribe
123138
dataframe.map
124139
dataframe.map_in_pandas
125140

@@ -133,6 +148,7 @@ DataFrame
133148
.. autosummary::
134149
:toctree: api/
135150

151+
DataFrame.ai
136152
DataFrame.columns
137153
DataFrame.na
138154
DataFrame.queries

docs/source/snowpark/grouping.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ Grouping
1818
:toctree: api/
1919

2020
RelationalGroupedDataFrame.agg
21+
RelationalGroupedDataFrame.ai_agg
2122
RelationalGroupedDataFrame.apply_in_pandas
2223
RelationalGroupedDataFrame.applyInPandas
2324
RelationalGroupedDataFrame.avg

src/snowflake/snowpark/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@
2222
"DataFrameStatFunctions",
2323
"DataFrameAnalyticsFunctions",
2424
"DataFrameNaFunctions",
25+
"DataFrameAIFunctions",
2526
"DataFrameWriter",
2627
"DataFrameReader",
2728
"GroupingSets",
@@ -54,6 +55,7 @@
5455
from snowflake.snowpark.column import CaseExpr, Column
5556
from snowflake.snowpark.stored_procedure_profiler import StoredProcedureProfiler
5657
from snowflake.snowpark.dataframe import DataFrame
58+
from snowflake.snowpark.dataframe_ai_functions import DataFrameAIFunctions
5759
from snowflake.snowpark.dataframe_analytics_functions import DataFrameAnalyticsFunctions
5860
from snowflake.snowpark.dataframe_na_functions import DataFrameNaFunctions
5961
from snowflake.snowpark.dataframe_reader import DataFrameReader

0 commit comments

Comments
 (0)