Skip to content

Commit aae870a

Browse files
Merge branch 'main' into not-implemented-args
2 parents 7c92e5a + 33a8ab6 commit aae870a

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

60 files changed

+1644
-294
lines changed

CHANGELOG.md

Lines changed: 32 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,14 +58,20 @@
5858
- `st_geometryfromwkt`
5959
- `try_to_geography`
6060
- `try_to_geometry`
61-
61+
- Added a parameter to enable and disable automatic column name aliasing for `interval_day_time_from_parts` and `interval_year_month_from_parts` functions.
6262

6363
#### Bug Fixes
6464

6565
- Fixed a bug that `DataFrameReader.xml` fails to parse XML files with undeclared namespaces when `ignoreNamespace` is `True`.
6666
- Added a fix for floating point precision discrepancies in `interval_day_time_from_parts`.
6767
- Fixed a bug where writing Snowpark pandas dataframes on the pandas backend with a column multiindex to Snowflake with `to_snowflake` would raise `KeyError`.
6868
- Fixed a bug that `DataFrameReader.dbapi` (PuPr) is not compatible with oracledb 3.4.0.
69+
- Fixed a bug where `modin` would unintentionally be imported during session initialization in some scenarios.
70+
- Fixed a bug where `session.udf|udtf|udaf|sproc.register` failed when an extra session argument was passed. These methods do not expect a session argument; please remove it if provided.
71+
72+
#### Improvements
73+
74+
- The default maximum length for inferred StringType columns during schema inference in `DataFrameReader.dbapi` is now increased from 16MB to 128MB in parquet file based ingestion.
6975

7076
#### Dependency Updates
7177

@@ -74,7 +80,10 @@
7480
### Snowpark pandas API Updates
7581

7682
#### New Features
83+
7784
- Added support for the `dtypes` parameter of `pd.get_dummies`
85+
- Added support for `nunique` in `df.pivot_table`, `df.agg` and other places where aggregate functions can be used.
86+
- Added support for `DataFrame.interpolate` and `Series.interpolate` with the "linear", "ffill"/"pad", and "backfill"/bfill" methods. These use the SQL `INTERPOLATE_LINEAR`, `INTERPOLATE_FFILL`, and `INTERPOLATE_BFILL` functions (PuPr).
7887

7988
#### Improvements
8089

@@ -132,6 +141,28 @@
132141
- `drop`
133142
- `invert`
134143
- `duplicated`
144+
- `iloc`
145+
- `head`
146+
- `columns` (e.g., df.columns = ["A", "B"])
147+
- `agg`
148+
- `min`
149+
- `max`
150+
- `count`
151+
- `sum`
152+
- `mean`
153+
- `median`
154+
- `std`
155+
- `var`
156+
- `groupby.agg`
157+
- `groupby.min`
158+
- `groupby.max`
159+
- `groupby.count`
160+
- `groupby.sum`
161+
- `groupby.mean`
162+
- `groupby.median`
163+
- `groupby.std`
164+
- `groupby.var`
165+
- `drop_duplicates`
135166
- Reuse row count from the relaxed query compiler in `get_axis_len`.
136167

137168
#### Bug Fixes

docs/source/modin/supported/agg_supp.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,9 @@ methods ``pd.pivot_table``, ``DataFrame.pivot_table``, and ``pd.crosstab``.
3838
| ``median`` | ``Y`` for ``axis=0``. | ``Y`` | ``Y`` | ``Y`` | ``Y`` |
3939
| | ``N`` for ``axis=1``. | | | | |
4040
+-----------------------------+-------------------------------------+----------------------------------+--------------------------------------------+-----------------------------------------+-----------------------------------------+
41+
| ``nunique`` | ``Y`` for ``axis=0``. | ``Y`` | ``Y`` | ``Y`` | ``Y`` |
42+
| | ``N`` for ``axis=1``. | | | | |
43+
+-----------------------------+-------------------------------------+----------------------------------+--------------------------------------------+-----------------------------------------+-----------------------------------------+
4144
| ``size`` | ``Y`` for ``axis=0``. | ``Y`` | ``Y`` | ``Y`` | ``N`` |
4245
| | ``N`` for ``axis=1``. | | | | |
4346
+-----------------------------+-------------------------------------+----------------------------------+--------------------------------------------+-----------------------------------------+-----------------------------------------+

docs/source/modin/supported/dataframe_supported.rst

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -227,7 +227,11 @@ Methods
227227
+-----------------------------+---------------------------------+----------------------------------+----------------------------------------------------+
228228
| ``insert`` | Y | | |
229229
+-----------------------------+---------------------------------+----------------------------------+----------------------------------------------------+
230-
| ``interpolate`` | N | | |
230+
| ``interpolate`` | P | | ``N`` if ``axis == 1``, ``limit`` is set, |
231+
| | | | ``limit_area`` is "outside", or ``method`` is not |
232+
| | | | "linear", "bfill", "backfill", "ffill", or "pad". |
233+
| | | | ``limit_area="inside"`` is supported only when |
234+
| | | | ``method`` is ``linear``. |
231235
+-----------------------------+---------------------------------+----------------------------------+----------------------------------------------------+
232236
| ``isetitem`` | N | | |
233237
+-----------------------------+---------------------------------+----------------------------------+----------------------------------------------------+

docs/source/modin/supported/series_supported.rst

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -243,7 +243,11 @@ Methods
243243
| ``info`` | D | | Different Index types are used in pandas but not |
244244
| | | | in Snowpark pandas |
245245
+-----------------------------+---------------------------------+----------------------------------+----------------------------------------------------+
246-
| ``interpolate`` | N | | |
246+
| ``interpolate`` | P | | ``N`` if ``limit`` is set, |
247+
| | | | ``limit_area`` is "outside", or ``method`` is not |
248+
| | | | "linear", "bfill", "backfill", "ffill", or "pad". |
249+
| | | | ``limit_area="inside"`` is supported only when |
250+
| | | | ``method`` is ``linear``. |
247251
+-----------------------------+---------------------------------+----------------------------------+----------------------------------------------------+
248252
| ``isin`` | Y | | Snowpark pandas deviates with respect to handling |
249253
| | | | NA values |

src/snowflake/snowpark/_internal/data_source/drivers/base_driver.py

Lines changed: 26 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
Connection,
1212
Cursor,
1313
)
14+
from snowflake.snowpark._internal.server_connection import MAX_STRING_SIZE
1415
from snowflake.snowpark._internal.utils import (
1516
get_sorted_key_for_version,
1617
measure_time,
@@ -27,6 +28,7 @@
2728
BinaryType,
2829
DateType,
2930
BooleanType,
31+
StringType,
3032
)
3133
import snowflake.snowpark
3234
import logging
@@ -103,7 +105,16 @@ def infer_schema_from_description(
103105
query_input_alias: str,
104106
) -> StructType:
105107
self.get_raw_schema(table_or_query, cursor, is_query, query_input_alias)
106-
return self.to_snow_type(self.raw_schema)
108+
generated_schema = self.to_snow_type(self.raw_schema)
109+
# snowflake will default string length to 128MB in the bundle which will be enabled in 2026-01
110+
# https://docs.snowflake.com/en/release-notes/bcr-bundles/2025_07_bundle
111+
# here we prematurely make the change to default string to
112+
# 1. align the string length with UDTF based ingestion
113+
# 2. avoid the BCR impact to dbapi feature
114+
for field in generated_schema.fields:
115+
if isinstance(field.datatype, StringType) and field.datatype.length is None:
116+
field.datatype.length = MAX_STRING_SIZE
117+
return generated_schema
107118

108119
def infer_schema_from_description_with_error_control(
109120
self, table_or_query: str, is_query: bool, query_input_alias: str
@@ -177,13 +188,17 @@ def udtf_ingestion(
177188
packages=packages or UDTF_PACKAGE_MAP.get(self.dbms_type),
178189
imports=imports,
179190
statement_params=statement_params,
191+
_emit_ast=_emit_ast, # internal function call, _emit_ast will be set to False by the caller
180192
)
181193
logger.debug(f"register ingestion udtf takes: {udtf_register_time()} seconds")
182194
call_udtf_sql = f"""
183195
select * from {partition_table}, table({udtf_name}({PARTITION_TABLE_COLUMN_NAME}))
184196
"""
185197
res = session.sql(call_udtf_sql, _emit_ast=_emit_ast)
186-
return self.to_result_snowpark_df_udtf(res, schema, _emit_ast=_emit_ast)
198+
return BaseDriver.keep_nullable_attributes(
199+
self.to_result_snowpark_df_udtf(res, schema, _emit_ast=_emit_ast),
200+
schema,
201+
)
187202

188203
def udtf_class_builder(
189204
self,
@@ -283,6 +298,14 @@ def to_result_snowpark_df(
283298
) -> "DataFrame":
284299
return session.table(table_name, _emit_ast=_emit_ast)
285300

301+
@staticmethod
302+
def keep_nullable_attributes(
303+
selected_df: "DataFrame", schema: StructType
304+
) -> "DataFrame":
305+
for attr, source_field in zip(selected_df._plan.attributes, schema.fields):
306+
attr.nullable = source_field.nullable
307+
return selected_df
308+
286309
@staticmethod
287310
def to_result_snowpark_df_udtf(
288311
res_df: "DataFrame",
@@ -293,10 +316,7 @@ def to_result_snowpark_df_udtf(
293316
res_df[field.name].cast(field.datatype).alias(field.name)
294317
for field in schema.fields
295318
]
296-
selected_df = res_df.select(cols, _emit_ast=_emit_ast)
297-
for attr, source_field in zip(selected_df._plan.attributes, schema.fields):
298-
attr.nullable = source_field.nullable
299-
return selected_df
319+
return res_df.select(cols, _emit_ast=_emit_ast)
300320

301321
def get_server_cursor_if_supported(self, conn: "Connection") -> "Cursor":
302322
"""

src/snowflake/snowpark/_internal/server_connection.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,7 @@
8686
PARAM_INTERNAL_APPLICATION_NAME = "internal_application_name"
8787
PARAM_INTERNAL_APPLICATION_VERSION = "internal_application_version"
8888
DEFAULT_STRING_SIZE = 16777216
89+
MAX_STRING_SIZE = 134217728
8990

9091

9192
def _build_target_path(stage_location: str, dest_prefix: str = "") -> str:

src/snowflake/snowpark/_internal/udf_utils.py

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1134,7 +1134,7 @@ def resolve_imports_and_packages(
11341134
skip_upload_on_content_match: bool = False,
11351135
is_permanent: bool = False,
11361136
force_inline_code: bool = False,
1137-
**kwargs,
1137+
_suppress_local_package_warnings: bool = False,
11381138
) -> Tuple[
11391139
Optional[str],
11401140
Optional[str],
@@ -1168,9 +1168,7 @@ def resolve_imports_and_packages(
11681168
packages,
11691169
include_pandas=is_pandas_udf,
11701170
statement_params=statement_params,
1171-
_suppress_local_package_warnings=kwargs.get(
1172-
"_suppress_local_package_warnings", False
1173-
),
1171+
_suppress_local_package_warnings=_suppress_local_package_warnings,
11741172
)
11751173
if packages is not None
11761174
else session._resolve_packages(
@@ -1179,9 +1177,7 @@ def resolve_imports_and_packages(
11791177
validate_package=False,
11801178
include_pandas=is_pandas_udf,
11811179
statement_params=statement_params,
1182-
_suppress_local_package_warnings=kwargs.get(
1183-
"_suppress_local_package_warnings", False
1184-
),
1180+
_suppress_local_package_warnings=_suppress_local_package_warnings,
11851181
)
11861182
)
11871183

src/snowflake/snowpark/async_job.py

Lines changed: 27 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -284,10 +284,34 @@ def cancel(self) -> None:
284284
"ENABLE_ASYNC_QUERY_IN_PYTHON_STORED_PROCS", False
285285
)
286286
):
287-
cancel_resp = self._session._conn._conn.cancel_query(self.query_id)
288-
if not cancel_resp.get("success", False):
287+
import _snowflake
288+
import json
289+
import uuid
290+
291+
try:
292+
uuid.UUID(self.query_id)
293+
except ValueError:
294+
raise ValueError(f"Invalid UUID: '{self.query_id}'")
295+
296+
raw_cancel_resp = _snowflake.cancel_query(self.query_id)
297+
298+
# Set failure_response when
299+
# - success != True in the response or
300+
# - cannot parse the response at all.
301+
failure_response = None
302+
try:
303+
parsed_cancel_resp = json.loads(raw_cancel_resp)
304+
if not parsed_cancel_resp.get("success", False):
305+
failure_response = parsed_cancel_resp
306+
except (TypeError, json.JSONDecodeError) as e:
307+
failure_response = {
308+
"success": False,
309+
"error": f"Error parsing response: {e}",
310+
}
311+
312+
if failure_response:
289313
raise DatabaseError(
290-
f"Failed to cancel query. Returned response: {cancel_resp}"
314+
f"Failed to cancel query. Returned response: {failure_response}"
291315
)
292316
else:
293317
self._cursor.execute(f"select SYSTEM$CANCEL_QUERY('{self.query_id}')")

src/snowflake/snowpark/dataframe_reader.py

Lines changed: 18 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1707,18 +1707,24 @@ def dbapi(
17071707
Reads data from a database table or query into a DataFrame using a DBAPI connection,
17081708
with support for optional partitioning, parallel processing, and query customization.
17091709
1710-
There are multiple methods to partition data and accelerate ingestion.
1711-
These methods can be combined to achieve optimal performance:
1712-
1713-
1.Use column, lower_bound, upper_bound and num_partitions at the same time when you need to split large tables into smaller partitions for parallel processing.
1714-
These must all be specified together, otherwise error will be raised.
1715-
2.Set max_workers to a proper positive integer.
1716-
This defines the maximum number of processes and threads used for parallel execution.
1717-
3.Adjusting fetch_size can optimize performance by reducing the number of round trips to the database.
1718-
4.Use predicates to defining WHERE conditions for partitions,
1719-
predicates will be ignored if column is specified to generate partition.
1720-
5.Set custom_schema to avoid snowpark infer schema, custom_schema must have a matched
1721-
column name with table in external data source.
1710+
Usage Notes:
1711+
- Ingestion performance tuning:
1712+
- **Partitioning**: Use ``column``, ``lower_bound``, ``upper_bound``, and ``num_partitions``
1713+
together to split large tables into smaller partitions for parallel processing.
1714+
All four parameters must be specified together, otherwise an error will be raised.
1715+
- **Parallel execution**: Set ``max_workers`` to control the maximum number of processes
1716+
and threads used for parallel execution.
1717+
- **Fetch optimization**: Adjust ``fetch_size`` to optimize performance by reducing
1718+
the number of round trips to the database.
1719+
- **Partition filtering**: Use ``predicates`` to define WHERE conditions for partitions.
1720+
Note that ``predicates`` will be ignored if ``column`` is specified for partitioning.
1721+
- **Schema specification**: Set ``custom_schema`` to skip schema inference. The custom schema
1722+
must have matching column names with the table in the external data source.
1723+
- Execution timing and error handling:
1724+
- **UDTF Ingestion**: Uses lazy evaluation. Errors are reported as ``SnowparkSQLException``
1725+
during DataFrame actions (e.g., ``DataFrame.collect()``).
1726+
- **Local Ingestion**: Uses eager execution. Errors are reported immediately as
1727+
``SnowparkDataFrameReaderException`` when this method is called.
17221728
17231729
Args:
17241730
create_connection: A callable that returns a DB-API compatible database connection.

0 commit comments

Comments
 (0)