You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -67,6 +67,10 @@
67
67
- Fixed a bug where writing Snowpark pandas dataframes on the pandas backend with a column multiindex to Snowflake with `to_snowflake` would raise `KeyError`.
68
68
- Fixed a bug that `DataFrameReader.dbapi` (PuPr) is not compatible with oracledb 3.4.0.
69
69
70
+
#### Improvements
71
+
72
+
- The default maximum length for inferred StringType columns during schema inference in `DataFrameReader.dbapi` is now increased from 16MB to 128MB in parquet file based ingestion.
73
+
70
74
#### Dependency Updates
71
75
72
76
- Updated dependency of `snowflake-connector-python>=3.17,<5.0.0`.
Copy file name to clipboardExpand all lines: src/snowflake/snowpark/dataframe_reader.py
+18-12Lines changed: 18 additions & 12 deletions
Original file line number
Diff line number
Diff line change
@@ -1707,18 +1707,24 @@ def dbapi(
1707
1707
Reads data from a database table or query into a DataFrame using a DBAPI connection,
1708
1708
with support for optional partitioning, parallel processing, and query customization.
1709
1709
1710
-
There are multiple methods to partition data and accelerate ingestion.
1711
-
These methods can be combined to achieve optimal performance:
1712
-
1713
-
1.Use column, lower_bound, upper_bound and num_partitions at the same time when you need to split large tables into smaller partitions for parallel processing.
1714
-
These must all be specified together, otherwise error will be raised.
1715
-
2.Set max_workers to a proper positive integer.
1716
-
This defines the maximum number of processes and threads used for parallel execution.
1717
-
3.Adjusting fetch_size can optimize performance by reducing the number of round trips to the database.
1718
-
4.Use predicates to defining WHERE conditions for partitions,
1719
-
predicates will be ignored if column is specified to generate partition.
1720
-
5.Set custom_schema to avoid snowpark infer schema, custom_schema must have a matched
1721
-
column name with table in external data source.
1710
+
Usage Notes:
1711
+
- Ingestion performance tuning:
1712
+
- **Partitioning**: Use ``column``, ``lower_bound``, ``upper_bound``, and ``num_partitions``
1713
+
together to split large tables into smaller partitions for parallel processing.
1714
+
All four parameters must be specified together, otherwise an error will be raised.
1715
+
- **Parallel execution**: Set ``max_workers`` to control the maximum number of processes
1716
+
and threads used for parallel execution.
1717
+
- **Fetch optimization**: Adjust ``fetch_size`` to optimize performance by reducing
1718
+
the number of round trips to the database.
1719
+
- **Partition filtering**: Use ``predicates`` to define WHERE conditions for partitions.
1720
+
Note that ``predicates`` will be ignored if ``column`` is specified for partitioning.
1721
+
- **Schema specification**: Set ``custom_schema`` to skip schema inference. The custom schema
1722
+
must have matching column names with the table in the external data source.
1723
+
- Execution timing and error handling:
1724
+
- **UDTF Ingestion**: Uses lazy evaluation. Errors are reported as ``SnowparkSQLException``
1725
+
during DataFrame actions (e.g., ``DataFrame.collect()``).
1726
+
- **Local Ingestion**: Uses eager execution. Errors are reported immediately as
1727
+
``SnowparkDataFrameReaderException`` when this method is called.
1722
1728
1723
1729
Args:
1724
1730
create_connection: A callable that returns a DB-API compatible database connection.
0 commit comments