Skip to content

Commit 5177a8a

Browse files
Swap default Apache Arrow type to large_string and large_binary.
1 parent dc161f8 commit 5177a8a

File tree

10 files changed

+216
-158
lines changed

10 files changed

+216
-158
lines changed

doc/src/api_manual/dataframe.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ DataFrame Class
2929
Each column in a DataFrame exposes an `Apache Arrow PyCapsule
3030
<https://arrow.apache.org/docs/format/CDataInterface/
3131
PyCapsuleInterface.html>`__ interface, giving access to the underlying
32-
Arrow array.
32+
Apache Arrow array.
3333

3434
.. dbapiobjectextension::
3535

doc/src/release_notes.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,11 @@ Common Changes
5151
- Added parameter ``requested_schema`` to :meth:`Connection.fetch_df_all()`
5252
and :meth:`Connection.fetch_df_batches()` to support type mapping when
5353
querying.
54+
- The large variants for strings and binary values that use 64-bit offsets
55+
are now used by default in order to avoid the limits imposed by using
56+
32-bit offsets. Use ``requested_schema`` if the smaller offset size is
57+
desired
58+
(`issue 536 <https://github.com/oracle/python-oracledb/issues/536>`__).
5459
- Added support for all of the signed and unsigned fixed width integer
5560
types when ingesting data frames supporting the Arrow PyCapsule
5661
interface. Previously only ``int64`` was supported.

doc/src/user_guide/dataframes.rst

Lines changed: 115 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -123,53 +123,53 @@ objects. Querying any other data types from Oracle Database will result in an
123123
exception. :ref:`Output type handlers <outputtypehandlers>` cannot be used to
124124
map data types.
125125

126-
.. list-table-with-summary:: Mapping from Oracle Database to Arrow data types
126+
.. list-table-with-summary:: Mapping from Oracle Database to Apache Arrow data types
127127
:header-rows: 1
128128
:class: wy-table-responsive
129129
:widths: 1 1
130130
:width: 100%
131131
:align: left
132-
:summary: The first column is the Oracle Database type. The second column is the Arrow data type used in the python-oracledb DataFrame object.
132+
:summary: The first column is the Oracle Database type. The second column is the Apache Arrow data type used in the python-oracledb DataFrame object.
133133

134134
* - Oracle Database Type
135-
- Arrow Data Type
136-
* - DB_TYPE_BINARY_DOUBLE
135+
- Apache Arrow Data Type
136+
* - :attr:`DB_TYPE_BINARY_DOUBLE`
137137
- DOUBLE
138-
* - DB_TYPE_BINARY_FLOAT
138+
* - :attr:`DB_TYPE_BINARY_FLOAT`
139139
- FLOAT
140-
* - DB_TYPE_BLOB
140+
* - :attr:`DB_TYPE_BLOB`
141141
- LARGE_BINARY
142-
* - DB_TYPE_BOOLEAN
142+
* - :attr:`DB_TYPE_BOOLEAN`
143143
- BOOLEAN
144-
* - DB_TYPE_CHAR
145-
- STRING
146-
* - DB_TYPE_CLOB
144+
* - :attr:`DB_TYPE_CHAR`
147145
- LARGE_STRING
148-
* - DB_TYPE_DATE
146+
* - :attr:`DB_TYPE_CLOB`
147+
- LARGE_STRING
148+
* - :attr:`DB_TYPE_DATE`
149149
- TIMESTAMP
150-
* - DB_TYPE_LONG
150+
* - :attr:`DB_TYPE_LONG`
151151
- LARGE_STRING
152-
* - DB_TYPE_LONG_RAW
152+
* - :attr:`DB_TYPE_LONG_RAW`
153153
- LARGE_BINARY
154-
* - DB_TYPE_NCHAR
155-
- STRING
156-
* - DB_TYPE_NCLOB
154+
* - :attr:`DB_TYPE_NCHAR`
155+
- LARGE_STRING
156+
* - :attr:`DB_TYPE_NCLOB`
157157
- LARGE_STRING
158-
* - DB_TYPE_NUMBER
158+
* - :attr:`DB_TYPE_NUMBER`
159159
- DECIMAL128, INT64, or DOUBLE
160-
* - DB_TYPE_NVARCHAR
161-
- STRING
162-
* - DB_TYPE_RAW
163-
- BINARY
164-
* - DB_TYPE_TIMESTAMP
160+
* - :attr:`DB_TYPE_NVARCHAR`
161+
- LARGE_STRING
162+
* - :attr:`DB_TYPE_RAW`
163+
- LARGE_BINARY
164+
* - :attr:`DB_TYPE_TIMESTAMP`
165165
- TIMESTAMP
166-
* - DB_TYPE_TIMESTAMP_LTZ
166+
* - :attr:`DB_TYPE_TIMESTAMP_LTZ`
167167
- TIMESTAMP
168-
* - DB_TYPE_TIMESTAMP_TZ
168+
* - :attr:`DB_TYPE_TIMESTAMP_TZ`
169169
- TIMESTAMP
170-
* - DB_TYPE_VARCHAR
171-
- STRING
172-
* - DB_TYPE_VECTOR
170+
* - :attr:`DB_TYPE_VARCHAR`
171+
- LARGE_STRING
172+
* - :attr:`DB_TYPE_VECTOR`
173173
- List or struct with DOUBLE, FLOAT, INT8, or UINT8 values
174174

175175
**Numbers**
@@ -178,15 +178,25 @@ When converting Oracle Database NUMBERs:
178178

179179
- If the column has been created without a precision and scale, or you are
180180
querying an expression that results in a number without precision or scale,
181-
then the Arrow data type will be DOUBLE.
181+
then the Apache Arrow data type will be DOUBLE.
182182

183183
- If :attr:`oracledb.defaults.fetch_decimals <Defaults.fetch_decimals>` is set
184-
to *True*, then the Arrow data type is DECIMAL128.
184+
to *True*, then the Apache Arrow data type is DECIMAL128.
185185

186186
- If the column has been created with a scale of *0*, and a precision value
187-
that is less than or equal to *18*, then the Arrow data type is INT64.
187+
that is less than or equal to *18*, then the Apache Arrow data type is INT64.
188+
189+
- In all other cases, the Apache Arrow data type is DOUBLE.
190+
191+
**Strings**
192+
193+
When converting Oracle Database character types:
188194

189-
- In all other cases, the Arrow data type is DOUBLE.
195+
- If the number of records being fetched by :meth:`Connection.fetch_df_all()`,
196+
or fetched in each batch by :meth:`Connection.fetch_df_batches()`, can be
197+
handled by 32-bit offsets, you can use an :ref:`explicit mapping
198+
<explicitmapping>` to fetch as STRING instead of the default
199+
LARGE_STRING. This will save 4 bytes per record.
190200

191201
**Vectors**
192202

@@ -211,10 +221,10 @@ When converting Oracle Database VECTORs:
211221
:class: wy-table-responsive
212222
:widths: 1 1
213223
:align: left
214-
:summary: The first column is the Oracle Database VECTOR format. The second column is the resulting Arrow data type in the list.
224+
:summary: The first column is the Oracle Database VECTOR format. The second column is the resulting Apache Arrow data type in the list.
215225

216226
* - Oracle Database VECTOR format
217-
- Arrow data type
227+
- Apache Arrow data type
218228
* - FLOAT64
219229
- DOUBLE
220230
* - FLOAT32
@@ -226,33 +236,52 @@ When converting Oracle Database VECTORs:
226236

227237
See :ref:`dfvector` for more information.
228238

229-
**LOBs**
239+
**CLOB and NCLOB**
230240

231-
When converting Oracle Database CLOBs and BLOBs:
241+
When converting Oracle Database CLOBs and NCLOBs:
232242

233-
- The LOBs must be no more than 1 GB in length.
243+
- LOBs must be no more than 1 GB in length.
244+
245+
- If the number of records being fetched by :meth:`Connection.fetch_df_all()`,
246+
or fetched in each batch by :meth:`Connection.fetch_df_batches()`, can be
247+
handled by 32-bit offsets, you can use an :ref:`explicit mapping
248+
<explicitmapping>` to fetch CLOBs and NCLOBs as STRING instead of the default
249+
LARGE_STRING. This will save 4 bytes per record.
250+
251+
**BLOB**
252+
253+
When converting Oracle Database BLOBs:
254+
255+
- LOBs must be no more than 1 GB in length.
256+
257+
- If the number of records being fetched by :meth:`Connection.fetch_df_all()`,
258+
or fetched in each batch by :meth:`Connection.fetch_df_batches()`, can be
259+
handled by 32-bit offsets, you can use an :ref:`explicit mapping
260+
<explicitmapping>` to fetch BLOBs as BINARY instead of the default
261+
LARGE_BINARY. This will save 4 bytes per record.
234262

235263
**Dates and Timestamps**
236264

237265
When converting Oracle Database DATEs and TIMESTAMPs:
238266

239-
- Arrow TIMESTAMPs will not have timezone data.
267+
- Apache Arrow TIMESTAMPs will not have timezone data.
240268

241-
- For Oracle Database DATE columns, the Arrow TIMESTAMP will have a time unit
242-
of "seconds".
269+
- For Oracle Database DATE columns, the Apache Arrow TIMESTAMP will have a time
270+
unit of "seconds".
243271

244-
- For Oracle Database TIMESTAMP types, the Arrow TIMESTAMP time unit depends on
245-
the Oracle type's fractional precision as shown in the table below:
272+
- For Oracle Database TIMESTAMP types, the Apache Arrow TIMESTAMP time unit
273+
depends on the Oracle type's fractional precision as shown in the table
274+
below:
246275

247276
.. list-table-with-summary::
248277
:header-rows: 1
249278
:class: wy-table-responsive
250279
:widths: 1 1
251280
:align: left
252-
:summary: The first column is the Oracle Database TIMESTAMP-type fractional second precision. The second column is the resulting Arrow TIMESTAMP time unit.
281+
:summary: The first column is the Oracle Database TIMESTAMP-type fractional second precision. The second column is the resulting Apache Arrow TIMESTAMP time unit.
253282

254283
* - Oracle Database TIMESTAMP fractional second precision range
255-
- Arrow TIMESTAMP time unit
284+
- Apache Arrow TIMESTAMP time unit
256285
* - 0
257286
- seconds
258287
* - 1 - 3
@@ -262,6 +291,8 @@ When converting Oracle Database DATEs and TIMESTAMPs:
262291
* - 7 - 9
263292
- nanoseconds
264293

294+
.. _explicitmapping:
295+
265296
Explicit Data Frame Type Mapping
266297
++++++++++++++++++++++++++++++++
267298

@@ -338,22 +369,50 @@ requested schema type.
338369
:class: wy-table-responsive
339370
:widths: 1 1
340371
:align: left
341-
:summary: The first column is the Oracle Database data type. The second column shows supported Arrow data types.
372+
:summary: The first column is the Oracle Database data type. The second column shows supported Apache Arrow data types.
342373

343374
* - Oracle Database Type
344-
- Arrow Data Types
345-
* - DB_TYPE_NUMBER
346-
- INT8, INT16, INT32, INT64, UINT8, UINT16, UINT32, UINT64, DECIMAL128(p, s), DOUBLE, FLOAT
347-
* - DB_TYPE_RAW, DB_TYPE_LONG_RAW
348-
- BINARY, FIXED SIZE BINARY, LARGE BINARY
349-
* - DB_TYPE_BOOLEAN
375+
- Apache Arrow Data Types
376+
* - :attr:`DB_TYPE_NUMBER`
377+
- DECIMAL128(p, s)
378+
DOUBLE
379+
FLOAT
380+
INT8
381+
INT16
382+
INT32
383+
INT64
384+
UINT8,
385+
UINT16
386+
UINT32
387+
UINT64
388+
* - :attr:`DB_TYPE_BLOB`
389+
:attr:`DB_TYPE_LONG_RAW`
390+
:attr:`DB_TYPE_RAW`
391+
- BINARY
392+
FIXED SIZE BINARY
393+
LARGE_BINARY
394+
* - :attr:`DB_TYPE_BOOLEAN`
350395
- BOOLEAN
351-
* - DB_TYPE_DATE, DB_TYPE_TIMESTAMP, DB_TYPE_TIMESTAMP_LTZ, DB_TYPE_TIMESTAMP_TZ
352-
- DATE32, DATE64, TIMESTAMP
353-
* - DB_TYPE_BINARY_DOUBLE, DB_TYPE_BINARY_FLOAT
354-
- DOUBLE, FLOAT
355-
* - DB_TYPE_VARCHAR, DB_TYPE_CHAR, DB_TYPE_LONG, DB_TYPE_NVARCHAR, DB_TYPE_NCHAR, DB_TYPE_LONG_NVARCHAR
356-
- STRING, LARGE_STRING
396+
* - :attr:`DB_TYPE_DATE`
397+
:attr:`DB_TYPE_TIMESTAMP`
398+
:attr:`DB_TYPE_TIMESTAMP_LTZ`
399+
:attr:`DB_TYPE_TIMESTAMP_TZ`
400+
- DATE32
401+
DATE64
402+
TIMESTAMP
403+
* - :attr:`DB_TYPE_BINARY_DOUBLE`
404+
:attr:`DB_TYPE_BINARY_FLOAT`
405+
- DOUBLE
406+
FLOAT
407+
* - :attr:`DB_TYPE_CHAR`
408+
:attr:`DB_TYPE_CLOB`
409+
:attr:`DB_TYPE_LONG`
410+
:attr:`DB_TYPE_NCHAR`
411+
:attr:`DB_TYPE_NCLOB`
412+
:attr:`DB_TYPE_NVARCHAR`
413+
:attr:`DB_TYPE_VARCHAR`
414+
- LARGE_STRING
415+
STRING
357416

358417
.. _convertingodf:
359418

src/oracledb/connection.py

Lines changed: 16 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1115,14 +1115,13 @@ def fetch_df_all(
11151115
the ``fetch_df_all()``'s :attr:`Cursor.prefetchrows` size is always set
11161116
to the value of the explicit or default ``arraysize`` parameter value.
11171117
1118-
The ``fetch_decimals`` parameter specifies whether to return
1119-
decimal values when fetching columns of type ``NUMBER`` that are
1120-
capable of being represented in Arrow Decimal128 format. The default
1121-
value is
1122-
:data:`oracledb.defaults.fetch_decimals <Defaults.fetch_decimals>`.
1118+
The ``fetch_decimals`` parameter specifies whether to return decimal
1119+
values when fetching columns of type ``NUMBER`` that are capable of
1120+
being represented in Apache Arrow Decimal128 format. The default value
1121+
is :data:`oracledb.defaults.fetch_decimals <Defaults.fetch_decimals>`.
11231122
11241123
The ``requested_schema`` parameter specifies an object that implements
1125-
the Arrow PyCapsule schema interface. The DataFrame returned by
1124+
the Apache Arrow PyCapsule schema interface. The DataFrame returned by
11261125
``fetch_df_all()`` will have the data types and names of the schema.
11271126
11281127
Any LOB fetched must be less than 1 GB.
@@ -1170,14 +1169,13 @@ def fetch_df_batches(
11701169
:attr:`Cursor.prefetchrows` sizes are always set to the value of the
11711170
explicit or default ``size`` parameter value.
11721171
1173-
The ``fetch_decimals`` parameter specifies whether to return
1174-
decimal values when fetching columns of type ``NUMBER`` that are
1175-
capable of being represented in Arrow Decimal128 format. The default
1176-
value is
1177-
:data:`oracledb.defaults.fetch_decimals <Defaults.fetch_decimals>`.
1172+
The ``fetch_decimals`` parameter specifies whether to return decimal
1173+
values when fetching columns of type ``NUMBER`` that are capable of
1174+
being represented in Apache Arrow Decimal128 format. The default value
1175+
is :data:`oracledb.defaults.fetch_decimals <Defaults.fetch_decimals>`.
11781176
11791177
The ``requested_schema`` parameter specifies an object that implements
1180-
the Arrow PyCapsule schema interface. The DataFrame returned by
1178+
the Apache Arrow PyCapsule schema interface. The DataFrame returned by
11811179
``fetch_df_all()`` will have the data types and names of the schema.
11821180
11831181
Any LOB fetched must be less than 1 GB.
@@ -2450,14 +2448,13 @@ async def fetch_df_all(
24502448
the ``fetch_df_all()``'s :attr:`Cursor.prefetchrows` size is always set
24512449
to the value of the explicit or default ``arraysize`` parameter value.
24522450
2453-
The ``fetch_decimals`` parameter specifies whether to return
2454-
decimal values when fetching columns of type ``NUMBER`` that are
2455-
capable of being represented in Arrow Decimal128 format. The default
2456-
value is
2457-
:data:`oracledb.defaults.fetch_decimals <Defaults.fetch_decimals>`.
2451+
The ``fetch_decimals`` parameter specifies whether to return decimal
2452+
values when fetching columns of type ``NUMBER`` that are capable of
2453+
being represented in Apache Arrow Decimal128 format. The default value
2454+
is :data:`oracledb.defaults.fetch_decimals <Defaults.fetch_decimals>`.
24582455
24592456
The ``requested_schema`` parameter specifies an object that implements
2460-
the Arrow PyCapsule schema interface. The DataFrame returned by
2457+
the Apache Arrow PyCapsule schema interface. The DataFrame returned by
24612458
``fetch_df_all()`` will have the data types and names of the schema.
24622459
"""
24632460
cursor = self.cursor()
@@ -2510,7 +2507,7 @@ async def fetch_df_batches(
25102507
:data:`oracledb.defaults.fetch_decimals <Defaults.fetch_decimals>`.
25112508
25122509
The ``requested_schema`` parameter specifies an object that implements
2513-
the Arrow PyCapsule schema interface. The DataFrame returned by
2510+
the Apache Arrow PyCapsule schema interface. The DataFrame returned by
25142511
``fetch_df_all()`` will have the data types and names of the schema.
25152512
"""
25162513
cursor = self.cursor()

src/oracledb/impl/base/metadata.pyx

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -46,15 +46,15 @@ cdef class OracleMetadata:
4646
and self.precision > 0 and self.precision <= 38:
4747
arrow_type = NANOARROW_TYPE_DECIMAL128
4848
elif py_type_num == PY_TYPE_NUM_STR:
49-
arrow_type = NANOARROW_TYPE_STRING
49+
arrow_type = NANOARROW_TYPE_LARGE_STRING
5050
elif py_type_num == PY_TYPE_NUM_INT and self.scale == 0 \
5151
and 0 < self.precision <= 18:
5252
arrow_type = NANOARROW_TYPE_INT64
5353
else:
5454
arrow_type = NANOARROW_TYPE_DOUBLE
5555
elif db_type_num in (DB_TYPE_NUM_CHAR, DB_TYPE_NUM_VARCHAR,
5656
DB_TYPE_NUM_NCHAR, DB_TYPE_NUM_NVARCHAR):
57-
arrow_type = NANOARROW_TYPE_STRING
57+
arrow_type = NANOARROW_TYPE_LARGE_STRING
5858
elif db_type_num == DB_TYPE_NUM_BINARY_FLOAT:
5959
arrow_type = NANOARROW_TYPE_FLOAT
6060
elif db_type_num == DB_TYPE_NUM_BINARY_DOUBLE:
@@ -78,7 +78,7 @@ cdef class OracleMetadata:
7878
DB_TYPE_NUM_LONG_NVARCHAR):
7979
arrow_type = NANOARROW_TYPE_LARGE_STRING
8080
elif db_type_num == DB_TYPE_NUM_RAW:
81-
arrow_type = NANOARROW_TYPE_BINARY
81+
arrow_type = NANOARROW_TYPE_LARGE_BINARY
8282
elif db_type_num == DB_TYPE_NUM_VECTOR:
8383
if self.vector_flags & VECTOR_META_FLAG_SPARSE_VECTOR:
8484
arrow_type = NANOARROW_TYPE_STRUCT

0 commit comments

Comments
 (0)