Fix UTF-8 expansion/truncation error during fetch

kadler · kadler · commit 2c4cfc3fc17d · 2018-11-28T12:38:55.000-06:00
On IBM i, we bind character strings to UTF-8, which breaks the
1 character = 1 code unit assumption of the code. LUW uses UTF-16
instead, which AFAICT doesn't have this problem (since I can't
find any encoding which maps from fewer than 4 bytes to a Unicode
code point above U+FFFF).

This results in a buffer that may be too small to hold the entire
result and the result is truncated - however the indicator value
is set to the length of the total data. To make problems worse,
the code assumes that the indicator value is less than the size of
the buffer and reads the length given. In the case of truncation,
this assumption is incorrect and the result is a buffer over-read,
and attempting to decode random bytes as UTF-8. If those bytes are
not valid, this can cause an error similar to the following:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 22: invalid start byte

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/QOpenSys/pkgs/lib/python3.6/site-packages/ibm_db_dbi.py", line 1472, in _fetch_helper
    row = ibm_db.fetch_tuple(self.stmt_handler)
SystemError: &lt;built-in function fetch_tuple&gt; returned a result with an error set

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "example.py", line 14, in &lt;module&gt;
    cur.fetchone()
  File "/QOpenSys/pkgs/lib/python3.6/site-packages/ibm_db_dbi.py", line 1492, in fetchone
    row_list = self._fetch_helper(1)
  File "/QOpenSys/pkgs/lib/python3.6/site-packages/ibm_db_dbi.py", line 1476, in _fetch_helper
    raise self.messages[-1]
ibm_db_dbi.Error: ibm_db_dbi::Error: SystemError('&lt;built-in function fetch_tuple&gt; returned a result with an error set',)
diff --git a/IBM_DB/ibm_db/ibm_db.c b/IBM_DB/ibm_db/ibm_db.c
@@ -1056,7 +1056,26 @@ static int _python_ibm_db_bind_column_helper(stmt_handle *stmt_res)
 			case SQL_GRAPHIC:
 			case SQL_VARGRAPHIC:
 			case SQL_LONGVARGRAPHIC:
+#ifndef __PASE__
+				// Assume that no matter the source encoding, a
+				// character encoded in fewer than 4 bytes will map to
+				// a Unicode code point below U+10000 and thus maps to
+				// 2-bytes in UTF-16. A source character encoded in
+				// 4 bytes may map to a Unicode code point above U+FFFF,
+				// leading to a UTF-16 surrogate pair, but this would
+				// not mean any expansion.
 				in_length = stmt_res->column_info[i].size+1;
+#else
+				// Assume the worst-case of 1 byte in the source
+				// encoding maps to 4-bytes encoded in UTF-8.
+				//
+				// NOTE: We could do some heuristics to limit the amount
+				// of memory we allocate, but the maximum record length
+				// is 32KiB, so the max we could allocate for all
+				// columns would not exceed 128KiB, which is tiny and
+				// not worth bothering with.
+				in_length = stmt_res->column_info[i].size*4 + 1;
+#endif
 				row_data->w_val = (SQLTCHAR *) ALLOC_N(SQLTCHAR, in_length);
 				rc = SQLBindCol((SQLHSTMT)stmt_res->hstmt, (SQLUSMALLINT)(i+1),
 					SQL_C_TCHAR, row_data->w_val, in_length * sizeof(SQLTCHAR),
@@ -8357,7 +8376,7 @@ static PyObject *_python_ibm_db_bind_fetch_helper(PyObject *args, int op)
 				case SQL_VARGRAPHIC:
 				case SQL_LONGVARGRAPHIC:
 					tmp_length = stmt_res->column_info[column_number].size;
-					value = getSQLTCharAsPyUnicodeObject(row_data->w_val, out_length);
+					value = getSQLTCharAsPyUnicodeObject(row_data->w_val, SQL_NTS);
 					break;
 
 				case SQL_LONGVARCHAR: