|
| 1 | +.. Licensed to the Apache Software Foundation (ASF) under one |
| 2 | +.. or more contributor license agreements. See the NOTICE file |
| 3 | +.. distributed with this work for additional information |
| 4 | +.. regarding copyright ownership. The ASF licenses this file |
| 5 | +.. to you under the Apache License, Version 2.0 (the |
| 6 | +.. "License"); you may not use this file except in compliance |
| 7 | +.. with the License. You may obtain a copy of the License at |
| 8 | +
|
| 9 | +.. http://www.apache.org/licenses/LICENSE-2.0 |
| 10 | +
|
| 11 | +.. Unless required by applicable law or agreed to in writing, |
| 12 | +.. software distributed under the License is distributed on an |
| 13 | +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 14 | +.. KIND, either express or implied. See the License for the |
| 15 | +.. specific language governing permissions and limitations |
| 16 | +.. under the License. |
| 17 | +
|
| 18 | +.. default-domain:: cpp |
| 19 | + |
| 20 | +.. _cpp-security: |
| 21 | + |
| 22 | +======================= |
| 23 | +Security Considerations |
| 24 | +======================= |
| 25 | + |
| 26 | +.. important:: |
| 27 | + This document describes the security model for using the Arrow C++ APIs. |
| 28 | + For better understanding of this document, we recommend that you first read |
| 29 | + the :ref:`overall security model <format_security>` for the Arrow project. |
| 30 | + |
| 31 | +Parameter mismatch |
| 32 | +================== |
| 33 | + |
| 34 | +Many Arrow C++ APIs report errors using the :class:`arrow::Status` and |
| 35 | +:class:`arrow::Result`. Such APIs can be assumed to detect common errors in the |
| 36 | +provided arguments. However, there are also often implicit pre-conditions that |
| 37 | +have to be upheld; these can usually be deduced from the semantics of an API |
| 38 | +as described by its documentation. |
| 39 | + |
| 40 | +.. seealso:: Arrow C++ :ref:`cpp-conventions` |
| 41 | + |
| 42 | +Pointer validity |
| 43 | +---------------- |
| 44 | + |
| 45 | +Pointers are always assumed to be valid and point to memory of the size required |
| 46 | +by the API. In particular, it is *forbidden to pass a null pointer* except where |
| 47 | +the API documentation explicitly says otherwise. |
| 48 | + |
| 49 | +Type restrictions |
| 50 | +----------------- |
| 51 | + |
| 52 | +Some APIs are specified to operate on specific Arrow data types and may not |
| 53 | +verify that their arguments conform to the expected data types. Passing the |
| 54 | +wrong kind of data as input may lead to undefined behavior. |
| 55 | + |
| 56 | +.. _cpp-valid-data: |
| 57 | + |
| 58 | +Data validity |
| 59 | +------------- |
| 60 | + |
| 61 | +Arrow data, for example passed as :class:`arrow::Array` or :class:`arrow::Table`, |
| 62 | +is always assumed to be :ref:`valid <format-invalid-data>`. If your program may |
| 63 | +encounter invalid data, it must explicitly check its validity by calling one of |
| 64 | +the following validation APIs. |
| 65 | + |
| 66 | +Structural validity |
| 67 | +''''''''''''''''''' |
| 68 | + |
| 69 | +The ``Validate`` methods exposed on various Arrow C++ classes perform relatively |
| 70 | +inexpensive validity checks that the data is structurally valid. This implies |
| 71 | +checking the number of buffers, child arrays, and other similar conditions. |
| 72 | + |
| 73 | +* :func:`arrow::Array::Validate` |
| 74 | +* :func:`arrow::RecordBatch::Validate` |
| 75 | +* :func:`arrow::ChunkedArray::Validate` |
| 76 | +* :func:`arrow::Table::Validate` |
| 77 | +* :func:`arrow::Scalar::Validate` |
| 78 | + |
| 79 | +These checks typically are constant-time against the number of rows in the data, |
| 80 | +but linear in the number of descendant fields. They can be good enough to detect |
| 81 | +potential bugs in your own code. However, they are not enough to detect all classes of |
| 82 | +invalid data, and they won't protect against all kinds of malicious payloads. |
| 83 | + |
| 84 | +Full validity |
| 85 | +''''''''''''' |
| 86 | + |
| 87 | +The ``ValidateFull`` methods exposed by the same classes perform the same validity |
| 88 | +checks as the ``Validate`` methods, but they also check the data extensively for |
| 89 | +any non-conformance to the Arrow spec. In particular, they check all the offsets |
| 90 | +of variable-length data types, which is of fundamental importance when ingesting |
| 91 | +untrusted data from sources such as the IPC format (otherwise the variable-length |
| 92 | +offsets could point outside of the corresponding data buffer). They also check |
| 93 | +for invalid values, such as invalid UTF-8 strings or decimal values out of range |
| 94 | +for the advertised precision. |
| 95 | + |
| 96 | +* :func:`arrow::Array::ValidateFull` |
| 97 | +* :func:`arrow::RecordBatch::ValidateFull` |
| 98 | +* :func:`arrow::ChunkedArray::ValidateFull` |
| 99 | +* :func:`arrow::Table::ValidateFull` |
| 100 | +* :func:`arrow::Scalar::ValidateFull` |
| 101 | + |
| 102 | +"Safe" and "unsafe" APIs |
| 103 | +------------------------ |
| 104 | + |
| 105 | +Some APIs are exposed in both "safe" and "unsafe" variants. The naming convention |
| 106 | +for such pairs varies: sometimes the former has a ``Safe`` suffix (for example |
| 107 | +``SliceSafe`` vs. ``Slice``), sometimes the latter has an ``Unsafe`` prefix or |
| 108 | +suffix (for example ``Append`` vs. ``UnsafeAppend``). |
| 109 | + |
| 110 | +In all cases, the "unsafe" API is intended as a more efficient API that |
| 111 | +eschews some of the checks that the "safe" API performs. It is then up to the |
| 112 | +caller to ensure that the preconditions are met, otherwise undefined behavior |
| 113 | +may ensue. |
| 114 | + |
| 115 | +The API documentation usually spells out the differences between "safe" and "unsafe" |
| 116 | +variants, but these typically fall into two categories: |
| 117 | + |
| 118 | +* structural checks, such as passing the right Arrow data type or numbers of buffers; |
| 119 | +* allocation size checks, such as having preallocated enough data for the given input |
| 120 | + arguments (this is typical of the :ref:`array builders <cpp-api-array-builders>` |
| 121 | + and :ref:`buffer builders <cpp-api-array-builders>`). |
| 122 | + |
| 123 | +Ingesting untrusted data |
| 124 | +======================== |
| 125 | + |
| 126 | +As an exception to the above (see :ref:`cpp-valid-data`), some APIs support ingesting |
| 127 | +untrusted, potentially malicious data. These are: |
| 128 | + |
| 129 | +* the :ref:`IPC reader <cpp-ipc-reading>` APIs |
| 130 | +* the :ref:`Parquet reader <cpp-parquet-reading>` APIs |
| 131 | +* the :ref:`CSV reader <cpp-csv-reading>` APIs |
| 132 | + |
| 133 | +You must not assume that they will always return valid Arrow data. The reason |
| 134 | +for not validating data automatically is that validation can be expensive but |
| 135 | +unnecessary when reading from trusted data sources. |
| 136 | + |
| 137 | +Instead, when using these APIs with potentially invalid data (such as data coming |
| 138 | +from an untrusted source), you **must** follow these steps: |
| 139 | + |
| 140 | +1. Check any error returned by the API, as with any other API |
| 141 | +2. If the API returned successfully, validate the returned Arrow data in full |
| 142 | + (see "Full validity" above) |
0 commit comments