Skip to content

Commit 10d4a2d

Browse files
committed
GH-49274: [Doc][C++] Document security model for Arrow C++
1 parent 5f12de2 commit 10d4a2d

File tree

11 files changed

+193
-6
lines changed

11 files changed

+193
-6
lines changed

cpp/src/arrow/array/data.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -501,6 +501,7 @@ struct ARROW_EXPORT BufferSpan {
501501
}
502502
};
503503

504+
/// \class ArraySpan
504505
/// \brief EXPERIMENTAL: A non-owning array data container
505506
///
506507
/// Unlike ArrayData, this class doesn't own its referenced data type nor data buffers.

docs/source/cpp/api/array.rst

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -92,8 +92,8 @@ Extension arrays
9292
.. doxygenclass:: arrow::ExtensionArray
9393
:members:
9494

95-
Run-End Encoded Array
96-
---------------------
95+
Run-end encoded
96+
---------------
9797

9898
.. doxygenclass:: arrow::RunEndEncodedArray
9999
:members:
@@ -116,6 +116,17 @@ Chunked Arrays
116116
:project: arrow_cpp
117117
:members:
118118

119+
Non-owning data class
120+
=====================
121+
122+
.. warning::
123+
As this class doesn't keep alive the objects and data it points to, their
124+
lifetime must be ensured separately. We recommend using :class:`arrow::ArrayData`
125+
instead.
126+
127+
.. doxygenclass:: arrow::ArraySpan
128+
:members:
129+
119130
Utilities
120131
=========
121132

docs/source/cpp/api/builder.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,14 @@
1515
.. specific language governing permissions and limitations
1616
.. under the License.
1717
18+
.. _cpp-api-array-builders:
19+
1820
==============
1921
Array Builders
2022
==============
2123

24+
.. seealso:: :ref:`cpp-api-buffer-builders` for direct construction of array buffers
25+
2226
.. doxygenclass:: arrow::ArrayBuilder
2327
:members:
2428

docs/source/cpp/api/memory.rst

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,16 @@ Buffers
5555
.. doxygenclass:: arrow::ResizableBuffer
5656
:members:
5757

58+
Non-owning Buffer
59+
-----------------
60+
61+
.. warning::
62+
This class is exposed solely as a building block for :class:`arrow::ArraySpan`.
63+
For any other purpose, please use :class:`arrow::Buffer`.
64+
65+
.. doxygenclass:: arrow::BufferSpan
66+
:members:
67+
5868
Memory Pools
5969
------------
6070

@@ -91,6 +101,8 @@ Slicing
91101
.. doxygengroup:: buffer-slicing-functions
92102
:content-only:
93103

104+
.. _cpp-api-buffer-builders:
105+
94106
Buffer Builders
95107
---------------
96108

docs/source/cpp/conventions.rst

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,8 @@
2020

2121
.. cpp:namespace:: arrow
2222

23+
.. _cpp-conventions:
24+
2325
Conventions
2426
===========
2527

@@ -43,6 +45,10 @@ Safe pointers
4345
Arrow objects are usually passed and stored using safe pointers -- most of
4446
the time :class:`std::shared_ptr` but sometimes also :class:`std::unique_ptr`.
4547

48+
Non-owning alternatives exist for the rare situations where the overhead of
49+
a safe pointer is considered unacceptable: :class:`ArraySpan` and :class:`BufferSpan`.
50+
Their usage in third-party code is not recommended.
51+
4652
Immutability
4753
------------
4854

@@ -104,4 +110,3 @@ For example::
104110

105111
.. seealso::
106112
:doc:`API reference for error reporting <api/support>`
107-

docs/source/cpp/csv.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,8 @@ to create Arrow Tables or a stream of Arrow RecordBatches.
3030
.. seealso::
3131
:ref:`CSV reader/writer API reference <cpp-api-csv>`.
3232

33+
.. _cpp-csv-reading:
34+
3335
Reading CSV files
3436
=================
3537

docs/source/cpp/ipc.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,8 @@ lower level input/output, handled through the :doc:`IO interfaces <io>`.
3333
For reading, there is also an event-driven API that enables feeding
3434
arbitrary data into the IPC decoding layer asynchronously.
3535

36+
.. _cpp-ipc-reading:
37+
3638
Reading IPC streams and files
3739
=============================
3840

docs/source/cpp/parquet.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,8 @@ is a space-efficient columnar storage format for complex data. The Parquet
3232
C++ implementation is part of the Apache Arrow project and benefits
3333
from tight integration with the Arrow C++ classes and facilities.
3434

35+
.. _cpp-parquet-reading:
36+
3537
Reading Parquet files
3638
=====================
3739

docs/source/cpp/security.rst

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
.. Licensed to the Apache Software Foundation (ASF) under one
2+
.. or more contributor license agreements. See the NOTICE file
3+
.. distributed with this work for additional information
4+
.. regarding copyright ownership. The ASF licenses this file
5+
.. to you under the Apache License, Version 2.0 (the
6+
.. "License"); you may not use this file except in compliance
7+
.. with the License. You may obtain a copy of the License at
8+
9+
.. http://www.apache.org/licenses/LICENSE-2.0
10+
11+
.. Unless required by applicable law or agreed to in writing,
12+
.. software distributed under the License is distributed on an
13+
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
.. KIND, either express or implied. See the License for the
15+
.. specific language governing permissions and limitations
16+
.. under the License.
17+
18+
.. default-domain:: cpp
19+
20+
.. _cpp-security:
21+
22+
=======================
23+
Security Considerations
24+
=======================
25+
26+
.. important::
27+
This document describes the security model for using the Arrow C++ APIs.
28+
For better understanding of this document, we recommend that you first read
29+
the :ref:`overall security model <format_security>` for the Arrow project.
30+
31+
Parameter mismatch
32+
==================
33+
34+
Many Arrow C++ APIs report errors using the :class:`arrow::Status` and
35+
:class:`arrow::Result`. Such APIs can be assumed to detect common errors in the
36+
provided arguments. However, there are also often implicit pre-conditions that
37+
have to be upheld; these can usually be deduced from the semantics of an API
38+
as described by its documentation.
39+
40+
.. seealso:: Arrow C++ :ref:`cpp-conventions`
41+
42+
Pointer validity
43+
----------------
44+
45+
Pointers are always assumed to be valid and point to memory of the size required
46+
by the API. In particular, it is *forbidden to pass a null pointer* except where
47+
the API documentation explicitly says otherwise.
48+
49+
Type restrictions
50+
-----------------
51+
52+
Some APIs are specified to operate on specific Arrow data types and may not
53+
verify that their arguments conform to the expected data types. Passing the
54+
wrong kind of data as input may lead to undefined behavior.
55+
56+
.. _cpp-valid-data:
57+
58+
Data validity
59+
-------------
60+
61+
Arrow data, for example passed as :class:`arrow::Array` or :class:`arrow::Table`,
62+
is always assumed to be :ref:`valid <format-invalid-data>`. If your program may
63+
encounter invalid data, it must explicitly check its validity by calling one of
64+
the following validation APIs.
65+
66+
Structural validity
67+
'''''''''''''''''''
68+
69+
The ``Validate`` methods exposed on various Arrow C++ classes perform relatively
70+
inexpensive validity checks that the data is structurally valid. This implies
71+
checking the number of buffers, child arrays, and other similar conditions.
72+
73+
* :func:`arrow::Array::Validate`
74+
* :func:`arrow::RecordBatch::Validate`
75+
* :func:`arrow::ChunkedArray::Validate`
76+
* :func:`arrow::Table::Validate`
77+
* :func:`arrow::Scalar::Validate`
78+
79+
These checks typically are constant-time against the number of rows in the data,
80+
but linear in the number of descendant fields. They can be good enough to detect
81+
potential bugs in your own code. However, they are not enough to detect all classes of
82+
invalid data, and they won't protect against all kinds of malicious payloads.
83+
84+
Full validity
85+
'''''''''''''
86+
87+
The ``ValidateFull`` methods exposed by the same classes perform the same validity
88+
checks as the ``Validate`` methods, but they also check the data extensively for
89+
any non-conformance to the Arrow spec. In particular, they check all the offsets
90+
of variable-length data types, which is of fundamental importance when ingesting
91+
untrusted data from sources such as the IPC format (otherwise the variable-length
92+
offsets could point outside of the corresponding data buffer). They also check
93+
for invalid values, such as invalid UTF-8 strings or decimal values out of range
94+
for the advertised precision.
95+
96+
* :func:`arrow::Array::ValidateFull`
97+
* :func:`arrow::RecordBatch::ValidateFull`
98+
* :func:`arrow::ChunkedArray::ValidateFull`
99+
* :func:`arrow::Table::ValidateFull`
100+
* :func:`arrow::Scalar::ValidateFull`
101+
102+
"Safe" and "unsafe" APIs
103+
------------------------
104+
105+
Some APIs are exposed in both "safe" and "unsafe" variants. The naming convention
106+
for such pairs varies: sometimes the former has a ``Safe`` suffix (for example
107+
``SliceSafe`` vs. ``Slice``), sometimes the latter has an ``Unsafe`` prefix or
108+
suffix (for example ``Append`` vs. ``UnsafeAppend``).
109+
110+
In all cases, the "unsafe" API is intended as a more efficient API that
111+
eschews some of the checks that the "safe" API performs. It is then up to the
112+
caller to ensure that the preconditions are met, otherwise undefined behavior
113+
may ensue.
114+
115+
The API documentation usually spells out the differences between "safe" and "unsafe"
116+
variants, but these typically fall into two categories:
117+
118+
* structural checks, such as passing the right Arrow data type or numbers of buffers;
119+
* allocation size checks, such as having preallocated enough data for the given input
120+
arguments (this is typical of the :ref:`array builders <cpp-api-array-builders>`
121+
and :ref:`buffer builders <cpp-api-array-builders>`).
122+
123+
Ingesting untrusted data
124+
========================
125+
126+
As an exception to the above (see :ref:`cpp-valid-data`), some APIs support ingesting
127+
untrusted, potentially malicious data. These are:
128+
129+
* the :ref:`IPC reader <cpp-ipc-reading>` APIs
130+
* the :ref:`Parquet reader <cpp-parquet-reading>` APIs
131+
* the :ref:`CSV reader <cpp-csv-reading>` APIs
132+
133+
You must not assume that they will always return valid Arrow data. The reason
134+
for not validating data automatically is that validation can be expensive but
135+
unnecessary when reading from trusted data sources.
136+
137+
Instead, when using these APIs with potentially invalid data (such as data coming
138+
from an untrusted source), you **must** follow these steps:
139+
140+
1. Check any error returned by the API, as with any other API
141+
2. If the API returned successfully, validate the returned Arrow data in full
142+
(see "Full validity" above)

docs/source/cpp/user_guide.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ User Guide
3939
json
4040
dataset
4141
flight
42+
security
4243
gdb
4344
threading
4445
opentelemetry

0 commit comments

Comments
 (0)