Skip to content

Commit 5b2afe7

Browse files
committed
Refinements
1 parent 5cdfe9f commit 5b2afe7

File tree

1 file changed

+26
-14
lines changed

1 file changed

+26
-14
lines changed

doc/source/user_guide/text.rst

Lines changed: 26 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -35,14 +35,21 @@ for many reasons:
3535
3. When reading code, the contents of an ``object`` dtype array is less clear
3636
than ``'string'``.
3737

38-
Currently, the performance of ``object`` dtype arrays of strings and
39-
:class:`arrays.StringArray` are about the same. We expect future enhancements
38+
When using :class:`StringDtype` with PyArrow as the storage (see below),
39+
users will see large performance improvements in memory as well as time
40+
for certain operations when compared to ``object`` dtype arrays. When
41+
not using PyArrow as the storage, the performance of :class:`StringDtype`
42+
is about the same as that of ``object``. We expect future enhancements
4043
to significantly increase the performance and lower the memory overhead of
41-
:class:`~arrays.StringArray`.
44+
:class:`StringDtype` in this case.
4245

4346
.. versionchanged:: 3.0
4447

45-
The default when pandas infers the dtype of a collection of strings is to use ``dtype='str'``.
48+
The default when pandas infers the dtype of a collection of
49+
strings is to use ``dtype='str'``. This will use ``np.nan``
50+
as it's NA value and be backed by a PyArrow string array when
51+
PyArrow is installed, or backed by NumPy ``object`` array
52+
when PyArrow is not installed.
4653

4754
.. ipython:: python
4855
@@ -51,15 +58,17 @@ to significantly increase the performance and lower the memory overhead of
5158
Specifying :class:`StringDtype` explicitly
5259
==========================================
5360

54-
When it is desired to explicitly specify the dtype, we generally recommend using the alias ``dtype="str"``.
61+
When it is desired to explicitly specify the dtype, we generally recommend
62+
using the alias ``dtype="str"`` if you desire to have ``np.nan`` as the NA
63+
value or the alias ``dtype="string"`` if you desire to have ``pd.NA`` as
64+
the NA value.
5565

5666
.. ipython:: python
5767
58-
pd.Series(["a", "b", "c"], dtype="str")
68+
pd.Series(["a", "b", None], dtype="str")
69+
pd.Series(["a", "b", None], dtype="string")
5970
60-
However there are four distinct :class:`StringDtype` variants that may be utilized.
61-
You can also use :class:`StringDtype`/``"str"``/``"string"`` as the dtype
62-
on non-string data and it will be converted to strings:
71+
Specifying either alias will also convert non-string data to strings:
6372

6473
.. ipython:: python
6574
@@ -73,10 +82,12 @@ or convert from existing pandas data:
7382
7483
s1 = pd.Series([1, 2, pd.NA], dtype="Int64")
7584
s1
76-
s2 = s1.astype("str")
85+
s2 = s1.astype("string")
7786
s2
7887
type(s2[0])
7988
89+
However there are four distinct :class:`StringDtype` variants that may be utilized.
90+
8091
Python storage with ``np.nan`` values
8192
-------------------------------------
8293

@@ -184,15 +195,16 @@ Behavior differences
184195
s.str.isdigit()
185196
s.str.match("a")
186197
187-
2. Some string methods, like :meth:`Series.str.decode` because the underlying
188-
array can only contain strings, not bytes.
198+
2. Some string methods, like :meth:`Series.str.decode`, are not
199+
available because the underlying array can only contain
200+
strings, not bytes.
189201
3. Comparison operations will return a NumPy array with dtype bool. Missing
190-
values will always compare as unequal just as :attr:`numpy.nan` does.
202+
values will always compare as unequal just as :attr:`np.nan` does.
191203

192204
``StringDtype`` with ``pd.NA`` NA values
193205
----------------------------------------
194206

195-
1. For ``StringDtype``, :ref:`string accessor methods<api.series.str>`
207+
1. :ref:`String accessor methods<api.series.str>`
196208
that return **integer** output will always return a nullable integer dtype,
197209
rather than either int or float dtype (depending on the presence of NA values).
198210
Methods returning **boolean** output will return a nullable boolean dtype.

0 commit comments

Comments
 (0)