@@ -35,14 +35,21 @@ for many reasons:
35
35
3. When reading code, the contents of an ``object `` dtype array is less clear
36
36
than ``'string' ``.
37
37
38
- Currently, the performance of ``object `` dtype arrays of strings and
39
- :class: `arrays.StringArray ` are about the same. We expect future enhancements
38
+ When using :class: `StringDtype ` with PyArrow as the storage (see below),
39
+ users will see large performance improvements in memory as well as time
40
+ for certain operations when compared to ``object `` dtype arrays. When
41
+ not using PyArrow as the storage, the performance of :class: `StringDtype `
42
+ is about the same as that of ``object ``. We expect future enhancements
40
43
to significantly increase the performance and lower the memory overhead of
41
- :class: `~arrays.StringArray ` .
44
+ :class: `StringDtype ` in this case .
42
45
43
46
.. versionchanged :: 3.0
44
47
45
- The default when pandas infers the dtype of a collection of strings is to use ``dtype='str' ``.
48
+ The default when pandas infers the dtype of a collection of
49
+ strings is to use ``dtype='str' ``. This will use ``np.nan ``
50
+ as it's NA value and be backed by a PyArrow string array when
51
+ PyArrow is installed, or backed by NumPy ``object `` array
52
+ when PyArrow is not installed.
46
53
47
54
.. ipython :: python
48
55
@@ -51,15 +58,17 @@ to significantly increase the performance and lower the memory overhead of
51
58
Specifying :class: `StringDtype ` explicitly
52
59
==========================================
53
60
54
- When it is desired to explicitly specify the dtype, we generally recommend using the alias ``dtype="str" ``.
61
+ When it is desired to explicitly specify the dtype, we generally recommend
62
+ using the alias ``dtype="str" `` if you desire to have ``np.nan `` as the NA
63
+ value or the alias ``dtype="string" `` if you desire to have ``pd.NA `` as
64
+ the NA value.
55
65
56
66
.. ipython :: python
57
67
58
- pd.Series([" a" , " b" , " c" ], dtype = " str" )
68
+ pd.Series([" a" , " b" , None ], dtype = " str" )
69
+ pd.Series([" a" , " b" , None ], dtype = " string" )
59
70
60
- However there are four distinct :class: `StringDtype ` variants that may be utilized.
61
- You can also use :class: `StringDtype `/``"str" ``/``"string" `` as the dtype
62
- on non-string data and it will be converted to strings:
71
+ Specifying either alias will also convert non-string data to strings:
63
72
64
73
.. ipython :: python
65
74
@@ -73,10 +82,12 @@ or convert from existing pandas data:
73
82
74
83
s1 = pd.Series([1 , 2 , pd.NA ], dtype = " Int64" )
75
84
s1
76
- s2 = s1.astype(" str " )
85
+ s2 = s1.astype(" string " )
77
86
s2
78
87
type (s2[0 ])
79
88
89
+ However there are four distinct :class: `StringDtype ` variants that may be utilized.
90
+
80
91
Python storage with ``np.nan `` values
81
92
-------------------------------------
82
93
@@ -184,15 +195,16 @@ Behavior differences
184
195
s.str.isdigit()
185
196
s.str.match(" a" )
186
197
187
- 2. Some string methods, like :meth: `Series.str.decode ` because the underlying
188
- array can only contain strings, not bytes.
198
+ 2. Some string methods, like :meth: `Series.str.decode `, are not
199
+ available because the underlying array can only contain
200
+ strings, not bytes.
189
201
3. Comparison operations will return a NumPy array with dtype bool. Missing
190
- values will always compare as unequal just as :attr: `numpy .nan ` does.
202
+ values will always compare as unequal just as :attr: `np .nan ` does.
191
203
192
204
``StringDtype `` with ``pd.NA `` NA values
193
205
----------------------------------------
194
206
195
- 1. For `` StringDtype ``, :ref: `string accessor methods<api.series.str> `
207
+ 1. :ref: `String accessor methods<api.series.str> `
196
208
that return **integer ** output will always return a nullable integer dtype,
197
209
rather than either int or float dtype (depending on the presence of NA values).
198
210
Methods returning **boolean ** output will return a nullable boolean dtype.
0 commit comments