Skip to content

Commit a95d84a

Browse files
davclarkjreback
authored andcommitted
Some more clarification, particularly regexps
1 parent fe924b4 commit a95d84a

File tree

5 files changed

+120
-43
lines changed

5 files changed

+120
-43
lines changed

doc/source/10min.rst

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -433,7 +433,12 @@ See more at :ref:`Histogramming and Discretization <basics.discretization>`
433433
String Methods
434434
~~~~~~~~~~~~~~
435435

436-
See more at :ref:`Vectorized String Methods <text.string_methods>`
436+
Series is equipped with a set of string processing methods in the `str`
437+
attribute that make it easy to operate on each element of the array, as in the
438+
code snippet below. Note that pattern-matching in `str` generally uses `regular
439+
expressions <https://docs.python.org/2/library/re.html>`__ by default (and in
440+
some cases always uses them). See more at :ref:`Vectorized String Methods
441+
<text.string_methods>`.
437442

438443
.. ipython:: python
439444

doc/source/api.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1410,7 +1410,7 @@ Computations / Descriptive Stats
14101410
GroupBy.mean
14111411
GroupBy.median
14121412
GroupBy.min
1413-
GroupBy.nth
1413+
GroupBy.nth
14141414
GroupBy.ohlc
14151415
GroupBy.prod
14161416
GroupBy.size

doc/source/basics.rst

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1159,6 +1159,28 @@ The ``.dt`` accessor works for period and timedelta dtypes.
11591159

11601160
``Series.dt`` will raise a ``TypeError`` if you access with a non-datetimelike values
11611161

1162+
Vectorized string methods
1163+
-------------------------
1164+
1165+
Series is equipped with a set of string processing methods that make it easy to
1166+
operate on each element of the array. Perhaps most importantly, these methods
1167+
exclude missing/NA values automatically. These are accessed via the Series's
1168+
``str`` attribute and generally have names matching the equivalent (scalar)
1169+
built-in string methods. For example:
1170+
1171+
.. ipython:: python
1172+
1173+
s = Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
1174+
s.str.lower()
1175+
1176+
Powerful pattern-matching methods are provided as well, but note that
1177+
pattern-matching generally uses `regular expressions
1178+
<https://docs.python.org/2/library/re.html>`__ by default (and in some cases
1179+
always uses them).
1180+
1181+
Please see :ref:`Vectorized String Methods <text.string_methods>` for a complete
1182+
description.
1183+
11621184
.. _basics.sorting:
11631185

11641186
Sorting by index and value

doc/source/text.rst

Lines changed: 90 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -21,10 +21,7 @@ Series is equipped with a set of string processing methods
2121
that make it easy to operate on each element of the array. Perhaps most
2222
importantly, these methods exclude missing/NA values automatically. These are
2323
accessed via the Series's ``str`` attribute and generally have names matching
24-
the equivalent (scalar) build-in string methods:
25-
26-
Splitting and Replacing Strings
27-
-------------------------------
24+
the equivalent (scalar) built-in string methods:
2825

2926
.. ipython:: python
3027
@@ -33,21 +30,33 @@ Splitting and Replacing Strings
3330
s.str.upper()
3431
s.str.len()
3532
33+
Splitting and Replacing Strings
34+
-------------------------------
35+
36+
.. _text.split:
37+
3638
Methods like ``split`` return a Series of lists:
3739

3840
.. ipython:: python
3941
4042
s2 = Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])
4143
s2.str.split('_')
4244
45+
Easy to expand this to return a DataFrame
46+
47+
.. ipython:: python
48+
49+
s2.str.split('_').apply(Series)
50+
4351
Elements in the split lists can be accessed using ``get`` or ``[]`` notation:
4452

4553
.. ipython:: python
4654
4755
s2.str.split('_').str.get(1)
4856
s2.str.split('_').str[1]
4957
50-
Methods like ``replace`` and ``findall`` take regular expressions, too:
58+
Methods like ``replace`` and ``findall`` take `regular expressions
59+
<https://docs.python.org/2/library/re.html>`__, too:
5160

5261
.. ipython:: python
5362
@@ -56,12 +65,49 @@ Methods like ``replace`` and ``findall`` take regular expressions, too:
5665
s3
5766
s3.str.replace('^.a|dog', 'XX-XX ', case=False)
5867
68+
Some caution must be taken to keep regular expressions in mind! For example, the
69+
following code will cause trouble because of the regular expression meaning of
70+
`$`:
71+
72+
.. ipython:: python
73+
74+
# Consider the following badly formatted financial data
75+
dollars = Series(['12', '-$10', '$10,000'])
76+
77+
# This does what you'd naively expect:
78+
dollars.str.replace('$', '')
79+
80+
# But this doesn't:
81+
dollars.str.replace('-$', '-')
82+
83+
# We need to escape the special character (for >1 len patterns)
84+
dollars.str.replace(r'-\$', '-')
85+
86+
Indexing with ``.str``
87+
----------------------
88+
89+
.. _text.indexing:
90+
91+
You can use ``[]`` notation to directly index by position locations. If you index past the end
92+
of the string, the result will be a ``NaN``.
93+
94+
95+
.. ipython:: python
96+
97+
s = Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan,
98+
'CABA', 'dog', 'cat'])
99+
100+
s.str[0]
101+
s.str[1]
102+
59103
Extracting Substrings
60104
---------------------
61105

62-
The method ``extract`` (introduced in version 0.13) accepts regular expressions
63-
with match groups. Extracting a regular expression with one group returns
64-
a Series of strings.
106+
.. _text.extract:
107+
108+
The method ``extract`` (introduced in version 0.13) accepts `regular expressions
109+
<https://docs.python.org/2/library/re.html>`__ with match groups. Extracting a
110+
regular expression with one group returns a Series of strings.
65111

66112
.. ipython:: python
67113
@@ -136,46 +182,49 @@ Methods like ``match``, ``contains``, ``startswith``, and ``endswith`` take
136182
s4 = Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
137183
s4.str.contains('A', na=False)
138184
139-
.. csv-table::
140-
:header: "Method", "Description"
141-
:widths: 20, 80
142-
143-
``cat``,Concatenate strings
144-
``split``,Split strings on delimiter
145-
``get``,Index into each element (retrieve i-th element)
146-
``join``,Join strings in each element of the Series with passed separator
147-
``contains``,Return boolean array if each string contains pattern/regex
148-
``replace``,Replace occurrences of pattern/regex with some other string
149-
``repeat``,Duplicate values (``s.str.repeat(3)`` equivalent to ``x * 3``)
150-
``pad``,"Add whitespace to left, right, or both sides of strings"
151-
``center``,Equivalent to ``pad(side='both')``
152-
``wrap``,Split long strings into lines with length less than a given width
153-
``slice``,Slice each string in the Series
154-
``slice_replace``,Replace slice in each string with passed value
155-
``count``,Count occurrences of pattern
156-
``startswith``,Equivalent to ``str.startswith(pat)`` for each element
157-
``endswith``,Equivalent to ``str.endswith(pat)`` for each element
158-
``findall``,Compute list of all occurrences of pattern/regex for each string
159-
``match``,"Call ``re.match`` on each element, returning matched groups as list"
160-
``extract``,"Call ``re.match`` on each element, as ``match`` does, but return matched groups as strings for convenience."
161-
``len``,Compute string lengths
162-
``strip``,Equivalent to ``str.strip``
163-
``rstrip``,Equivalent to ``str.rstrip``
164-
``lstrip``,Equivalent to ``str.lstrip``
165-
``lower``,Equivalent to ``str.lower``
166-
``upper``,Equivalent to ``str.upper``
167-
168-
169-
Getting indicator variables from separated strings
170-
--------------------------------------------------
185+
Creating Indicator Variables
186+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
171187

172188
You can extract dummy variables from string columns.
173189
For example if they are separated by a ``'|'``:
174190

175191
.. ipython:: python
176192
177-
s = pd.Series(['a', 'a|b', np.nan, 'a|c'])
193+
s = Series(['a', 'a|b', np.nan, 'a|c'])
178194
s.str.get_dummies(sep='|')
179195
180196
See also :func:`~pandas.get_dummies`.
181197

198+
Method Summary
199+
--------------
200+
201+
.. _text.summary:
202+
203+
.. csv-table::
204+
:header: "Method", "Description"
205+
:widths: 20, 80
206+
207+
:meth:`~core.strings.StringMethods.cat`,Concatenate strings
208+
:meth:`~core.strings.StringMethods.split`,Split strings on delimiter
209+
:meth:`~core.strings.StringMethods.get`,Index into each element (retrieve i-th element)
210+
:meth:`~core.strings.StringMethods.join`,Join strings in each element of the Series with passed separator
211+
:meth:`~core.strings.StringMethods.contains`,Return boolean array if each string contains pattern/regex
212+
:meth:`~core.strings.StringMethods.replace`,Replace occurrences of pattern/regex with some other string
213+
:meth:`~core.strings.StringMethods.repeat`,Duplicate values (``s.str.repeat(3)`` equivalent to ``x * 3``)
214+
:meth:`~core.strings.StringMethods.pad`,"Add whitespace to left, right, or both sides of strings"
215+
:meth:`~core.strings.StringMethods.center`,Equivalent to ``pad(side='both')``
216+
:meth:`~core.strings.StringMethods.wrap`,Split long strings into lines with length less than a given width
217+
:meth:`~core.strings.StringMethods.slice`,Slice each string in the Series
218+
:meth:`~core.strings.StringMethods.slice_replace`,Replace slice in each string with passed value
219+
:meth:`~core.strings.StringMethods.count`,Count occurrences of pattern
220+
:meth:`~core.strings.StringMethods.startswith`,Equivalent to ``str.startswith(pat)`` for each element
221+
:meth:`~core.strings.StringMethods.endswith`,Equivalent to ``str.endswith(pat)`` for each element
222+
:meth:`~core.strings.StringMethods.findall`,Compute list of all occurrences of pattern/regex for each string
223+
:meth:`~core.strings.StringMethods.match`,"Call ``re.match`` on each element, returning matched groups as list"
224+
:meth:`~core.strings.StringMethods.extract`,"Call ``re.match`` on each element, as ``match`` does, but return matched groups as strings for convenience."
225+
:meth:`~core.strings.StringMethods.len`,Compute string lengths
226+
:meth:`~core.strings.StringMethods.strip`,Equivalent to ``str.strip``
227+
:meth:`~core.strings.StringMethods.rstrip`,Equivalent to ``str.rstrip``
228+
:meth:`~core.strings.StringMethods.lstrip`,Equivalent to ``str.lstrip``
229+
:meth:`~core.strings.StringMethods.lower`,Equivalent to ``str.lower``
230+
:meth:`~core.strings.StringMethods.upper`,Equivalent to ``str.upper``

doc/source/v0.15.0.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ users upgrade to this version.
1919
- New scalar type ``Timedelta``, and a new index type ``TimedeltaIndex``, see :ref:`here <whatsnew_0150.timedeltaindex>`
2020
- New datetimelike properties accessor ``.dt`` for Series, see :ref:`Datetimelike Properties <whatsnew_0150.dt>`
2121
- Split indexing documentation into :ref:`Indexing and Selecting Data <indexing>` and :ref:`MultiIndex / Advanced Indexing <advanced>`
22+
- Split out string methods documentation into :ref:`Working with Text Data <text>`
2223
- ``read_csv`` will now by default ignore blank lines when parsing, see :ref:`here <whatsnew_0150.blanklines>`
2324
- API change in using Indexes in set operations, see :ref:`here <whatsnew_0150.index_set_ops>`
2425
- Internal refactoring of the ``Index`` class to no longer sub-class ``ndarray``, see :ref:`Internal Refactoring <whatsnew_0150.refactoring>`

0 commit comments

Comments
 (0)