Skip to content

Commit fe924b4

Browse files
committed
DOC: create text.rst with string methods (GH8416)
1 parent 5cfc9cf commit fe924b4

File tree

6 files changed

+185
-170
lines changed

6 files changed

+185
-170
lines changed

doc/source/10min.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -433,7 +433,7 @@ See more at :ref:`Histogramming and Discretization <basics.discretization>`
433433
String Methods
434434
~~~~~~~~~~~~~~
435435

436-
See more at :ref:`Vectorized String Methods <basics.string_methods>`
436+
See more at :ref:`Vectorized String Methods <text.string_methods>`
437437

438438
.. ipython:: python
439439

doc/source/basics.rst

Lines changed: 0 additions & 167 deletions
Original file line numberDiff line numberDiff line change
@@ -1159,173 +1159,6 @@ The ``.dt`` accessor works for period and timedelta dtypes.
11591159

11601160
``Series.dt`` will raise a ``TypeError`` if you access with a non-datetimelike values
11611161

1162-
.. _basics.string_methods:
1163-
1164-
Vectorized string methods
1165-
-------------------------
1166-
1167-
Series is equipped (as of pandas 0.8.1) with a set of string processing methods
1168-
that make it easy to operate on each element of the array. Perhaps most
1169-
importantly, these methods exclude missing/NA values automatically. These are
1170-
accessed via the Series's ``str`` attribute and generally have names matching
1171-
the equivalent (scalar) build-in string methods:
1172-
1173-
Splitting and Replacing Strings
1174-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1175-
1176-
.. ipython:: python
1177-
1178-
s = Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
1179-
s.str.lower()
1180-
s.str.upper()
1181-
s.str.len()
1182-
1183-
Methods like ``split`` return a Series of lists:
1184-
1185-
.. ipython:: python
1186-
1187-
s2 = Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])
1188-
s2.str.split('_')
1189-
1190-
Elements in the split lists can be accessed using ``get`` or ``[]`` notation:
1191-
1192-
.. ipython:: python
1193-
1194-
s2.str.split('_').str.get(1)
1195-
s2.str.split('_').str[1]
1196-
1197-
Methods like ``replace`` and ``findall`` take regular expressions, too:
1198-
1199-
.. ipython:: python
1200-
1201-
s3 = Series(['A', 'B', 'C', 'Aaba', 'Baca',
1202-
'', np.nan, 'CABA', 'dog', 'cat'])
1203-
s3
1204-
s3.str.replace('^.a|dog', 'XX-XX ', case=False)
1205-
1206-
Extracting Substrings
1207-
~~~~~~~~~~~~~~~~~~~~~
1208-
1209-
The method ``extract`` (introduced in version 0.13) accepts regular expressions
1210-
with match groups. Extracting a regular expression with one group returns
1211-
a Series of strings.
1212-
1213-
.. ipython:: python
1214-
1215-
Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)')
1216-
1217-
Elements that do not match return ``NaN``. Extracting a regular expression
1218-
with more than one group returns a DataFrame with one column per group.
1219-
1220-
.. ipython:: python
1221-
1222-
Series(['a1', 'b2', 'c3']).str.extract('([ab])(\d)')
1223-
1224-
Elements that do not match return a row filled with ``NaN``.
1225-
Thus, a Series of messy strings can be "converted" into a
1226-
like-indexed Series or DataFrame of cleaned-up or more useful strings,
1227-
without necessitating ``get()`` to access tuples or ``re.match`` objects.
1228-
1229-
The results dtype always is object, even if no match is found and the result
1230-
only contains ``NaN``.
1231-
1232-
Named groups like
1233-
1234-
.. ipython:: python
1235-
1236-
Series(['a1', 'b2', 'c3']).str.extract('(?P<letter>[ab])(?P<digit>\d)')
1237-
1238-
and optional groups like
1239-
1240-
.. ipython:: python
1241-
1242-
Series(['a1', 'b2', '3']).str.extract('(?P<letter>[ab])?(?P<digit>\d)')
1243-
1244-
can also be used.
1245-
1246-
Testing for Strings that Match or Contain a Pattern
1247-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1248-
1249-
You can check whether elements contain a pattern:
1250-
1251-
.. ipython:: python
1252-
1253-
pattern = r'[a-z][0-9]'
1254-
Series(['1', '2', '3a', '3b', '03c']).str.contains(pattern)
1255-
1256-
or match a pattern:
1257-
1258-
1259-
.. ipython:: python
1260-
1261-
Series(['1', '2', '3a', '3b', '03c']).str.match(pattern, as_indexer=True)
1262-
1263-
The distinction between ``match`` and ``contains`` is strictness: ``match``
1264-
relies on strict ``re.match``, while ``contains`` relies on ``re.search``.
1265-
1266-
.. warning::
1267-
1268-
In previous versions, ``match`` was for *extracting* groups,
1269-
returning a not-so-convenient Series of tuples. The new method ``extract``
1270-
(described in the previous section) is now preferred.
1271-
1272-
This old, deprecated behavior of ``match`` is still the default. As
1273-
demonstrated above, use the new behavior by setting ``as_indexer=True``.
1274-
In this mode, ``match`` is analogous to ``contains``, returning a boolean
1275-
Series. The new behavior will become the default behavior in a future
1276-
release.
1277-
1278-
Methods like ``match``, ``contains``, ``startswith``, and ``endswith`` take
1279-
an extra ``na`` argument so missing values can be considered True or False:
1280-
1281-
.. ipython:: python
1282-
1283-
s4 = Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
1284-
s4.str.contains('A', na=False)
1285-
1286-
.. csv-table::
1287-
:header: "Method", "Description"
1288-
:widths: 20, 80
1289-
1290-
``cat``,Concatenate strings
1291-
``split``,Split strings on delimiter
1292-
``get``,Index into each element (retrieve i-th element)
1293-
``join``,Join strings in each element of the Series with passed separator
1294-
``contains``,Return boolean array if each string contains pattern/regex
1295-
``replace``,Replace occurrences of pattern/regex with some other string
1296-
``repeat``,Duplicate values (``s.str.repeat(3)`` equivalent to ``x * 3``)
1297-
``pad``,"Add whitespace to left, right, or both sides of strings"
1298-
``center``,Equivalent to ``pad(side='both')``
1299-
``wrap``,Split long strings into lines with length less than a given width
1300-
``slice``,Slice each string in the Series
1301-
``slice_replace``,Replace slice in each string with passed value
1302-
``count``,Count occurrences of pattern
1303-
``startswith``,Equivalent to ``str.startswith(pat)`` for each element
1304-
``endswith``,Equivalent to ``str.endswith(pat)`` for each element
1305-
``findall``,Compute list of all occurrences of pattern/regex for each string
1306-
``match``,"Call ``re.match`` on each element, returning matched groups as list"
1307-
``extract``,"Call ``re.match`` on each element, as ``match`` does, but return matched groups as strings for convenience."
1308-
``len``,Compute string lengths
1309-
``strip``,Equivalent to ``str.strip``
1310-
``rstrip``,Equivalent to ``str.rstrip``
1311-
``lstrip``,Equivalent to ``str.lstrip``
1312-
``lower``,Equivalent to ``str.lower``
1313-
``upper``,Equivalent to ``str.upper``
1314-
1315-
1316-
Getting indicator variables from separated strings
1317-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1318-
1319-
You can extract dummy variables from string columns.
1320-
For example if they are separated by a ``'|'``:
1321-
1322-
.. ipython:: python
1323-
1324-
s = pd.Series(['a', 'a|b', np.nan, 'a|c'])
1325-
s.str.get_dummies(sep='|')
1326-
1327-
See also :func:`~pandas.get_dummies`.
1328-
13291162
.. _basics.sorting:
13301163

13311164
Sorting by index and value

doc/source/index.rst.template

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,7 @@ See the package overview for more detail about what's in the library.
122122
cookbook
123123
dsintro
124124
basics
125+
text
125126
options
126127
indexing
127128
advanced

doc/source/text.rst

Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
.. currentmodule:: pandas
2+
.. _text:
3+
4+
.. ipython:: python
5+
:suppress:
6+
7+
import numpy as np
8+
from pandas import *
9+
randn = np.random.randn
10+
np.set_printoptions(precision=4, suppress=True)
11+
from pandas.compat import lrange
12+
options.display.max_rows=15
13+
14+
======================
15+
Working with Text Data
16+
======================
17+
18+
.. _text.string_methods:
19+
20+
Series is equipped with a set of string processing methods
21+
that make it easy to operate on each element of the array. Perhaps most
22+
importantly, these methods exclude missing/NA values automatically. These are
23+
accessed via the Series's ``str`` attribute and generally have names matching
24+
the equivalent (scalar) build-in string methods:
25+
26+
Splitting and Replacing Strings
27+
-------------------------------
28+
29+
.. ipython:: python
30+
31+
s = Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
32+
s.str.lower()
33+
s.str.upper()
34+
s.str.len()
35+
36+
Methods like ``split`` return a Series of lists:
37+
38+
.. ipython:: python
39+
40+
s2 = Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])
41+
s2.str.split('_')
42+
43+
Elements in the split lists can be accessed using ``get`` or ``[]`` notation:
44+
45+
.. ipython:: python
46+
47+
s2.str.split('_').str.get(1)
48+
s2.str.split('_').str[1]
49+
50+
Methods like ``replace`` and ``findall`` take regular expressions, too:
51+
52+
.. ipython:: python
53+
54+
s3 = Series(['A', 'B', 'C', 'Aaba', 'Baca',
55+
'', np.nan, 'CABA', 'dog', 'cat'])
56+
s3
57+
s3.str.replace('^.a|dog', 'XX-XX ', case=False)
58+
59+
Extracting Substrings
60+
---------------------
61+
62+
The method ``extract`` (introduced in version 0.13) accepts regular expressions
63+
with match groups. Extracting a regular expression with one group returns
64+
a Series of strings.
65+
66+
.. ipython:: python
67+
68+
Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)')
69+
70+
Elements that do not match return ``NaN``. Extracting a regular expression
71+
with more than one group returns a DataFrame with one column per group.
72+
73+
.. ipython:: python
74+
75+
Series(['a1', 'b2', 'c3']).str.extract('([ab])(\d)')
76+
77+
Elements that do not match return a row filled with ``NaN``.
78+
Thus, a Series of messy strings can be "converted" into a
79+
like-indexed Series or DataFrame of cleaned-up or more useful strings,
80+
without necessitating ``get()`` to access tuples or ``re.match`` objects.
81+
82+
The results dtype always is object, even if no match is found and the result
83+
only contains ``NaN``.
84+
85+
Named groups like
86+
87+
.. ipython:: python
88+
89+
Series(['a1', 'b2', 'c3']).str.extract('(?P<letter>[ab])(?P<digit>\d)')
90+
91+
and optional groups like
92+
93+
.. ipython:: python
94+
95+
Series(['a1', 'b2', '3']).str.extract('(?P<letter>[ab])?(?P<digit>\d)')
96+
97+
can also be used.
98+
99+
Testing for Strings that Match or Contain a Pattern
100+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101+
102+
You can check whether elements contain a pattern:
103+
104+
.. ipython:: python
105+
106+
pattern = r'[a-z][0-9]'
107+
Series(['1', '2', '3a', '3b', '03c']).str.contains(pattern)
108+
109+
or match a pattern:
110+
111+
112+
.. ipython:: python
113+
114+
Series(['1', '2', '3a', '3b', '03c']).str.match(pattern, as_indexer=True)
115+
116+
The distinction between ``match`` and ``contains`` is strictness: ``match``
117+
relies on strict ``re.match``, while ``contains`` relies on ``re.search``.
118+
119+
.. warning::
120+
121+
In previous versions, ``match`` was for *extracting* groups,
122+
returning a not-so-convenient Series of tuples. The new method ``extract``
123+
(described in the previous section) is now preferred.
124+
125+
This old, deprecated behavior of ``match`` is still the default. As
126+
demonstrated above, use the new behavior by setting ``as_indexer=True``.
127+
In this mode, ``match`` is analogous to ``contains``, returning a boolean
128+
Series. The new behavior will become the default behavior in a future
129+
release.
130+
131+
Methods like ``match``, ``contains``, ``startswith``, and ``endswith`` take
132+
an extra ``na`` argument so missing values can be considered True or False:
133+
134+
.. ipython:: python
135+
136+
s4 = Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
137+
s4.str.contains('A', na=False)
138+
139+
.. csv-table::
140+
:header: "Method", "Description"
141+
:widths: 20, 80
142+
143+
``cat``,Concatenate strings
144+
``split``,Split strings on delimiter
145+
``get``,Index into each element (retrieve i-th element)
146+
``join``,Join strings in each element of the Series with passed separator
147+
``contains``,Return boolean array if each string contains pattern/regex
148+
``replace``,Replace occurrences of pattern/regex with some other string
149+
``repeat``,Duplicate values (``s.str.repeat(3)`` equivalent to ``x * 3``)
150+
``pad``,"Add whitespace to left, right, or both sides of strings"
151+
``center``,Equivalent to ``pad(side='both')``
152+
``wrap``,Split long strings into lines with length less than a given width
153+
``slice``,Slice each string in the Series
154+
``slice_replace``,Replace slice in each string with passed value
155+
``count``,Count occurrences of pattern
156+
``startswith``,Equivalent to ``str.startswith(pat)`` for each element
157+
``endswith``,Equivalent to ``str.endswith(pat)`` for each element
158+
``findall``,Compute list of all occurrences of pattern/regex for each string
159+
``match``,"Call ``re.match`` on each element, returning matched groups as list"
160+
``extract``,"Call ``re.match`` on each element, as ``match`` does, but return matched groups as strings for convenience."
161+
``len``,Compute string lengths
162+
``strip``,Equivalent to ``str.strip``
163+
``rstrip``,Equivalent to ``str.rstrip``
164+
``lstrip``,Equivalent to ``str.lstrip``
165+
``lower``,Equivalent to ``str.lower``
166+
``upper``,Equivalent to ``str.upper``
167+
168+
169+
Getting indicator variables from separated strings
170+
--------------------------------------------------
171+
172+
You can extract dummy variables from string columns.
173+
For example if they are separated by a ``'|'``:
174+
175+
.. ipython:: python
176+
177+
s = pd.Series(['a', 'a|b', np.nan, 'a|c'])
178+
s.str.get_dummies(sep='|')
179+
180+
See also :func:`~pandas.get_dummies`.
181+

doc/source/v0.8.1.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ processing functionality and a series of new plot types and options.
1010
New features
1111
~~~~~~~~~~~~
1212

13-
- Add :ref:`vectorized string processing methods <basics.string_methods>`
13+
- Add :ref:`vectorized string processing methods <text.string_methods>`
1414
accessible via Series.str (:issue:`620`)
1515
- Add option to disable adjustment in EWMA (:issue:`1584`)
1616
- :ref:`Radviz plot <visualization.radviz>` (:issue:`1566`)

doc/source/v0.9.0.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ New features
1818
~~~~~~~~~~~~
1919

2020
- Add ``encode`` and ``decode`` for unicode handling to :ref:`vectorized
21-
string processing methods <basics.string_methods>` in Series.str (:issue:`1706`)
21+
string processing methods <text.string_methods>` in Series.str (:issue:`1706`)
2222
- Add ``DataFrame.to_latex`` method (:issue:`1735`)
2323
- Add convenient expanding window equivalents of all rolling_* ops (:issue:`1785`)
2424
- Add Options class to pandas.io.data for fetching options data from Yahoo!

0 commit comments

Comments
 (0)