Some more clarification, particularly regexps

davclark · jreback · commit a95d84a79875 · 2014-09-30T10:10:16.000-04:00
diff --git a/doc/source/10min.rst b/doc/source/10min.rst
@@ -433,7 +433,12 @@ See more at :ref:`Histogramming and Discretization <basics.discretization>`
 String Methods
 ~~~~~~~~~~~~~~
 
-See more at :ref:`Vectorized String Methods <text.string_methods>`
+Series is equipped with a set of string processing methods in the `str`
+attribute that make it easy to operate on each element of the array, as in the
+code snippet below. Note that pattern-matching in `str` generally uses `regular
+expressions <https://docs.python.org/2/library/re.html>`__ by default (and in
+some cases always uses them). See more at :ref:`Vectorized String Methods
+<text.string_methods>`.
 
 .. ipython:: python
 
diff --git a/doc/source/api.rst b/doc/source/api.rst
@@ -1410,7 +1410,7 @@ Computations / Descriptive Stats
    GroupBy.mean
    GroupBy.median
    GroupBy.min
-   GroupBy.nth 
+   GroupBy.nth
    GroupBy.ohlc
    GroupBy.prod
    GroupBy.size
diff --git a/doc/source/basics.rst b/doc/source/basics.rst
@@ -1159,6 +1159,28 @@ The ``.dt`` accessor works for period and timedelta dtypes.
 
    ``Series.dt`` will raise a ``TypeError`` if you access with a non-datetimelike values
 
+Vectorized string methods
+-------------------------
+
+Series is equipped with a set of string processing methods that make it easy to
+operate on each element of the array. Perhaps most importantly, these methods
+exclude missing/NA values automatically. These are accessed via the Series's
+``str`` attribute and generally have names matching the equivalent (scalar)
+built-in string methods. For example:
+
+ .. ipython:: python
+
+  s = Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
+  s.str.lower()
+
+Powerful pattern-matching methods are provided as well, but note that
+pattern-matching generally uses `regular expressions
+<https://docs.python.org/2/library/re.html>`__ by default (and in some cases
+always uses them).
+
+Please see :ref:`Vectorized String Methods <text.string_methods>` for a complete
+description.
+
 .. _basics.sorting:
 
 Sorting by index and value
diff --git a/doc/source/text.rst b/doc/source/text.rst
@@ -21,10 +21,7 @@ Series is equipped with a set of string processing methods
 that make it easy to operate on each element of the array. Perhaps most
 importantly, these methods exclude missing/NA values automatically. These are
 accessed via the Series's ``str`` attribute and generally have names matching
-the equivalent (scalar) build-in string methods:
-
-Splitting and Replacing Strings
--------------------------------
+the equivalent (scalar) built-in string methods:
 
 .. ipython:: python
 
@@ -33,21 +30,33 @@ Splitting and Replacing Strings
    s.str.upper()
    s.str.len()
 
+Splitting and Replacing Strings
+-------------------------------
+
+.. _text.split:
+
 Methods like ``split`` return a Series of lists:
 
 .. ipython:: python
 
    s2 = Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])
    s2.str.split('_')
 
+Easy to expand this to return a DataFrame
+
+.. ipython:: python
+
+   s2.str.split('_').apply(Series)
+
 Elements in the split lists can be accessed using ``get`` or ``[]`` notation:
 
 .. ipython:: python
 
    s2.str.split('_').str.get(1)
    s2.str.split('_').str[1]
 
-Methods like ``replace`` and ``findall`` take regular expressions, too:
+Methods like ``replace`` and ``findall`` take `regular expressions
+<https://docs.python.org/2/library/re.html>`__, too:
 
 .. ipython:: python
 
@@ -56,12 +65,49 @@ Methods like ``replace`` and ``findall`` take regular expressions, too:
    s3
    s3.str.replace('^.a|dog', 'XX-XX ', case=False)
 
+Some caution must be taken to keep regular expressions in mind! For example, the
+following code will cause trouble because of the regular expression meaning of
+`$`:
+
+.. ipython:: python
+
+   # Consider the following badly formatted financial data
+   dollars = Series(['12', '-$10', '$10,000'])
+
+   # This does what you'd naively expect:
+   dollars.str.replace('$', '')
+
+   # But this doesn't:
+   dollars.str.replace('-$', '-')
+
+   # We need to escape the special character (for >1 len patterns)
+   dollars.str.replace(r'-\$', '-')
+
+Indexing with ``.str``
+----------------------
+
+.. _text.indexing:
+
+You can use ``[]`` notation to directly index by position locations. If you index past the end
+of the string, the result will be a ``NaN``.
+
+
+.. ipython:: python
+
+   s = Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan,
+               'CABA', 'dog', 'cat'])
+
+   s.str[0]
+   s.str[1]
+
 Extracting Substrings
 ---------------------
 
-The method ``extract`` (introduced in version 0.13) accepts regular expressions
-with match groups. Extracting a regular expression with one group returns
-a Series of strings.
+.. _text.extract:
+
+The method ``extract`` (introduced in version 0.13) accepts `regular expressions
+<https://docs.python.org/2/library/re.html>`__ with match groups. Extracting a
+regular expression with one group returns a Series of strings.
 
 .. ipython:: python
 
@@ -136,46 +182,49 @@ Methods like ``match``, ``contains``, ``startswith``, and ``endswith`` take
    s4 = Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
    s4.str.contains('A', na=False)
 
-.. csv-table::
-    :header: "Method", "Description"
-    :widths: 20, 80
-
-    ``cat``,Concatenate strings
-    ``split``,Split strings on delimiter
-    ``get``,Index into each element (retrieve i-th element)
-    ``join``,Join strings in each element of the Series with passed separator
-    ``contains``,Return boolean array if each string contains pattern/regex
-    ``replace``,Replace occurrences of pattern/regex with some other string
-    ``repeat``,Duplicate values (``s.str.repeat(3)`` equivalent to ``x * 3``)
-    ``pad``,"Add whitespace to left, right, or both sides of strings"
-    ``center``,Equivalent to ``pad(side='both')``
-    ``wrap``,Split long strings into lines with length less than a given width
-    ``slice``,Slice each string in the Series
-    ``slice_replace``,Replace slice in each string with passed value
-    ``count``,Count occurrences of pattern
-    ``startswith``,Equivalent to ``str.startswith(pat)`` for each element
-    ``endswith``,Equivalent to ``str.endswith(pat)`` for each element
-    ``findall``,Compute list of all occurrences of pattern/regex for each string
-    ``match``,"Call ``re.match`` on each element, returning matched groups as list"
-    ``extract``,"Call ``re.match`` on each element, as ``match`` does, but return matched groups as strings for convenience."
-    ``len``,Compute string lengths
-    ``strip``,Equivalent to ``str.strip``
-    ``rstrip``,Equivalent to ``str.rstrip``
-    ``lstrip``,Equivalent to ``str.lstrip``
-    ``lower``,Equivalent to ``str.lower``
-    ``upper``,Equivalent to ``str.upper``
-
-
-Getting indicator variables from separated strings
---------------------------------------------------
+Creating Indicator Variables
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 You can extract dummy variables from string columns.
 For example if they are separated by a ``'|'``:
 
   .. ipython:: python
 
-      s = pd.Series(['a', 'a|b', np.nan, 'a|c'])
+      s = Series(['a', 'a|b', np.nan, 'a|c'])
       s.str.get_dummies(sep='|')
 
 See also :func:`~pandas.get_dummies`.
 
+Method Summary
+--------------
+
+.. _text.summary:
+
+.. csv-table::
+    :header: "Method", "Description"
+    :widths: 20, 80
+
+    :meth:`~core.strings.StringMethods.cat`,Concatenate strings
+    :meth:`~core.strings.StringMethods.split`,Split strings on delimiter
+    :meth:`~core.strings.StringMethods.get`,Index into each element (retrieve i-th element)
+    :meth:`~core.strings.StringMethods.join`,Join strings in each element of the Series with passed separator
+    :meth:`~core.strings.StringMethods.contains`,Return boolean array if each string contains pattern/regex
+    :meth:`~core.strings.StringMethods.replace`,Replace occurrences of pattern/regex with some other string
+    :meth:`~core.strings.StringMethods.repeat`,Duplicate values (``s.str.repeat(3)`` equivalent to ``x * 3``)
+    :meth:`~core.strings.StringMethods.pad`,"Add whitespace to left, right, or both sides of strings"
+    :meth:`~core.strings.StringMethods.center`,Equivalent to ``pad(side='both')``
+    :meth:`~core.strings.StringMethods.wrap`,Split long strings into lines with length less than a given width
+    :meth:`~core.strings.StringMethods.slice`,Slice each string in the Series
+    :meth:`~core.strings.StringMethods.slice_replace`,Replace slice in each string with passed value
+    :meth:`~core.strings.StringMethods.count`,Count occurrences of pattern
+    :meth:`~core.strings.StringMethods.startswith`,Equivalent to ``str.startswith(pat)`` for each element
+    :meth:`~core.strings.StringMethods.endswith`,Equivalent to ``str.endswith(pat)`` for each element
+    :meth:`~core.strings.StringMethods.findall`,Compute list of all occurrences of pattern/regex for each string
+    :meth:`~core.strings.StringMethods.match`,"Call ``re.match`` on each element, returning matched groups as list"
+    :meth:`~core.strings.StringMethods.extract`,"Call ``re.match`` on each element, as ``match`` does, but return matched groups as strings for convenience."
+    :meth:`~core.strings.StringMethods.len`,Compute string lengths
+    :meth:`~core.strings.StringMethods.strip`,Equivalent to ``str.strip``
+    :meth:`~core.strings.StringMethods.rstrip`,Equivalent to ``str.rstrip``
+    :meth:`~core.strings.StringMethods.lstrip`,Equivalent to ``str.lstrip``
+    :meth:`~core.strings.StringMethods.lower`,Equivalent to ``str.lower``
+    :meth:`~core.strings.StringMethods.upper`,Equivalent to ``str.upper``
diff --git a/doc/source/v0.15.0.txt b/doc/source/v0.15.0.txt
@@ -19,6 +19,7 @@ users upgrade to this version.
   - New scalar type ``Timedelta``, and a new index type ``TimedeltaIndex``, see :ref:`here <whatsnew_0150.timedeltaindex>`
   - New datetimelike properties accessor ``.dt`` for Series, see :ref:`Datetimelike Properties <whatsnew_0150.dt>`
   - Split indexing documentation into :ref:`Indexing and Selecting Data <indexing>` and :ref:`MultiIndex / Advanced Indexing <advanced>`
+  - Split out string methods documentation into :ref:`Working with Text Data <text>`
   - ``read_csv`` will now by default ignore blank lines when parsing, see :ref:`here <whatsnew_0150.blanklines>`
   - API change in using Indexes in set operations, see :ref:`here <whatsnew_0150.index_set_ops>`
   - Internal refactoring of the ``Index`` class to no longer sub-class ``ndarray``, see :ref:`Internal Refactoring <whatsnew_0150.refactoring>`