Skip to content

Commit 756d4db

Browse files
committed
Minor tweaks to the partition docs page
1 parent 709a866 commit 756d4db

14 files changed

+235
-222
lines changed

advertools/partition.py

Lines changed: 97 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -1,94 +1,130 @@
11
"""
2-
Introduction to partitioning text
3-
===================================
2+
Text partitioning with Python
3+
=============================
4+
5+
The ``partition`` function in ``advertools`` provides a powerful way to partition a string
6+
based on a regular expression. Unlike typical string splitting methods that only return
7+
the text *between* delimiters, ``partition`` includes the delimiters themselves in the
8+
result list. This is particularly useful for tasks where the delimiters are as important
9+
as the content they separate.
10+
11+
12+
13+
What is partitioning?
14+
---------------------
15+
16+
It is the process of converting a string of characters into a list, while preserving all
17+
characters in the input string.
18+
In other words, you should be able to do a "round trip" from string to partitioned
19+
string, and back to the original string.
20+
21+
This function does this, although it strips whitespace so the round-trip is not 100%
22+
but almost.
23+
24+
25+
Partitioning using a regular expression
26+
---------------------------------------
27+
28+
An important feature in this function is that it enables you to partition using a regex
29+
pattern, and not just a fixed sequence of characters. You can partition a markdown
30+
string into headings and regular text for example, and use only "#", "##", and "###" for
31+
the partitioning.
32+
33+
It also provides a `flags` parameter, in case you want to utilize Python's various options
34+
like ``re.IGNORECASE``, ``re.DOTALL``, or ``re.MULTILINE`` for example
35+
436
5-
The ``partition`` function in ``advertools`` provides a powerful way to split a string
6-
based on a regular expression. Unlike typical string splitting methods that only return the text *between* delimiters, ``partition`` includes the delimiters themselves in the result list. This is particularly useful for tasks where the delimiters are as important as the content they separate.
737
838
Core Functionality
939
------------------
1040
11-
The function takes a ``text`` string, a ``regex`` pattern, and optional ``flags`` from the ``re`` module. It returns a list of strings, alternating between the substrings and the matches.
41+
The function takes a ``text`` string, a ``regex`` pattern, and optional ``flags`` from
42+
the ``re`` module. It returns a list of strings, alternating between the substrings and
43+
the matches.
1244
1345
**Key Features:**
1446
1547
* **Includes Delimiters:** The matched delimiters are part of the output list.
1648
* **Regex Powered:** Leverages the full power of regular expressions for defining separators.
17-
* **Handles Edge Cases:** Correctly processes matches at the beginning or end of the string, and consecutive matches, by including empty strings to represent zero-length parts.
49+
1850
1951
Examples
2052
--------
2153
2254
Let's explore some practical examples:
2355
24-
**1. Basic Splitting by Numbers:**
56+
**1. Basic splitting by numbers:**
2557
2658
.. code-block:: python
2759
28-
import advertools as adv
60+
>>> import advertools as adv
2961
30-
text = "abc123def456ghi"
31-
regex = r"\\d+"
32-
result = adv.partition(text, regex)
33-
print(result)
34-
# Output: ['abc', '123', 'def', '456', 'ghi']
62+
>>> text = "abc123def456ghi"
63+
>>> regex = r"\\d+"
64+
>>> adv.partition(text, regex)
65+
['abc', '123', 'def', '456', 'ghi']
3566
36-
**2. No Match Found:**
67+
**2. No match found:**
3768
3869
If the regex pattern doesn't find any matches, the original string is returned as a single-element list.
3970
4071
.. code-block:: python
4172
42-
import advertools as adv
73+
>>> import advertools as adv
4374
44-
text = "test"
45-
regex = r"X"
46-
result = adv.partition(text, regex)
47-
print(result)
48-
# Output: ['test']
75+
>>> text = "test"
76+
>>> regex = r"X"
77+
>>> adv.partition(text, regex)
78+
['test']
4979
50-
**3. Handling Consecutive Delimiters and Edge Matches:**
80+
**3. Handling consecutive delimiters and edge matches:**
5181
5282
This example shows how ``partition`` handles cases where delimiters are at the start/end or appear consecutively.
5383
5484
.. code-block:: python
5585
56-
import advertools as adv
86+
>>> import advertools as adv
5787
58-
text = ",a,,b,"
59-
regex = r","
60-
result = adv.partition(text, regex)
61-
print(result)
62-
# Output: ['', ',', 'a', ',', '', ',', 'b', ',', '']
88+
>>> text = ",a,,b,"
89+
>>> regex = r","
90+
>>> adv.partition(text, regex)
91+
[',', 'a', ',', ',', 'b', ',']
6392
64-
**4. Case-Insensitive Partitioning:**
93+
**4. Case-insensitive partitioning:**
6594
6695
You can use regex flags, like ``re.IGNORECASE``, for more flexible matching.
6796
6897
.. code-block:: python
6998
70-
import advertools as adv
71-
import re
99+
>>> import advertools as adv
100+
>>> import re
72101
73-
text = "TestData"
74-
regex = r"t"
75-
result = adv.partition(text, regex, flags=re.IGNORECASE)
76-
print(result)
77-
# Output: ['', 'T', 'es', 't', 'Data']
102+
>>> text = "TestData"
103+
>>> regex = r"t"
104+
>>> adv.partition(text, regex, flags=re.IGNORECASE)
105+
['T', 'es', 't', 'Da', 't', 'a']
78106
79-
Connecting to Other Use Cases
107+
Connecting to other use cases
80108
-----------------------------
81109
82-
While ``partition`` is a general-purpose string manipulation tool, its ability to retain delimiters makes it valuable in various contexts. For instance, if you were working with a function that processes Markdown documents (let's imagine a hypothetical ``generate_markdown_chunks`` function), ``partition`` could be used to split a Markdown document by specific structural elements (e.g., headings, code blocks, lists).
110+
While ``partition`` is a general-purpose string manipulation tool, its ability to retain
111+
delimiters makes it valuable in various contexts. For instance, if you were working with
112+
a function that processes Markdown documents (using the ``adv.crawlytics.generate_markdown``
113+
function),
114+
``partition`` could be used to split a Markdown document by specific structural elements
115+
(e.g., headings, code blocks, lists).
116+
83117
84-
Imagine you want to break down a Markdown document into chunks based on heading levels (e.g., ``## ``, ``### ``). The ``partition`` function could be used to identify these headings and the content between them.
118+
Imagine you want to break down a Markdown document into chunks based on heading levels
119+
(e.g., ``##``, ``###`` ). The ``partition`` function could be used to identify these
120+
headings and the content between them.
85121
86122
.. code-block:: python
87123
88-
import advertools as adv
89-
import re
124+
>>> import advertools as adv
125+
>>> import re
90126
91-
markdown_text = '''
127+
>>> markdown_text = '''
92128
# Document Title
93129
94130
Some introductory text.
@@ -105,24 +141,31 @@
105141
106142
Content for section 2.
107143
'''
108-
# Regex to match markdown headings (##, ###, etc.)
109-
# Matches lines starting with one or more '#' followed by a space
110-
heading_regex = r"^#+\\s"
111144
145+
>>> heading_regex = r"^#+ .*?$"
112146
113147
# Partition the markdown text by headings
114148
# Note: This is a simplified example. A robust markdown parser would be more complex.
115-
chunks = adv.partition(markdown_text, heading_regex, flags=re.MULTILINE)
149+
>>> chunks = adv.partition(markdown_text, heading_regex, flags=re.MULTILINE)
116150
117151
# The 'chunks' list would contain alternating text blocks and the matched headings,
118152
# allowing further processing of each part of the document.
119-
for i, chunk in enumerate(chunks):
120-
if re.match(heading_regex, chunk):
121-
print(f"Heading: {chunk.strip()}")
122-
else:
123-
print(f"Content Block {i // 2 + 1}:\\n{chunk.strip()}\\n")
124-
125-
This demonstrates how ``partition`` can be a foundational tool for more complex text processing tasks, such as breaking down structured documents into manageable pieces.
153+
>>> print(*chunks, sep="\\n----\\n")
154+
# Document Title
155+
----
156+
Some introductory text.
157+
----
158+
## Section 1
159+
----
160+
Content for section 1.
161+
----
162+
### Subsection 1.1
163+
----
164+
Details for subsection 1.1.
165+
----
166+
## Section 2
167+
----
168+
Content for section 2.
126169
127170
"""
128171

@@ -167,9 +210,9 @@ def partition(text, regex, flags=0):
167210
>>> partition("startmiddleend", r"middle")
168211
['start', 'middle', 'end']
169212
>>> partition("delimtextdelim", r"delim")
170-
['', 'delim', 'text', 'delim', '']
213+
['delim', 'text', 'delim']
171214
>>> partition("TestData", r"t", flags=re.IGNORECASE)
172-
['', 'T', 'es', 't', 'Data']
215+
['T', 'es', 't', 'Da', 't', 'a']
173216
"""
174217
if text == "":
175218
return [""]
-9.39 KB
Binary file not shown.
317 Bytes
Binary file not shown.

docs/_build/html/_sources/advertools.partition.rst.txt

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,3 @@
1-
advertools.partition module
2-
===========================
3-
41
.. automodule:: advertools.partition
52
:members:
63
:undoc-members:

docs/_build/html/advertools.html

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -230,12 +230,11 @@ <h2>Submodules<a class="headerlink" href="#submodules" title="Link to this headi
230230
</ul>
231231
</li>
232232
<li class="toctree-l1"><a class="reference internal" href="advertools.logs.html#parse-and-analyze-crawl-logs-in-a-dataframe">Parse and Analyze Crawl Logs in a Dataframe</a></li>
233-
<li class="toctree-l1"><a class="reference internal" href="advertools.partition.html">advertools.partition module</a><ul>
234-
<li class="toctree-l2"><a class="reference internal" href="advertools.partition.html#text-partitioning-with-python">Text partitioning with Python</a><ul>
235-
<li class="toctree-l3"><a class="reference internal" href="advertools.partition.html#core-functionality">Core Functionality</a></li>
236-
<li class="toctree-l3"><a class="reference internal" href="advertools.partition.html#connecting-to-other-use-cases">Connecting to Other Use Cases</a></li>
237-
</ul>
238-
</li>
233+
<li class="toctree-l1"><a class="reference internal" href="advertools.partition.html">Text partitioning with Python</a><ul>
234+
<li class="toctree-l2"><a class="reference internal" href="advertools.partition.html#what-is-partitioning">What is partitioning?</a></li>
235+
<li class="toctree-l2"><a class="reference internal" href="advertools.partition.html#partitioning-using-a-regular-expression">Partitioning using a regular expression</a></li>
236+
<li class="toctree-l2"><a class="reference internal" href="advertools.partition.html#core-functionality">Core Functionality</a></li>
237+
<li class="toctree-l2"><a class="reference internal" href="advertools.partition.html#connecting-to-other-use-cases">Connecting to other use cases</a></li>
239238
</ul>
240239
</li>
241240
<li class="toctree-l1"><a class="reference internal" href="advertools.regex.html">Regular Expressions for Extracting Structured Entities</a></li>

0 commit comments

Comments
 (0)