|
1 | 1 | """ |
2 | | -Introduction to partitioning text |
3 | | -=================================== |
| 2 | +Text partitioning with Python |
| 3 | +============================= |
| 4 | +
|
| 5 | +The ``partition`` function in ``advertools`` provides a powerful way to partition a string |
| 6 | +based on a regular expression. Unlike typical string splitting methods that only return |
| 7 | +the text *between* delimiters, ``partition`` includes the delimiters themselves in the |
| 8 | +result list. This is particularly useful for tasks where the delimiters are as important |
| 9 | +as the content they separate. |
| 10 | +
|
| 11 | +
|
| 12 | +
|
| 13 | +What is partitioning? |
| 14 | +--------------------- |
| 15 | +
|
| 16 | +It is the process of converting a string of characters into a list, while preserving all |
| 17 | +characters in the input string. |
| 18 | +In other words, you should be able to do a "round trip" from string to partitioned |
| 19 | +string, and back to the original string. |
| 20 | +
|
| 21 | +This function does this, although it strips whitespace so the round-trip is not 100% |
| 22 | +but almost. |
| 23 | +
|
| 24 | +
|
| 25 | +Partitioning using a regular expression |
| 26 | +--------------------------------------- |
| 27 | +
|
| 28 | +An important feature in this function is that it enables you to partition using a regex |
| 29 | +pattern, and not just a fixed sequence of characters. You can partition a markdown |
| 30 | +string into headings and regular text for example, and use only "#", "##", and "###" for |
| 31 | +the partitioning. |
| 32 | +
|
| 33 | +It also provides a `flags` parameter, in case you want to utilize Python's various options |
| 34 | +like ``re.IGNORECASE``, ``re.DOTALL``, or ``re.MULTILINE`` for example |
| 35 | +
|
4 | 36 |
|
5 | | -The ``partition`` function in ``advertools`` provides a powerful way to split a string |
6 | | -based on a regular expression. Unlike typical string splitting methods that only return the text *between* delimiters, ``partition`` includes the delimiters themselves in the result list. This is particularly useful for tasks where the delimiters are as important as the content they separate. |
7 | 37 |
|
8 | 38 | Core Functionality |
9 | 39 | ------------------ |
10 | 40 |
|
11 | | -The function takes a ``text`` string, a ``regex`` pattern, and optional ``flags`` from the ``re`` module. It returns a list of strings, alternating between the substrings and the matches. |
| 41 | +The function takes a ``text`` string, a ``regex`` pattern, and optional ``flags`` from |
| 42 | +the ``re`` module. It returns a list of strings, alternating between the substrings and |
| 43 | +the matches. |
12 | 44 |
|
13 | 45 | **Key Features:** |
14 | 46 |
|
15 | 47 | * **Includes Delimiters:** The matched delimiters are part of the output list. |
16 | 48 | * **Regex Powered:** Leverages the full power of regular expressions for defining separators. |
17 | | -* **Handles Edge Cases:** Correctly processes matches at the beginning or end of the string, and consecutive matches, by including empty strings to represent zero-length parts. |
| 49 | +
|
18 | 50 |
|
19 | 51 | Examples |
20 | 52 | -------- |
21 | 53 |
|
22 | 54 | Let's explore some practical examples: |
23 | 55 |
|
24 | | -**1. Basic Splitting by Numbers:** |
| 56 | +**1. Basic splitting by numbers:** |
25 | 57 |
|
26 | 58 | .. code-block:: python |
27 | 59 |
|
28 | | - import advertools as adv |
| 60 | + >>> import advertools as adv |
29 | 61 |
|
30 | | - text = "abc123def456ghi" |
31 | | - regex = r"\\d+" |
32 | | - result = adv.partition(text, regex) |
33 | | - print(result) |
34 | | - # Output: ['abc', '123', 'def', '456', 'ghi'] |
| 62 | + >>> text = "abc123def456ghi" |
| 63 | + >>> regex = r"\\d+" |
| 64 | + >>> adv.partition(text, regex) |
| 65 | + ['abc', '123', 'def', '456', 'ghi'] |
35 | 66 |
|
36 | | -**2. No Match Found:** |
| 67 | +**2. No match found:** |
37 | 68 |
|
38 | 69 | If the regex pattern doesn't find any matches, the original string is returned as a single-element list. |
39 | 70 |
|
40 | 71 | .. code-block:: python |
41 | 72 |
|
42 | | - import advertools as adv |
| 73 | + >>> import advertools as adv |
43 | 74 |
|
44 | | - text = "test" |
45 | | - regex = r"X" |
46 | | - result = adv.partition(text, regex) |
47 | | - print(result) |
48 | | - # Output: ['test'] |
| 75 | + >>> text = "test" |
| 76 | + >>> regex = r"X" |
| 77 | + >>> adv.partition(text, regex) |
| 78 | + ['test'] |
49 | 79 |
|
50 | | -**3. Handling Consecutive Delimiters and Edge Matches:** |
| 80 | +**3. Handling consecutive delimiters and edge matches:** |
51 | 81 |
|
52 | 82 | This example shows how ``partition`` handles cases where delimiters are at the start/end or appear consecutively. |
53 | 83 |
|
54 | 84 | .. code-block:: python |
55 | 85 |
|
56 | | - import advertools as adv |
| 86 | + >>> import advertools as adv |
57 | 87 |
|
58 | | - text = ",a,,b," |
59 | | - regex = r"," |
60 | | - result = adv.partition(text, regex) |
61 | | - print(result) |
62 | | - # Output: ['', ',', 'a', ',', '', ',', 'b', ',', ''] |
| 88 | + >>> text = ",a,,b," |
| 89 | + >>> regex = r"," |
| 90 | + >>> adv.partition(text, regex) |
| 91 | + [',', 'a', ',', ',', 'b', ','] |
63 | 92 |
|
64 | | -**4. Case-Insensitive Partitioning:** |
| 93 | +**4. Case-insensitive partitioning:** |
65 | 94 |
|
66 | 95 | You can use regex flags, like ``re.IGNORECASE``, for more flexible matching. |
67 | 96 |
|
68 | 97 | .. code-block:: python |
69 | 98 |
|
70 | | - import advertools as adv |
71 | | - import re |
| 99 | + >>> import advertools as adv |
| 100 | + >>> import re |
72 | 101 |
|
73 | | - text = "TestData" |
74 | | - regex = r"t" |
75 | | - result = adv.partition(text, regex, flags=re.IGNORECASE) |
76 | | - print(result) |
77 | | - # Output: ['', 'T', 'es', 't', 'Data'] |
| 102 | + >>> text = "TestData" |
| 103 | + >>> regex = r"t" |
| 104 | + >>> adv.partition(text, regex, flags=re.IGNORECASE) |
| 105 | + ['T', 'es', 't', 'Da', 't', 'a'] |
78 | 106 |
|
79 | | -Connecting to Other Use Cases |
| 107 | +Connecting to other use cases |
80 | 108 | ----------------------------- |
81 | 109 |
|
82 | | -While ``partition`` is a general-purpose string manipulation tool, its ability to retain delimiters makes it valuable in various contexts. For instance, if you were working with a function that processes Markdown documents (let's imagine a hypothetical ``generate_markdown_chunks`` function), ``partition`` could be used to split a Markdown document by specific structural elements (e.g., headings, code blocks, lists). |
| 110 | +While ``partition`` is a general-purpose string manipulation tool, its ability to retain |
| 111 | +delimiters makes it valuable in various contexts. For instance, if you were working with |
| 112 | +a function that processes Markdown documents (using the ``adv.crawlytics.generate_markdown`` |
| 113 | +function), |
| 114 | +``partition`` could be used to split a Markdown document by specific structural elements |
| 115 | +(e.g., headings, code blocks, lists). |
| 116 | +
|
83 | 117 |
|
84 | | -Imagine you want to break down a Markdown document into chunks based on heading levels (e.g., ``## ``, ``### ``). The ``partition`` function could be used to identify these headings and the content between them. |
| 118 | +Imagine you want to break down a Markdown document into chunks based on heading levels |
| 119 | +(e.g., ``##``, ``###`` ). The ``partition`` function could be used to identify these |
| 120 | +headings and the content between them. |
85 | 121 |
|
86 | 122 | .. code-block:: python |
87 | 123 |
|
88 | | - import advertools as adv |
89 | | - import re |
| 124 | + >>> import advertools as adv |
| 125 | + >>> import re |
90 | 126 |
|
91 | | - markdown_text = ''' |
| 127 | + >>> markdown_text = ''' |
92 | 128 | # Document Title |
93 | 129 |
|
94 | 130 | Some introductory text. |
|
105 | 141 |
|
106 | 142 | Content for section 2. |
107 | 143 | ''' |
108 | | - # Regex to match markdown headings (##, ###, etc.) |
109 | | - # Matches lines starting with one or more '#' followed by a space |
110 | | - heading_regex = r"^#+\\s" |
111 | 144 |
|
| 145 | + >>> heading_regex = r"^#+ .*?$" |
112 | 146 |
|
113 | 147 | # Partition the markdown text by headings |
114 | 148 | # Note: This is a simplified example. A robust markdown parser would be more complex. |
115 | | - chunks = adv.partition(markdown_text, heading_regex, flags=re.MULTILINE) |
| 149 | + >>> chunks = adv.partition(markdown_text, heading_regex, flags=re.MULTILINE) |
116 | 150 |
|
117 | 151 | # The 'chunks' list would contain alternating text blocks and the matched headings, |
118 | 152 | # allowing further processing of each part of the document. |
119 | | - for i, chunk in enumerate(chunks): |
120 | | - if re.match(heading_regex, chunk): |
121 | | - print(f"Heading: {chunk.strip()}") |
122 | | - else: |
123 | | - print(f"Content Block {i // 2 + 1}:\\n{chunk.strip()}\\n") |
124 | | -
|
125 | | -This demonstrates how ``partition`` can be a foundational tool for more complex text processing tasks, such as breaking down structured documents into manageable pieces. |
| 153 | + >>> print(*chunks, sep="\\n----\\n") |
| 154 | + # Document Title |
| 155 | + ---- |
| 156 | + Some introductory text. |
| 157 | + ---- |
| 158 | + ## Section 1 |
| 159 | + ---- |
| 160 | + Content for section 1. |
| 161 | + ---- |
| 162 | + ### Subsection 1.1 |
| 163 | + ---- |
| 164 | + Details for subsection 1.1. |
| 165 | + ---- |
| 166 | + ## Section 2 |
| 167 | + ---- |
| 168 | + Content for section 2. |
126 | 169 |
|
127 | 170 | """ |
128 | 171 |
|
@@ -167,9 +210,9 @@ def partition(text, regex, flags=0): |
167 | 210 | >>> partition("startmiddleend", r"middle") |
168 | 211 | ['start', 'middle', 'end'] |
169 | 212 | >>> partition("delimtextdelim", r"delim") |
170 | | - ['', 'delim', 'text', 'delim', ''] |
| 213 | + ['delim', 'text', 'delim'] |
171 | 214 | >>> partition("TestData", r"t", flags=re.IGNORECASE) |
172 | | - ['', 'T', 'es', 't', 'Data'] |
| 215 | + ['T', 'es', 't', 'Da', 't', 'a'] |
173 | 216 | """ |
174 | 217 | if text == "": |
175 | 218 | return [""] |
|
0 commit comments