Commit cc635c9
⚡️ Speed up function
### 📄 76% (0.76x) speedup for ***`under_non_alpha_ratio` in
`unstructured/partition/text_type.py`***
⏱️ Runtime : **`9.53 milliseconds`** **→** **`5.41 milliseconds`** (best
of `91` runs)
### 📝 Explanation and details
Here's an optimized version of your function.
Major improvements.
- Only **one pass** through the text string instead of two list
comprehensions (saves a ton of memory and CPU).
- No lists are constructed, only simple integer counters.
- `char.strip()` is only used to check for non-space; you can check
explicitly for that.
Here's the optimized code with all original comments retained.
This approach processes the string only **once** and uses **O(1)
memory** (just two ints). The use of `char.isspace()` is a fast way to
check for all Unicode whitespace, just as before. This will
significantly speed up your function and eliminate almost all time spent
in the original two list comprehensions.
✅ **Correctness verification report:**
| Test | Status |
| --------------------------- | ----------------- |
| ⚙️ Existing Unit Tests | ✅ **21 Passed** |
| 🌀 Generated Regression Tests | ✅ **80 Passed** |
| ⏪ Replay Tests | ✅ **594 Passed** |
| 🔎 Concolic Coverage Tests | 🔘 **None Found** |
|📊 Tests Coverage | 100.0% |
<details>
<summary>⚙️ Existing Unit Tests and Runtime</summary>
| Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup |
|:------------------------------------------------------------------------------------------------|:--------------|:---------------|:----------|
| `partition/test_text_type.py::test_under_non_alpha_ratio_zero_divide`
| 1.14μs | 991ns | ✅15.1% |
|
`test_tracer_py__replay_test_0.py::test_unstructured_partition_text_type_under_non_alpha_ratio`
| 820μs | 412μs | ✅98.8% |
|
`test_tracer_py__replay_test_2.py::test_unstructured_partition_text_type_under_non_alpha_ratio`
| 5.62ms | 3.21ms | ✅75.3% |
|
`test_tracer_py__replay_test_3.py::test_unstructured_partition_text_type_under_non_alpha_ratio`
| 1.95ms | 1.09ms | ✅79.2% |
</details>
<details>
<summary>🌀 Generated Regression Tests and Runtime</summary>
```python
from __future__ import annotations
# imports
import pytest # used for our unit tests
from unstructured.partition.text_type import under_non_alpha_ratio
# unit tests
# -------------------------
# BASIC TEST CASES
# -------------------------
def test_all_alpha_below_threshold():
# All alphabetic, so ratio is 1.0, which is not < threshold (default 0.5)
codeflash_output = not under_non_alpha_ratio("HelloWorld") # 2.48μs -> 1.63μs (52.1% faster)
def test_all_alpha_above_threshold():
# All alphabetic, but threshold is 1.1, so ratio is < threshold
codeflash_output = under_non_alpha_ratio("HelloWorld", threshold=1.1) # 2.19μs -> 1.47μs (49.1% faster)
def test_all_non_alpha():
# All non-alpha, so ratio is 0, which is < threshold
codeflash_output = under_non_alpha_ratio("1234567890!@#$%^&*()_+-=[]{}|;':,.<>/?", threshold=0.5) # 3.38μs -> 2.00μs (68.8% faster)
def test_mixed_alpha_non_alpha_below_threshold():
# 4 alpha, 6 non-alpha (excluding spaces): ratio = 4/10 = 0.4 < 0.5
codeflash_output = under_non_alpha_ratio("a1b2c3d4!!", threshold=0.5) # 2.16μs -> 1.44μs (50.7% faster)
def test_mixed_alpha_non_alpha_above_threshold():
# 6 alpha, 2 non-alpha: ratio = 6/8 = 0.75 > 0.5, so not under threshold
codeflash_output = not under_non_alpha_ratio("abCD12ef", threshold=0.5) # 2.04μs -> 1.43μs (42.9% faster)
def test_spaces_are_ignored():
# Only 'a', 'b', 'c', '1', '2', '3' are counted (spaces ignored)
# 3 alpha, 3 non-alpha: ratio = 3/6 = 0.5, not < threshold
codeflash_output = not under_non_alpha_ratio("a b c 1 2 3", threshold=0.5) # 2.16μs -> 1.39μs (55.3% faster)
# If threshold is 0.6, ratio 0.5 < 0.6, so True
codeflash_output = under_non_alpha_ratio("a b c 1 2 3", threshold=0.6) # 1.25μs -> 705ns (77.7% faster)
def test_threshold_edge_case_exact():
# 2 alpha, 2 non-alpha: ratio = 2/4 = 0.5, not < threshold
codeflash_output = not under_non_alpha_ratio("a1b2", threshold=0.5) # 1.76μs -> 1.26μs (39.2% faster)
# If threshold is 0.51, ratio 0.5 < 0.51, so True
codeflash_output = under_non_alpha_ratio("a1b2", threshold=0.51) # 781ns -> 541ns (44.4% faster)
# -------------------------
# EDGE TEST CASES
# -------------------------
def test_empty_string():
# Empty string should always return False
codeflash_output = under_non_alpha_ratio("") # 450ns -> 379ns (18.7% faster)
def test_only_spaces():
# Only spaces, so total_count == 0, should return False
codeflash_output = under_non_alpha_ratio(" ") # 1.24μs -> 822ns (50.9% faster)
def test_only_newlines_and_tabs():
# Only whitespace, so total_count == 0, should return False
codeflash_output = under_non_alpha_ratio("\n\t \t") # 1.16μs -> 745ns (55.7% faster)
def test_only_one_alpha():
# Single alpha, total_count == 1, ratio == 1.0
codeflash_output = not under_non_alpha_ratio("A") # 1.43μs -> 1.06μs (35.5% faster)
codeflash_output = under_non_alpha_ratio("A", threshold=1.1) # 689ns -> 589ns (17.0% faster)
def test_only_one_non_alpha():
# Single non-alpha, total_count == 1, ratio == 0.0
codeflash_output = under_non_alpha_ratio("1") # 1.27μs -> 937ns (35.4% faster)
codeflash_output = under_non_alpha_ratio("!", threshold=0.1) # 775ns -> 550ns (40.9% faster)
def test_unicode_alpha_and_non_alpha():
# Unicode alpha: 'é', 'ü', 'ß' are isalpha()
# Unicode non-alpha: '1', '!', '。'
# 3 alpha, 3 non-alpha, ratio = 0.5
codeflash_output = not under_non_alpha_ratio("éüß1!。", threshold=0.5) # 3.03μs -> 2.21μs (37.3% faster)
codeflash_output = under_non_alpha_ratio("éüß1!。", threshold=0.6) # 1.11μs -> 789ns (40.2% faster)
def test_mixed_with_whitespace():
# Alpha: a, b, c; Non-alpha: 1, 2, 3; Spaces ignored
codeflash_output = under_non_alpha_ratio(" a 1 b 2 c 3 ", threshold=0.6) # 2.46μs -> 1.60μs (54.0% faster)
def test_threshold_zero():
# Any non-zero alpha ratio is not < 0, so always False unless all non-alpha
codeflash_output = not under_non_alpha_ratio("abc123", threshold=0.0) # 2.19μs -> 1.49μs (46.5% faster)
# All non-alpha: ratio = 0, not < 0, so False
codeflash_output = not under_non_alpha_ratio("123", threshold=0.0) # 859ns -> 594ns (44.6% faster)
def test_threshold_one():
# Any ratio < 1.0 should return True if not all alpha
codeflash_output = under_non_alpha_ratio("abc123", threshold=1.0) # 1.94μs -> 1.24μs (56.4% faster)
# All alpha: ratio = 1.0, not < 1.0, so False
codeflash_output = not under_non_alpha_ratio("abcdef", threshold=1.0) # 1.17μs -> 602ns (93.7% faster)
def test_leading_trailing_whitespace():
# Whitespace should be ignored
codeflash_output = under_non_alpha_ratio(" a1b2c3 ", threshold=0.6) # 2.32μs -> 1.44μs (61.3% faster)
def test_only_symbols():
# Only symbols, ratio = 0, so < threshold
codeflash_output = under_non_alpha_ratio("!@#$%^&*", threshold=0.5) # 1.89μs -> 1.15μs (65.0% faster)
def test_long_string_all_spaces_and_newlines():
# All whitespace, should return False
codeflash_output = under_non_alpha_ratio(" \n " * 100) # 11.4μs -> 3.65μs (213% faster)
def test_single_space():
# Single space, should return False
codeflash_output = under_non_alpha_ratio(" ") # 1.04μs -> 741ns (39.8% faster)
def test_non_ascii_non_alpha():
# Non-ASCII, non-alpha (emoji)
codeflash_output = under_non_alpha_ratio("😀😀😀", threshold=0.5) # 2.42μs -> 1.79μs (34.9% faster)
def test_mixed_emojis_and_alpha():
# 2 alpha, 2 emoji: ratio = 2/4 = 0.5
codeflash_output = not under_non_alpha_ratio("a😀b😀", threshold=0.5) # 2.24μs -> 1.50μs (49.3% faster)
codeflash_output = under_non_alpha_ratio("a😀b😀", threshold=0.6) # 948ns -> 658ns (44.1% faster)
# -------------------------
# LARGE SCALE TEST CASES
# -------------------------
def test_large_all_alpha():
# 1000 alpha, ratio = 1.0
s = "a" * 1000
codeflash_output = not under_non_alpha_ratio(s, threshold=0.5) # 47.9μs -> 33.4μs (43.4% faster)
def test_large_all_non_alpha():
# 1000 non-alpha, ratio = 0.0
s = "1" * 1000
codeflash_output = under_non_alpha_ratio(s, threshold=0.5) # 40.9μs -> 24.9μs (64.3% faster)
def test_large_mixed_half_and_half():
# 500 alpha, 500 non-alpha, ratio = 0.5
s = "a" * 500 + "1" * 500
codeflash_output = not under_non_alpha_ratio(s, threshold=0.5) # 45.2μs -> 28.4μs (59.0% faster)
codeflash_output = under_non_alpha_ratio(s, threshold=0.6) # 43.3μs -> 27.8μs (55.9% faster)
def test_large_with_spaces_ignored():
# 400 alpha, 400 non-alpha, 200 spaces (should be ignored)
s = "a" * 400 + "1" * 400 + " " * 200
codeflash_output = not under_non_alpha_ratio(s, threshold=0.5) # 43.5μs -> 24.5μs (77.4% faster)
codeflash_output = under_non_alpha_ratio(s, threshold=0.6) # 41.8μs -> 23.8μs (75.6% faster)
def test_large_unicode_mixed():
# 300 unicode alpha, 300 unicode non-alpha, 400 ascii alpha
s = "é" * 300 + "😀" * 300 + "a" * 400
# alpha: 300 (é) + 400 (a) = 700, non-alpha: 300 (😀), total = 1000
codeflash_output = not under_non_alpha_ratio(s, threshold=0.8) # 65.3μs -> 36.8μs (77.6% faster)
codeflash_output = under_non_alpha_ratio(s, threshold=0.71) # 59.0μs -> 34.8μs (69.8% faster)
# ratio = 700/1000 = 0.7
def test_large_threshold_zero_one():
# All alpha, threshold=0.0, should be False
s = "b" * 999
codeflash_output = not under_non_alpha_ratio(s, threshold=0.0) # 46.2μs -> 33.1μs (39.4% faster)
# All non-alpha, threshold=1.0, should be True
s = "!" * 999
codeflash_output = under_non_alpha_ratio(s, threshold=1.0) # 40.4μs -> 24.3μs (66.0% faster)
def test_large_string_with_whitespace_only():
# 1000 spaces, should return False
s = " " * 1000
codeflash_output = under_non_alpha_ratio(s) # 33.7μs -> 10.0μs (236% faster)
def test_large_string_with_mixed_whitespace_and_chars():
# 333 alpha, 333 non-alpha, 334 whitespace (ignored)
s = "a" * 333 + "1" * 333 + " " * 334
# total_count = 666, alpha = 333, ratio = 0.5
codeflash_output = not under_non_alpha_ratio(s, threshold=0.5) # 41.2μs -> 21.8μs (89.4% faster)
codeflash_output = under_non_alpha_ratio(s, threshold=0.51) # 40.1μs -> 21.0μs (90.8% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from __future__ import annotations
# imports
import pytest # used for our unit tests
from unstructured.partition.text_type import under_non_alpha_ratio
# unit tests
# --- Basic Test Cases ---
def test_all_alpha_default_threshold():
# All alphabetic, should be False (ratio = 1.0, not under 0.5)
codeflash_output = under_non_alpha_ratio("HelloWorld") # 3.09μs -> 2.04μs (51.3% faster)
def test_all_non_alpha_default_threshold():
# All non-alpha (punctuation), should be True (ratio = 0.0)
codeflash_output = under_non_alpha_ratio("!!!???---") # 2.02μs -> 1.24μs (63.3% faster)
def test_mixed_alpha_non_alpha_default_threshold():
# 5 alpha, 5 non-alpha, ratio = 0.5, should be False (not under threshold)
codeflash_output = under_non_alpha_ratio("abc12!@#de") # 2.20μs -> 1.37μs (61.1% faster)
def test_mixed_alpha_non_alpha_just_under_threshold():
# 2 alpha, 3 non-alpha, ratio = 0.4, should be True (under threshold)
codeflash_output = under_non_alpha_ratio("a1!b2") # 1.67μs -> 1.19μs (40.9% faster)
def test_spaces_are_ignored():
# Spaces should not count toward total_count
# 3 alpha, 2 non-alpha, 2 spaces; ratio = 3/5 = 0.6, should be False
codeflash_output = under_non_alpha_ratio("a b! c?") # 1.88μs -> 1.19μs (57.4% faster)
def test_threshold_parameter():
# 2 alpha, 3 non-alpha, total=5, ratio=0.4
# threshold=0.3 -> False (not under), threshold=0.5 -> True (under)
codeflash_output = under_non_alpha_ratio("a1!b2", threshold=0.3) # 1.71μs -> 1.29μs (32.4% faster)
codeflash_output = under_non_alpha_ratio("a1!b2", threshold=0.5) # 900ns -> 536ns (67.9% faster)
# --- Edge Test Cases ---
def test_empty_string():
# Empty string should return False
codeflash_output = under_non_alpha_ratio("") # 425ns -> 367ns (15.8% faster)
def test_only_spaces():
# Only spaces, total_count=0, should return False
codeflash_output = under_non_alpha_ratio(" ") # 1.16μs -> 776ns (49.9% faster)
def test_only_alpha_with_spaces():
# Only alpha and spaces, ratio=1.0, should return False
codeflash_output = under_non_alpha_ratio("a b c d e") # 1.79μs -> 1.24μs (44.3% faster)
def test_only_non_alpha_with_spaces():
# Only non-alpha and spaces, ratio=0.0, should return True
codeflash_output = under_non_alpha_ratio("! @ # $ %") # 1.73μs -> 1.08μs (59.6% faster)
def test_single_alpha():
# Single alpha, ratio=1.0, should return False
codeflash_output = under_non_alpha_ratio("A") # 1.38μs -> 1.03μs (33.9% faster)
def test_single_non_alpha():
# Single non-alpha, ratio=0.0, should return True
codeflash_output = under_non_alpha_ratio("?") # 1.23μs -> 978ns (26.1% faster)
def test_single_space():
# Single space, total_count=0, should return False
codeflash_output = under_non_alpha_ratio(" ") # 1.00μs -> 729ns (37.7% faster)
def test_all_digits():
# All digits, ratio=0.0, should return True
codeflash_output = under_non_alpha_ratio("1234567890") # 2.00μs -> 1.21μs (65.4% faster)
def test_unicode_alpha():
# Unicode alphabetic characters (e.g. accented letters)
# 3 alpha, 2 non-alpha, ratio=0.6, should be False
codeflash_output = under_non_alpha_ratio("éàü!!") # 2.19μs -> 1.58μs (39.3% faster)
def test_unicode_non_alpha():
# Unicode non-alpha (emoji, symbols)
# 2 non-alpha, 2 alpha, ratio=0.5, should be False
codeflash_output = under_non_alpha_ratio("a😀b!") # 2.46μs -> 1.69μs (45.9% faster)
def test_threshold_1_0():
# threshold=1.0, any string with <100% alpha should return True
# 2 alpha, 2 non-alpha, ratio=0.5 < 1.0
codeflash_output = under_non_alpha_ratio("a1b2", threshold=1.0) # 1.73μs -> 1.36μs (27.4% faster)
def test_threshold_0_0():
# threshold=0.0, only strings with 0% alpha should return True
# 2 alpha, 2 non-alpha, ratio=0.5 > 0.0
codeflash_output = under_non_alpha_ratio("a1b2", threshold=0.0) # 1.73μs -> 1.29μs (33.7% faster)
# All non-alpha, ratio=0.0 == 0.0, should be False (not under threshold)
codeflash_output = under_non_alpha_ratio("1234", threshold=0.0) # 883ns -> 653ns (35.2% faster)
def test_threshold_exactly_equal():
# Ratio equals threshold: should return False (not under threshold)
# 2 alpha, 2 non-alpha, ratio=0.5 == threshold
codeflash_output = under_non_alpha_ratio("a1b2", threshold=0.5) # 1.60μs -> 1.20μs (33.1% faster)
def test_tabs_and_newlines_ignored():
# Tabs and newlines are whitespace, so ignored
# 2 alpha, 2 non-alpha, 2 whitespace, ratio=2/4=0.5, should be False
codeflash_output = under_non_alpha_ratio("a\tb\n!") # 1.80μs -> 1.16μs (55.6% faster)
def test_long_repeated_pattern():
# 500 alpha, 500 non-alpha, ratio=0.5, should be False
s = "a1" * 500
codeflash_output = under_non_alpha_ratio(s) # 45.3μs -> 29.1μs (55.5% faster)
# 499 alpha, 501 non-alpha, ratio=499/1000=0.499, should be True
s2 = "a1" * 499 + "1!"
codeflash_output = under_non_alpha_ratio(s2) # 43.8μs -> 28.0μs (56.3% faster)
# --- Large Scale Test Cases ---
def test_large_all_alpha():
# 1000 alphabetic characters, ratio=1.0, should be False
s = "a" * 1000
codeflash_output = under_non_alpha_ratio(s) # 46.5μs -> 32.9μs (41.2% faster)
def test_large_all_non_alpha():
# 1000 non-alpha characters, ratio=0.0, should be True
s = "!" * 1000
codeflash_output = under_non_alpha_ratio(s) # 41.5μs -> 24.6μs (68.4% faster)
def test_large_half_alpha_half_non_alpha():
# 500 alpha, 500 non-alpha, ratio=0.5, should be False
s = ("a!" * 500)
codeflash_output = under_non_alpha_ratio(s) # 44.6μs -> 28.6μs (56.1% faster)
def test_large_sparse_alpha():
# 10 alpha, 990 non-alpha, ratio=0.01, should be True
s = "a" + "!" * 99
s = s * 10 # 10 alpha, 990 non-alpha
codeflash_output = under_non_alpha_ratio(s) # 41.4μs -> 25.0μs (65.6% faster)
def test_large_sparse_non_alpha():
# 990 alpha, 10 non-alpha, ratio=0.99, should be False
s = "a" * 99 + "!" # 99 alpha, 1 non-alpha
s = s * 10 # 990 alpha, 10 non-alpha
codeflash_output = under_non_alpha_ratio(s) # 46.2μs -> 33.0μs (39.9% faster)
def test_large_with_spaces():
# 500 alpha, 500 non-alpha, 100 spaces (should be ignored)
s = ("a!" * 500) + (" " * 100)
codeflash_output = under_non_alpha_ratio(s) # 47.7μs -> 29.7μs (60.6% faster)
def test_large_thresholds():
# 600 alpha, 400 non-alpha, ratio=0.6
s = "a" * 600 + "!" * 400
codeflash_output = under_non_alpha_ratio(s, threshold=0.5) # 44.5μs -> 29.4μs (51.1% faster)
codeflash_output = under_non_alpha_ratio(s, threshold=0.7) # 42.6μs -> 28.5μs (49.4% faster)
# --- Additional Robustness Tests ---
def test_mixed_case_and_symbols():
# Mixed uppercase, lowercase, digits, symbols
# 3 alpha, 3 non-alpha, ratio=0.5, should be False
codeflash_output = under_non_alpha_ratio("A1b2C!") # 1.79μs -> 1.17μs (53.7% faster)
def test_realistic_sentence():
# Realistic sentence, mostly alpha, some punctuation
# 20 alpha, 2 non-alpha (comma, period), ratio=20/22 ~ 0.909, should be False
codeflash_output = under_non_alpha_ratio("Hello, this is a test sentence.") # 3.39μs -> 1.94μs (74.3% faster)
def test_realistic_break_line():
# Typical break line, mostly non-alpha
# 1 alpha, 9 non-alpha, ratio=0.1, should be True
codeflash_output = under_non_alpha_ratio("----BREAK----") # 2.08μs -> 1.32μs (57.9% faster)
def test_space_heavy_string():
# Spaces should be ignored, only non-space chars count
# 2 alpha, 2 non-alpha, 10 spaces, ratio=2/4=0.5, should be False
codeflash_output = under_non_alpha_ratio(" a ! b ? ") # 2.25μs -> 1.29μs (74.3% faster)
def test_only_whitespace_variety():
# Only tabs, spaces, newlines, should return False
codeflash_output = under_non_alpha_ratio(" \t\n\r") # 1.08μs -> 714ns (51.4% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
```
</details>
To edit these changes `git checkout
codeflash/optimize-under_non_alpha_ratio-mcgm6dor` and push.
[](https://codeflash.ai)
---------
Signed-off-by: Saurabh Misra <[email protected]>
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>under_non_alpha_ratio by 76% (#4079)1 parent 76d7a5c commit cc635c9
2 files changed
+12
-3
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
| 5 | + | |
| 6 | + | |
5 | 7 | | |
6 | 8 | | |
7 | 9 | | |
8 | 10 | | |
9 | 11 | | |
10 | 12 | | |
| 13 | + | |
11 | 14 | | |
12 | 15 | | |
13 | 16 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
245 | 245 | | |
246 | 246 | | |
247 | 247 | | |
248 | | - | |
| 248 | + | |
249 | 249 | | |
250 | 250 | | |
251 | | - | |
252 | | - | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
253 | 259 | | |
254 | 260 | | |
255 | 261 | | |
| |||
0 commit comments