Commit bb72982
perf: Optimize file I/O, regex compilation, and logging (#207)
* Initial plan
* perf: Optimize file I/O, regex patterns, and hash functions
- Use context managers for file operations to ensure proper resource cleanup
- Precompile regex patterns for repeated use (whitespace, URL extraction, sentence splitting)
- Replace expensive SHA256 hash with native Python hash for PydanticModel
- Optimize logging serialization to handle bool type explicitly
- Improve DataFrame sorting to avoid work on empty dataframes
- Reduce redundant os.path.abspath calls in warnings_to_logger
Co-authored-by: sam-hey <40773225+sam-hey@users.noreply.github.com>
* fix: Correct import order for regex pattern definitions
Co-authored-by: sam-hey <40773225+sam-hey@users.noreply.github.com>
* style: Apply ruff formatting
* fix: Address code review feedback
- Fix regex pattern to properly match newlines (not escaped backslash)
- Maintain backward compatibility in logging structure (nested extra dict)
- Ensure log parsing systems continue to work with existing format
Co-authored-by: sam-hey <40773225+sam-hey@users.noreply.github.com>
* fix: Maintain deterministic hash for compatibility
- Keep SHA256-based hash for determinism (Python's hash() is randomized)
- Optimize by building tuple first instead of multiple string concatenations
- This ensures hash values are consistent across Python sessions
- Fixes test failure in test_metadata_field_metadata
Co-authored-by: sam-hey <40773225+sam-hey@users.noreply.github.com>
* perf: Further optimize hash function per code review
- Move hashlib import to module level to avoid repeated import overhead
- Join generator directly instead of building intermediate tuple
- Maintains deterministic SHA256 hash for backward compatibility
- Addresses code review suggestions for performance
Co-authored-by: sam-hey <40773225+sam-hey@users.noreply.github.com>
* revert: Remove hash function optimization from PR
Reverted __hash__ method to original implementation as requested.
The hash function changes will be addressed in a separate PR.
File I/O improvements with context managers remain in place.
Co-authored-by: sam-hey <40773225+sam-hey@users.noreply.github.com>
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: sam-hey <40773225+sam-hey@users.noreply.github.com>1 parent f29b6ea commit bb72982
File tree
5 files changed
+31
-17
lines changed- wurzel
- datacontract
- step_executor
- steps/embedding
- utils
- splitters
5 files changed
+31
-17
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
53 | 53 | | |
54 | 54 | | |
55 | 55 | | |
56 | | - | |
| 56 | + | |
| 57 | + | |
57 | 58 | | |
58 | 59 | | |
59 | 60 | | |
| |||
124 | 125 | | |
125 | 126 | | |
126 | 127 | | |
127 | | - | |
| 128 | + | |
| 129 | + | |
128 | 130 | | |
129 | | - | |
| 131 | + | |
| 132 | + | |
130 | 133 | | |
131 | 134 | | |
132 | 135 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
65 | 65 | | |
66 | 66 | | |
67 | 67 | | |
68 | | - | |
69 | 68 | | |
70 | 69 | | |
71 | 70 | | |
72 | 71 | | |
73 | 72 | | |
74 | 73 | | |
75 | | - | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
76 | 78 | | |
77 | 79 | | |
78 | | - | |
| 80 | + | |
79 | 81 | | |
80 | | - | |
| 82 | + | |
81 | 83 | | |
82 | 84 | | |
83 | 85 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
30 | 30 | | |
31 | 31 | | |
32 | 32 | | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
33 | 37 | | |
34 | 38 | | |
35 | 39 | | |
| |||
164 | 168 | | |
165 | 169 | | |
166 | 170 | | |
167 | | - | |
| 171 | + | |
168 | 172 | | |
169 | 173 | | |
170 | 174 | | |
| |||
214 | 218 | | |
215 | 219 | | |
216 | 220 | | |
217 | | - | |
218 | | - | |
219 | | - | |
220 | | - | |
221 | | - | |
| 221 | + | |
| 222 | + | |
222 | 223 | | |
223 | 224 | | |
224 | 225 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
42 | 42 | | |
43 | 43 | | |
44 | 44 | | |
| 45 | + | |
| 46 | + | |
45 | 47 | | |
46 | 48 | | |
47 | | - | |
| 49 | + | |
48 | 50 | | |
49 | 51 | | |
50 | 52 | | |
| |||
69 | 71 | | |
70 | 72 | | |
71 | 73 | | |
72 | | - | |
| 74 | + | |
73 | 75 | | |
74 | | - | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
75 | 80 | | |
76 | 81 | | |
77 | 82 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
24 | 27 | | |
25 | 28 | | |
26 | 29 | | |
| |||
452 | 455 | | |
453 | 456 | | |
454 | 457 | | |
455 | | - | |
| 458 | + | |
456 | 459 | | |
457 | 460 | | |
458 | 461 | | |
| |||
0 commit comments