Replaced lazy all_loaded scans with an _n_lazy counter#103
Replaced lazy all_loaded scans with an _n_lazy counter#103milosobral wants to merge 12 commits intomainfrom
Conversation
…oved benchmark to really show that change
📝 WalkthroughWalkthroughA per-instance lazy-count Changes
Sequence DiagramsequenceDiagram
participant Client
participant LazyObj as Lazy Object
participant AttrHandler as __getattribute__
participant HDF5 as HDF5 Dataset
participant Converter as Promotion Logic
participant EagerObj as Eager Object
Client->>LazyObj: access attribute (e.g., "start" or "target_pos_i")
LazyObj->>AttrHandler: intercept access
AttrHandler->>HDF5: detect h5py.Dataset (lazy)
HDF5-->>AttrHandler: dataset reference
AttrHandler->>AttrHandler: materialize -> NumPy array
AttrHandler->>LazyObj: store array in __dict__
AttrHandler->>LazyObj: decrement _n_lazy
LazyObj->>LazyObj: check _n_lazy
alt _n_lazy == 0
LazyObj->>Converter: trigger promotion
Converter->>EagerObj: convert instance, remove lazy state
EagerObj-->>Client: return eager attribute/object
else _n_lazy > 0
LazyObj-->>Client: return materialized attribute (object remains lazy)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
benchmarks/benchmark.py (1)
52-53: Remove or implement the unusedspanparameter.
spanon Line 52 is never used, so the helper API is misleading. If it’s not needed, remove it.Proposed simplification
-def _make_disjoint_intervals( - n, span=10_000, min_gap=1.0, min_dur=0.5, max_dur=2.0, seed=42 -): +def _make_disjoint_intervals( + n, min_gap=1.0, min_dur=0.5, max_dur=2.0, seed=42 +):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/benchmark.py` around lines 52 - 53, The function signature in benchmarks/benchmark.py declares a parameter span (n, span=10_000, min_gap=1.0, min_dur=0.5, max_dur=2.0, seed=42) that is never used; either remove span from the signature and any callers, or implement its intended behavior where the function (and helper functions like any generator or sampler used inside) constrains/uses the overall span value. Update the signature and all call sites to match, and run tests to ensure no references to span remain. Ensure you adjust the docstring and parameter list (and any default) for the function to reflect the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@benchmarks/benchmark.py`:
- Around line 300-335: The temp file created in bench_lazy_interval_access
(tmpfile/path) is only removed at the end and can leak on exceptions; wrap the
setup and benchmark execution in a try/finally (or use the same pattern as
bench_data_slice_lazy) so os.unlink(path) is always called in the finally block,
ensuring the temporary file is removed on both success and failure; locate
function bench_lazy_interval_access and move the cleanup into a finally that
always executes after the h5py usage and _bench call.
In `@benchmarks/compare.py`:
- Around line 33-38: Replace raw subprocess.run invocations in
benchmarks/compare.py with hardened calls: resolve the git executable via
shutil.which("git") and use that absolute path instead of relying on PATH, add a
reasonable timeout value to prevent hanging, and enable check=True so failures
raise exceptions you can handle. Update each subprocess.run call (the ones
creating result and similar calls around the previous invocations) to use the
resolved git path, timeout, and check parameters and handle
subprocess.CalledProcessError where appropriate.
In `@tests/test_arraydict.py`:
- Around line 196-200: The test uses bare attribute access expressions on the
Data object (data.unit_id, data.brain_region, data.waveform_mean) to drive lazy
loading and triggers Ruff B018; update these to explicitly assign the accessed
values to a throwaway variable (e.g., use "_" ) so the side-effect intent is
clear and the linter is satisfied while keeping the existing asserts that check
data.__dict__["_n_lazy"].
In `@tests/test_interval.py`:
- Around line 269-275: The test uses bare attribute accesses (data.start,
data.end, data.go_cue_time, data.drifting_gratings_dir and the similar access at
line 307) solely to trigger lazy materialization, which Ruff flags as B018;
change each bare access to an explicit discard assignment (e.g., _ = data.start)
so the side effect is preserved while satisfying the linter, updating
occurrences for the attributes referenced above.
---
Nitpick comments:
In `@benchmarks/benchmark.py`:
- Around line 52-53: The function signature in benchmarks/benchmark.py declares
a parameter span (n, span=10_000, min_gap=1.0, min_dur=0.5, max_dur=2.0,
seed=42) that is never used; either remove span from the signature and any
callers, or implement its intended behavior where the function (and helper
functions like any generator or sampler used inside) constrains/uses the overall
span value. Update the signature and all call sites to match, and run tests to
ensure no references to span remain. Ensure you adjust the docstring and
parameter list (and any default) for the function to reflect the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8c5b0d6c-011f-46c8-b6b4-86ce7cd0a2fc
📒 Files selected for processing (9)
CHANGELOG.mdbenchmarks/benchmark.pybenchmarks/compare.pytemporaldata/arraydict.pytemporaldata/interval.pytemporaldata/irregular_ts.pytemporaldata/regular_ts.pytests/test_arraydict.pytests/test_interval.py
Benchmark comparison vs mainBaseline: |
vinamarora8
left a comment
There was a problem hiding this comment.
@milosobral Leaving some comments!
Also, curious as to why no tests are added for IrregularTimeSeries?
Can you also merge main into this pls.
|
|
||
| # store it, now the array is loaded | ||
| self.__dict__[name] = out | ||
| self.__dict__["_n_lazy"] -= 1 |
There was a problem hiding this comment.
Why not just self._n_lazy -= 1?
| obj = cls.__new__(cls) | ||
| for key, value in file.items(): | ||
| obj.__dict__[key] = value | ||
|
|
||
| obj._unicode_keys = file.attrs["_unicode_keys"].astype(str).tolist() | ||
| obj._lazy_ops = {} | ||
| obj._n_lazy = len(file) | ||
|
|
There was a problem hiding this comment.
Is this going to be safe always? Maybe we should do this?
for key, value in file.item():
obj.__dict__[key] = value
if isinstance(value, h5py.Dataset):
self._n_lazy += 1…n_lazy. - Added a new test to verify that _n_lazy reflects the number of datasets in LazyArrayDict.
…nto milo/lazy-attributes
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@temporaldata/arraydict.py`:
- Around line 367-386: The object can become permanently “lazy” because _n_lazy
is not updated when callers set or delete public fields; override
LazyArrayDict.__setattr__ and __delattr__ to keep _n_lazy in sync with actual
dataset-valued entries: when setting an attribute (in __setattr__) check the
previous value from self.__dict__ and the incoming value and increment/decrement
_n_lazy if a h5py.Dataset (or whatever lazy marker you use) is being added or
removed (or replaced by a non-dataset), then delegate to ArrayDict.__setattr__
or assign into __dict__; when deleting an attribute (in __delattr__) decrement
_n_lazy if the removed value was a dataset, then perform deletion and run the
same promotion logic used in __getattribute__ (flip to ArrayDict and del
_lazy_ops/_unicode_keys/_n_lazy when _n_lazy reaches 0). Ensure you reference
__getattribute__, __setattr__, __delattr__, _n_lazy, LazyArrayDict, ArrayDict
and to_hdf5 when implementing these changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 131a3d36-e278-4a0c-9683-e54f4f2c8f6c
📒 Files selected for processing (5)
CHANGELOG.mdbenchmarks/benchmark.pytemporaldata/arraydict.pytests/test_arraydict.pytests/test_interval.py
✅ Files skipped from review due to trivial changes (1)
- CHANGELOG.md
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/test_arraydict.py
- tests/test_interval.py
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
benchmarks/benchmark.py (1)
339-356:⚠️ Potential issue | 🟡 MinorMove temp-file cleanup into a
finally.The trailing
os.unlink(path)still won't run if setup or the benchmark body raises, so failed benchmark runs can leak temp files.Possible fix
- with h5py.File(path, "w") as f: - iv.to_hdf5(f) - - results = None - with h5py.File(path, "r") as f: - - def go(): - lazy = LazyInterval.from_hdf5(f) - _ = lazy.start - _ = lazy.end - - for i in range(num_attrs): - _ = getattr(lazy, f"target_pos_{i}") - - results = _bench(f"LazyInterval access ({num_attrs} attrs)", go, number=25) - - os.unlink(path) - return results + try: + with h5py.File(path, "w") as f: + iv.to_hdf5(f) + + with h5py.File(path, "r") as f: + + def go(): + lazy = LazyInterval.from_hdf5(f) + _ = lazy.start + _ = lazy.end + + for i in range(num_attrs): + _ = getattr(lazy, f"target_pos_{i}") + + return _bench(f"LazyInterval access ({num_attrs} attrs)", go, number=25) + finally: + if os.path.exists(path): + os.unlink(path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/benchmark.py` around lines 339 - 356, The temp file at variable path can leak if an exception occurs before the final os.unlink(path); wrap the file-creation and benchmark logic (the with h5py.File(path, "w") as f: ... the with h5py.File(path, "r") as f: ... and the call to _bench inside go which uses LazyInterval.from_hdf5) in a try/finally and move os.unlink(path) into the finally block so the file is always removed even on errors; ensure any early returns still let the finally execute and reference path, LazyInterval.from_hdf5, and _bench unchanged.temporaldata/arraydict.py (1)
342-386:⚠️ Potential issue | 🟠 MajorCounter-based promotion still needs mutation hooks.
Now that promotion depends entirely on
_n_lazy, replacing or deleting a still-lazy public field can leave the instance stuck asLazyArrayDicteven after every dataset is gone. Please keep the counter in sync in__setattr__/__delattr__and mirror the same fix in the other lazy containers added in this PR.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@temporaldata/arraydict.py` around lines 342 - 386, The lazy-promotion counter _n_lazy is not updated when public fields are set or deleted, so implement hooks in __setattr__ and __delattr__ to keep it in sync: in __setattr__(self, name, value) if name is a public key (not starting with "_") adjust _n_lazy up when assigning an h5py.Dataset (or other lazy sentinel) and adjust down when replacing an existing h5py.Dataset with a non-lazy value; in __delattr__(self, name) if deleting a public key decrement _n_lazy if the deleted value is an h5py.Dataset; after adjustments, check the same promotion condition (if self._n_lazy == 0 then promote by setting __class__ = ArrayDict and deleting _lazy_ops/_unicode_keys/_n_lazy) to mirror the behavior in __getattribute__; apply the same pattern to the other lazy container classes introduced in this PR so all lazy containers update their _n_lazy consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@temporaldata/arraydict.py`:
- Around line 455-458: The loop that assigns HDF5 group members into
obj.__dict__ is inserting non-dataset members (e.g., groups) which breaks
LazyArrayDict assumptions and _n_lazy counting; change the logic in the loop
inside class LazyArrayDict (the for key, value in file.items() block) to skip
any value that is not an instance of h5py.Dataset before assigning to
obj.__dict__ and before incrementing _n_lazy (i.e., only set obj.__dict__[key]
and increment _n_lazy when isinstance(value, h5py.Dataset)); alternatively, if
you prefer strictness, raise a TypeError for non-dataset members instead of
silently skipping—ensure keys(), __len__ and masking only see dataset entries.
---
Duplicate comments:
In `@benchmarks/benchmark.py`:
- Around line 339-356: The temp file at variable path can leak if an exception
occurs before the final os.unlink(path); wrap the file-creation and benchmark
logic (the with h5py.File(path, "w") as f: ... the with h5py.File(path, "r") as
f: ... and the call to _bench inside go which uses LazyInterval.from_hdf5) in a
try/finally and move os.unlink(path) into the finally block so the file is
always removed even on errors; ensure any early returns still let the finally
execute and reference path, LazyInterval.from_hdf5, and _bench unchanged.
In `@temporaldata/arraydict.py`:
- Around line 342-386: The lazy-promotion counter _n_lazy is not updated when
public fields are set or deleted, so implement hooks in __setattr__ and
__delattr__ to keep it in sync: in __setattr__(self, name, value) if name is a
public key (not starting with "_") adjust _n_lazy up when assigning an
h5py.Dataset (or other lazy sentinel) and adjust down when replacing an existing
h5py.Dataset with a non-lazy value; in __delattr__(self, name) if deleting a
public key decrement _n_lazy if the deleted value is an h5py.Dataset; after
adjustments, check the same promotion condition (if self._n_lazy == 0 then
promote by setting __class__ = ArrayDict and deleting
_lazy_ops/_unicode_keys/_n_lazy) to mirror the behavior in __getattribute__;
apply the same pattern to the other lazy container classes introduced in this PR
so all lazy containers update their _n_lazy consistently.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 82eb43df-8459-4e3a-bab2-cb843c52a1ab
📒 Files selected for processing (8)
CHANGELOG.mdbenchmarks/benchmark.pytemporaldata/arraydict.pytemporaldata/interval.pytemporaldata/irregular_ts.pytemporaldata/regular_ts.pytests/test_arraydict.pytests/test_interval.py
✅ Files skipped from review due to trivial changes (1)
- tests/test_interval.py
🚧 Files skipped from review as they are similar to previous changes (3)
- CHANGELOG.md
- tests/test_arraydict.py
- temporaldata/regular_ts.py
| for key, value in file.items(): | ||
| obj.__dict__[key] = value | ||
| if isinstance(value, h5py.Dataset): | ||
| n_lazy += 1 |
There was a problem hiding this comment.
Skip unexpected non-dataset HDF5 members here.
This now excludes non-datasets from _n_lazy, but it still inserts them into __dict__. A root group then shows up in keys() and breaks __len__ / masking because LazyArrayDict assumes every public entry is array-like. Either ignore or reject non-h5py.Dataset members before assignment.
Possible fix
obj = cls.__new__(cls)
n_lazy = 0
for key, value in file.items():
- obj.__dict__[key] = value
- if isinstance(value, h5py.Dataset):
- n_lazy += 1
+ if not isinstance(value, h5py.Dataset):
+ continue
+ obj.__dict__[key] = value
+ n_lazy += 1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for key, value in file.items(): | |
| obj.__dict__[key] = value | |
| if isinstance(value, h5py.Dataset): | |
| n_lazy += 1 | |
| for key, value in file.items(): | |
| if not isinstance(value, h5py.Dataset): | |
| continue | |
| obj.__dict__[key] = value | |
| n_lazy += 1 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@temporaldata/arraydict.py` around lines 455 - 458, The loop that assigns HDF5
group members into obj.__dict__ is inserting non-dataset members (e.g., groups)
which breaks LazyArrayDict assumptions and _n_lazy counting; change the logic in
the loop inside class LazyArrayDict (the for key, value in file.items() block)
to skip any value that is not an instance of h5py.Dataset before assigning to
obj.__dict__ and before incrementing _n_lazy (i.e., only set obj.__dict__[key]
and increment _n_lazy when isinstance(value, h5py.Dataset)); alternatively, if
you prefer strictness, raise a TypeError for non-dataset members instead of
silently skipping—ensure keys(), __len__ and masking only see dataset entries.
…assignment and deletion. Added tests to ensure correct promotion to ArrayDict after modifying datasets.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
temporaldata/arraydict.py (1)
490-498:⚠️ Potential issue | 🟡 MinorSkip non-dataset HDF5 members during
from_hdf5loading.The loop stores all HDF5 items directly into
__dict__, including groups, but only counts datasets for_n_lazy. Whenselect_by_mask()iterateskeys()and applies masking withvalue[mask], h5py.Group objects would fail since they don't support boolean indexing. HDF5 groups are created in production code (e.g.,data.py,regular_ts.py,irregular_ts.py), so this issue would manifest in real usage.🔧 Suggested fix: skip non-datasets before assignment
obj = cls.__new__(cls) n_lazy = 0 for key, value in file.items(): + if not isinstance(value, h5py.Dataset): + continue obj.__dict__[key] = value - if isinstance(value, h5py.Dataset): - n_lazy += 1 + n_lazy += 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@temporaldata/arraydict.py` around lines 490 - 498, The from_hdf5 loader is assigning all HDF5 members (including h5py.Group) into obj.__dict__ but only counts datasets in _n_lazy, causing select_by_mask to later attempt boolean indexing on groups; modify the loop in from_hdf5 to only assign and count members when isinstance(value, h5py.Dataset) (skip non-dataset members), leaving _lazy_ops, _unicode_keys and _n_lazy behavior unchanged so select_by_mask and other code that expects datasets won't receive groups.
🧹 Nitpick comments (1)
tests/test_arraydict.py (1)
216-235: Strengthen test to verify non-dataset members are excluded fromkeys().The test validates that
_n_lazyonly counts datasets, but doesn't verify that the nested group"nested_metadata"is excluded fromlazy.keys(). This would catch the issue where non-datasets are still inserted into__dict__.✨ Suggested enhancement
with h5py.File(test_filepath, "r") as f: assert len(f) == 4 lazy = LazyArrayDict.from_hdf5(f) assert lazy.__dict__["_n_lazy"] == 3 + # Verify non-dataset members are not in keys + assert "nested_metadata" not in lazy.keys() + assert len(lazy.keys()) == 3 _ = lazy.unit_id _ = lazy.brain_region _ = lazy.waveform_mean assert lazy.__class__ == ArrayDict🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_arraydict.py` around lines 216 - 235, Add an assertion that the non-dataset group "nested_metadata" is not treated as a key by the lazy loader: after creating lazy via LazyArrayDict.from_hdf5(f) (variable name lazy), assert that "nested_metadata" is not in lazy.keys() (or alternatively not in lazy.__dict__) to ensure only datasets are represented as keys and non-dataset groups aren’t inserted into the object's state.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@temporaldata/arraydict.py`:
- Around line 490-498: The from_hdf5 loader is assigning all HDF5 members
(including h5py.Group) into obj.__dict__ but only counts datasets in _n_lazy,
causing select_by_mask to later attempt boolean indexing on groups; modify the
loop in from_hdf5 to only assign and count members when isinstance(value,
h5py.Dataset) (skip non-dataset members), leaving _lazy_ops, _unicode_keys and
_n_lazy behavior unchanged so select_by_mask and other code that expects
datasets won't receive groups.
---
Nitpick comments:
In `@tests/test_arraydict.py`:
- Around line 216-235: Add an assertion that the non-dataset group
"nested_metadata" is not treated as a key by the lazy loader: after creating
lazy via LazyArrayDict.from_hdf5(f) (variable name lazy), assert that
"nested_metadata" is not in lazy.keys() (or alternatively not in lazy.__dict__)
to ensure only datasets are represented as keys and non-dataset groups aren’t
inserted into the object's state.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 88e78040-f20f-4902-8a2c-c4e82bd7989b
📒 Files selected for processing (4)
temporaldata/arraydict.pytests/test_arraydict.pytests/test_irregular_ts.pytests/test_regular_ts.py
|
After some back and forth, we (@vinamarora8 and I) have decided that, although this does speed up the code quite significantly, the added complexity and potential for dangerous bugs is not worth it for this specific optimization. |
This optimization makes quite a small difference for small objects but does make a bigger difference for larger ones with more attributes.
Benchmark results:
Summary by CodeRabbit
New Features
Benchmarks
Tests
Changelog