Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,30 @@
-------
stemflow version 1.1.6
-------
**Oct, 2025**

Fixed several issues. Fix prediction bug, lazyloading bug; update plotting function; update docs. #82. Also: A previous bug: after getting an attribute of a LazyLoadingEstimator object, the model was not auto-dumped. This is now fixed.


-------
stemflow version 1.1.5
-------
**Oct, 2025**

This is a large update

Features:
1. The major changes are that the `AdaSTEM` class now supports `duckdb` and `parquet` file path as input, this allow the user to pass in large dataset without duplicating the pandas dataframe cross the processors when working with n_jobs>1 parallel computing. See the new Jupyter notebooks for details. #76
2. The lazy loading is no longer realized by the`LazyLoadingEnsemble` class. Instead, it is realized by `LazyLoadingEstimator`. This allow the model to be dumped once its training/prediction is finished, and we don't need to accumulate the models (hence, memory) until the training is finished for the whole ensemble. This will largely reduce the memory use. See the new Jupyter notebooks for details. #77
3. n_jobs > ensemble_folds are no longer supported for user-end clarity. Those jobs are paralleled by ensemble folds so n_jobs > ensemble_folds is meaning less. We do not want to mislead users to think that a 10-ensemble model will be trained faster using n_jobs=20 compared to n_jobs=10.
4. These features will not be available in `SphereAdaSTEM` due to the negligible user market and the negligible advantages. #75

Major bugs fixed:
1. Previously the models are stored in `self.model_dict` dynamically during the parallel ensemble training process, which means the dictionary is being altered during this process. However, we ask for a `self` as input argument for the ensemble-level training function serialization. This is not ideal since the object being serialized should not be changing. This is fixed by assigning the `model_dict` to `self` after all trainings are finished.
2. Also fixed #74



-------
stemflow version 1.1.3
-------
Expand Down
4 changes: 2 additions & 2 deletions docs/Examples/08.Lazy_loading.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1560,7 +1560,7 @@
"metadata": {},
"source": [
"From the results we can clearly see the trade-off. Using lazy-loading:\n",
"1. It is so interesting that lazy-loading seems to even reduce the prediction time... Maybe because it does not load unnecessary models an only focus on certain stixels that cover the needed points.\n",
"1. It is so interesting that lazy-loading seems to even reduce the prediction time. This could be due to two reasons: (1) A lazy-loading model does not load unnecessary models an only focus on certain stixels that cover the needed points, and (2) joblib does not need to serialize a huge amount of data (models) which saves so much time.\n",
"2. Has large impact on testing (prediction) speed. The time for prediction is more than doubled in our case.\n",
"3. Lazy-loading will maintain memory-use stable and unchanged as ensemble fold increases (maintaining ~ 3GB in our case), while non-lazy-loading will have linear memory consumption growth."
]
Expand Down Expand Up @@ -2200,7 +2200,7 @@
"Still, the memory use will proportionally increase when n_jobs increase. That is because\n",
"1. Your data is being copied n_jobs times -- once for each processor, because data cannot be shared among processors. This problem cannot be solved by lazy loading, but can be solved by using database query (see the other notebook for how to use duckdb as input).\n",
"2. The trained models also cost memory. For non-lazy loading, all trained models are saved in memory, so a 10-ensemble model means 10 times more models, therefore memory, than a 1-ensemble model. Despite that, lazy-loading still managed to reduce this memory load by only allowing ~1 models in memory per ensemble (so still proportional to the number of ensembles), and ask that if the model has finished training or predicting, auto-dump itself to disk.\n",
"3. It is still surprising that prediction is so much faster when using lazy loading..."
"3. Lazy-loading also seems to dramatically reduce the prediction time. This means that avoiding serializing huge amount of data (models) with joblib is more important than I/O overhead in single model reading/dumping."
]
},
{
Expand Down
3 changes: 2 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,5 @@ scipy>=1.10.1
setuptools>=68.2.0
tqdm>=4.65.0
duckdb>=1.1.3
pyarrow>=17.0.0
pyarrow>=17.0.0
cartopy>=0.22
23 changes: 14 additions & 9 deletions stemflow/lazyloading/lazyloading.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ def __init__(
estimator: Optional[BaseEstimator],
dump_dir: Optional[Path | str] = None,
filename: Optional[str] = None,
compress: Any = 3,
compress: Any = 0,
auto_load: bool = True,
auto_dump: bool = False,
keep_loaded: bool = False,
Expand Down Expand Up @@ -91,14 +91,16 @@ def __getattr__(self, name):
# Try autoloading and then delegate
if name.startswith("__"): # avoid dunder recursion
raise AttributeError(name)
with self._lock:
if self.estimator is None and self.auto_load:
self._load_inplace()
if self.estimator is not None and hasattr(self.estimator, name):
return getattr(self.estimator, name)
# Fallback to default behavior
raise AttributeError(f"{type(self).__name__} has no attribute '{name}'")

if self.estimator is None and not self.auto_load:
raise AttributeError(f"Trying to get a attribute of estimator, but the estimator can not be auto-loaded from the disk because auto_load=False.")

with self._loaded_estimator() as est:
if hasattr(est, name):
return getattr(est, name)
else:
raise AttributeError(f"{type(est).__name__} has no attribute '{name}'")

# ---------- Persistence helpers ----------
def _resolve_path(self) -> Path:
if self.dump_dir is None:
Expand Down Expand Up @@ -133,6 +135,7 @@ def dump(self) -> Path:
shutil.move(str(tmp_path), str(path))
# Free memory
self.estimator = None

finally:
# Best-effort cleanup
try:
Expand All @@ -143,8 +146,10 @@ def dump(self) -> Path:
tmp_dir.rmdir()
except Exception:
pass

return path


def load(self, path: Optional[Path | str] = None) -> "LazyLoadingEstimator":
"""
Load the inner estimator from disk into this wrapper (in-place).
Expand All @@ -159,7 +164,7 @@ def load_from_dir(
cls,
dump_dir: Path | str,
filename: Optional[str] = None,
compress: Any = 3,
compress: Any = 0,
**kwargs,
) -> "LazyLoadingEstimator":
dump_dir = Path(dump_dir)
Expand Down
Loading