chenyangkang · chenyangkang · Oct 31, 2025 · Oct 17, 2025 · Oct 17, 2025 · Oct 17, 2025
diff --git a/NEWS.md b/NEWS.md
@@ -1,3 +1,30 @@
+-------
+stemflow version 1.1.6
+-------
+**Oct, 2025**
+
+Fixed several issues. Fix prediction bug, lazyloading bug; update plotting function; update docs. #82. Also: A previous bug: after getting an attribute of a LazyLoadingEstimator object, the model was not auto-dumped. This is now fixed.
+
+
+-------
+stemflow version 1.1.5
+-------
+**Oct, 2025**
+
+This is a large update
+
+Features:
+1. The major changes are that the `AdaSTEM` class now supports `duckdb` and `parquet` file path as input, this allow the user to pass in large dataset without duplicating the pandas dataframe cross the processors when working with n_jobs>1 parallel computing. See the new Jupyter notebooks for details. #76 
+2. The lazy loading is no longer realized by the`LazyLoadingEnsemble` class. Instead, it is realized by `LazyLoadingEstimator`. This allow the model to be dumped once its training/prediction is finished, and we don't need to accumulate the models (hence, memory) until the training is finished for the whole ensemble. This will largely reduce the memory use. See the new Jupyter notebooks for details. #77 
+3. n_jobs > ensemble_folds are no longer supported for user-end clarity. Those jobs are paralleled by ensemble folds so n_jobs > ensemble_folds  is meaning less. We do not want to mislead users to think that a 10-ensemble model will be trained faster using n_jobs=20 compared to n_jobs=10.
+4. These features will not be available in `SphereAdaSTEM` due to the negligible user market and the negligible advantages. #75 
+
+Major bugs fixed:
+1. Previously the models are stored in `self.model_dict` dynamically during the parallel ensemble training process, which means the dictionary is being altered during this process. However, we ask for a `self` as input argument for the ensemble-level training function serialization. This is not ideal since the object being serialized should not be changing. This is fixed by assigning the `model_dict` to `self` after all trainings are finished.
+2. Also fixed #74
+
+
+
 -------
 stemflow version 1.1.3
 -------

diff --git a/docs/Examples/08.Lazy_loading.ipynb b/docs/Examples/08.Lazy_loading.ipynb
@@ -1560,7 +1560,7 @@
    "metadata": {},
    "source": [
     "From the results we can clearly see the trade-off. Using lazy-loading:\n",
-    "1. It is so interesting that lazy-loading seems to even reduce the prediction time... Maybe because it does not load unnecessary models an only focus on certain stixels that cover the needed points.\n",
+    "1. It is so interesting that lazy-loading seems to even reduce the prediction time. This could be due to two reasons: (1) A lazy-loading model does not load unnecessary models an only focus on certain stixels that cover the needed points, and (2) joblib does not need to serialize a huge amount of data (models) which saves so much time.\n",
     "2. Has large impact on testing (prediction) speed. The time for prediction is more than doubled in our case.\n",
     "3. Lazy-loading will maintain memory-use stable and unchanged as ensemble fold increases (maintaining ~ 3GB in our case), while non-lazy-loading will have linear memory consumption growth."
    ]
@@ -2200,7 +2200,7 @@
     "Still, the memory use will proportionally increase when n_jobs increase. That is because\n",
     "1. Your data is being copied n_jobs times -- once for each processor, because data cannot be shared among processors. This problem cannot be solved by lazy loading, but can be solved by using database query (see the other notebook for how to use duckdb as input).\n",
     "2. The trained models also cost memory. For non-lazy loading, all trained models are saved in memory, so a 10-ensemble model means 10 times more models, therefore memory, than a 1-ensemble model. Despite that, lazy-loading still managed to reduce this memory load by only allowing ~1 models in memory per ensemble (so still proportional to the number of ensembles), and ask that if the model has finished training or predicting, auto-dump itself to disk.\n",
-    "3. It is still surprising that prediction is so much faster when using lazy loading..."
+    "3. Lazy-loading also seems to dramatically reduce the prediction time. This means that avoiding serializing huge amount of data (models) with joblib is more important than I/O overhead in single model reading/dumping."
    ]
   },
   {

diff --git a/requirements.txt b/requirements.txt
@@ -13,4 +13,5 @@ scipy>=1.10.1
 setuptools>=68.2.0
 tqdm>=4.65.0
 duckdb>=1.1.3
-pyarrow>=17.0.0
+pyarrow>=17.0.0
+cartopy>=0.22
diff --git a/stemflow/lazyloading/lazyloading.py b/stemflow/lazyloading/lazyloading.py
@@ -41,7 +41,7 @@ def __init__(
         estimator: Optional[BaseEstimator],
         dump_dir: Optional[Path | str] = None,
         filename: Optional[str] = None,
-        compress: Any = 3,
+        compress: Any = 0,
         auto_load: bool = True,
         auto_dump: bool = False,
         keep_loaded: bool = False,
@@ -91,14 +91,16 @@ def __getattr__(self, name):
         # Try autoloading and then delegate
         if name.startswith("__"):  # avoid dunder recursion
             raise AttributeError(name)
-        with self._lock:
-            if self.estimator is None and self.auto_load:
-                self._load_inplace()
-            if self.estimator is not None and hasattr(self.estimator, name):
-                return getattr(self.estimator, name)
-        # Fallback to default behavior
-        raise AttributeError(f"{type(self).__name__} has no attribute '{name}'")
 
+        if self.estimator is None and not self.auto_load:
+            raise AttributeError(f"Trying to get a attribute of estimator, but the estimator can not be auto-loaded from the disk because auto_load=False.")
+
+        with self._loaded_estimator() as est:
+            if hasattr(est, name):
+                return getattr(est, name)
+            else:
+                raise AttributeError(f"{type(est).__name__} has no attribute '{name}'")
+
     # ---------- Persistence helpers ----------
     def _resolve_path(self) -> Path:
         if self.dump_dir is None:
@@ -133,6 +135,7 @@ def dump(self) -> Path:
                 shutil.move(str(tmp_path), str(path))
                 # Free memory
                 self.estimator = None
+
             finally:
                 # Best-effort cleanup
                 try:
@@ -143,8 +146,10 @@ def dump(self) -> Path:
                     tmp_dir.rmdir()
                 except Exception:
                     pass
+
             return path
 
+
     def load(self, path: Optional[Path | str] = None) -> "LazyLoadingEstimator":
         """
         Load the inner estimator from disk into this wrapper (in-place).
@@ -159,7 +164,7 @@ def load_from_dir(
         cls,
         dump_dir: Path | str,
         filename: Optional[str] = None,
-        compress: Any = 3,
+        compress: Any = 0,
         **kwargs,
     ) -> "LazyLoadingEstimator":
         dump_dir = Path(dump_dir)