dholzmueller
diff --git a/‎README.md‎
Lines changed: 8 additions & 1 deletion b/‎README.md‎
Lines changed: 8 additions & 1 deletion
diff --git a/‎docs/source/bench/using_the_scheduler.md‎
Lines changed: 55 additions & 0 deletions b/‎docs/source/bench/using_the_scheduler.md‎
Lines changed: 55 additions & 0 deletions
diff --git a/‎docs/source/index.rst‎
Lines changed: 1 addition & 0 deletions b/‎docs/source/index.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 3 additions & 1 deletion b/‎pyproject.toml‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎pytabkit/__about__.py‎
Lines changed: 1 addition & 1 deletion b/‎pytabkit/__about__.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎pytabkit/bench/scheduling/execution.py‎
Lines changed: 3 additions & 2 deletions b/‎pytabkit/bench/scheduling/execution.py‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎pytabkit/models/alg_interfaces/tabm_interface.py‎
Lines changed: 22 additions & 11 deletions b/‎pytabkit/models/alg_interfaces/tabm_interface.py‎
Lines changed: 22 additions & 11 deletions
@@ -170,6 +170,14 @@ and https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html
 
 ## Releases (see git tags)
 
+- v1.4.1: 
+    - moved dill to optional dependencies
+    - updated TabM code to a newer version: 
+      added option share_training_batches=False (old version: True), 
+      exclude certain parameters from weight decay.
+    - added [documentation](https://pytabkit.readthedocs.io/en/latest/bench/using_the_scheduler.html) for using the scheduler with custom jobs.
+    - fixed bug in RealMLP refitting.
+    - updated process start method for scheduler to speed up benchmarking
 - v1.4.0:
     - moved some imports to the new `models` optional dependencies
       to have a more light-weight RealMLP installation
@@ -237,4 +245,3 @@ and https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html
 - v0.0.1: First release for arXiv v1.
   Code and data are archived at [DaRUS](https://doi.org/10.18419/darus-4255).
 
-
 
@@ -0,0 +1,55 @@
+# Using the scheduler
+
+`pytabkit` includes a flexible scheduler that can schedule jobs within python using `ray` and `multiprocessing`.
+Essentially, it is a much fancier version of `multiprocessing.Pool`.
+Custom jobs need to provide an estimate of their required resources. The scheduler will
+- run as many jobs in parallel as possible on the current hardware while respecting the RAM and resource constraints
+- try to run the slowest jobs first, to avoid waiting for a few slow jobs in the end
+- measure free CPU RAM in the beginning, and add the fixed RAM that a CPU process uses to the requested RAM. 
+  For processes requesting a GPU, the fixed RAM used by a process using torch CUDA will be added to the requested RAM.
+- print info including remaining time estimates after each new started job, failed jobs etc.
+  (unless the jobs run so fast that multiple ones are started at once). 
+  The time estimates will be based on the time estimates by the jobs, 
+  but they will be adapted by a factor learned based on the actual time taken by already finished jobs. 
+  Hence, the time estimate is only accurate after a few jobs have finished. 
+  It often underestimates the actually needed time to some extent.
+  (This is probably also due to selection bias, since the estimated longest jobs are run first.)
+
+The scheduler also works on multi-GPU systems,
+and it even works on multi-node systems thanks to `ray`'s multi-node support. 
+See [`ray_slurm_launch.py`](https://github.com/dholzmueller/pytabkit/blob/main/scripts/ray_slurm_launch.py) 
+and [`ray_slurm_template.sh`](https://github.com/dholzmueller/pytabkit/blob/main/scripts/ray_slurm_template.sh).
+To use the scheduler, install `pytabkit[models,bench]`.
+
+Here is some example code:
+
+```python
+from pytabkit.models.alg_interfaces.base import RequiredResources
+from pytabkit.bench.scheduling.execution import RayJobManager
+from pytabkit.bench.scheduling.jobs import AbstractJob
+from pytabkit.bench.scheduling.resources import NodeResources
+from pytabkit.bench.scheduling.schedulers import SimpleJobScheduler
+
+class CustomJob(AbstractJob):
+    def get_group(self):
+        # group name, for all jobs with the same group name
+        # one joint time multiplier will be fitted in the scheduler
+        return 'default'
+
+    def get_desc(self) -> str:
+        return 'CustomJob'  # name for displaying
+
+    def __call__(self, assigned_resources: NodeResources) -> bool:
+        # the main job, should only use the assigned resources
+        print(f'Running job with {assigned_resources.get_n_threads()} threads', flush=True)
+        return True  # job finished successfully
+
+    def get_required_resources(self) -> RequiredResources:
+        # Return the resources requested by this job (RAM should be upper bounds, time doesn't need to be)
+        return RequiredResources(time_s=1.0, n_threads=1, cpu_ram_gb=0.1, n_gpus=0, gpu_ram_gb=0.0, gpu_usage=1.0)
+
+
+sched = SimpleJobScheduler(RayJobManager(available_gpu_ram_multiplier=0.7))
+sched.add_jobs([CustomJob() for _ in range(1000)])
+sched.run()
+```
@@ -30,6 +30,7 @@ Tabular benchmarking using pytabkit.bench
    bench/03_code
    bench/download_results
    bench/refine_then_calibrate
+   bench/using_the_scheduler
 
 
 
 
@@ -38,7 +38,6 @@ dependencies = [
     # can also install the newer lightning package with more dependencies instead, it will be prioritized
     "pytorch_lightning>=2.0",
     "psutil>=5.0",  # used for getting logical CPU count in the sklearn base and for getting process RAM usage
-    "dill",  # more powerful pickle, used for file-saving and multiprocessing
 ]
 
 [project.optional-dependencies]
@@ -62,6 +61,9 @@ models = [
     # not necessary unless these things are actually used
     "probmetrics>=0.0.1",
 
+    # more powerful pickle, used for file-saving and multiprocessing.
+    # Unfortunately it can't save certain torch objects
+    "dill",
     # saving objects in yaml/msgpack
     # needed if used in utils.serialize() / deserialize()
     "pyyaml>=5.0",
 
@@ -2,4 +2,4 @@
 #
 # SPDX-License-Identifier: Apache-2.0
 
-__version__ = "1.4.0"
+__version__ = "1.4.1"
@@ -5,7 +5,6 @@
 import traceback
 from typing import Tuple, Optional, List
 
-import dill
 import numpy as np
 
 from pytabkit.bench.scheduling.jobs import JobRunner
@@ -69,7 +68,7 @@ def measure_node_resources(node_id: int) -> Tuple[NodeResources, NodeResources]:
 
 
 def node_runner(feedback_queue, job_queue, node_id: int):
-    mp.set_start_method('spawn', force=True)
+    mp.set_start_method('fork', force=True)
 
     # get resources in separate process so CUDA runtime is shut down when the process is terminated
     # this means that this process will not use up CUDA memory all the time
@@ -96,6 +95,7 @@ def node_runner(feedback_queue, job_queue, node_id: int):
                 # cannot use None as termination signal since that is already the timeout signal
                 return  # or check if processes are still running?
 
+            import dill
             job_data = dill.loads(job_str)
             # print(f'DEBUG: got job data', flush=True)
             processes.append(FunctionProcess(JobRunner(*job_data)).start())
@@ -193,6 +193,7 @@ def get_resource_manager(self) -> ResourceManager:
         return self.resource_manager
 
     def submit_job(self, job_info: JobInfo) -> None:
+        import dill
         if self.resource_manager is None:
             raise RuntimeError('called submit_job() before start()')
         job = job_info.job
 
@@ -18,7 +18,7 @@
 from pytabkit.models.nn_models import rtdl_num_embeddings
 from pytabkit.models.nn_models.base import Fitter
 from pytabkit.models.nn_models.models import PreprocessingFactory
-from pytabkit.models.nn_models.tabm import Model
+from pytabkit.models.nn_models.tabm import Model, make_parameter_groups
 from pytabkit.models.training.logging import Logger
 
 
@@ -56,6 +56,8 @@ def fit(self, ds: DictDataset, idxs_list: List[SplitIdxs], interface_resources:
         allow_amp = self.config.get('allow_amp', False)
         n_blocks = self.config.get('n_blocks', 'auto')
         num_emb_n_bins = self.config.get('num_emb_n_bins', 48)
+        # set default to True for backward compatibility
+        share_training_batches = self.config.get("share_training_batches", False)
 
         weight_decay = self.config.get('weight_decay', 0.0)
         gradient_clipping_norm = self.config.get('gradient_clipping_norm', None)
@@ -180,8 +182,9 @@ def fit(self, ds: DictDataset, idxs_list: List[SplitIdxs], interface_resources:
             ),
             arch_type=arch_type,
             k=tabm_k,
+            share_training_batches=share_training_batches,
         ).to(device)
-        optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)
+        optimizer = torch.optim.AdamW(make_parameter_groups(model), lr=lr, weight_decay=weight_decay)
 
 
         if compile_model:
@@ -210,8 +213,11 @@ def loss_fn(y_pred: torch.Tensor, y_true: torch.Tensor) -> torch.Tensor:
             # TabM produces k predictions per object. Each of them must be trained separately.
             # (regression)     y_pred.shape == (batch_size, k)
             # (classification) y_pred.shape == (batch_size, k, n_classes)
-            k = y_pred.shape[-1 if task_type == 'regression' else -2]
-            return base_loss_fn(y_pred.flatten(0, 1), y_true.repeat_interleave(k))
+            k = y_pred.shape[1]
+            return base_loss_fn(
+                y_pred.flatten(0, 1),
+                y_true.repeat_interleave(k) if model.share_training_batches else y_true.squeeze(-1),
+            )
 
         @evaluation_mode()
         def evaluate(part: str) -> float:
@@ -270,17 +276,22 @@ def evaluate(part: str) -> float:
             if self.config.get('verbosity', 0) >= 1:
                 from tqdm.std import tqdm
             else:
-                tqdm = lambda arr, desc, total: arr
+                tqdm = lambda arr, desc: arr
         except ImportError:
-            tqdm = lambda arr, desc, total: arr
+            tqdm = lambda arr, desc: arr
 
         logger.log(1, '-' * 88 + '\n')
         for epoch in range(n_epochs):
-            for batch_idx in tqdm(
-                    torch.randperm(len(data['train']['y']), device=device).split(batch_size),
-                    desc=f'Epoch {epoch}',
-                    total=epoch_size,
-            ):
+            batches = (
+                torch.randperm(n_train, device=device).split(batch_size)
+                if model.share_training_batches
+                else [
+                    x.transpose(0, 1).flatten()
+                    for x in torch.rand((model.k, n_train), device=device).argsort(dim=1).split(batch_size, dim=1)
+                ]
+            )
+
+            for batch_idx in tqdm(batches, desc=f"Epoch {epoch}"):
                 model.train()
                 optimizer.zero_grad()
                 loss = loss_fn(apply_model('train', batch_idx), Y_train[batch_idx])
Original file line number	Diff line number	Diff line change
`@@ -2,4 +2,4 @@`
`2`	`2`	`#`
`3`	`3`	`# SPDX-License-Identifier: Apache-2.0`
`4`	`4`
`5`		`-__version__ = "1.4.0"`
	`5`	`+__version__ = "1.4.1"`