You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Use Gloo PG if available for both restore and restore_with_id methods (#897)
Summary:
Pull Request resolved: #897
Use Gloo PG if available for both restore and restore_with_id methods.
This diff moves the logic to restore_with_id which gets called by the restore method. This will ensure that it takes effect for both the code paths.
Reviewed By: JKSenthil
Differential Revision: D62539308
fbshipit-source-id: bb37c2ce0e33027967c7ef5727ca09c3ec491fc6
# destroy gloo pg if created, its sole purpose was for checkpoint restore
263
-
ifgloo_pg_created:
264
-
dist.destroy_process_group(pg)
265
-
266
246
@staticmethod
267
247
defrestore_with_id(
268
248
checkpoint_id: Union[int, str],
@@ -284,15 +264,32 @@ def restore_with_id(
284
264
checkpoint_id: Checkpoint id. It can be the path of the snapshot to restore.
285
265
unit: An instance of :class:`~torchtnt.framework.unit.TrainUnit`, :class:`~torchtnt.framework.unit.EvalUnit`, or :class:`~torchtnt.framework.unit.PredictUnit` containing states to restore.
286
266
train_dataloader: An optional train dataloader to restore.
287
-
process_group: The process group on which the ranks will communicate on. default: ``None`` (the entire world) Note:
288
-
If torch.distributed is available and a process group is initialized, dcp assumes the intention is to save/load checkpoints in distributed fashion.
267
+
process_group: The process group on which the ranks will communicate on. default: ``None`` (the entire world)
268
+
If not Gloo, a Gloo process group is created.
269
+
Note: If torch.distributed is available and a process group is initialized, dcp assumes the intention is to save/load checkpoints in distributed fashion.
289
270
restore_options: Controls what to filter when restoring the state.
290
271
knob_options: Additional keyword options for StorageWriter and StorageReader
291
272
planner: Instance of LoadPlanner. If this is not specificed, the default planner will be used. (Default: ``None``)
292
273
storage_reader: Instance of StorageReader used to perform reads. If this is not specified, it will automatically infer
293
274
the reader based on the checkpoint_id. If checkpoint_id is also None, an exception will be raised. (Default: ``None``)
0 commit comments