added checkpointing to support LLMs by hariharan-devarajan · Pull Request #114 · argonne-lcf/dlio_benchmark

hariharan-devarajan · 2023-11-21T03:18:48Z

Changes to support Microsoft's Megatron Deepspeed.

Support for Checkpointing of model parameters, optimization states, and layer parameters.
Support for Reading Index dataset
Write configuration file megatron deepspeed.

zhenghh04 · 2023-11-22T04:26:36Z

This PR addressed: #88

hariharan-devarajan · 2023-11-22T06:52:02Z

@zhenghh04 I can confirm with the profiler that this change to checkpointing accurately represents the checkpointing in deepspeed. Additionally, the indexed_binary and mmap_indexed_binary are the two modes used in megatron deepspeed for data reading and the calls are accurate.

You can merge this if it looks good to u.

zhenghh04 · 2023-12-15T17:14:50Z

@hariharan-devarajan could you take a look at the conflict, and make sure that the check-pointing writes are performed with the storage API functions which apply also for S3 storage.

zhenghh04 · 2023-11-22T04:25:08Z

dlio_benchmark/configs/workload/dlrm.yaml

+
+train:
+  epochs: 1
+  computation_time: 0.064296


Is the computation time from running the real workload?

This is based on the configuration used in the PR #88

We have to validate this after merging the PR.

zhenghh04 · 2023-12-15T17:16:04Z

dlio_benchmark/framework/torch_framework.py

+            if self.model_state:
+                fname = os.path.join(self.checkpoint_folder, f"model-{epoch}-{step_number}-{my_rank}.pt")
+                with open(fname, "wb") as f:
+                    torch.save(self.model_state, f)
+            if self.optimization_state:
+                fname = os.path.join(self.checkpoint_folder, f"optimizer-{epoch}-{step_number}-{my_rank}.pt")
+                with open(fname, "wb") as f:
+                    torch.save(self.optimization_state, f)
+
+            if self.layer_state and self.args.num_layers > 0:
+                for layer in range(self.args.num_layers):
+                    fname = os.path.join(self.checkpoint_folder, f"layer-{layer}-{epoch}-{step_number}-{my_rank}.pt")
+                    with open(fname, "wb") as f:
+                        torch.save(self.layer_state, f)


Make sure the conflict is solved.

hariharan-devarajan · 2023-12-16T17:04:23Z

@zhenghh04 The original code uses TensorFlow and PyTorch APIs to save. This is needed as we are storing complex tensors.

How would this work with S3? I think we need that fspec type interface for abstracting storage not manual abstraction.

Thoughts?

Fix rank for merge bug.

zhenghh04

Thank you for all the implementation. This feature implemented here is very useful. Please address the issues raise up.

zhenghh04 · 2024-01-04T18:34:20Z

dlio_benchmark/common/enumerations.py


 from enum import Enum

+class CheckpointType(Enum):


Maybe rename this is IOType instead of CheckpointType?

Check point looks like more different kinds of checkpoint? We can use it as for example, only checkpoint model, optimization state, etc.

How about CheckpointIOType Just IOType might confuse with Reading.

I named its CheckpointLocationType as RANK_ZERO or ALL_RANKS

zhenghh04 · 2024-01-04T18:36:15Z

dlio_benchmark/configs/workload/dlrm.yaml

+
+train:
+  epochs: 1
+  computation_time: 0.064296


We have to validate this after merging the PR.

zhenghh04 · 2024-01-04T18:40:41Z

dlio_benchmark/data_generator/indexed_binary_generator.py

+        """
+        super().generate()
+        np.random.seed(10)
+        GB=1024**3


Please change GB=1073741824.

zhenghh04 · 2024-01-04T18:42:51Z

dlio_benchmark/data_generator/indexed_binary_generator.py

+            sample_size = dim1 * dim2
+            total_size = sample_size * self.num_samples
+            write_size = total_size
+            MEMORY_SIZE = 2*GB


Should we allow user to configure this using environment variable, with a default value of 2GB?

under dataset, I will add a configuration called generation_buffer_size. What do you think?

zhenghh04 · 2024-01-04T18:51:20Z

dlio_benchmark/framework/tf_framework.py

+        if self.args.checkpoint_type == CheckpointType.COLLECTIVE:
+            rank_to_checkpoint = 0
+        if rank_to_checkpoint == self.args.my_rank:
+            num_ranks = 1
+            if self.args.checkpoint_type == CheckpointType.COLLECTIVE:
+                num_ranks = self.args.comm_size


What does it mean for COLLECTIVE, is it every rank writing data?

Lines 62-63 and Lines 58-59 are inconsistent to each other.

Collective basically means in the context of checkpointing is that all data is collected by rank zero and written. I am open for a better word to describe it. Maybe Aggregated and Per-Process?

CALLED IT RANK_ZERO

zhenghh04 · 2024-01-04T18:52:24Z

dlio_benchmark/framework/tf_framework.py

+            if self.args.checkpoint_type == CheckpointType.COLLECTIVE:
+                num_ranks = self.args.comm_size
+            if self.args.model_size > 0:
+                self.model_state = {"a": self._get_tensor(self.args.model_size*num_ranks)}


model_size is the size of the model, right?
It is confusing there, to have model_size * num_ranks

model_size is size of model per GPU.
We can define it as absolute model size of app in which case.

For Per GPU case we need to divide it.

Else if it is per GPU then we will have to multiply it for the Collective case.

Explained correctly in Doc

dlio_benchmark/utils/config.py

Change GB to a abs value.

Args model size

…/checkpoint

zhenghh04

This PR looks good now.
But we need to validating DLRM and Magatron-Deepspeed config files. I'll create two issues to keep track of this.

hariharan-devarajan added 6 commits November 20, 2023 19:10

added checkpointing to support LLMs

557afed

added indexed binary data support for LLMs.

8dd835e

added configuration for megatron deepspeed.

0905c36

fixes for out of core data generation

c50f6fd

fixes for out of core data generation

146ce85

fixes for out of core data generation

381880c

hariharan-devarajan marked this pull request as ready for review November 22, 2023 00:21

hariharan-devarajan requested a review from zhenghh04 November 22, 2023 00:21

added dlrm configuration

db73681

hariharan-devarajan added 3 commits November 21, 2023 21:26

added changes to support mmapped file.

5cb6d0e

added changes to support mmapped file.

0564dfe

added changes to support mmapped file.

1b0d42e

added changes to support mmapped file.

52011b4

hariharan-devarajan force-pushed the feature/checkpoint branch 2 times, most recently from faee51e to 87c195b Compare November 28, 2023 05:11

added changes to support mmapped file.

6221b33

hariharan-devarajan force-pushed the feature/checkpoint branch 2 times, most recently from 8a1fb5a to 8537d35 Compare November 29, 2023 04:40

fixed checkpointing for tensors

03796ad

hariharan-devarajan force-pushed the feature/checkpoint branch from 8537d35 to 03796ad Compare November 29, 2023 18:19

zhenghh04 requested changes Dec 15, 2023

View reviewed changes

hariharan-devarajan added 3 commits December 16, 2023 22:36

Merge branch 'main' into feature/checkpoint

900cb53

Merge branch 'main' into feature/checkpoint

7ade86b

Update torch_framework.py

96eefa2

Fix rank for merge bug.

zhenghh04 requested changes Jan 4, 2024

View reviewed changes

Update indexed_binary_generator.py

b73cea5

Change GB to a abs value.

Update megatron_deepspeed.yaml

fc42c86

hariharan-devarajan force-pushed the feature/checkpoint branch 3 times, most recently from b83819e to ddc92ff Compare January 8, 2024 19:29

refactor enum for better naming

0c058ce

hariharan-devarajan force-pushed the feature/checkpoint branch from ddc92ff to 0c058ce Compare January 8, 2024 19:38

documentation for the checkpointing.

3f28662

hariharan-devarajan force-pushed the feature/checkpoint branch from 6dfa60e to 3f28662 Compare January 8, 2024 19:45

hariharan-devarajan added 7 commits January 8, 2024 13:37

make data generation buffer_size configurable.

7fdead2

Update tf_framework.py

189b2e4

Args model size

Update tf_framework.py

8209980

Update megatron_deepspeed.yaml

baf023f

Update megatron_deepspeed.yaml

63ff8c4

make data generation buffer_size configurable.

a397d11

Merge remote-tracking branch 'origin/feature/checkpoint' into feature…

3727e5a

…/checkpoint

hariharan-devarajan force-pushed the feature/checkpoint branch from b3f4427 to 3727e5a Compare January 9, 2024 17:00

hariharan-devarajan requested a review from zhenghh04 January 9, 2024 18:43

zhenghh04 approved these changes Jan 9, 2024

View reviewed changes

zhenghh04 merged commit 0a6130a into main Jan 9, 2024

This was referenced Jan 9, 2024

Validating DLRM config #133

Open

Validating Magatron-DeepSpeed #134

Closed

zhenghh04 deleted the feature/checkpoint branch March 12, 2024 03:59

Conversation

hariharan-devarajan commented Nov 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhenghh04 commented Nov 22, 2023

Uh oh!

hariharan-devarajan commented Nov 22, 2023

Uh oh!

zhenghh04 commented Dec 15, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hariharan-devarajan commented Dec 16, 2023

Uh oh!

zhenghh04 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhenghh04 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hariharan-devarajan commented Nov 21, 2023 •

edited

Loading