feat: refactor main_ds.py (1/n) Model class#572
feat: refactor main_ds.py (1/n) Model class#572mergify[bot] merged 1 commit intoinstructlab:mainfrom
Conversation
| class ModelTypes(Enum): | ||
| LIGER = "Liger" | ||
| CAUSALLM = "Causallm" | ||
| DOLOMITE = "Dolomite" |
There was a problem hiding this comment.
We've dropped dolomite, no need to include this.
There was a problem hiding this comment.
@RobotSail Interesting! What does it mean exactly? If I grep through the code, I still see hits for dolomite, including the mandatory dependency on instructlab-dolomite. Was some decision made to drop it? Should we clean these remnants from the tree then?
|
This pull request has merge conflicts that must be resolved before it can be |
booxter
left a comment
There was a problem hiding this comment.
I haven't reviewed tests or Accelerator class in detail. I need to step off this PR. Posting questions and concerns I have collected so far.
| parser.add_argument( | ||
| "--model-class", | ||
| type=str, | ||
| default=ModelTypes.CAUSALLM.value, |
There was a problem hiding this comment.
nit: you can use choice=[x.value for x in enum] to avoid listing them below
src/instructlab/training/config.py
Outdated
| sharding_strategy: ShardingStrategies = ShardingStrategies.HYBRID_SHARD | ||
|
|
||
|
|
||
| class Optimizers(Enum): |
There was a problem hiding this comment.
(No action required, Observation) I think it's more common to call enums as singular, not plural. But it's a matter of habit of course.
| from deepspeed.ops.adam import DeepSpeedCPUAdam | ||
| except ImportError: | ||
| DeepSpeedCPUAdam = None | ||
| local_rank = int(os.getenv("LOCAL_RANK", "0")) |
There was a problem hiding this comment.
(No action required) I know it was done in main_ds so you are not introducing anything new here, but consider not running code / issuing warnings when importing the module. An import should not, generally, produce side effects of this sort, especially in a library. Consider warning later when the missing class is actually referred to / used.
| output_dir: str, | ||
| distributed_framework: DistributedBackend, | ||
| model_type: ModelTypes, | ||
| noise_alpha: Optional[float], |
There was a problem hiding this comment.
nit: use type | None instead of Optional
src/instructlab/training/model.py
Outdated
| ) | ||
| self.model.config.eos_token_id = self.tokenizer.eos_token_id | ||
|
|
||
| if "ForCausalLM" not in self.model.__class__.__name__: |
There was a problem hiding this comment.
this is fragile; can you think of a more robust way of checking it? if not, maybe the Model class could have a helper method to hide the check?
There was a problem hiding this comment.
this is inherited from main:
training/src/instructlab/training/main_ds.py
Line 229 in ccac4fd
I will refactor into a helper and we can investigate a better solution if there is one
src/instructlab/training/model.py
Outdated
| from .utils import add_noisy_embeddings, convert_loss_to_reduce_sum | ||
|
|
||
| self.model = convert_loss_to_reduce_sum( | ||
| self.model, use_dolomite=(self.model_type == "dolomite") |
There was a problem hiding this comment.
incorrect enum == str check
There was a problem hiding this comment.
fixed with children classes I created, I think
| """Check if a GPU supports FlashAttention.""" | ||
| major, minor = torch.cuda.get_device_capability(device_id) | ||
| # Check if the GPU architecture is Ampere (SM 8.x) or newer (SM 9.0) | ||
| is_sm8x = major == 8 and minor >= 0 |
There was a problem hiding this comment.
(No action required) Could be:
if ...:
return True
if ...:
return True
if ...:
return True
return False
|
@booxter thanks for the review. I actually meant to remove In regard to most other comments, a lot of them are inherited from the existing code or mis-steps by me when splitting out my mega PR (I forgot to take my changes from utils.py for example). Will take another pass here. Thanks! |
|
This pull request has merge conflicts that must be resolved before it can be |
|
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
|
e2e workflow failed on this PR: View run, please investigate. |
|
e2e workflow succeeded on this PR: View run, congrats! |
booxter
left a comment
There was a problem hiding this comment.
bnb question should be addressed before merging. Do we need it? Is it ok to drop it here?
| base_model_args = { | ||
| "pretrained_model_name_or_path": args.model_name_or_path, | ||
| "torch_dtype": torch.bfloat16, | ||
| "quantization_config": bnb_config, |
There was a problem hiding this comment.
Do you have an answer to this? Should the drop be included here?
src/instructlab/training/model.py
Outdated
|
|
||
| self.reconcile_tokenizer() | ||
| if self.lora_config: | ||
| # First Party |
|
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
|
I changed |
booxter
left a comment
There was a problem hiding this comment.
This looks reasonable. It's hard to review a large patch line by line through multiple iterations, so this follow-up review focused on high level question of: whether prior feedback of mine was addressed. I think it was (bnb restored; logging module used, duplicate functions cleaned up; accelerator class removed; etc.)
|
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
Introduce a new design for key components of main_ds.py. Namely splitting Model initialization, Accelerator initialization, Optimizer initialization, and Checkpoint saving initialization into classes. This commit introduces the Model class NOTE: a follow up to this work will be to introduce classes/structure for the DataLoader, Sampler, etc. This was left out of this PR given the already large scope of change. The Model class wraps the various AutoModel classes we support -- and aims to be a lightweight wrapper to help with usability of the library with different model types. setup_optimizer resides within the model class and returns one of the optimizer types we support These classes are one of a few steps needed to "SDK-ify" the training library Adding structure to code via classes can either be someone's favorite or least favorite thing. So I figured I'd explain myself before continuing. Here is my rationale: Classes provide logical structuring to code, especially code meant to be a publicly consumable SDK and allows you to associate related objects and methods with one another. Being able to group functionality under the Model, Accelerator, and Checkpointer classes inherently reduces code complexity and duplication. Being able to store things like , self.distributed_framework,self.lora_config, etc in a way such that within the class they are accessible within different methods allows the arguments per method to go down drastically, as well as complex return values. Simpler methods and argument/return values allows for simpler testing of code. Signed-off-by: Charlie Doern <cdoern@redhat.com>
|
|
|
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
| lora_config: Optional[LoraConfig] = None, | ||
| lora_quant_bits: int = 0, | ||
| ): | ||
| self.lora_config = lora_config |
There was a problem hiding this comment.
i think lora_config should not be put inside the model class, it should act as a wrapper to our model. We can deliberate this in a further issue/pr
|
holding for the L40s test to pass |
fynnsu
left a comment
There was a problem hiding this comment.
I support moving quickly with these prs so that we can start to refine the final shape of the new sdk style codebase.
This is reasonable for now, pending future prs to update the other components.
|
e2e workflow succeeded on this PR: View run, congrats! |
Introduce a new design for key components of main_ds.py. Namely splitting Model initialization, Accelerator initialization, Optimizer initialization, and Checkpoint saving initialization into classes. This commit introduces the Model class
NOTE: a follow up to this work will be to introduce classes/structure for the DataLoader, Sampler, etc. This was left out of this PR given the already large scope of change.
The Model class wraps the various AutoModel classes we support -- and aims to be a lightweight wrapper to help with usability of the library with different model types. setup_optimizer resides within the model class and returns one of the optimizer types we support
These classes are one of a few steps needed to "SDK-ify" the training library
Adding structure to code via classes can either be someone's favorite or least favorite thing. So I figured I'd explain myself before continuing. Here is my rationale:
Classes provide logical structuring to code, especially code meant to be a publicly consumable SDK and allows you to associate related objects and methods with one another.
Being able to group functionality under the Model, Accelerator, and Checkpointer classes inherently reduces code complexity and duplication. Being able to store things like , self.distributed_framework,self.lora_config, etc in a way such that within the class they are accessible within different methods allows the arguments per method to go down drastically, as well as complex return values. Simpler methods and argument/return values allows for simpler testing of code.