Skip to content

Conversation

@kmehant
Copy link
Collaborator

@kmehant kmehant commented Feb 18, 2025

Extend support to GraniteMoeSharedForCausalLM for all the fast moe (fast kernels, and EP) features and padding free.

Summary of changes

  • extend support to GraniteMoeSharedForCausalLM
  • update a100_80gb_moe.csv | bench for moe granite (moe and moe shared granite models)
  • update a100_80gb_moe.csv | bench for mixtral
  • update requirements_moe.txt
  • regression test for the above runs.

Final reg plots that includes granite moe and mixtral.

memory increased loss holds parity throughput increased
compare-mem_nvidia_mem_reserved compare-train_loss compare-train_tokens_per_second

Outliers

Screenshot 2025-02-21 at 12 48 31 PM

OOM experiments

Mixtral OOM due to extra memory consumption

epoch framework_config gradient_accumulation_steps mem_nvidia_mem_reserved model_name_or_path num_gpus per_device_train_batch_size torch_dtype train_loss train_runtime train_samples_per_second train_steps_per_second train_tokens_per_second
  none 16 78783.5 mistralai/Mixtral-8x7B-Instruct-v0.1 8 1 bfloat16          

Also note:

  • ibm-research/moe-7b-1b-active-shared-experts has number of experts that are not divisible by 4, so ep4 does not apply

Reference issue on regression - #128

@kmehant kmehant force-pushed the support-sharedmoe-granite branch from 321f6e9 to acd6d19 Compare February 18, 2025 13:35
@kmehant kmehant marked this pull request as ready for review February 18, 2025 14:02
@kmehant kmehant requested a review from fabianlim as a code owner February 18, 2025 14:02
@fabianlim fabianlim requested a review from willmj February 18, 2025 14:03
@kmehant kmehant force-pushed the support-sharedmoe-granite branch 2 times, most recently from 270123d to 0615a81 Compare February 18, 2025 15:34
Signed-off-by: Mehant Kammakomati <[email protected]>
@kmehant kmehant force-pushed the support-sharedmoe-granite branch from 0615a81 to 353b623 Compare February 18, 2025 15:45
logic="APPEND",
),
),
ModelPatcherRule(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refer to how granite does it

packing: False
adam_epsilon: 1e-8
model_name_or_path:
- 'ibm-research/moe-7b-1b-active-shared-experts'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you shouldnt copy and paste this, you should just add to model_name_or_path in the existing scenarios

model_name_or_path:
  - 'ibm-granite/granite-3.0-3b-a800m-instruct'
  - 'ibm-research/moe-7b-1b-active-shared-experts'

- moe-scattermoe-granite-ep4-padding-free-foak
arguments:
learning_rate: 5e-5
torch_dtype: bfloat16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.. dont copy and paste.. if its all the same arguments there is no need

I dont understand why you need this different bench

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted to run only for moesharedmodel so had to copy paste. Is there a way I can subselect a model along with scenario with scenariofilter? Apologies if I have missed that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry you cant. If you want to do ad hoc testing then just uncomment the other models you dont want to test. For the official bench we need to update all models, this is because we only version 1 set of requirements for reproducibilty, and we cant have partial benches running, otherwise there will be inconsistency

[
"--gradient_accumulation_steps",
str(effective_batch_size // num_gpus // pdtbs),
str(1 if gas == 0 else gas),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this worjks, but dont understand why you need it, because your benches use the same parameters as the existing ones, and we dont run into this issue

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I noticed is the experiments continue to happen silently even when some of them are failed and also the benchmark report gets generated irrespectively,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each job is independent. the bench will run all jobs and those jobs get failed will have empty reports.

Copy link
Collaborator

@willmj willmj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once Fabian's comments have been addressed this looks good to me except for minor nit, would be interested to see benchmarks as well if available.

Signed-off-by: Mehant Kammakomati <[email protected]>
@fabianlim
Copy link
Contributor

@kmehant yes code changes looks good, what is missing now is the updating of the benches. So you need to run the full bench for scenarios-moe.yaml now in order to update a100_80gb_moe.csv and requirements_moe.txt

  • Please run the bench using the documented instructions, running the actual bench in adhoc ways cause inconsistencies.
  • There are two suites accelerated-moe-scatter and accelerated-moe-scatter-mixtral, their commands are seperate, and then the results are stictched together. If this is too troublesome, then you can split them. Take out the mixtral benches and put them in seperate requirements and CSV file (maybe call them moe_mixtral.csv, requirements_moe_mixtral.csv, etc..
  • Then run one full bench for accelerated-moe-scatter for both models and update accordingly

Signed-off-by: Mehant Kammakomati <[email protected]>
@kmehant
Copy link
Collaborator Author

kmehant commented Feb 20, 2025

  • update a100_80gb_moe.csv | bench for moe granite (moe and moe shared granite models)
  • update a100_80gb_moe.csv | bench for mixtral
  • update requirements_moe.txt
  • regression test for the above runs.

@fabianlim
Copy link
Contributor

fabianlim commented Feb 20, 2025

@willmj and @anhuong can either of you help to guide @kmehant how to produce the regression plots?

@kmehant when updating the benches we want to make sure we do not suffer any worsening in performance as compared to the current bench, so we have some convinience scripts to produce plots like those you see in the PR.. then we will record these regrressions as a form of record keeping

@kmehant
Copy link
Collaborator Author

kmehant commented Feb 20, 2025

#126 (comment)

Got it @fabianlim do you have those scripts handy?

@kmehant
Copy link
Collaborator Author

kmehant commented Feb 20, 2025

@fabianlim

Use of memory increased

compare-mem_nvidia_mem_reserved

Train loss parity exists

compare-train_loss

Throughput increased in the new benchmark

compare-train_tokens_per_second

@kmehant
Copy link
Collaborator Author

kmehant commented Feb 20, 2025

Outliers

Screenshot 2025-02-20 at 2 31 22 PM

@fabianlim
Copy link
Contributor

@kmehant pls take note of my comment here

  • you must preserve the mixtral bench, you cannot delete it

@kmehant
Copy link
Collaborator Author

kmehant commented Feb 20, 2025

@fabianlim apologies for the confusion. I was sharing intermediate results and regression. I am tracking it here and mixtral is currently running and will update the regression and results once done. Thank you.

Signed-off-by: Mehant Kammakomati <[email protected]>
@kmehant
Copy link
Collaborator Author

kmehant commented Feb 21, 2025

Final reg plots that includes mixtral.

memory increased loss holds parity throughput increased
compare-mem_nvidia_mem_reserved compare-train_loss compare-train_tokens_per_second

Outliers

Screenshot 2025-02-21 at 12 48 31 PM

Comment on lines 16 to 21
-e git+https://github.com/kmehant/fms-acceleration.git@d0560e0c652dc2f72fdad07a2fb7af8fe67bbf08#egg=fms_acceleration&subdirectory=plugins/framework
-e git+https://github.com/kmehant/fms-acceleration.git@d0560e0c652dc2f72fdad07a2fb7af8fe67bbf08#egg=fms_acceleration_aadp&subdirectory=plugins/attention-and-distributed-packing
-e git+https://github.com/kmehant/fms-acceleration.git@d0560e0c652dc2f72fdad07a2fb7af8fe67bbf08#egg=fms_acceleration_foak&subdirectory=plugins/fused-ops-and-kernels
-e git+https://github.com/kmehant/fms-acceleration.git@d0560e0c652dc2f72fdad07a2fb7af8fe67bbf08#egg=fms_acceleration_moe&subdirectory=plugins/accelerated-moe
-e git+https://github.com/kmehant/fms-acceleration.git@d0560e0c652dc2f72fdad07a2fb7af8fe67bbf08#egg=fms_acceleration_peft&subdirectory=plugins/accelerated-peft
fms-hf-tuning @ git+https://github.com/foundation-model-stack/fms-hf-tuning.git@fdc7527510692ada03e4303df1549cebc5139b31
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a pip freeze to update this, however, this would show my forks of fms-acceleration. Should we keep it as is?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no pls delete

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the req file. Thanks

tox.ini Outdated

# install the flash attn at the last
pip install flash-attn
pip install flash-attn --no-build-isolation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this really need to change

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this torch module not found issue with tox. Had to do this.

Copy link
Contributor

@fabianlim fabianlim Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im abit confused because i believe torch is installed with fms_hf_tuing and our tox file installs that, because tox is completely reproducible. so if you haad seen this, we must have seen this before

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about this.. maybe can you leave out this commit, and make an issue saying you want to add it in

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I can do that.

Copy link
Contributor

@fabianlim fabianlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmehant can you put your final regression plots into the main description. Otherwise i think it looks good. Lets have @willmj take one look at it before merge

@kmehant kmehant force-pushed the support-sharedmoe-granite branch from 69a684f to 0a04dcc Compare February 21, 2025 07:34
@kmehant kmehant force-pushed the support-sharedmoe-granite branch from 0a04dcc to d4256f6 Compare February 21, 2025 07:35
Signed-off-by: Mehant Kammakomati <[email protected]>
@kmehant kmehant requested a review from willmj February 21, 2025 07:37
Copy link
Collaborator

@willmj willmj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kmehant
Copy link
Collaborator Author

kmehant commented Feb 21, 2025

@fabianlim @willmj requesting merge.

@fabianlim
Copy link
Contributor

@kmehant sorry i just noticed you only have 2 mixtral entries, but we used to have 3. can you check this
image

@kmehant
Copy link
Collaborator Author

kmehant commented Feb 21, 2025

#126 (comment)

The FSDP variant which was previously part got OOMed @fabianlim

@fabianlim
Copy link
Contributor

All the OOMs need to be recorded down, can you do the same in the main commeng @kmehant , and pls open an issue and link it here for tracking

@kmehant
Copy link
Collaborator Author

kmehant commented Feb 22, 2025

#126 (comment)

@fabianlim completed!

@fabianlim fabianlim merged commit c959c6b into foundation-model-stack:main Feb 24, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants