[QEff. Finetuning]: Adding tests for PP in HF trainer stack#817
[QEff. Finetuning]: Adding tests for PP in HF trainer stack#817quic-swatia wants to merge 3 commits intoquic:ft_experimentalfrom
Conversation
Signed-off-by: Swati Allabadi <sallabad@qti.qualcomm.com>
quic-akuruvil
left a comment
There was a problem hiding this comment.
Please correct lint error
QEfficient/finetune/experimental/tests/test_pipeline_parallelism.py
Outdated
Show resolved
Hide resolved
| assert not overlap, f"Stages {s_idx} and {t_idx} share layers {overlap} – stages must be disjoint." | ||
|
|
||
| # --- 5. Balance: each stage has base or base+1 layers ----------------- | ||
| base, remainder = divmod(num_layers, pp_degree) |
There was a problem hiding this comment.
How is balancing ensured here?
There was a problem hiding this comment.
What is the strategy/logic used for splitting model layers across the devices?
There was a problem hiding this comment.
Consider an example: num_layers = 22, num_stages = 5
With 'base, remainder = divmod(num_layers, pp_degree)', base and remainder turns out to be 4 and 2 resp.
With line #134-139 : 'expected_count = base + (1 if stage_idx < remainder else 0)' : it is checking that each first two (#remainder) devices has 5 (base +1) layers each. And the last 3 (num_stages - remainder) devices has 4 (base) devices each. Hence, ensuring balancing amongst devices.
quic-akuruvil
left a comment
There was a problem hiding this comment.
All these test cases are passing locally?
quic-akuruvil
left a comment
There was a problem hiding this comment.
test cases looks good, with extensive coverage.
QEfficient/finetune/experimental/tests/test_pipeline_parallelism.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
QAIC_VISIBLE_DEVICES=0 python -m pytest QEfficient/finetune/experimental/tests/
Currently this is how we run the existing tests. So based on this mention how to run PP tests. Include sample commands too, in docs
Ideally all tests in the project should run when above command is executed. So lets keep it like that
Signed-off-by: Swati Allabadi <sallabad@qti.qualcomm.com>
yes, all of them are passing locally. |
2 tests in this file requires # visible devices =2. If 1 is passed, it skips those 2 tests. These two tests run even when QAIC_VISIBLE_DEVICES is not mentioned at all and the machine has >=2 devices. |
Signed-off-by: Swati Allabadi <sallabad@qti.qualcomm.com>
No description provided.