Skip to content

[QEff. Finetuning]: Adding tests for PP in HF trainer stack#817

Open
quic-swatia wants to merge 3 commits intoquic:ft_experimentalfrom
quic-swatia:pp-test
Open

[QEff. Finetuning]: Adding tests for PP in HF trainer stack#817
quic-swatia wants to merge 3 commits intoquic:ft_experimentalfrom
quic-swatia:pp-test

Conversation

@quic-swatia
Copy link
Contributor

No description provided.

Signed-off-by: Swati Allabadi <sallabad@qti.qualcomm.com>
Copy link
Contributor

@quic-akuruvil quic-akuruvil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct lint error

assert not overlap, f"Stages {s_idx} and {t_idx} share layers {overlap} – stages must be disjoint."

# --- 5. Balance: each stage has base or base+1 layers -----------------
base, remainder = divmod(num_layers, pp_degree)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is balancing ensured here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the strategy/logic used for splitting model layers across the devices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider an example: num_layers = 22, num_stages = 5
With 'base, remainder = divmod(num_layers, pp_degree)', base and remainder turns out to be 4 and 2 resp.

With line #134-139 : 'expected_count = base + (1 if stage_idx < remainder else 0)' : it is checking that each first two (#remainder) devices has 5 (base +1) layers each. And the last 3 (num_stages - remainder) devices has 4 (base) devices each. Hence, ensuring balancing amongst devices.

Copy link
Contributor

@quic-akuruvil quic-akuruvil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these test cases are passing locally?

Copy link
Contributor

@quic-akuruvil quic-akuruvil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test cases looks good, with extensive coverage.

Copy link
Contributor

@quic-akuruvil quic-akuruvil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QAIC_VISIBLE_DEVICES=0 python -m pytest QEfficient/finetune/experimental/tests/
Currently this is how we run the existing tests. So based on this mention how to run PP tests. Include sample commands too, in docs

Ideally all tests in the project should run when above command is executed. So lets keep it like that

Signed-off-by: Swati Allabadi <sallabad@qti.qualcomm.com>
@quic-swatia
Copy link
Contributor Author

All these test cases are passing locally?

yes, all of them are passing locally.

@quic-swatia
Copy link
Contributor Author

quic-swatia commented Mar 25, 2026

QAIC_VISIBLE_DEVICES=0 python -m pytest QEfficient/finetune/experimental/tests/ Currently this is how we run the existing tests. So based on this mention how to run PP tests. Include sample commands too, in docs

Ideally all tests in the project should run when above command is executed. So lets keep it like that

2 tests in this file requires # visible devices =2. If 1 is passed, it skips those 2 tests. These two tests run even when QAIC_VISIBLE_DEVICES is not mentioned at all and the machine has >=2 devices.

Signed-off-by: Swati Allabadi <sallabad@qti.qualcomm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants