Why TP can't be used with pure DP?

As per [this](https://github.com/huggingface/accelerate/blob/b9ca0de682f25f15357a3f9f1a4d94374a1d451d/src/accelerate/parallelism_config.py#L332), we can not be use TP along with pure DP (or DDP). We need to shard the model across further nodes by specifying dp_shard_size as well. Why this limitation exists? Is it just a software limitation? 
Please share any documentation, code reference and justification for the same.

What to do inorder to do TP+DP?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why TP can't be used with pure DP? #3876

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why TP can't be used with pure DP? #3876

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions