Learning rate scaling for distributed training?

Hi @lucidrains, thanks for this implementation.

I wonder if you're using distributed training for your [experiments](https://wandb.ai/lucidrains/lion-test/reports/Lion--VmlldzozNTY0OTQ0?accessToken=wxt5ha81c05k26zq01b51j3ondpzsfd1sfmng8x94g16vul5gnxq32zcjdzp5oel). If so, [as noted in Accelerate's docs](https://huggingface.co/docs/accelerate/concept_guides/performance#learning-rates), do you scale your learning rate (on top of downscaling for LION optimizer, even if you're not using Accelerate) based on number of processes (GPUs).

If you don't scale learning rate, do you recommend doing so?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Learning rate scaling for distributed training? #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Learning rate scaling for distributed training? #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions