Hi @lucidrains, thanks for this implementation.
I wonder if you're using distributed training for your experiments. If so, as noted in Accelerate's docs, do you scale your learning rate (on top of downscaling for LION optimizer, even if you're not using Accelerate) based on number of processes (GPUs).
If you don't scale learning rate, do you recommend doing so?