Can we use custom kernel with atomics for [`∇getindex!(dx::AbstractGPUArray, dy, inds...)`](https://github.com/JuliaDiff/ChainRules.jl/blob/2db48540b2fd9943c4e6db92dba1ee1e8f7f8550/src/rulesets/Base/indexing.jl#L183) instead of copying everything to CPU? This way we'd be able to avoid synchronizations and we can add such kernel via extension