Autotuning for distributed kernels

Enable autotuning for distributed kernels launched via torchrun. 

Event based benchmarker available at `autotuner/benchmarker.py` in PR: https://github.com/pytorch/helion/pull/393

## Enable via 
1. Make sure all torchrun workers benchmark same configs in same order. 
2. Master rank decides the config and communicate that to all processes.