We are seeing ~2x worse performance when using ZE_DEVICE_HIERARCHY = COMPOSITE vs FLAT. <img width="722" height="434" alt="Image" src="https://github.com/user-attachments/assets/36d710b8-79a6-4ffb-a973-66d98f229cc1" /> See suggestions for other environment variables [here](https://github.com/alan-turing-institute/aurora-hpc/issues/17#issuecomment-3070014507) TODO: - [x] Try I_MPI_OFFLOAD - [x] Try I_MPI_OFFLOAD_SYMMETRIC - [ ] Dump xpu-smi stats when running jobs