It would have been quite useful if RGA could spit the number of cycles per functional group (SMEM, VMEM, SALU etc). Of course none can predict the actual cycles of instructions but an approximation could also work. Maybe the cycles reported by RGA can be modeled as if all memory is in cache or something. Doesn't need to be accurate.
Something like this for example: