Investigate and fix TP=8 GSM8K accuracy issue for nano3 model, see https://github.com/NVIDIA/TensorRT-LLM/pull/8744 and https://github.com/NVIDIA/TensorRT-LLM/pull/8744#issuecomment-3489456944