Skip node output dump for MemcpyToHost (microsoft#25651)

tianleiwu · gedoensmax · commit 6958ef1db6c6 · 2025-09-02T11:22:17.000+02:00
Fix node output dump for MemcpyToHost. The statistics data is not
correct, since data might not be copied to CPU yet:

```
MemcpyToHost node: Memcpy_token_232
Input 0 Name: /model/layers.6/moe/router/Add/output_0_CUDAExecutionProvider
 Shape: {1,1,32}
OrtMemoryInfo:[name:Cuda OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 VendorId:4318 DeviceId:0 Alignment:0]]
Min=-2.5136719,Max=1.6914062
-----------
Output 0 Name: /model/layers.6/moe/router/Add/output_0
 Shape: {1,1,32}
Min=-4888,Max=6672,NaN=2
```

This fix will skip the output dump (or statistics) like
```
-----------
Output 0 Name: /model/layers.6/moe/router/Add/output_0
 Shape: {1,1,32}
 is same as input
```
diff --git a/onnxruntime/core/framework/debug_node_inputs_outputs_utils.cc b/onnxruntime/core/framework/debug_node_inputs_outputs_utils.cc
@@ -667,6 +667,13 @@ void DumpNodeOutputs(
             const bool is_shape_set = (dump_options.dump_flags & NodeDumpOptions::DumpFlags::Shape) != 0;
             PrintIf(is_shape_set, MakeString(" Shape: ", shape, "\n"));
 
+            // For MemcpyToHost, the memory copy has not been syncronized so the data is not ready to read yet.
+            // Here we skip it since it is just a copy of input tensor (or output of previous node) which has been dumped.
+            if (node.OpType() == "MemcpyToHost") {
+              std::cout << " is same as input.\n";
+              continue;
+            }
+
             if ((dump_options.dump_flags & NodeDumpOptions::DumpFlags::OutputData) != 0 || check_half_overflow) {
               tensor_metadata.name = output_defs[i]->Name();
               tensor_metadata.step = dump_context.iteration;