when training deepseek v3 model using cuda_graph, it reports the following error:
Checkpointing is not compatible with .grad(), please use .backward() if possible
setting the following args:
--recompute-granularity selective
--recompute-modules mla_up_proj mlp
--cuda-graph-impl transformer_engine
--cuda-graph-scope full