Possible issues in XPU memory management #1540

airMeng · 2025-04-02T14:48:58Z

airMeng
Apr 2, 2025
Collaborator

In summary, during fine-tuning process, XPU reserved 8GB VRAM more than CUDA, ~~make it impossible to port to BMG-12GB.~~

This might not be a bug, so I post here for discussion

Reproduce steps:

git clone https://github.com/songhappy/torchtune/tree/debug_memory
install the library, prepare the datasets/models as instructions

tune run lora_dpo_single_device --config recipes/configs/llama3_1/8B_lora_dpo_single_device_my.yaml device=xpu profiler.enabled=False max_steps_per_epoch=1 2>&1

What I have tried:

torchtune side
I register a hook during backward,

+        # 🔹 Hook function for gradients
+        def grad_hook(param_name):
+            def hook(grad):
+                torch.xpu.synchronize()
+                peak_memory_reserved = torch_device.max_memory_reserved(self._device) / (1024**3)
+                peak_memory_allocated = torch_device.max_memory_allocated(self._device) / (1024**3)
+                print(f"\n[BACKWARD] Parameter: {param_name} memory reserved", peak_memory_reserved)
+                print(f"\n[BACKWARD] Parameter: {param_name} memory allocated", peak_memory_allocated)
+                return grad
+            return hook
+
+        # 🔹 Register gradient hooks on model parameters (Fix)
+        for name, param in model.named_parameters():
+            if param.requires_grad:
+                param.register_hook(grad_hook(name))
+            else:
+                print(f"Skipping hook for frozen param: {name}")

I also print the memory stats between different steps

                   loss = loss / self._gradient_accumulation_steps
                   running_loss += loss
+                peak_memory_reserved = torch_device.max_memory_reserved(self._device) / (1024**3)
+                peak_memory_allocated = torch_device.max_memory_allocated(self._device) / (1024**3)
+                print("before backward reserved: ", peak_memory_reserved, " allocated: ", peak_memory_allocated)

                  loss.backward()
+
+                peak_memory_reserved = torch_device.max_memory_reserved(self._device) / (1024**3)
+                peak_memory_allocated = torch_device.max_memory_allocated(self._device) / (1024**3)
+                print("after backward reserved: ", peak_memory_reserved, " allocated: ", peak_memory_allocated)

The log shows there are ~8GB allocated even before the first operator got backward

before backward reserved:  34.06640625  allocated:  32.82672309875488

[BACKWARD] Parameter: layers.31._checkpoint_wrapped_module.mlp.w2.lora_b.weight memory reserved 41.849609375

PyTorch Side

I modify the XPU allocator to record each time of the memory allocation:

diff --git a/c10/xpu/XPUCachingAllocator.cpp b/c10/xpu/XPUCachingAllocator.cpp
index 64eb171616d..fa11702f46c 100644
--- a/c10/xpu/XPUCachingAllocator.cpp
+++ b/c10/xpu/XPUCachingAllocator.cpp
@@ -125,6 +125,8 @@ struct AllocParams {

 } // anonymous namespace

+std::mutex print_mutex;
+
 class DeviceCachingAllocator {
  private:
   mutable std::recursive_mutex mutex;
@@ -265,7 +267,17 @@ class DeviceCachingAllocator {
     p.block = new Block(device, p.queue(), size, p.pool, ptr);
     for_each_selected_stat_type(p.stat_types, [&](size_t stat_type) {
       stats.reserved_bytes[stat_type].increase(size);
-    });
+      std::lock_guard<std::mutex> lock0(print_mutex);
+      printf("%s %lu %s %s %s %s %s %s\n", "size", size, "current ", format_size(stats.reserved_bytes[stat_type].current).c_str(), "peak ", format_size(stats.reserved_bytes[stat_type].peak).c_str(), "allocated ", format_size(stats.reserved_bytes[stat_type].allocated).c_str());
+   });    
   return true;
   }

Indeed the extra 8GB allocation is recorded but not reasonable

size 4200595456 current  34.02 GiB peak  34.07 GiB allocated  41.99 GiB
size 4200595456 current  33.98 GiB peak  34.03 GiB allocated  41.95 GiB
size 4200595456 current  30.11 GiB peak  34.07 GiB allocated  45.90 GiB
size 4200595456 current  30.07 GiB peak  34.03 GiB allocated  45.86 GiB
size 4200595456 current  34.02 GiB peak  34.07 GiB allocated  49.81 GiB
size 4200595456 current  33.98 GiB peak  34.03 GiB allocated  49.77 GiB
size 4202692608 current  37.94 GiB peak  37.94 GiB allocated  53.72 GiB
size 4202692608 current  37.90 GiB peak  37.90 GiB allocated  53.69 GiB
size 4202692608 current  41.85 GiB peak  41.85 GiB allocated  57.64 GiB
size 4202692608 current  41.81 GiB peak  41.81 GiB allocated  57.60 GiB
size 2097152 current  41.85 GiB peak  41.85 GiB allocated  57.64 GiB
size 2097152 current  42.00 MiB peak  42.00 MiB allocated  42.00 MiB
size 2097152 current  41.85 GiB peak  41.85 GiB allocated  57.64 GiB
size 2097152 current  44.00 MiB peak  44.00 MiB allocated  44.00 MiB
size 2097152 current  41.86 GiB peak  41.86 GiB allocated  57.64 GiB
size 2097152 current  46.00 MiB peak  46.00 MiB allocated  46.00 MiB

Note that 4202692608 is 4008*1024**2, 4200595456 is 4006*1024**2, and there are no tensors that has shapes divided or multiplied by 4006 or 4008.

CUDA side
You can see no extra memory reserved before the the first layer got backward.

 390 before backward reserved 36.376953125
 391 before backward allocated 32.83465766906738
 392
 393 [BACKWARD] Parameter: layers.31._checkpoint_wrapped_module.mlp.w2.lora_b.weight reserved 36.376953125
 394
 395 [BACKWARD] Parameter: layers.31._checkpoint_wrapped_module.mlp.w2.lora_b.weight allocated 32.83465766906738

songhappy · 2025-04-02T20:32:59Z

songhappy
Apr 2, 2025
Collaborator

This memory issue happens to a specific recipe, while two forwards and one backward are called to calculate loss and gradient.
For a full fine tune recipe, one forward and backward are called to calculate loss and gradient, reserved memory grows faster on XPU than on A100. Explicitly calling a torch.xpu.empy_cache() releases memory and gives similar behavior as on A100.

0 replies

airMeng · 2025-04-03T01:34:08Z

airMeng
Apr 3, 2025
Collaborator Author

@guangyey @fengyuan14 @mingfeima @EikanWang FYI

0 replies

EikanWang · 2025-04-03T05:27:52Z

EikanWang
Apr 3, 2025
Maintainer

@guangyey , I'm not sure whether it is a bug. It is suspicious. We need to trace the memory footprint to understand the logic.

1 reply

EikanWang Apr 3, 2025
Maintainer

As @airMeng mentioned, we need to understand the differences of the allocation behavior between CUDA and XPU.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possible issues in XPU memory management #1540

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Possible issues in XPU memory management #1540

Uh oh!

Uh oh!

airMeng Apr 2, 2025 Collaborator

Replies: 3 comments · 1 reply

Uh oh!

songhappy Apr 2, 2025 Collaborator

Uh oh!

Uh oh!

airMeng Apr 3, 2025 Collaborator Author

Uh oh!

EikanWang Apr 3, 2025 Maintainer

Uh oh!

Uh oh!

EikanWang Apr 3, 2025 Maintainer

airMeng
Apr 2, 2025
Collaborator

Replies: 3 comments 1 reply

songhappy
Apr 2, 2025
Collaborator

airMeng
Apr 3, 2025
Collaborator Author

EikanWang
Apr 3, 2025
Maintainer

EikanWang Apr 3, 2025
Maintainer