Skip to content

Commit b07738d

Browse files
author
ssjia
committed
Update on "[ET-VK] Allocate memory for weight and activation tensors lazily"
Summary: * Allocate memory for weight tensors right before the prepacking shader is dispatched, rather than while building the graph * Move allocation of shared objects (i.e. memory for intermediate tensors) to occur after prepacking ## Motivation Prevent screen blackout (Llama 3.2 1B) / device crash (Llama 3.2 3B) when running Llama 3.2 models on Samsung Galaxy S24. This behaviour is related to high peak memory usage when loading the model. ## Full Context During model loading, Vulkan delegate needs to store 3 copies of constant data in memory at various points: * source data obtained from loading the model * staging buffer * GPU texture/buffer The general rationale of this change is to allocate memory for each copy only when necessary to minimize the "overlap" when all 3 exist at once. ### Current Order of operations Legend: * `W` represents total weight nbytes * `w` represents weight nbytes for one tensor * `A` represents total activations nbytes * `M` represents approximation of total memory footprint First, model file is loaded Then, when building compute graph, for each weight tensor: 1. Weight data is loaded from NamedDataMap (`M = W`) 2. GPU texture/buffer for weight is initialized + memory allocated (`M = 2W`) 3. After building the graph, `graph->prepare()` is called which currently allocates memory for the activation tensors as well (`M = 2W + A`) Then, during the prepacking stage for each weight tensor, each weight tensor is copied individually: 1. Staging buffer initialized (`M = 2W + A + w`) 2. Copy CPU weight data to staging + CPU Weight data is freed (`M = 2W + A`) 3. Compute shader dispatch to copy staging to GPU texture/buffer + free staging buffer (`M = 2W + A - w`) The peak usage in mainline will be `M = 2W + A + w` ### Revised order of operations This change revises the order of operations: 1. Weight data is loaded from NamedDataMap (`M = W`) 2. GPU texture/buffer for weight is initialized, but **memory is not allocated** (`M = W`) Then, during the prepacking stage for each weight tensor, each weight tensor is copied individually: 1. Staging buffer initialized (`M = W + w`) 2. **Memory allocated for GPU texture/buffer** (`M = W + 2w`) 3. Copy CPU weight data to staging + CPU Weight data is freed (`M = W + w`) 4. Compute shader dispatch to copy staging to GPU texture/buffer + free staging buffer (`M = W`) **Then, after all prepacking operations complete, only then is Activation memory allocated** (`M = W + A`) Under this scheme, peak memory is reduced to `M = W + A` (or alternatively `M = W + 2w` if `2w > A`) which is (or at least very close to) the theoretical minimum. Test Plan: ## Logging Memory Usage Using ``` uint64_t getVmRssInKB() { std::ifstream statusFile("/proc/self/status"); std::string l, num; while (std::getline(statusFile, l)) { if (l.substr(0, 5) == "VmRSS") { size_t pos = l.find_first_of("0123456789"); num = l.substr(pos); break; } } uint64_t vmRssInKB = std::stoi(num); return vmRssInKB; } uint64_t getVmaStatsInKB() { auto stats = vkcompute::api::context()->adapter_ptr()->vma().get_memory_statistics(); uint64_t vmaBlockInKB = stats.total.statistics.blockBytes >> 10; return vmaBlockInKB; } ``` to log memory footprint at various points of inference when running the llama_runner binary with Llama 3.2 1B, we can compare the memory footprint with and without these changes. With changes: P1908051860 (Meta only) ``` Memory usage before model compilation: 1115760 KB (VmRSS), 0 KB (VMA) Memory usage after graph building: 1924832 KB (VmRSS), 17920 KB (VMA) Memory usage after graph preparation: 1935312 KB (VmRSS), 17920 KB (VMA) Memory usage prepack start: 1935312 KB, VMA Block: 17920 KB Memory usage after prepack operations: 1372376 KB (VmRSS), 2330528 KB (VMA) Memory usage before execute: 1372804 KB (VmRSS), 2330528 KB (VMA) Memory usage at end of execute: 1376916 KB (VmRSS), 2330528 KB (VMA) ``` WIthout changes: P1908054759 (Meta only) ``` Memory usage before model compilation: 1114784 KB (VmRSS), 0 KB (VMA) Memory usage after graph building: 1924432 KB (VmRSS), 962464 KB (VMA) Memory usage after graph preparation: 1922916 KB (VmRSS), 2326432 KB (VMA) Memory usage prepack start: 1922916 KB, VMA Block: 2326432 KB Memory usage after prepack operations: 1359180 KB (VmRSS), 2330528 KB (VMA) Memory usage before execute: 1359492 KB (VmRSS), 2330528 KB (VMA) Memory usage at end of execute: 1363636 KB (VmRSS), 2330528 KB (VMA) ``` It is evident how peak memory can be reduced with these changes, as VMA footprint gradually increases while loading the model while VmRss gradually decreases. Without these changes, VMA footprint will reach its peak after initializing the graph. Visually, it can also be verified that Samsung Galaxy S24's screen no longer blacks out while loading the model. Differential Revision: [D80460033](https://our.internmc.facebook.com/intern/diff/D80460033) [ghstack-poisoned]
2 parents 0392ea4 + 15eb5f7 commit b07738d

File tree

8 files changed

+77
-19
lines changed

8 files changed

+77
-19
lines changed

backends/vulkan/runtime/VulkanBackend.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@
2222
#include <executorch/runtime/core/event_tracer_hooks_delegate.h>
2323
#endif // ET_EVENT_TRACER_ENABLED
2424
#include <executorch/runtime/core/exec_aten/util/tensor_util.h>
25-
#include <executorch/runtime/executor/pte_data_map.h>
25+
#include <executorch/runtime/core/named_data_map.h>
2626
#include <executorch/runtime/platform/compiler.h>
2727
#include <executorch/runtime/platform/profiler.h>
2828

backends/vulkan/runtime/graph/ComputeGraph.cpp

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -958,9 +958,24 @@ void ComputeGraph::prepack() {
958958
staging_nbytes_in_cmd_ = 0;
959959

960960
// Initialize allocations for intermediate tensors
961-
for (SharedObject& shared_object : shared_objects_) {
962-
shared_object.allocate(this);
963-
shared_object.bind_users(this);
961+
962+
// If shared objects are used, then that implies memory planning was
963+
// performed. Memory for intermediate tensors can be allocated by allocating
964+
// the shared objects. Assume that no intermediate tensors use dedicated
965+
// allocations.
966+
if (shared_objects_.size() > 0) {
967+
for (SharedObject& shared_object : shared_objects_) {
968+
shared_object.allocate(this);
969+
shared_object.bind_users(this);
970+
}
971+
}
972+
// Otherwise, intermediate tensors likely use dedicated allocations.
973+
else {
974+
for (int i = 0; i < values_.size(); i++) {
975+
if (values_.at(i).isTensor()) {
976+
create_dedicated_allocation_for(i);
977+
}
978+
}
964979
}
965980
}
966981

backends/vulkan/runtime/vk_api/memory/Buffer.cpp

Lines changed: 21 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ VulkanBuffer::VulkanBuffer()
2020
allocator_(VK_NULL_HANDLE),
2121
memory_{},
2222
owns_memory_(false),
23+
memory_bundled_(false),
2324
is_copy_(false),
2425
handle_(VK_NULL_HANDLE) {}
2526

@@ -33,6 +34,7 @@ VulkanBuffer::VulkanBuffer(
3334
allocator_(vma_allocator),
3435
memory_{},
3536
owns_memory_(allocate_memory),
37+
memory_bundled_(allocate_memory),
3638
is_copy_(false),
3739
handle_(VK_NULL_HANDLE) {
3840
// If the buffer size is 0, allocate a buffer with a size of 1 byte. This is
@@ -77,6 +79,7 @@ VulkanBuffer::VulkanBuffer(
7779
allocator_(other.allocator_),
7880
memory_(other.memory_),
7981
owns_memory_(false),
82+
memory_bundled_(false),
8083
is_copy_(true),
8184
handle_(other.handle_) {
8285
// TODO: set the offset and range appropriately
@@ -91,6 +94,7 @@ VulkanBuffer::VulkanBuffer(VulkanBuffer&& other) noexcept
9194
allocator_(other.allocator_),
9295
memory_(std::move(other.memory_)),
9396
owns_memory_(other.owns_memory_),
97+
memory_bundled_(other.memory_bundled_),
9498
is_copy_(other.is_copy_),
9599
handle_(other.handle_) {
96100
other.handle_ = VK_NULL_HANDLE;
@@ -99,16 +103,19 @@ VulkanBuffer::VulkanBuffer(VulkanBuffer&& other) noexcept
99103
VulkanBuffer& VulkanBuffer::operator=(VulkanBuffer&& other) noexcept {
100104
VkBuffer tmp_buffer = handle_;
101105
bool tmp_owns_memory = owns_memory_;
106+
bool tmp_memory_bundled = memory_bundled_;
102107

103108
buffer_properties_ = other.buffer_properties_;
104109
allocator_ = other.allocator_;
105110
memory_ = std::move(other.memory_);
106111
owns_memory_ = other.owns_memory_;
112+
memory_bundled_ = other.memory_bundled_;
107113
is_copy_ = other.is_copy_;
108114
handle_ = other.handle_;
109115

110116
other.handle_ = tmp_buffer;
111117
other.owns_memory_ = tmp_owns_memory;
118+
other.memory_bundled_ = tmp_memory_bundled;
112119

113120
return *this;
114121
}
@@ -119,14 +126,22 @@ VulkanBuffer::~VulkanBuffer() {
119126
// ownership of the underlying resource.
120127
if (handle_ != VK_NULL_HANDLE && !is_copy_) {
121128
if (owns_memory_) {
122-
vmaDestroyBuffer(allocator_, handle_, memory_.allocation);
129+
if (memory_bundled_) {
130+
vmaDestroyBuffer(allocator_, handle_, memory_.allocation);
131+
// Prevent the underlying memory allocation from being freed; it was
132+
// freed by vmaDestroyImage
133+
memory_.allocation = VK_NULL_HANDLE;
134+
} else {
135+
vkDestroyBuffer(this->device(), handle_, nullptr);
136+
// Allow underlying memory allocation to be freed by the destructor of
137+
// Allocation class
138+
}
123139
} else {
124140
vkDestroyBuffer(this->device(), handle_, nullptr);
141+
// Prevent the underlying memory allocation from being freed since this
142+
// object doesn't own it
143+
memory_.allocation = VK_NULL_HANDLE;
125144
}
126-
// Prevent the underlying memory allocation from being freed; it was either
127-
// freed by vmaDestroyBuffer, or this resource does not own the underlying
128-
// memory
129-
memory_.allocation = VK_NULL_HANDLE;
130145
}
131146
}
132147

@@ -151,6 +166,7 @@ void VulkanBuffer::bind_allocation(const Allocation& memory) {
151166
void VulkanBuffer::acquire_allocation(Allocation&& memory) {
152167
bind_allocation_impl(memory);
153168
memory_ = std::move(memory);
169+
owns_memory_ = true;
154170
}
155171

156172
VkMemoryRequirements VulkanBuffer::get_memory_requirements() const {

backends/vulkan/runtime/vk_api/memory/Buffer.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,10 @@ class VulkanBuffer final {
100100
Allocation memory_;
101101
// Indicates whether the underlying memory is owned by this resource
102102
bool owns_memory_;
103+
// Indicates whether the allocation for the buffer was created with the buffer
104+
// via vmaCreateBuffer; if this is false, the memory is owned but was bound
105+
// separately via vmaBindBufferMemory
106+
bool memory_bundled_;
103107
// Indicates whether this VulkanBuffer was copied from another VulkanBuffer,
104108
// thus it does not have ownership of the underlying VKBuffer
105109
bool is_copy_;

backends/vulkan/runtime/vk_api/memory/Image.cpp

Lines changed: 22 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,7 @@ VulkanImage::VulkanImage()
9999
allocator_(VK_NULL_HANDLE),
100100
memory_{},
101101
owns_memory_(false),
102+
memory_bundled_(false),
102103
owns_view_(false),
103104
is_copy_(false),
104105
handles_{
@@ -125,6 +126,7 @@ VulkanImage::VulkanImage(
125126
allocator_(vma_allocator),
126127
memory_{},
127128
owns_memory_{allocate_memory},
129+
memory_bundled_(allocate_memory),
128130
owns_view_(false),
129131
is_copy_(false),
130132
handles_{
@@ -195,6 +197,7 @@ VulkanImage::VulkanImage(
195197
allocator_(VK_NULL_HANDLE),
196198
memory_{},
197199
owns_memory_(false),
200+
memory_bundled_(false),
198201
is_copy_(false),
199202
handles_{
200203
image,
@@ -224,6 +227,7 @@ VulkanImage::VulkanImage(VulkanImage&& other) noexcept
224227
allocator_(other.allocator_),
225228
memory_(std::move(other.memory_)),
226229
owns_memory_(other.owns_memory_),
230+
memory_bundled_(other.memory_bundled_),
227231
owns_view_(other.owns_view_),
228232
is_copy_(other.is_copy_),
229233
handles_(other.handles_),
@@ -232,12 +236,14 @@ VulkanImage::VulkanImage(VulkanImage&& other) noexcept
232236
other.handles_.image_view = VK_NULL_HANDLE;
233237
other.handles_.sampler = VK_NULL_HANDLE;
234238
other.owns_memory_ = false;
239+
other.memory_bundled_ = false;
235240
}
236241

237242
VulkanImage& VulkanImage::operator=(VulkanImage&& other) noexcept {
238243
VkImage tmp_image = handles_.image;
239244
VkImageView tmp_image_view = handles_.image_view;
240245
bool tmp_owns_memory = owns_memory_;
246+
bool tmp_memory_bundled = memory_bundled_;
241247

242248
device_ = other.device_;
243249
image_properties_ = other.image_properties_;
@@ -246,13 +252,15 @@ VulkanImage& VulkanImage::operator=(VulkanImage&& other) noexcept {
246252
allocator_ = other.allocator_;
247253
memory_ = std::move(other.memory_);
248254
owns_memory_ = other.owns_memory_;
255+
memory_bundled_ = other.memory_bundled_;
249256
is_copy_ = other.is_copy_;
250257
handles_ = other.handles_;
251258
layout_ = other.layout_;
252259

253260
other.handles_.image = tmp_image;
254261
other.handles_.image_view = tmp_image_view;
255262
other.owns_memory_ = tmp_owns_memory;
263+
other.memory_bundled_ = tmp_memory_bundled;
256264

257265
return *this;
258266
}
@@ -271,14 +279,22 @@ VulkanImage::~VulkanImage() {
271279

272280
if (handles_.image != VK_NULL_HANDLE) {
273281
if (owns_memory_) {
274-
vmaDestroyImage(allocator_, handles_.image, memory_.allocation);
282+
if (memory_bundled_) {
283+
vmaDestroyImage(allocator_, handles_.image, memory_.allocation);
284+
// Prevent the underlying memory allocation from being freed; it was
285+
// freed by vmaDestroyImage
286+
memory_.allocation = VK_NULL_HANDLE;
287+
} else {
288+
vkDestroyImage(this->device(), handles_.image, nullptr);
289+
// Allow underlying memory allocation to be freed by the destructor of
290+
// Allocation class
291+
}
275292
} else {
276293
vkDestroyImage(this->device(), handles_.image, nullptr);
294+
// Prevent the underlying memory allocation from being freed since this
295+
// object doesn't own it
296+
memory_.allocation = VK_NULL_HANDLE;
277297
}
278-
// Prevent the underlying memory allocation from being freed; it was either
279-
// freed by vmaDestroyImage, or this resource does not own the underlying
280-
// memory
281-
memory_.allocation = VK_NULL_HANDLE;
282298
}
283299
}
284300

@@ -341,6 +357,7 @@ void VulkanImage::bind_allocation(const Allocation& memory) {
341357
void VulkanImage::acquire_allocation(Allocation&& memory) {
342358
bind_allocation_impl(memory);
343359
memory_ = std::move(memory);
360+
owns_memory_ = true;
344361
}
345362

346363
VkMemoryRequirements VulkanImage::get_memory_requirements() const {

backends/vulkan/runtime/vk_api/memory/Image.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -156,6 +156,10 @@ class VulkanImage final {
156156
Allocation memory_;
157157
// Indicates whether the underlying memory is owned by this resource
158158
bool owns_memory_;
159+
// Indicates whether the allocation for the image was created with the image
160+
// via vmaCreateImage; if this is false, the memory is owned but was bound
161+
// separately via vmaBindImageMemory
162+
bool memory_bundled_;
159163
// In some cases, a VulkanImage may be a copy of another VulkanImage but still
160164
// own a unique view of the VkImage.
161165
bool owns_view_;

backends/vulkan/serialization/vulkan_graph_serialize.py

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -201,11 +201,11 @@ def serialize_constant_tensors(
201201
named_key=named_key,
202202
)
203203
)
204-
elif tensor is None or tensor.numel() == 0:
205-
assert isinstance(tensor, torch.Tensor)
204+
elif tensor is None or (
205+
isinstance(tensor, torch.Tensor) and tensor.numel() == 0
206+
):
206207
vk_graph.constants.append(VkBytes(current_offset, 0))
207-
else:
208-
assert isinstance(tensor, torch.Tensor)
208+
elif isinstance(tensor, torch.Tensor):
209209
array_type = ctypes.c_char * tensor.untyped_storage().nbytes()
210210
array = ctypes.cast(
211211
tensor.untyped_storage().data_ptr(),
@@ -219,6 +219,8 @@ def serialize_constant_tensors(
219219

220220
vk_graph.constants.append(VkBytes(current_offset, len(tensor_bytes)))
221221
current_offset += aligned_size(len(tensor_bytes))
222+
else:
223+
raise ValueError(f"Unsupported constant tensor type: {type(tensor)}")
222224

223225

224226
def serialize_custom_shaders(

backends/vulkan/targets.bzl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -305,7 +305,7 @@ def define_common_targets(is_fbcode = False):
305305
"//executorch/backends/vulkan/serialization:vk_delegate_schema",
306306
"//executorch/runtime/core:event_tracer",
307307
"//executorch/runtime/core/exec_aten/util:tensor_util",
308-
"//executorch/runtime/executor:pte_data_map",
308+
"//executorch/runtime/core:named_data_map",
309309
],
310310
define_static_target = True,
311311
# VulkanBackend.cpp needs to compile with executor as whole

0 commit comments

Comments
 (0)