Skip to content

Commit 356277b

Browse files
committed
Update on "[ET-VK][ez] Add support for buffer backed qparams in int4 linear + add checks for physical limits when allocating"
## Context Currently, the groupwise quantized int4 linear op implementation forces the scales and zero tensor to be a `Texture3D`. However, for i.e. transformer models that have a logit linear layer, the image extents required may exceed the maximum image extents available on the device. ## Changes * Add support for the scales and zero tensor being a `Buffer` instead of a `Texture3D` * Add checks when allocating buffers or images for tensors that the requested resource fits within the physical device limits Differential Revision: [D72662176](https://our.internmc.facebook.com/intern/diff/D72662176/) [ghstack-poisoned]
2 parents 39e52d6 + da3c415 commit 356277b

File tree

30 files changed

+145
-77
lines changed

30 files changed

+145
-77
lines changed

backends/vulkan/runtime/api/containers/Tensor.cpp

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -260,10 +260,25 @@ vkapi::VulkanImage allocate_image(
260260
return vkapi::VulkanImage();
261261
}
262262

263-
utils::uvec3 max_extents = adapter_ptr->max_texture_extents();
263+
// TODO(ssjia): change to always check that the image extents do not exceed
264+
// physical limits. Adding the check now based on `maxImageDimension3D` will
265+
// cause some existing models to break. Anecdotally, on Adreno and
266+
// SwiftShader devices, using 3D textures that exceed `maxImageDimension3D`
267+
// appears to be ok. So we need to figure out if is it undefined behaviour
268+
// or if there's a better way to figure out what the limit is. For now, only
269+
// check during debug build so that we can detect when exceeding physical
270+
// limits could be a potential cause for model outputs to be wrong. In the
271+
// meantime, the threshold for using texture storage can be configured at
272+
// export time.
273+
#ifdef VULKAN_DEBUG
274+
uint32_t max_extent = storage_type == utils::kTexture3D
275+
? adapter_ptr->max_texture3d_dim()
276+
: adapter_ptr->max_texture2d_dim();
277+
264278
VK_CHECK_COND(
265-
image_extents[0] <= max_extents[0] &&
266-
image_extents[1] <= max_extents[1] && image_extents[2] <= max_extents[2]);
279+
image_extents[0] <= max_extent && image_extents[1] <= max_extent &&
280+
image_extents[2] <= max_extent);
281+
#endif
267282

268283
VkSampler sampler = adapter_ptr->sampler_cache().retrieve(sampler_props);
269284

backends/vulkan/runtime/graph/ops/glsl/pack_int4_linear_weight_transposed_interleaved.glsl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -131,6 +131,6 @@ void main() {
131131
t_qmat2[packed_pos.y * stride + packed_pos.x] = out_tex_1;
132132
t_qmat2[(packed_pos.y + 1) * stride + packed_pos.x] = out_tex_2;
133133
$else:
134-
imageStore(t_qmat2, ivec3(packed_pos.xy, 0), out_tex_1);
135-
imageStore(t_qmat2, ivec3(packed_pos.x, packed_pos.y + 1, 0), out_tex_2);
134+
imageStore(t_qmat2, packed_pos.xy, out_tex_1);
135+
imageStore(t_qmat2, ivec2(packed_pos.x, packed_pos.y + 1), out_tex_2);
136136
}

backends/vulkan/runtime/graph/ops/glsl/pack_int4_linear_weight_transposed_interleaved.yaml

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,10 @@
66

77
pack_int4_linear_weight_transposed_interleaved:
88
parameter_names_with_default_values:
9-
STORAGE: texture3d
9+
STORAGE: texture2d
10+
generate_variant_forall:
11+
STORAGE:
12+
- VALUE: texture2d
13+
- VALUE: buffer
1014
shader_variants:
11-
- NAME: pack_int4_linear_weight_transposed_interleaved_texture3d
12-
- NAME: pack_int4_linear_weight_transposed_interleaved_buffer
13-
STORAGE: buffer
15+
- NAME: pack_int4_linear_weight_transposed_interleaved

backends/vulkan/runtime/graph/ops/glsl/q_4w_linear.glsl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ layout(std430) buffer;
2121
${layout_declare_tensor(B, "w", "t_out", DTYPE, OUT_STORAGE, is_scalar_array=False)}
2222
${layout_declare_tensor(B, "r", "t_mat1", DTYPE, IN_STORAGE, is_scalar_array=False)}
2323
${layout_declare_tensor(B, "r", "t_qmat2", "uint8", WEIGHT_STORAGE, is_scalar_array=False)}
24-
${layout_declare_tensor(B, "r", "t_qparams", DTYPE, PARAMS_STORAGE, is_scalar_array=False)}
24+
${layout_declare_tensor(B, "r", "t_qparams", DTYPE, "buffer", is_scalar_array=False)}
2525

2626
layout(push_constant) uniform restrict Block {
2727
ivec4 out_sizes;
@@ -111,7 +111,7 @@ void main() {
111111
$else:
112112
const uvec4 packed_weight_tex = texelFetch(
113113
t_qmat2,
114-
ivec3(gl_GlobalInvocationID.x, k + comp, 0),
114+
ivec2(gl_GlobalInvocationID.x, k + comp),
115115
0);
116116

117117
const uvec4 weight_tex_1 = (packed_weight_tex & 0xF0) >> 4;

backends/vulkan/runtime/graph/ops/glsl/q_4w_linear.yaml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,14 @@ q_4w_linear:
99
DTYPE: float
1010
OUT_STORAGE: texture3d
1111
IN_STORAGE: texture3d
12-
WEIGHT_STORAGE: texture3d
13-
PARAMS_STORAGE: texture3d
12+
WEIGHT_STORAGE: texture2d
13+
PARAMS_STORAGE: buffer
1414
shader_variants:
15-
- NAME: q_4w_linear_texture3d_texture3d_texture3d_texture3d_float
16-
- NAME: q_4w_linear_buffer_buffer_texture3d_texture3d_float
15+
- NAME: q_4w_linear_texture3d_texture3d_texture2d_float
16+
- NAME: q_4w_linear_buffer_buffer_texture2d_float
1717
OUT_STORAGE: buffer
1818
IN_STORAGE: buffer
19-
- NAME: q_4w_linear_buffer_buffer_texture3d_buffer_float
19+
- NAME: q_4w_linear_buffer_buffer_buffer_float
2020
OUT_STORAGE: buffer
2121
IN_STORAGE: buffer
22-
PARAMS_STORAGE: buffer
22+
WEIGHT_STORAGE: buffer

backends/vulkan/runtime/graph/ops/impl/QuantizedLinearGroupwiseInt4.cpp

Lines changed: 4 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -83,10 +83,9 @@ ValueRef prepack_int4_linear_weight_transposed_interleaved(
8383
const int64_t N = qmat2_orig_sizes.at(ndim - 2);
8484
const int64_t N_div2 = N / int64_t(2);
8585

86-
utils::StorageType storage_type = utils::kTexture3D;
87-
utils::uvec3 max_extents =
88-
graph.context()->adapter_ptr()->max_texture_extents();
89-
if (N_div2 > max_extents[0] * 4 || K > max_extents[1]) {
86+
utils::StorageType storage_type = utils::kTexture2D;
87+
uint32_t max_extent = graph.context()->adapter_ptr()->max_texture2d_dim();
88+
if (N_div2 > max_extent * 4 || K > max_extent) {
9089
storage_type = utils::kBuffer;
9190
}
9291

@@ -132,22 +131,13 @@ void add_q_4w_linear_node(
132131
ValueRef mat2 =
133132
prepack_int4_linear_weight_transposed_interleaved(graph, mat2_data);
134133

135-
utils::StorageType qparams_storage_type = utils::kTexture3D;
136-
utils::uvec3 max_extents =
137-
graph.context()->adapter_ptr()->max_texture_extents();
138-
if (graph.size_at<uint32_t>(-2, scales_and_zeros_data) > max_extents[0] * 4 ||
139-
graph.size_at<uint32_t>(-3, scales_and_zeros_data) > max_extents[2]) {
140-
qparams_storage_type = utils::kBuffer;
141-
}
142-
143134
ValueRef scales_and_zeros = prepack_standard_hw_transposed(
144-
graph, scales_and_zeros_data, qparams_storage_type, utils::kWidthPacked);
135+
graph, scales_and_zeros_data, utils::kBuffer, utils::kWidthPacked);
145136

146137
std::string kernel_name = "q_4w_linear";
147138
add_storage_type_suffix(kernel_name, graph.storage_type_of(out));
148139
add_storage_type_suffix(kernel_name, graph.storage_type_of(mat1));
149140
add_storage_type_suffix(kernel_name, graph.storage_type_of(mat2));
150-
add_storage_type_suffix(kernel_name, qparams_storage_type);
151141
add_dtype_suffix(kernel_name, graph.dtype_of(out));
152142

153143
const uint32_t group_size_val = graph.extract_scalar<uint32_t>(group_size);

backends/vulkan/runtime/vk_api/Adapter.h

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -211,11 +211,12 @@ class Adapter final {
211211
return physical_device_.min_ubo_alignment;
212212
}
213213

214-
inline utils::uvec3 max_texture_extents() const {
215-
return {
216-
physical_device_.properties.limits.maxImageDimension1D,
217-
physical_device_.properties.limits.maxImageDimension2D,
218-
physical_device_.properties.limits.maxImageDimension3D};
214+
inline uint32_t max_texture2d_dim() const {
215+
return physical_device_.properties.limits.maxImageDimension2D;
216+
}
217+
218+
inline uint32_t max_texture3d_dim() const {
219+
return physical_device_.properties.limits.maxImageDimension3D;
219220
}
220221

221222
inline uint32_t max_buffer_numel() const {

backends/xnnpack/test/ops/test_check_quant_params.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ def _test_check_quant_message(self, ep_modifier, expected_message):
5252
torch._dynamo.reset()
5353
mod = torch.nn.Linear(10, 10)
5454
quantizer = XNNPACKQuantizer()
55-
captured = export_for_training(mod, (torch.randn(1, 10),)).module()
55+
captured = export_for_training(mod, (torch.randn(1, 10),), strict=True).module()
5656
quantizer.set_global(get_symmetric_quantization_config(is_per_channel=True))
5757
prepared = prepare_pt2e(captured, quantizer)
5858

@@ -68,7 +68,6 @@ def _test_check_quant_message(self, ep_modifier, expected_message):
6868
self.assertEquals(str(context.exception), expected_message)
6969

7070
def test_in_per_tensor_quant(self):
71-
7271
for invalid_scale in [
7372
float("nan"),
7473
float("inf"),

docs/source/intro-how-it-works.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,10 @@ ExecuTorch provides the following benefits to engineers who need to deploy machi
1717

1818
* **Export that is robust and powerful.** Export uses [`torch.export()`](https://pytorch.org/docs/main/export.html), which uses the same technology used in PyTorch 2.x to capture PyTorch programs for fast execution. While eager mode is flexible and allows experimentation in Python, it may not work well if Python isn't available or cannot deliver efficient execution. The _Export Intermediate Representation (Export IR)_ that export flow generates can describe a wide range of dynamism in PyTorch models, including control flow and dynamic shapes, which makes it a powerful tool for fully capturing existing PyTorch models with little effort.
1919
* **Operator standardization.** During the graph export process, the nodes in the graph represent operators such as addition, multiplication, or convolution. These operators are part of a small standardized list called the [Core ATen Op set](https://pytorch.org/docs/main/torch.compiler_ir.html#core-aten-ir). Most PyTorch programs can be decomposed into a graph using this small set of operators during export. Small list of standardized operators reduces the surface, needed to be covered, by third-party operator libraries as well as accelerator backends, in order to run models exported for ExecuTorch. ExecuTorch runtime ships with one such library, called portable operator library, that implements core ATen opset.
20-
* **Standardization for compiler interfaces (aka delegates) and the OSS ecosystem.** In addition to the _Operator standardization_ above, ExecuTorch has a standardized interface for delegation to compilers. This allows third-party vendors and compilers to implement interfaces and API entry points for compilation and execution of (either partial or full) graphs targeting their specialized hardware. This provides greater flexibility in terms of hardware support and performance optimization, as well as easier integration with the PyTorch open source ecosystem for on-device AI.
21-
* **First-party SDK and toolchain.** Due to the above standardization efforts, it was possible to build a unified first-party SDK for ExecuTorch, where developers can export, compile, and deploy to a wide range of target devices--such as iOS, Android, and microcontrollers--using the same SDK, streamlining the process and gaining productivity. Additionally, the SDK provides profiling and debugging functionality to easily inspect intermediate states, which are core parts of most developer workflows.
20+
* **Standardization for compiler interfaces (aka delegates) and the OSS ecosystem.** In addition to the _Operator standardization_ above, ExecuTorch has a [standardized interface](./compiler-delegate-and-partitioner.md) for delegation to compilers. This allows third-party vendors and compilers to implement interfaces and API entry points for compilation and execution of (either partial or full) graphs targeting their specialized hardware. This provides greater flexibility in terms of hardware support and performance optimization, as well as easier integration with the PyTorch open source ecosystem for on-device AI.
21+
* **First-party Developer Tools** Due to the above standardization efforts, it was possible to build unified first-party [developer tools](./devtools-overview.md) for ExecuTorch, where developers can export, compile, and deploy to a wide range of target devicessuch as iOS, Android, and microcontrollersusing the same APIs, streamlining the process and increasing productivity. Additionally, ExecuTorch provides profiling and debugging functionality to easily inspect intermediate states, which are core parts of most developer workflows.
2222
* **No intermediate conversions necessary.** ExecuTorch's main design principle is to allow developers to run their models on target devices without the need for converting to third-party intermediate representations. This eliminates a number of problems that on-device developers typically face when working with these conversion steps, such as lack of debuggability and profiling, the need to familiarize themselves with hardware-specific tools, and models not being able to run due to conversion steps failing.
23-
* **Ease of customization.** Developers can optimize their deployment for even better performance gains on the target architecture by applying custom techniques, such as linking with high-performance operator implementations or customizing memory planning based on storage and latency trade-offs. This level of customization is made possible through the standardization of the compiler pass interface and registration APIs on exported graphs.
23+
* **Ease of customization.** Developers can optimize their deployment for even better performance gains on the target architecture by applying custom techniques, such as [linking with high-performance operator implementations](./kernel-library-custom-aten-kernel.md) or [customizing memory planning](./compiler-memory-planning.md) based on storage and latency trade-offs. This level of customization is made possible through the standardization of the [compiler pass interface](./compiler-custom-compiler-passes.md) and registration APIs on exported graphs.
2424
* **Low overhead runtime and execution.** The ExecuTorch runtime, written in C++, is highly efficient and can run on a wide range of architectures, including Linux, iOS, Android, embedded systems, and bare metal hardware, with little additional setup or configuration. It is capable of linking in only those operators needed for the model, resulting in a minimal runtime binary size. It is also able to run at low latency because of ahead-of-time compilation and memory planning stages, with the runtime responsible only for execution (e.g., call operator `conv` and save the result in memory location X).
2525

2626
The above highlights the key advantages of ExecuTorch across three main categories: portability, productivity, and performance. We consider it to be an ideal choice for enabling on-device AI across mobile and edge computing platforms.

examples/llm_manual/export_nanogpt.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@
2828
# The torch.no_grad() call tells PyTorch to exclude training-specific logic.
2929
with sdpa_kernel([SDPBackend.MATH]), torch.no_grad():
3030
m = export_for_training(
31-
model, example_inputs, dynamic_shapes=dynamic_shape
31+
model, example_inputs, dynamic_shapes=dynamic_shape, strict=True
3232
).module()
3333
traced_model = export(m, example_inputs, dynamic_shapes=dynamic_shape, strict=True)
3434

0 commit comments

Comments
 (0)