Skip to content

Commit 05f99cb

Browse files
author
ssjia
committed
[ET-VK] Miscellaneous fixes
Collecting fixes for various models/ops in this diff/PR. They have all been squashed into this single change to make it easier to cherry pick. # Fixes ## Wav2Letter Type: Output correctness failure This is caused by a bug in swiftshader, and not reproducible on any other platform. Specifically, the issue is in the softmax shader; the exact cause of the issue is unknown, but it is related to using shared memory within shaders. The workaround for this issue is to use separate shared memory arrays for the shared max and shared sum. ## ConvNeXT Type: Exception during runtime This is caused by an incompatible memory layout being used for mean2d. More technically, the packed dimension of the tensor cannot be one of the dims being reduced. The current operator registry system did not have a way to select valid tensor representations based on the actual arguments of an op. To fix, we have to introduce a mechanism for ops to specify valid representations once a node's arguments are known. Once the model is exported with supported memory layout, the model test passes. ## Inception_V3/ViT Type: Exception during runtime The root cause of this was an interaction betwen the fuse batch norm pass and how `vulkan_preprocess.py` was applying passes. Essentially, the fuse batch norm pass creates a new param node for the fused weight, but after the pass is applied `_copy_module` is used to copy the transformed graph back into the ExportedProgram. However, it seems that _copy_module lowercases the node names without updating the exported program's graph signature. Therefore, subsequent passes couldn't recognize the weight tensor of convolution tensors as a constant/parameter node. The solution was to migrate vulkan_preprocess.py to use the _transform() API instead of using _copy_module. ## DenseNet 161 (w/ dynamic shapes) Type: Output Mismatch Cause: the native_batch_norm op doesn't support dynamic shapes. However, the backend test runner doesn't set the correct compile option to filter ops without dynamic shape support. Differential Revision: [D83703496](https://our.internmc.facebook.com/intern/diff/D83703496/) [ghstack-poisoned]
1 parent 07dcd95 commit 05f99cb

File tree

8 files changed

+99
-65
lines changed

8 files changed

+99
-65
lines changed

.github/workflows/pull.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -970,7 +970,7 @@ jobs:
970970
PYTHON_EXECUTABLE=python bash backends/vulkan/test/scripts/test_model.sh --build
971971
972972
# Test models serially
973-
models="mv2 mv3 edsr resnet18 resnet50 dl3"
973+
models="mv2 mv3 edsr resnet18 resnet50 dl3 w2l ic3"
974974
for model in $models; do
975975
python -m examples.vulkan.export --model_name=$model --test
976976
done

backends/vulkan/op_registry.py

Lines changed: 45 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,9 @@ class OpFeatures:
4848
# Optional check function used during partitioning to determine if a node's
4949
# inputs are supported by the operator implementation.
5050
"are_node_inputs_supported_fn",
51+
# Optional function to determine valid representation sets for input and outputs
52+
# once a node's actual inputs are known.
53+
"pick_io_storage_fn",
5154
]
5255

5356
def __init__(
@@ -61,6 +64,7 @@ def __init__(
6164
supports_resize: bool = False,
6265
supports_prepacking: bool = False,
6366
are_node_inputs_supported_fn: Optional[Callable] = allow_node,
67+
pick_io_storage_fn: Optional[Callable] = None,
6468
):
6569
self.inputs_storage: utils.TensorRepSetList = utils.TensorRepSetList(
6670
inputs_storage if inputs_storage is not None else []
@@ -77,14 +81,22 @@ def __init__(
7781
self.supports_prepacking = supports_prepacking
7882

7983
self.are_node_inputs_supported_fn = are_node_inputs_supported_fn
84+
self.pick_io_storage_fn = pick_io_storage_fn
8085

8186
def make_op_repsets(
8287
self,
8388
op_node: torch.fx.Node,
8489
texture_limits: utils.ImageExtents = utils.DEFAULT_TEXTURE_LIMITS,
8590
) -> utils.OpRepSets:
91+
inputs_storage = self.inputs_storage
92+
outputs_storage = self.outputs_storage
93+
if self.pick_io_storage_fn is not None:
94+
i_storage, o_storage = self.pick_io_storage_fn(op_node)
95+
inputs_storage = utils.TensorRepSetList(i_storage)
96+
outputs_storage = utils.TensorRepSetList(o_storage)
97+
8698
return utils.OpRepSets(
87-
self.inputs_storage, self.outputs_storage, op_node, texture_limits
99+
inputs_storage, outputs_storage, op_node, texture_limits
88100
)
89101

90102

@@ -411,27 +423,10 @@ def register_softmax_op():
411423
def register_reduce_op():
412424
def check_reduce_node(node: torch.fx.Node) -> bool:
413425
dim_list = node.args[1]
426+
# Only 1D and 2D reductions are supported at the moment.
414427
if isinstance(dim_list, list) and len(dim_list) > 2:
415428
return False
416429

417-
if isinstance(dim_list, list) and len(dim_list) == 2:
418-
# Try to get the memory layout for this node
419-
try:
420-
memory_layout = utils.get_node_memory_layout(node)
421-
422-
# If we have memory layout information, check if any dimension in dim_list corresponds to a packed dimension
423-
if (
424-
memory_layout is not None
425-
and memory_layout != VkMemoryLayout.DEFAULT_LAYOUT
426-
):
427-
# For now only default layout is supported for 2D reduction.
428-
# Because we can't determine if the input is NCHW or NHWC here,
429-
# assume the reduction dimension is packed so we cannot support it.
430-
return False
431-
except (AssertionError, KeyError, AttributeError):
432-
# If we can't get memory layout information, we'll assume the dims aren't packed
433-
pass
434-
435430
def try_find_keepdim_arg(node: torch.fx.Node) -> bool:
436431
for arg in node.args:
437432
if isinstance(arg, bool):
@@ -446,10 +441,41 @@ def try_find_keepdim_arg(node: torch.fx.Node) -> bool:
446441

447442
return True
448443

444+
def pick_io_storage_for_reduce(node: torch.fx.Node):
445+
inputs_storage = utils.ANY_TEXTURE
446+
outputs_storage = utils.ANY_TEXTURE
447+
448+
input_tensor = node.args[0]
449+
ndim = input_tensor.meta["val"].ndim
450+
dim_list = node.args[1]
451+
if isinstance(dim_list, list) and len(dim_list) == 2:
452+
reduce_dim1_whcn = utils.nchw_dim_to_whcn_dim(dim_list[0], ndim)
453+
reduce_dim2_whcn = utils.nchw_dim_to_whcn_dim(dim_list[1], ndim)
454+
455+
possible_packed_dims = {0, 1, 2}
456+
possible_packed_dims.discard(reduce_dim1_whcn)
457+
possible_packed_dims.discard(reduce_dim2_whcn)
458+
459+
packed_dim = possible_packed_dims.pop()
460+
assert packed_dim in [0, 1, 2]
461+
462+
if (packed_dim == 0):
463+
inputs_storage = utils.WIDTH_PACKED_TEXTURE
464+
outputs_storage = utils.WIDTH_PACKED_TEXTURE
465+
elif (packed_dim == 1):
466+
inputs_storage = utils.HEIGHT_PACKED_TEXTURE
467+
outputs_storage = utils.HEIGHT_PACKED_TEXTURE
468+
else:
469+
inputs_storage = utils.CHANNELS_PACKED_TEXTURE
470+
outputs_storage = utils.CHANNELS_PACKED_TEXTURE
471+
472+
return inputs_storage, outputs_storage
473+
449474
return OpFeatures(
450475
inputs_storage=utils.ANY_TEXTURE,
451476
supports_resize=True,
452477
are_node_inputs_supported_fn=check_reduce_node,
478+
pick_io_storage_fn=pick_io_storage_for_reduce,
453479
)
454480

455481

backends/vulkan/runtime/graph/ops/glsl/full.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,5 +14,6 @@ full:
1414
DTYPE:
1515
- VALUE: half
1616
- VALUE: float
17+
- VALUE: int32
1718
shader_variants:
1819
- NAME: full

backends/vulkan/runtime/graph/ops/glsl/softmax.glsl

Lines changed: 14 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,8 @@ layout(constant_id = 5) const int group_dim = 1;
4242
// work group will write into its assigned element in the shared array.
4343
#define MAX_NTHREADS 16
4444

45-
shared vec4 shared_vecs[MAX_NTHREADS];
45+
shared vec4 shared_max[MAX_NTHREADS];
46+
shared vec4 shared_sum[MAX_NTHREADS];
4647

4748
#include "indexing_utils.h"
4849

@@ -102,13 +103,13 @@ void softmax_nonpacked_dim(const ivec2 tid, ivec3 scan_pos) {
102103
i += NWORKERS, scan_pos[reduce_dim] += NWORKERS) {
103104
max_elements = max(max_elements, load_texel(tin, scan_pos));
104105
}
105-
shared_vecs[smi] = max_elements;
106+
shared_max[smi] = max_elements;
106107
barrier();
107108
// Iterate over the partial maximums to obtain the overall maximum
108109
group_i = tid.y * NWORKERS;
109-
max_elements = shared_vecs[group_i++];
110+
max_elements = shared_max[group_i++];
110111
for (int i = 1; i < NWORKERS; ++i, group_i++) {
111-
max_elements = max(max_elements, shared_vecs[group_i]);
112+
max_elements = max(max_elements, shared_max[group_i]);
112113
}
113114

114115
scan_pos[reduce_dim] = tid.x;
@@ -118,13 +119,13 @@ void softmax_nonpacked_dim(const ivec2 tid, ivec3 scan_pos) {
118119
i += NWORKERS, scan_pos[reduce_dim] += NWORKERS) {
119120
denominators += exp(load_texel(tin, scan_pos) - max_elements);
120121
}
121-
shared_vecs[smi] = denominators;
122+
shared_sum[smi] = denominators;
122123
barrier();
123124
// Iterate over the partial sums to obtain the overall sum
124125
group_i = tid.y * NWORKERS;
125-
denominators = shared_vecs[group_i++];
126+
denominators = shared_sum[group_i++];
126127
for (int i = 1; i < NWORKERS; ++i, group_i++) {
127-
denominators += shared_vecs[group_i];
128+
denominators += shared_sum[group_i];
128129
}
129130

130131
// Determine if there are any padding elements in the final texel of the
@@ -184,13 +185,13 @@ void softmax_packed_dim(const ivec2 tid, ivec3 scan_pos) {
184185
max_elements.x = max(intex[i], max_elements.x);
185186
}
186187
}
187-
shared_vecs[smi] = max_elements;
188+
shared_max[smi] = max_elements;
188189
barrier();
189190
// Iterate over the partial maximums to obtain the overall maximum
190191
group_i = tid.y * NWORKERS;
191-
max_elements = shared_vecs[group_i++];
192+
max_elements = shared_max[group_i++];
192193
for (int i = 1; i < NWORKERS; ++i, group_i++) {
193-
max_elements = max(max_elements, shared_vecs[group_i]);
194+
max_elements = max(max_elements, shared_max[group_i]);
194195
}
195196
// Each element of the texel is itself a partial maximum; iterate over the
196197
// texel to find the actual maximum
@@ -214,13 +215,13 @@ void softmax_packed_dim(const ivec2 tid, ivec3 scan_pos) {
214215
denominators.x += exp(intex[i] - max_element);
215216
}
216217
}
217-
shared_vecs[smi] = denominators;
218+
shared_sum[smi] = denominators;
218219
barrier();
219220
// Iterate over the partial sums to obtain the overall sum
220221
group_i = tid.y * NWORKERS;
221-
denominators = shared_vecs[group_i++];
222+
denominators = shared_sum[group_i++];
222223
for (int i = 1; i < NWORKERS; ++i, group_i++) {
223-
denominators += shared_vecs[group_i];
224+
denominators += shared_sum[group_i];
224225
}
225226
// Reduce over the accumulated texel to find the overall sum
226227
float denominator = 0;

backends/vulkan/test/utils.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,9 @@ def export_model_to_vulkan(
9090
qmode=QuantizationMode.NONE,
9191
):
9292
compile_options = {}
93-
exported_graph = get_exported_graph(model, sample_inputs, qmode=qmode)
93+
exported_graph = get_exported_graph(
94+
model, sample_inputs, dynamic_shapes=dynamic_shapes, qmode=qmode
95+
)
9496
program = export(
9597
exported_graph,
9698
sample_inputs,

backends/vulkan/utils.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -128,7 +128,7 @@ def is_param_node(program: ExportedProgram, node: torch.fx.Node) -> bool:
128128
is_get_attr_node(node)
129129
or is_param(program, node)
130130
or is_buffer(program, node)
131-
or is_constant(program, node)
131+
or is_lifted_tensor_constant(program, node)
132132
)
133133

134134

@@ -1228,6 +1228,16 @@ def is_in_8bit_range(tensor: torch.Tensor) -> bool:
12281228
##
12291229

12301230

1231+
def nchw_dim_to_whcn_dim(nchw_dim: int, ndim: int) -> int:
1232+
# Handle negative indices for nchw_dim
1233+
if nchw_dim < 0:
1234+
nchw_dim += ndim
1235+
1236+
assert nchw_dim >= 0 and nchw_dim < ndim
1237+
whcn_dim = (ndim - 1) - nchw_dim
1238+
return whcn_dim
1239+
1240+
12311241
def get_tensor_val_str(tensor_val: FakeTensor) -> str:
12321242
return f"{tensor_val.dtype}: {tensor_val.shape}"
12331243

backends/vulkan/vulkan_preprocess.py

Lines changed: 24 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88

99
from functools import partial
1010

11-
from typing import Any, Dict, final, List
11+
from typing import Any, Callable, Dict, final, List
1212

1313
import executorch.backends.vulkan.utils as utils
1414

@@ -56,7 +56,9 @@
5656

5757
from executorch.exir.passes.sym_shape_eval_pass import ConstraintBasedSymShapeEvalPass
5858

59-
from executorch.exir.program._program import _copy_module
59+
from executorch.exir.program._program import _copy_module, _transform
60+
61+
from torch._export.verifier import Verifier
6062

6163
from torch.export._remove_auto_functionalized_pass import (
6264
unsafe_remove_auto_functionalized_pass,
@@ -65,28 +67,24 @@
6567
DEFAULT_DEBUG_HANDLE = 65535
6668

6769

70+
class _any_op(Verifier):
71+
dialect = "ANY_OP"
72+
73+
def allowed_op_types(self):
74+
return (Callable,)
75+
76+
6877
# pyre-ignore
6978
def apply_passes(program: ExportedProgram, passes) -> ExportedProgram:
7079
for p in passes:
71-
if issubclass(type(p), ExportPass) or issubclass(type(p), PassBase):
72-
new_gm = program.graph_module
73-
# This is a workaround to allow the memory planning pass to work without
74-
# having to first apply ToOutVarPass(). See the `greedy()` function in
75-
# `exir.memory_planning`; if this attribute isn't set, assertions in
76-
# `collect_spec_from_nodes()` will fail.
77-
if isinstance(p, MemoryPlanningPass):
78-
new_gm.encounter_to_out_var_failure = True
79-
80-
new_gm_res = p(new_gm)
81-
assert new_gm_res is not None
82-
new_gm = new_gm_res.graph_module
83-
80+
if isinstance(p, MemoryPlanningPass) and hasattr(p, "run"):
81+
p.run(program.graph_module)
82+
elif issubclass(type(p), ExportPass) or issubclass(type(p), PassBase):
83+
program = _transform(program, p, override_verifiers=[_any_op])
8484
# See the application of this function in exir/program/_program.py for more
8585
# details on why this step is necessary.
8686
if isinstance(p, SpecPropPass):
87-
p.update_placeholder_tensor_specs(program, new_gm)
88-
89-
_copy_module(program.graph_module, new_gm)
87+
p.update_placeholder_tensor_specs(program, program.graph_module)
9088
else:
9189
program = p(program)
9290

@@ -159,17 +157,17 @@ def preprocess( # noqa: C901
159157
program = apply_passes(
160158
program,
161159
[
160+
FuseBatchNormPass(program),
162161
FusePatternsPass(program),
163-
RemoveRedundantOpsTransform(),
162+
FuseClampPass(),
164163
AddmmToLinearTransform(),
164+
RemoveRedundantOpsTransform(),
165165
FuseQuantizedOpsTransform(program),
166166
ReplaceQDQPass(),
167167
FoldQDQPass(program),
168168
SqueezeUnsqueezeInputs(),
169169
FuseViewCopyTransform(),
170170
ViewCopyToSqueezeUnsqueezePass(),
171-
FuseBatchNormPass(program),
172-
FuseClampPass(),
173171
],
174172
)
175173

@@ -215,6 +213,11 @@ def preprocess( # noqa: C901
215213
mem_planning_suite = MemoryPlanningAlgorithmSuite(
216214
algo_list=[greedy_memory_planning]
217215
)
216+
# This is a workaround to allow the memory planning pass to work without having
217+
# to first apply ToOutVarPass(). See the `greedy()` function in
218+
# `exir.memory_planning`; if this attribute isn't set, assertions in
219+
# `collect_spec_from_nodes()` will fail.
220+
program.graph_module.encounter_to_out_var_failure = True
218221
program = apply_passes(
219222
program,
220223
[

examples/vulkan/export.py

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -139,11 +139,6 @@ def main() -> None:
139139
if args.force_fp16:
140140
compile_options["force_fp16"] = True
141141

142-
# Configure Edge compilation
143-
edge_compile_config = EdgeCompileConfig(
144-
_skip_dim_order=False, # Proper handling for Vulkan memory format
145-
)
146-
147142
logging.info(f"Exporting model {args.model_name} with Vulkan delegate")
148143

149144
# Export the model using torch.export
@@ -157,10 +152,6 @@ def main() -> None:
157152
# Transform and lower with Vulkan partitioner
158153
edge_program = to_edge_transform_and_lower(
159154
program,
160-
compile_config=edge_compile_config,
161-
transform_passes=[
162-
I64toI32(edge_compile_config._skip_dim_order),
163-
],
164155
partitioner=[VulkanPartitioner(compile_options)],
165156
generate_etrecord=args.etrecord,
166157
)

0 commit comments

Comments
 (0)