Skip to content

Commit 4328bb9

Browse files
authored
Merge branch 'main' into docs/jhelsby/new-contributor-guide-update
2 parents b5e7a75 + d9c31fa commit 4328bb9

File tree

8 files changed

+97
-130
lines changed

8 files changed

+97
-130
lines changed

backends/apple/coreml/README.md

Lines changed: 1 addition & 106 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,7 @@
11
# ExecuTorch Core ML Delegate
22

3-
43
This subtree contains the Core ML Delegate implementation for ExecuTorch.
5-
Core ML is an optimized framework for running machine learning models on Apple devices. The delegate is the mechanism for leveraging the Core ML framework to accelerate operators when running on Apple devices.
4+
Core ML is an optimized framework for running machine learning models on Apple devices. The delegate is the mechanism for leveraging the Core ML framework to accelerate operators when running on Apple devices. To learn how to use the CoreML delegate, see the [documentation](https://github.com/pytorch/executorch/blob/main/docs/source/backends-coreml.md).
65

76
## Layout
87
- `compiler/` : Lowers a module to Core ML backend.
@@ -19,110 +18,6 @@ Core ML is an optimized framework for running machine learning models on Apple d
1918
- `workspace` : Xcode workspace for the runtime.
2019
- `third-party/`: External dependencies.
2120

22-
## Partition and Delegation
23-
24-
To delegate a Program to the **Core ML** backend, the client must call `to_backend` with the **CoreMLPartitioner**.
25-
26-
```python
27-
import torch
28-
import executorch.exir
29-
30-
from executorch.backends.apple.coreml.compiler import CoreMLBackend
31-
from executorch.backends.apple.coreml.partition import CoreMLPartitioner
32-
33-
class Model(torch.nn.Module):
34-
def __init__(self):
35-
super().__init__()
36-
37-
def forward(self, x):
38-
return torch.sin(x)
39-
40-
source_model = Model()
41-
example_inputs = (torch.ones(1), )
42-
43-
# Export the source model to Edge IR representation
44-
aten_program = torch.export.export(source_model, example_inputs)
45-
edge_program_manager = executorch.exir.to_edge(aten_program)
46-
47-
# Delegate to Core ML backend
48-
delegated_program_manager = edge_program_manager.to_backend(CoreMLPartitioner())
49-
50-
# Serialize delegated program
51-
executorch_program = delegated_program_manager.to_executorch()
52-
with open("model.pte", "wb") as f:
53-
f.write(executorch_program.buffer)
54-
```
55-
56-
The module will be fully or partially delegated to **Core ML**, depending on whether all or part of ops are supported by the **Core ML** backend. User may force skip certain ops by `CoreMLPartitioner(skip_ops_for_coreml_delegation=...)`
57-
58-
The `to_backend` implementation is a thin wrapper over [coremltools](https://apple.github.io/coremltools/docs-guides/), `coremltools` is responsible for converting an **ExportedProgram** to a **MLModel**. The converted **MLModel** data is saved, flattened, and returned as bytes to **ExecuTorch**.
59-
60-
## Quantization
61-
62-
To quantize a Program in a Core ML favored way, the client may utilize **CoreMLQuantizer**.
63-
64-
```python
65-
import torch
66-
import executorch.exir
67-
68-
from torch.export import export_for_training
69-
from torch.ao.quantization.quantize_pt2e import (
70-
convert_pt2e,
71-
prepare_pt2e,
72-
prepare_qat_pt2e,
73-
)
74-
75-
from executorch.backends.apple.coreml.quantizer import CoreMLQuantizer
76-
from coremltools.optimize.torch.quantization.quantization_config import (
77-
LinearQuantizerConfig,
78-
QuantizationScheme,
79-
)
80-
81-
class Model(torch.nn.Module):
82-
def __init__(self) -> None:
83-
super().__init__()
84-
self.conv = torch.nn.Conv2d(
85-
in_channels=3, out_channels=16, kernel_size=3, padding=1
86-
)
87-
self.relu = torch.nn.ReLU()
88-
89-
def forward(self, x: torch.Tensor) -> torch.Tensor:
90-
a = self.conv(x)
91-
return self.relu(a)
92-
93-
source_model = Model()
94-
example_inputs = (torch.randn((1, 3, 256, 256)), )
95-
96-
pre_autograd_aten_dialect = export_for_training(source_model, example_inputs).module()
97-
98-
quantization_config = LinearQuantizerConfig.from_dict(
99-
{
100-
"global_config": {
101-
"quantization_scheme": QuantizationScheme.symmetric,
102-
"activation_dtype": torch.quint8,
103-
"weight_dtype": torch.qint8,
104-
"weight_per_channel": True,
105-
}
106-
}
107-
)
108-
quantizer = CoreMLQuantizer(quantization_config)
109-
110-
# For post-training quantization, use `prepare_pt2e`
111-
# For quantization-aware trainin,g use `prepare_qat_pt2e`
112-
prepared_graph = prepare_pt2e(pre_autograd_aten_dialect, quantizer)
113-
114-
prepared_graph(*example_inputs)
115-
converted_graph = convert_pt2e(prepared_graph)
116-
```
117-
118-
The `converted_graph` is the quantized torch model, and can be delegated to **Core ML** similarly through **CoreMLPartitioner**
119-
120-
## Runtime
121-
122-
To execute a Core ML delegated program, the application must link to the `coremldelegate` library. Once linked there are no additional steps required, ExecuTorch when running the program would call the Core ML runtime to execute the Core ML delegated part of the program.
123-
124-
Please follow the instructions described in the [Core ML setup](/backends/apple/coreml/setup.md) to link the `coremldelegate` library.
125-
12621
## Help & Improvements
12722
If you have problems or questions or have suggestions for ways to make
12823
implementation and testing better, please create an issue on [github](https://www.github.com/pytorch/executorch/issues).

backends/cadence/aot/compiler.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,11 +31,11 @@
3131
EdgeProgramManager,
3232
ExecutorchBackendConfig,
3333
ExecutorchProgramManager,
34-
to_edge,
3534
)
3635
from executorch.exir.pass_base import PassResult
3736
from executorch.exir.passes import ToOutVarPass
3837
from executorch.exir.passes.sym_shape_eval_pass import HintBasedSymShapeEvalPass
38+
from executorch.exir.program._program import to_edge_with_preserved_ops
3939
from torch._inductor.decomposition import remove_decompositions
4040
from torch.ao.quantization.quantize_pt2e import convert_pt2e, prepare_pt2e
4141

@@ -80,6 +80,7 @@ def convert_pt2(
8080
torch.ops.aten.layer_norm.default,
8181
torch.ops.aten.linear.default,
8282
torch.ops.aten.matmul.default,
83+
torch.ops.aten.rms_norm.default,
8384
]
8485
# Remove decompositions for the ops we want to keep
8586
# pyre-fixme[6]: For 1st argument expected `Dict[typing.Callable[..., typing.Any
@@ -201,9 +202,9 @@ def lower_ep_to_edge(
201202
"""
202203
Lower an ExportedProgram to an EdgeProgramManager (in edge IR).
203204
"""
204-
# Call to_edge to convert the graph to edge IR.
205+
# Call to_edge_with_preserved_ops to convert the graph to edge IR.
205206
# Note: dim_order is skipped (https://github.com/pytorch/executorch/issues/3704)
206-
edge_prog_manager = to_edge(
207+
edge_prog_manager = to_edge_with_preserved_ops(
207208
expo_program,
208209
compile_config=EdgeCompileConfig(
209210
_skip_dim_order=True,
@@ -216,9 +217,11 @@ def lower_ep_to_edge(
216217
torch.ops.aten.linalg_vector_norm.default,
217218
torch.ops.aten.unfold.default,
218219
torch.ops.aten.angle.default,
220+
torch.ops.aten.rms_norm.default,
219221
],
220222
),
221223
constant_methods=constant_methods,
224+
preserve_ops=(torch.ops.aten.rms_norm.default,),
222225
)
223226

224227
if dump_graphs:

backends/cadence/aot/ops_registrations.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -139,7 +139,6 @@
139139
"int in_zero_point, bool channel_last=False) -> (Tensor out)"
140140
)
141141
lib.define("linalg_vector_norm(Tensor X) -> (Tensor Y)")
142-
lib.define("rms_norm(Tensor X, float eps, Tensor W) -> (Tensor Y)")
143142
lib.define(
144143
"transposed_im2row(Tensor input, int[2] kernel_size, int[2] dilation, int[2] padding, int[2] stride, "
145144
"int[2] output_padding, Tensor in_zero_point, bool channel_last=False) -> (Tensor out)"
@@ -211,9 +210,6 @@
211210
"fully_connected.out(Tensor input, Tensor weight, Tensor? bias=None, *, Tensor(a!) out) -> Tensor(a!)"
212211
)
213212
lib.define("linalg_vector_norm.out(Tensor X, *, Tensor(a!) out) -> Tensor(a!)")
214-
lib.define(
215-
"rms_norm.out(Tensor X, float eps, Tensor W, *, Tensor(a!) out) -> Tensor(a!)"
216-
)
217213
lib.define(
218214
"quantized_fully_connected.out(Tensor src, Tensor weight, Tensor bias, int src_zero_point, "
219215
"Tensor weight_zero_point, Tensor out_multiplier, Tensor out_shift, int out_zero_point, Tensor? offset, *, Tensor(a!) out) -> Tensor(a!)"

exir/emit/_emitter.py

Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1640,13 +1640,26 @@ def placeholder( # noqa: C901
16401640
else:
16411641
spec.extra_tensor_info.fully_qualified_name = fqn
16421642
spec.extra_tensor_info.location = TensorDataLocation.EXTERNAL
1643-
if self.emitter_state.emit_mutable_buffer_names and is_mutable_buffer:
1644-
if spec.extra_tensor_info is None:
1645-
spec.extra_tensor_info = ExtraTensorInfo(
1646-
fully_qualified_name=fqn, location=TensorDataLocation.SEGMENT
1643+
1644+
if is_mutable_buffer:
1645+
# Emit names if we are supposed to.
1646+
if self.emitter_state.emit_mutable_buffer_names:
1647+
if spec.extra_tensor_info is None:
1648+
spec.extra_tensor_info = ExtraTensorInfo(
1649+
fully_qualified_name=fqn,
1650+
location=TensorDataLocation.SEGMENT,
1651+
)
1652+
else:
1653+
spec.extra_tensor_info.fully_qualified_name = fqn
1654+
# if We aren't emitting the name then it needs to be memory planned.
1655+
elif spec.mem_id is None or spec.mem_offset is None:
1656+
raise InternalError(
1657+
self._emit_node_specific_error(
1658+
self.node,
1659+
# [2:] to remove the b_ prefix buffers get
1660+
f'Mutable buffer "{target[2:]}" must have a memory id and offset if we are emitting it without a name. Please either memory plan your mutable buffers or call to_executorch with config=ExecutorchBackendConfig(emit_mutable_buffer_names=True)',
1661+
)
16471662
)
1648-
else:
1649-
spec.extra_tensor_info.fully_qualified_name = fqn
16501663

16511664
# From the fqn find the corresponding tensor
16521665
real_tensor = None

exir/emit/test/test_emit.py

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1838,8 +1838,40 @@ def forward(self, x):
18381838
ep = to_edge(ep)
18391839
# Lower the graph to executorch.
18401840
ep = ep.to_executorch(
1841-
config=ExecutorchBackendConfig(emit_mutable_buffer_names=True)
1841+
config=ExecutorchBackendConfig(
1842+
emit_mutable_buffer_names=True,
1843+
memory_planning_pass=MemoryPlanningPass(alloc_mutable_buffers=False),
1844+
)
18421845
)
18431846
for val in ep.executorch_program.execution_plan[0].values:
18441847
if isinstance(val, Tensor) and val.extra_tensor_info:
18451848
self.assertEqual(val.extra_tensor_info.fully_qualified_name, "buffer")
1849+
self.assertEqual(val.allocation_info, None)
1850+
1851+
def test_emit_mutable_buffer_names_fails(self) -> None:
1852+
class Net(nn.Module):
1853+
def __init__(self):
1854+
super().__init__()
1855+
self.linear = nn.Linear(2, 2)
1856+
self.register_buffer("buffer", torch.zeros(1, 2))
1857+
1858+
def forward(self, x):
1859+
self.buffer.add_(1)
1860+
return self.linear(x) + self.buffer
1861+
1862+
net = Net()
1863+
1864+
ep = export(net, (torch.randn(1, 2),), strict=True)
1865+
# Lower the graph to edge dialect.
1866+
ep = to_edge(ep)
1867+
# Lower the graph to executorch.
1868+
# Must emit mutable buffer names if we don't allocate mutable buffers
1869+
with self.assertRaises(InternalError):
1870+
ep.to_executorch(
1871+
config=ExecutorchBackendConfig(
1872+
emit_mutable_buffer_names=False,
1873+
memory_planning_pass=MemoryPlanningPass(
1874+
alloc_mutable_buffers=False
1875+
),
1876+
)
1877+
)

exir/memory_planning.py

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,12 +44,14 @@ def __init__(
4444
graph_module: torch.fx.GraphModule,
4545
alloc_graph_input: bool,
4646
alloc_graph_output: bool,
47+
alloc_mutable_buffers: bool,
4748
graph_signature: Optional[ExportGraphSignature] = None,
4849
) -> None:
4950
self.graph_module = graph_module
5051
self.graph_signature = graph_signature
5152
self.alloc_graph_input = alloc_graph_input
5253
self.alloc_graph_output = alloc_graph_output
54+
self.alloc_mutable_buffers = alloc_mutable_buffers
5355

5456
@classmethod
5557
def mem_obj_id_match(
@@ -149,6 +151,7 @@ def verify_storage_reuse(
149151
ignore_const=True,
150152
ignore_graph_input=not self.alloc_graph_input,
151153
ignore_graph_output=not self.alloc_graph_output,
154+
ignore_mutable_buffers=not self.alloc_mutable_buffers,
152155
do_assertion=False,
153156
ignore_out_var_node=False,
154157
dedup=True,
@@ -374,6 +377,7 @@ def collect_specs_from_nodes( # noqa: C901
374377
graph_signature: Optional[ExportGraphSignature] = None,
375378
ignore_graph_input: bool = False,
376379
ignore_graph_output: bool = False,
380+
ignore_mutable_buffers: bool = False,
377381
ignore_const: bool = True,
378382
ignore_out_var_node: bool = True,
379383
dedup: bool = True,
@@ -414,6 +418,9 @@ def collect_specs_from_nodes( # noqa: C901
414418
if _is_inplace_node(node):
415419
continue
416420

421+
if _is_mutable_buffer(node, graph_signature) and ignore_mutable_buffers:
422+
continue
423+
417424
if do_assertion:
418425
internal_assert(
419426
node.op in ("placeholder", "output")
@@ -469,6 +476,7 @@ def update_all_tensors_lifetime(
469476
Set the lifetime for all the tensors encountered in the Fx graph.
470477
"""
471478
specs = set()
479+
472480
for node_idx, node in enumerate(graph_module.graph.nodes):
473481
for spec in collect_specs_from_nodes(
474482
filter_nodes(itertools.chain([node], node.args, node.kwargs.values())),
@@ -1053,6 +1061,7 @@ def apply_algo(
10531061
graph_signature: Optional[ExportGraphSignature] = None,
10541062
alloc_graph_input: bool = True,
10551063
alloc_graph_output: bool = True,
1064+
alloc_mutable_buffers: bool = True,
10561065
) -> List[int]:
10571066
"""
10581067
Recursively apply algo to graph_module and its submodules for control flow.
@@ -1065,19 +1074,18 @@ def apply_algo(
10651074
storage with tensors in the outer module.
10661075
TODO: make these optimizations once we have some baseline working.
10671076
"""
1068-
10691077
# Extract the nodes and their lifespans from the graph_module
10701078
# Difficult to just filter the list of specs returned by this due to
10711079
# how we flag trainable weights.
10721080
_ = update_all_tensors_lifetime(graph_module, graph_signature)
1073-
10741081
# Filter specs based on alloc_graph_input and alloc_graph_output
10751082
specs = collect_specs_from_nodes(
10761083
graph_module.graph.nodes,
10771084
graph_signature,
10781085
do_assertion=False,
10791086
ignore_graph_input=not alloc_graph_input,
10801087
ignore_graph_output=not alloc_graph_output,
1088+
ignore_mutable_buffers=not alloc_mutable_buffers,
10811089
)
10821090

10831091
# Get extra padding for XNNPACK if needed

exir/passes/memory_planning_pass.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ def __init__(
4444
allow_lifetime_and_storage_overlap: bool = False,
4545
alloc_graph_input: bool = True,
4646
alloc_graph_output: bool = True,
47+
alloc_mutable_buffers: bool = True,
4748
alignment: int = ALIGNMENT,
4849
) -> None:
4950
r"""
@@ -54,10 +55,11 @@ def __init__(
5455
"""
5556
if memory_planning_algo is None:
5657
memory_planning_algo = MemoryPlanningAlgorithmSuite()
57-
self.memory_planning_algo = memory_planning_algo
58+
self.memory_planning_algo: Callable[..., List[int]] = memory_planning_algo
5859
self.allow_lifetime_and_storage_overlap = allow_lifetime_and_storage_overlap
5960
self.alloc_graph_input = alloc_graph_input
6061
self.alloc_graph_output = alloc_graph_output
62+
self.alloc_mutable_buffers = alloc_mutable_buffers
6163
self.alignment = alignment
6264

6365
def _set_alloc_node_spec(self, graph_module: torch.fx.GraphModule) -> None:
@@ -124,13 +126,15 @@ def run(
124126
# customized fields. Using the graph_module object to convey information across
125127
# passes/stages is quite natural and avoid yet another 'context' data structure
126128
# to do the job.
129+
127130
_ = apply_algo(
128-
self.memory_planning_algo, # pyre-ignore[6]
131+
self.memory_planning_algo,
129132
graph_module,
130133
self.alignment,
131134
graph_signature,
132135
self.alloc_graph_input,
133136
self.alloc_graph_output,
137+
self.alloc_mutable_buffers,
134138
)
135139

136140
# TODO: make the verifier do the work recursively to handle
@@ -139,6 +143,7 @@ def run(
139143
graph_module,
140144
self.alloc_graph_input,
141145
self.alloc_graph_output,
146+
self.alloc_mutable_buffers,
142147
graph_signature,
143148
)
144149

0 commit comments

Comments
 (0)