Skip to content

Commit 26ad146

Browse files
committed
Update base for Update on "add module level benchmark for gemma3 model"
This diff adds a module-level benchmark for the GEMMA3 model. Also introduce mutlmodal_benchmark.cpp to replace original voxtral_runner.cpp for benchmarking both gemma3 and voxtral model in module level. Differential Revision: [D84958564](https://our.internmc.facebook.com/intern/diff/D84958564/) [ghstack-poisoned]
2 parents 2cacc74 + baa41c6 commit 26ad146

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+1216
-763
lines changed
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
44d8d54e38c0258357d4e92e1fefe21e845947a3
1+
09fdbd0a0639b128f712a4f5202ed42ca4c60957

.github/workflows/cuda.yml

Lines changed: 30 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -88,14 +88,26 @@ jobs:
8888
PYTHON_EXECUTABLE=python source .ci/scripts/test_model.sh "${{ matrix.model }}" cmake cuda
8989
9090
export-voxtral-cuda-artifact:
91-
name: export-voxtral-cuda-artifact
91+
name: export-voxtral-cuda-${{ matrix.quant.name }}
9292
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
9393
permissions:
9494
id-token: write
9595
contents: read
9696
secrets: inherit
9797
strategy:
9898
fail-fast: false
99+
matrix:
100+
quant:
101+
- name: "non-quantized"
102+
artifact: "voxtral-cuda-export"
103+
extra_args: ""
104+
- name: "quantized-int4-tile-packed"
105+
artifact: "voxtral-cuda-quantized-int4-tile-packed"
106+
extra_args: "--qlinear 4w --qlinear_encoder 4w --qlinear_packing_format tile_packed_to_4d --qlinear_encoder_packing_format tile_packed_to_4d"
107+
- name: "quantized-int4-weight-only"
108+
artifact: "voxtral-cuda-quantized-int4-weight-only"
109+
# TODO: adding "--qlinear 4w" produces invalid results. Need further investigation.
110+
extra_args: "--qlinear_encoder 4w"
99111
with:
100112
timeout: 90
101113
secrets-env: EXECUTORCH_HF_TOKEN
@@ -104,7 +116,7 @@ jobs:
104116
gpu-arch-version: 12.6
105117
use-custom-docker-registry: false
106118
submodules: recursive
107-
upload-artifact: voxtral-cuda-export
119+
upload-artifact: ${{ matrix.quant.artifact }}
108120
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
109121
script: |
110122
set -eux
@@ -122,14 +134,16 @@ jobs:
122134
pip list
123135
echo "::endgroup::"
124136
125-
echo "::group::Export Voxtral"
137+
echo "::group::Export Voxtral (${{ matrix.quant.name }})"
138+
EXTRA_ARGS="${{ matrix.quant.extra_args }}"
126139
optimum-cli export executorch \
127140
--model "mistralai/Voxtral-Mini-3B-2507" \
128141
--task "multimodal-text-to-text" \
129142
--recipe "cuda" \
130143
--dtype bfloat16 \
131144
--device cuda \
132145
--max_seq_len 1024 \
146+
${EXTRA_ARGS} \
133147
--output_dir ./
134148
python -m executorch.extension.audio.mel_spectrogram \
135149
--feature_size 128 \
@@ -142,7 +156,7 @@ jobs:
142156
test -f voxtral_preprocessor.pte
143157
echo "::endgroup::"
144158
145-
echo "::group::Store Voxtral Artifacts"
159+
echo "::group::Store Voxtral Artifacts (${{ matrix.quant.name }})"
146160
mkdir -p "${RUNNER_ARTIFACT_DIR}"
147161
cp model.pte "${RUNNER_ARTIFACT_DIR}/"
148162
cp aoti_cuda_blob.ptd "${RUNNER_ARTIFACT_DIR}/"
@@ -201,22 +215,30 @@ jobs:
201215
echo "::endgroup::"
202216
203217
test-voxtral-cuda-e2e:
204-
name: test-voxtral-cuda-e2e
218+
name: test-voxtral-cuda-e2e-${{ matrix.format.name }}
205219
needs: export-voxtral-cuda-artifact
206220
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
207221
permissions:
208222
id-token: write
209223
contents: read
210224
strategy:
211225
fail-fast: false
226+
matrix:
227+
format:
228+
- name: "non-quantized"
229+
artifact: "voxtral-cuda-export"
230+
- name: "quantized-int4-tile-packed"
231+
artifact: "voxtral-cuda-quantized-int4-tile-packed"
232+
- name: "quantized-int4-weight-only"
233+
artifact: "voxtral-cuda-quantized-int4-weight-only"
212234
with:
213235
timeout: 90
214236
runner: linux.g5.4xlarge.nvidia.gpu
215237
gpu-arch-type: cuda
216238
gpu-arch-version: 12.6
217239
use-custom-docker-registry: false
218240
submodules: recursive
219-
download-artifact: voxtral-cuda-export
241+
download-artifact: ${{ matrix.format.artifact }}
220242
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
221243
script: |
222244
set -eux
@@ -226,7 +248,7 @@ jobs:
226248
pip list
227249
echo "::endgroup::"
228250
229-
echo "::group::Prepare Voxtral Artifacts"
251+
echo "::group::Prepare Voxtral Artifacts (${{ matrix.format.name }})"
230252
cp "${RUNNER_ARTIFACT_DIR}/model.pte" .
231253
cp "${RUNNER_ARTIFACT_DIR}/aoti_cuda_blob.ptd" .
232254
cp "${RUNNER_ARTIFACT_DIR}/voxtral_preprocessor.pte" .
@@ -255,7 +277,7 @@ jobs:
255277
cmake --build cmake-out/examples/models/voxtral --target voxtral_runner --config Release
256278
echo "::endgroup::"
257279
258-
echo "::group::Run Voxtral Runner"
280+
echo "::group::Run Voxtral Runner (${{ matrix.format.name }})"
259281
set +e
260282
export LD_LIBRARY_PATH=/opt/conda/lib:$LD_LIBRARY_PATH
261283
OUTPUT=$(cmake-out/examples/models/voxtral/voxtral_runner \

CONTRIBUTING.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,8 @@ For Apple, please refer to the [iOS documentation](docs/source/using-executorch-
2424
executorch
2525
├── <a href="backends">backends</a> - Backend delegate implementations for various hardware targets. Each backend uses partitioner to split the graph into subgraphs that can be executed on specific hardware, quantizer to optimize model precision, and runtime components to execute the graph on target hardware. For details refer to the <a href="docs/source/backend-delegates-integration.md">backend documentation</a> and the <a href="docs/source/using-executorch-export.md">Export and Lowering tutorial</a> for more information.
2626
│ ├── <a href="backends/apple">apple</a> - Apple-specific backends.
27-
│ │ ├── <a href="backends/apple/coreml">coreml</a> - CoreML backend for Apple devices. See <a href="docs/source/backends-coreml.md">doc</a>.
28-
│ │ └── <a href="backends/apple/mps">mps</a> - Metal Performance Shaders backend for Apple devices. See <a href="docs/source/backends-mps.md">doc</a>.
27+
│ │ ├── <a href="backends/apple/coreml">coreml</a> - CoreML backend for Apple devices. See <a href="docs/source/backends/coreml/coreml-overview.md">doc</a>.
28+
│ │ └── <a href="backends/apple/mps">mps</a> - Metal Performance Shaders backend for Apple devices. See <a href="docs/source/backends/mps/mps-overview.md">doc</a>.
2929
│ ├── <a href="backends/arm">arm</a> - ARM architecture backends. See <a href="docs/source/backends-arm-ethos-u.md">doc</a>.
3030
│ ├── <a href="backends/cadence">cadence</a> - Cadence-specific backends. See <a href="docs/source/backends-cadence.md">doc</a>.
3131
│ ├── <a href="backends/example">example</a> - Example backend implementations.

README-wheel.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ The prebuilt `executorch.runtime` module included in this package provides a way
1212
to run ExecuTorch `.pte` files, with some restrictions:
1313
* Only [core ATen operators](docs/source/ir-ops-set-definition.md) are linked into the prebuilt module
1414
* Only the [XNNPACK backend delegate](docs/source/backends-xnnpack.md) is linked into the prebuilt module.
15-
* \[macOS only] [Core ML](docs/source/backends-coreml.md) and [MPS](docs/source/backends-mps.md) backend
15+
* \[macOS only] [Core ML](docs/source/backends/coreml/coreml-overview.md) and [MPS](docs/source/backends/mps/mps-overview.md) backend
1616
are also linked into the prebuilt module.
1717

1818
Please visit the [ExecuTorch website](https://pytorch.org/executorch) for

backends/apple/coreml/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# ExecuTorch Core ML Delegate
22

33
This subtree contains the Core ML Delegate implementation for ExecuTorch.
4-
Core ML is an optimized framework for running machine learning models on Apple devices. The delegate is the mechanism for leveraging the Core ML framework to accelerate operators when running on Apple devices. To learn how to use the CoreML delegate, see the [documentation](https://github.com/pytorch/executorch/blob/main/docs/source/backends-coreml.md).
4+
Core ML is an optimized framework for running machine learning models on Apple devices. The delegate is the mechanism for leveraging the Core ML framework to accelerate operators when running on Apple devices. To learn how to use the CoreML delegate, see the [documentation](https://github.com/pytorch/executorch/blob/main/docs/source/backends/coreml/coreml-overview.md).
55

66
## Layout
77
- `compiler/` : Lowers a module to Core ML backend.

backends/cadence/aot/ops_registrations.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,6 @@ def _validate_ref_impl_exists() -> None:
6060
"cadence::quantized_softmax.per_tensor",
6161
"cadence::quantized_conv2d_nchw", # We should only support per_tensor variant, should remove
6262
"cadence::quantized_relu", # We should only support per_tensor variant, should remove
63-
"cadence::linalg_svd",
6463
"cadence::quantized_conv2d_nhwc", # We should only support per_tensor variant, should remove
6564
"cadence::quantized_softmax",
6665
"cadence::quantized_w8a32_gru",

0 commit comments

Comments
 (0)