Skip to content

Commit 7f80aa0

Browse files
committed
Update on "gemma3 e2e runner on cuda"
This diff introduces e2e runner for gemma3 model on cuda delegating using AOTI library, which is guarded by CI. Also other necessary infrastructure updates for building and running the `gemma3 e2e runner` on CUDA devices. Differential Revision: [D85087532](https://our.internmc.facebook.com/intern/diff/D85087532/) [ghstack-poisoned]
2 parents 05ff26e + 436bf3d commit 7f80aa0

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+1216
-763
lines changed
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
44d8d54e38c0258357d4e92e1fefe21e845947a3
1+
09fdbd0a0639b128f712a4f5202ed42ca4c60957

.github/workflows/cuda.yml

Lines changed: 30 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -88,14 +88,26 @@ jobs:
8888
PYTHON_EXECUTABLE=python source .ci/scripts/test_model.sh "${{ matrix.model }}" cmake cuda
8989
9090
export-voxtral-cuda-artifact:
91-
name: export-voxtral-cuda-artifact
91+
name: export-voxtral-cuda-${{ matrix.quant.name }}
9292
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
9393
permissions:
9494
id-token: write
9595
contents: read
9696
secrets: inherit
9797
strategy:
9898
fail-fast: false
99+
matrix:
100+
quant:
101+
- name: "non-quantized"
102+
artifact: "voxtral-cuda-export"
103+
extra_args: ""
104+
- name: "quantized-int4-tile-packed"
105+
artifact: "voxtral-cuda-quantized-int4-tile-packed"
106+
extra_args: "--qlinear 4w --qlinear_encoder 4w --qlinear_packing_format tile_packed_to_4d --qlinear_encoder_packing_format tile_packed_to_4d"
107+
- name: "quantized-int4-weight-only"
108+
artifact: "voxtral-cuda-quantized-int4-weight-only"
109+
# TODO: adding "--qlinear 4w" produces invalid results. Need further investigation.
110+
extra_args: "--qlinear_encoder 4w"
99111
with:
100112
timeout: 90
101113
secrets-env: EXECUTORCH_HF_TOKEN
@@ -104,7 +116,7 @@ jobs:
104116
gpu-arch-version: 12.6
105117
use-custom-docker-registry: false
106118
submodules: recursive
107-
upload-artifact: voxtral-cuda-export
119+
upload-artifact: ${{ matrix.quant.artifact }}
108120
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
109121
script: |
110122
set -eux
@@ -122,14 +134,16 @@ jobs:
122134
pip list
123135
echo "::endgroup::"
124136
125-
echo "::group::Export Voxtral"
137+
echo "::group::Export Voxtral (${{ matrix.quant.name }})"
138+
EXTRA_ARGS="${{ matrix.quant.extra_args }}"
126139
optimum-cli export executorch \
127140
--model "mistralai/Voxtral-Mini-3B-2507" \
128141
--task "multimodal-text-to-text" \
129142
--recipe "cuda" \
130143
--dtype bfloat16 \
131144
--device cuda \
132145
--max_seq_len 1024 \
146+
${EXTRA_ARGS} \
133147
--output_dir ./
134148
python -m executorch.extension.audio.mel_spectrogram \
135149
--feature_size 128 \
@@ -142,7 +156,7 @@ jobs:
142156
test -f voxtral_preprocessor.pte
143157
echo "::endgroup::"
144158
145-
echo "::group::Store Voxtral Artifacts"
159+
echo "::group::Store Voxtral Artifacts (${{ matrix.quant.name }})"
146160
mkdir -p "${RUNNER_ARTIFACT_DIR}"
147161
cp model.pte "${RUNNER_ARTIFACT_DIR}/"
148162
cp aoti_cuda_blob.ptd "${RUNNER_ARTIFACT_DIR}/"
@@ -320,22 +334,30 @@ jobs:
320334
echo "::endgroup::"
321335
322336
test-voxtral-cuda-e2e:
323-
name: test-voxtral-cuda-e2e
337+
name: test-voxtral-cuda-e2e-${{ matrix.format.name }}
324338
needs: export-voxtral-cuda-artifact
325339
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
326340
permissions:
327341
id-token: write
328342
contents: read
329343
strategy:
330344
fail-fast: false
345+
matrix:
346+
format:
347+
- name: "non-quantized"
348+
artifact: "voxtral-cuda-export"
349+
- name: "quantized-int4-tile-packed"
350+
artifact: "voxtral-cuda-quantized-int4-tile-packed"
351+
- name: "quantized-int4-weight-only"
352+
artifact: "voxtral-cuda-quantized-int4-weight-only"
331353
with:
332354
timeout: 90
333355
runner: linux.g5.4xlarge.nvidia.gpu
334356
gpu-arch-type: cuda
335357
gpu-arch-version: 12.6
336358
use-custom-docker-registry: false
337359
submodules: recursive
338-
download-artifact: voxtral-cuda-export
360+
download-artifact: ${{ matrix.format.artifact }}
339361
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
340362
script: |
341363
set -eux
@@ -345,7 +367,7 @@ jobs:
345367
pip list
346368
echo "::endgroup::"
347369
348-
echo "::group::Prepare Voxtral Artifacts"
370+
echo "::group::Prepare Voxtral Artifacts (${{ matrix.format.name }})"
349371
cp "${RUNNER_ARTIFACT_DIR}/model.pte" .
350372
cp "${RUNNER_ARTIFACT_DIR}/aoti_cuda_blob.ptd" .
351373
cp "${RUNNER_ARTIFACT_DIR}/voxtral_preprocessor.pte" .
@@ -374,7 +396,7 @@ jobs:
374396
cmake --build cmake-out/examples/models/voxtral --target voxtral_runner --config Release
375397
echo "::endgroup::"
376398
377-
echo "::group::Run Voxtral Runner"
399+
echo "::group::Run Voxtral Runner (${{ matrix.format.name }})"
378400
set +e
379401
export LD_LIBRARY_PATH=/opt/conda/lib:$LD_LIBRARY_PATH
380402
OUTPUT=$(cmake-out/examples/models/voxtral/voxtral_runner \

CONTRIBUTING.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,8 @@ For Apple, please refer to the [iOS documentation](docs/source/using-executorch-
2424
executorch
2525
├── <a href="backends">backends</a> - Backend delegate implementations for various hardware targets. Each backend uses partitioner to split the graph into subgraphs that can be executed on specific hardware, quantizer to optimize model precision, and runtime components to execute the graph on target hardware. For details refer to the <a href="docs/source/backend-delegates-integration.md">backend documentation</a> and the <a href="docs/source/using-executorch-export.md">Export and Lowering tutorial</a> for more information.
2626
│ ├── <a href="backends/apple">apple</a> - Apple-specific backends.
27-
│ │ ├── <a href="backends/apple/coreml">coreml</a> - CoreML backend for Apple devices. See <a href="docs/source/backends-coreml.md">doc</a>.
28-
│ │ └── <a href="backends/apple/mps">mps</a> - Metal Performance Shaders backend for Apple devices. See <a href="docs/source/backends-mps.md">doc</a>.
27+
│ │ ├── <a href="backends/apple/coreml">coreml</a> - CoreML backend for Apple devices. See <a href="docs/source/backends/coreml/coreml-overview.md">doc</a>.
28+
│ │ └── <a href="backends/apple/mps">mps</a> - Metal Performance Shaders backend for Apple devices. See <a href="docs/source/backends/mps/mps-overview.md">doc</a>.
2929
│ ├── <a href="backends/arm">arm</a> - ARM architecture backends. See <a href="docs/source/backends-arm-ethos-u.md">doc</a>.
3030
│ ├── <a href="backends/cadence">cadence</a> - Cadence-specific backends. See <a href="docs/source/backends-cadence.md">doc</a>.
3131
│ ├── <a href="backends/example">example</a> - Example backend implementations.

README-wheel.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ The prebuilt `executorch.runtime` module included in this package provides a way
1212
to run ExecuTorch `.pte` files, with some restrictions:
1313
* Only [core ATen operators](docs/source/ir-ops-set-definition.md) are linked into the prebuilt module
1414
* Only the [XNNPACK backend delegate](docs/source/backends-xnnpack.md) is linked into the prebuilt module.
15-
* \[macOS only] [Core ML](docs/source/backends-coreml.md) and [MPS](docs/source/backends-mps.md) backend
15+
* \[macOS only] [Core ML](docs/source/backends/coreml/coreml-overview.md) and [MPS](docs/source/backends/mps/mps-overview.md) backend
1616
are also linked into the prebuilt module.
1717

1818
Please visit the [ExecuTorch website](https://pytorch.org/executorch) for

backends/apple/coreml/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# ExecuTorch Core ML Delegate
22

33
This subtree contains the Core ML Delegate implementation for ExecuTorch.
4-
Core ML is an optimized framework for running machine learning models on Apple devices. The delegate is the mechanism for leveraging the Core ML framework to accelerate operators when running on Apple devices. To learn how to use the CoreML delegate, see the [documentation](https://github.com/pytorch/executorch/blob/main/docs/source/backends-coreml.md).
4+
Core ML is an optimized framework for running machine learning models on Apple devices. The delegate is the mechanism for leveraging the Core ML framework to accelerate operators when running on Apple devices. To learn how to use the CoreML delegate, see the [documentation](https://github.com/pytorch/executorch/blob/main/docs/source/backends/coreml/coreml-overview.md).
55

66
## Layout
77
- `compiler/` : Lowers a module to Core ML backend.

backends/cadence/aot/ops_registrations.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,6 @@ def _validate_ref_impl_exists() -> None:
6060
"cadence::quantized_softmax.per_tensor",
6161
"cadence::quantized_conv2d_nchw", # We should only support per_tensor variant, should remove
6262
"cadence::quantized_relu", # We should only support per_tensor variant, should remove
63-
"cadence::linalg_svd",
6463
"cadence::quantized_conv2d_nhwc", # We should only support per_tensor variant, should remove
6564
"cadence::quantized_softmax",
6665
"cadence::quantized_w8a32_gru",

0 commit comments

Comments
 (0)