Skip to content

Commit 436bf3d

Browse files
committed
Update base for Update on "gemma3 e2e runner on cuda"
This diff introduces e2e runner for gemma3 model on cuda delegating using AOTI library, which is guarded by CI. Also other necessary infrastructure updates for building and running the `gemma3 e2e runner` on CUDA devices. Differential Revision: [D85087532](https://our.internmc.facebook.com/intern/diff/D85087532/) [ghstack-poisoned]
2 parents 96704b5 + baa41c6 commit 436bf3d

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+1216
-763
lines changed
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
44d8d54e38c0258357d4e92e1fefe21e845947a3
1+
09fdbd0a0639b128f712a4f5202ed42ca4c60957

.github/workflows/cuda.yml

Lines changed: 30 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -88,14 +88,26 @@ jobs:
8888
PYTHON_EXECUTABLE=python source .ci/scripts/test_model.sh "${{ matrix.model }}" cmake cuda
8989
9090
export-voxtral-cuda-artifact:
91-
name: export-voxtral-cuda-artifact
91+
name: export-voxtral-cuda-${{ matrix.quant.name }}
9292
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
9393
permissions:
9494
id-token: write
9595
contents: read
9696
secrets: inherit
9797
strategy:
9898
fail-fast: false
99+
matrix:
100+
quant:
101+
- name: "non-quantized"
102+
artifact: "voxtral-cuda-export"
103+
extra_args: ""
104+
- name: "quantized-int4-tile-packed"
105+
artifact: "voxtral-cuda-quantized-int4-tile-packed"
106+
extra_args: "--qlinear 4w --qlinear_encoder 4w --qlinear_packing_format tile_packed_to_4d --qlinear_encoder_packing_format tile_packed_to_4d"
107+
- name: "quantized-int4-weight-only"
108+
artifact: "voxtral-cuda-quantized-int4-weight-only"
109+
# TODO: adding "--qlinear 4w" produces invalid results. Need further investigation.
110+
extra_args: "--qlinear_encoder 4w"
99111
with:
100112
timeout: 90
101113
secrets-env: EXECUTORCH_HF_TOKEN
@@ -104,7 +116,7 @@ jobs:
104116
gpu-arch-version: 12.6
105117
use-custom-docker-registry: false
106118
submodules: recursive
107-
upload-artifact: voxtral-cuda-export
119+
upload-artifact: ${{ matrix.quant.artifact }}
108120
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
109121
script: |
110122
set -eux
@@ -122,14 +134,16 @@ jobs:
122134
pip list
123135
echo "::endgroup::"
124136
125-
echo "::group::Export Voxtral"
137+
echo "::group::Export Voxtral (${{ matrix.quant.name }})"
138+
EXTRA_ARGS="${{ matrix.quant.extra_args }}"
126139
optimum-cli export executorch \
127140
--model "mistralai/Voxtral-Mini-3B-2507" \
128141
--task "multimodal-text-to-text" \
129142
--recipe "cuda" \
130143
--dtype bfloat16 \
131144
--device cuda \
132145
--max_seq_len 1024 \
146+
${EXTRA_ARGS} \
133147
--output_dir ./
134148
python -m executorch.extension.audio.mel_spectrogram \
135149
--feature_size 128 \
@@ -142,7 +156,7 @@ jobs:
142156
test -f voxtral_preprocessor.pte
143157
echo "::endgroup::"
144158
145-
echo "::group::Store Voxtral Artifacts"
159+
echo "::group::Store Voxtral Artifacts (${{ matrix.quant.name }})"
146160
mkdir -p "${RUNNER_ARTIFACT_DIR}"
147161
cp model.pte "${RUNNER_ARTIFACT_DIR}/"
148162
cp aoti_cuda_blob.ptd "${RUNNER_ARTIFACT_DIR}/"
@@ -306,22 +320,30 @@ jobs:
306320
echo "::endgroup::"
307321
308322
test-voxtral-cuda-e2e:
309-
name: test-voxtral-cuda-e2e
323+
name: test-voxtral-cuda-e2e-${{ matrix.format.name }}
310324
needs: export-voxtral-cuda-artifact
311325
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
312326
permissions:
313327
id-token: write
314328
contents: read
315329
strategy:
316330
fail-fast: false
331+
matrix:
332+
format:
333+
- name: "non-quantized"
334+
artifact: "voxtral-cuda-export"
335+
- name: "quantized-int4-tile-packed"
336+
artifact: "voxtral-cuda-quantized-int4-tile-packed"
337+
- name: "quantized-int4-weight-only"
338+
artifact: "voxtral-cuda-quantized-int4-weight-only"
317339
with:
318340
timeout: 90
319341
runner: linux.g5.4xlarge.nvidia.gpu
320342
gpu-arch-type: cuda
321343
gpu-arch-version: 12.6
322344
use-custom-docker-registry: false
323345
submodules: recursive
324-
download-artifact: voxtral-cuda-export
346+
download-artifact: ${{ matrix.format.artifact }}
325347
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
326348
script: |
327349
set -eux
@@ -331,7 +353,7 @@ jobs:
331353
pip list
332354
echo "::endgroup::"
333355
334-
echo "::group::Prepare Voxtral Artifacts"
356+
echo "::group::Prepare Voxtral Artifacts (${{ matrix.format.name }})"
335357
cp "${RUNNER_ARTIFACT_DIR}/model.pte" .
336358
cp "${RUNNER_ARTIFACT_DIR}/aoti_cuda_blob.ptd" .
337359
cp "${RUNNER_ARTIFACT_DIR}/voxtral_preprocessor.pte" .
@@ -360,7 +382,7 @@ jobs:
360382
cmake --build cmake-out/examples/models/voxtral --target voxtral_runner --config Release
361383
echo "::endgroup::"
362384
363-
echo "::group::Run Voxtral Runner"
385+
echo "::group::Run Voxtral Runner (${{ matrix.format.name }})"
364386
set +e
365387
export LD_LIBRARY_PATH=/opt/conda/lib:$LD_LIBRARY_PATH
366388
OUTPUT=$(cmake-out/examples/models/voxtral/voxtral_runner \

CONTRIBUTING.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,8 @@ For Apple, please refer to the [iOS documentation](docs/source/using-executorch-
2424
executorch
2525
├── <a href="backends">backends</a> - Backend delegate implementations for various hardware targets. Each backend uses partitioner to split the graph into subgraphs that can be executed on specific hardware, quantizer to optimize model precision, and runtime components to execute the graph on target hardware. For details refer to the <a href="docs/source/backend-delegates-integration.md">backend documentation</a> and the <a href="docs/source/using-executorch-export.md">Export and Lowering tutorial</a> for more information.
2626
│ ├── <a href="backends/apple">apple</a> - Apple-specific backends.
27-
│ │ ├── <a href="backends/apple/coreml">coreml</a> - CoreML backend for Apple devices. See <a href="docs/source/backends-coreml.md">doc</a>.
28-
│ │ └── <a href="backends/apple/mps">mps</a> - Metal Performance Shaders backend for Apple devices. See <a href="docs/source/backends-mps.md">doc</a>.
27+
│ │ ├── <a href="backends/apple/coreml">coreml</a> - CoreML backend for Apple devices. See <a href="docs/source/backends/coreml/coreml-overview.md">doc</a>.
28+
│ │ └── <a href="backends/apple/mps">mps</a> - Metal Performance Shaders backend for Apple devices. See <a href="docs/source/backends/mps/mps-overview.md">doc</a>.
2929
│ ├── <a href="backends/arm">arm</a> - ARM architecture backends. See <a href="docs/source/backends-arm-ethos-u.md">doc</a>.
3030
│ ├── <a href="backends/cadence">cadence</a> - Cadence-specific backends. See <a href="docs/source/backends-cadence.md">doc</a>.
3131
│ ├── <a href="backends/example">example</a> - Example backend implementations.

README-wheel.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ The prebuilt `executorch.runtime` module included in this package provides a way
1212
to run ExecuTorch `.pte` files, with some restrictions:
1313
* Only [core ATen operators](docs/source/ir-ops-set-definition.md) are linked into the prebuilt module
1414
* Only the [XNNPACK backend delegate](docs/source/backends-xnnpack.md) is linked into the prebuilt module.
15-
* \[macOS only] [Core ML](docs/source/backends-coreml.md) and [MPS](docs/source/backends-mps.md) backend
15+
* \[macOS only] [Core ML](docs/source/backends/coreml/coreml-overview.md) and [MPS](docs/source/backends/mps/mps-overview.md) backend
1616
are also linked into the prebuilt module.
1717

1818
Please visit the [ExecuTorch website](https://pytorch.org/executorch) for

backends/apple/coreml/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# ExecuTorch Core ML Delegate
22

33
This subtree contains the Core ML Delegate implementation for ExecuTorch.
4-
Core ML is an optimized framework for running machine learning models on Apple devices. The delegate is the mechanism for leveraging the Core ML framework to accelerate operators when running on Apple devices. To learn how to use the CoreML delegate, see the [documentation](https://github.com/pytorch/executorch/blob/main/docs/source/backends-coreml.md).
4+
Core ML is an optimized framework for running machine learning models on Apple devices. The delegate is the mechanism for leveraging the Core ML framework to accelerate operators when running on Apple devices. To learn how to use the CoreML delegate, see the [documentation](https://github.com/pytorch/executorch/blob/main/docs/source/backends/coreml/coreml-overview.md).
55

66
## Layout
77
- `compiler/` : Lowers a module to Core ML backend.

backends/cadence/aot/ops_registrations.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,6 @@ def _validate_ref_impl_exists() -> None:
6060
"cadence::quantized_softmax.per_tensor",
6161
"cadence::quantized_conv2d_nchw", # We should only support per_tensor variant, should remove
6262
"cadence::quantized_relu", # We should only support per_tensor variant, should remove
63-
"cadence::linalg_svd",
6463
"cadence::quantized_conv2d_nhwc", # We should only support per_tensor variant, should remove
6564
"cadence::quantized_softmax",
6665
"cadence::quantized_w8a32_gru",

0 commit comments

Comments
 (0)