Skip to content

Commit 6a4879f

Browse files
author
ssjia
committed
Update on "[ET-VK] Add kInt8x4 dtype and GPUMemoryLayouts for packed quantized tensors"
## Motivation Lay the foundations for being able to execute statically quantized CNNs with ET-VK. Unlike with dynamic quantization, static quantization allows the output of quantized operators to stay in integer representation and be fed directly to the next quantized operator. ## Context Typically, int8 quantized tensors can be represented by simply having the tensor use the int8 data type. While this is possible in ET-VK, in practice quantized operators expect int8 quantized tensors to be packed so that 16 8-bit values are packed into each `ivec4`, such that quantized int8 tensors will load/store with a granularity of 16 elements. The reason for this is twofold: * Support for shader int8 / storage buffer int8 extension is not guaranteed, meaning some devices do not allow using int8 types in shaders * We have found that load/store from storage buffers/textures that use int8 data types sometimes results in worse memory load performance, due to vectorized load/store instructions not being used. Therefore, in ET-VK we need a way to mark that a quantized tensor should 1. Use int32 as the underlying data type for the storage buffer/texture 2. Account for the block-packing that may be used ## Changes First, introduce the `Int8x4` dtype that can be used for packed int8 tensors. This dtype is functionally the same as `Int`, but denotes that each int32 actually contains 4 packed 8-bit values. Second, introduce new memory layouts: `kPackedInt8_4W4C` and `kPackedInt8_4H4W`. The former will be used for convolution, whil the latter will be used for matrix multiplication. See the inline comments for more details about these memory layouts. Then, update `QuantizedConvolution.cpp` and `QuantizedLinear.cpp` to use the new data type and memory layouts for the packed int8 input tensor. Differential Revision: [D82542336](https://our.internmc.facebook.com/intern/diff/D82542336/) [ghstack-poisoned]
2 parents bb7d44c + e26ce58 commit 6a4879f

File tree

576 files changed

+20685
-9293
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

576 files changed

+20685
-9293
lines changed

.ci/docker/build.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -54,13 +54,13 @@ case "${IMAGE_NAME}" in
5454
executorch-ubuntu-22.04-mediatek-sdk)
5555
MEDIATEK_SDK=yes
5656
CLANG_VERSION=12
57-
ANDROID_NDK_VERSION=r27b
57+
ANDROID_NDK_VERSION=r28c
5858
;;
5959
executorch-ubuntu-22.04-clang12-android)
6060
LINTRUNNER=""
6161
CLANG_VERSION=12
6262
# From https://developer.android.com/ndk/downloads
63-
ANDROID_NDK_VERSION=r27b
63+
ANDROID_NDK_VERSION=r28c
6464
;;
6565
*)
6666
echo "Invalid image name ${IMAGE_NAME}"
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
40b02a2dc61bbf901a2df91719f47c98d65368ec
1+
bd06b54e627fbfd354a2cffa4c80fb21883209a9
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
4d4abec80f03cd8fdefe1d9cb3a60d3690cd777e
1+
53a2908a10f414a2f85caa06703a26a40e873869

.ci/scripts/setup-samsung-linux-deps.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ set -ex
1111

1212
download_ai_lite_core() {
1313
API_BASE="https://soc-developer.semiconductor.samsung.com/api/v1/resource/ai-litecore/download"
14-
API_KEY="kn10SoSY3hkC-9Qny5TqD2mnqVrlupv3krnjLeBt5cY"
14+
API_KEY=$SAMSUNG_AI_LITECORE_KEY
1515

1616
VERSION="0.5"
1717
OS_NAME="Ubuntu 22.04"
@@ -52,7 +52,7 @@ download_ai_lite_core() {
5252
install_enn_backend() {
5353
NDK_INSTALLATION_DIR=/opt/ndk
5454
rm -rf "${NDK_INSTALLATION_DIR}" && sudo mkdir -p "${NDK_INSTALLATION_DIR}"
55-
ANDROID_NDK_VERSION=r27b
55+
ANDROID_NDK_VERSION=r28c
5656

5757
# build Exynos backend
5858
export ANDROID_NDK_ROOT=${ANDROID_NDK_ROOT:-/opt/ndk}

.ci/scripts/test-cuda-build.sh

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
#!/bin/bash
2+
# Copyright (c) Meta Platforms, Inc. and affiliates.
3+
# All rights reserved.
4+
#
5+
# This source code is licensed under the BSD-style license found in the
6+
# LICENSE file in the root directory of this source tree.
7+
8+
set -exu
9+
10+
CUDA_VERSION=${1:-"12.6"}
11+
12+
echo "=== Testing ExecuTorch CUDA ${CUDA_VERSION} Build ==="
13+
14+
# Function to build and test ExecuTorch with CUDA support
15+
test_executorch_cuda_build() {
16+
local cuda_version=$1
17+
18+
echo "Building ExecuTorch with CUDA ${cuda_version} support..."
19+
echo "ExecuTorch will automatically detect CUDA and install appropriate PyTorch wheel"
20+
21+
# Check available resources before starting
22+
echo "=== System Information ==="
23+
echo "Available memory: $(free -h | grep Mem | awk '{print $2}')"
24+
echo "Available disk space: $(df -h . | tail -1 | awk '{print $4}')"
25+
echo "CPU cores: $(nproc)"
26+
echo "CUDA version check:"
27+
nvcc --version || echo "nvcc not found"
28+
nvidia-smi || echo "nvidia-smi not found"
29+
30+
# Set CMAKE_ARGS to enable CUDA build - ExecuTorch will handle PyTorch installation automatically
31+
export CMAKE_ARGS="-DEXECUTORCH_BUILD_CUDA=ON"
32+
33+
echo "=== Starting ExecuTorch Installation ==="
34+
# Install ExecuTorch with CUDA support with timeout and error handling
35+
timeout 5400 ./install_executorch.sh || {
36+
local exit_code=$?
37+
echo "ERROR: install_executorch.sh failed with exit code: $exit_code"
38+
if [ $exit_code -eq 124 ]; then
39+
echo "ERROR: Installation timed out after 90 minutes"
40+
fi
41+
exit $exit_code
42+
}
43+
44+
echo "SUCCESS: ExecuTorch CUDA build completed"
45+
46+
# Verify the installation
47+
echo "=== Verifying ExecuTorch CUDA Installation ==="
48+
49+
# Test that ExecuTorch was built successfully
50+
python -c "
51+
import executorch
52+
print('SUCCESS: ExecuTorch imported successfully')
53+
"
54+
55+
# Test CUDA availability and show details
56+
python -c "
57+
try:
58+
import torch
59+
print('INFO: PyTorch version:', torch.__version__)
60+
print('INFO: CUDA available:', torch.cuda.is_available())
61+
62+
if torch.cuda.is_available():
63+
print('SUCCESS: CUDA is available for ExecuTorch')
64+
print('INFO: CUDA version:', torch.version.cuda)
65+
print('INFO: GPU device count:', torch.cuda.device_count())
66+
print('INFO: Current GPU device:', torch.cuda.current_device())
67+
print('INFO: GPU device name:', torch.cuda.get_device_name())
68+
69+
# Test basic CUDA tensor operation
70+
device = torch.device('cuda')
71+
x = torch.randn(10, 10).to(device)
72+
y = torch.randn(10, 10).to(device)
73+
z = torch.mm(x, y)
74+
print('SUCCESS: CUDA tensor operation completed on device:', z.device)
75+
print('INFO: Result tensor shape:', z.shape)
76+
77+
print('SUCCESS: ExecuTorch CUDA integration verified')
78+
else:
79+
print('WARNING: CUDA not detected, but ExecuTorch built successfully')
80+
exit(1)
81+
except Exception as e:
82+
print('ERROR: ExecuTorch CUDA test failed:', e)
83+
exit(1)
84+
"
85+
86+
echo "SUCCESS: ExecuTorch CUDA ${cuda_version} build and verification completed successfully"
87+
}
88+
89+
# Main execution
90+
echo "Current working directory: $(pwd)"
91+
echo "Directory contents:"
92+
ls -la
93+
94+
# Run the CUDA build test
95+
test_executorch_cuda_build "${CUDA_VERSION}"

.ci/scripts/test_huggingface_optimum_model.py

Lines changed: 114 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,9 @@ def cli_export(command, model_dir):
4343

4444

4545
def check_causal_lm_output_quality(
46-
model_id: str, generated_tokens: List[int], max_perplexity_threshold: float = 100.0
46+
model_id: str,
47+
generated_tokens: List[int],
48+
max_perplexity_threshold: float = 100.0,
4749
):
4850
"""
4951
Evaluates the quality of text generated by a causal language model by calculating its perplexity.
@@ -58,12 +60,24 @@ def check_causal_lm_output_quality(
5860
"""
5961
logging.info(f"Starting perplexity check with model '{model_id}' ...")
6062
# Load model
61-
model = AutoModelForCausalLM.from_pretrained(
62-
model_id,
63-
low_cpu_mem_usage=True,
64-
use_cache=False,
65-
torch_dtype=torch.bfloat16,
66-
)
63+
cls_name = AutoModelForCausalLM
64+
if "llava" in model_id:
65+
from transformers import LlavaForConditionalGeneration
66+
67+
cls_name = LlavaForConditionalGeneration
68+
try:
69+
model = cls_name.from_pretrained(
70+
model_id,
71+
low_cpu_mem_usage=True,
72+
use_cache=False,
73+
torch_dtype=torch.bfloat16,
74+
)
75+
except TypeError:
76+
model = cls_name.from_pretrained(
77+
model_id,
78+
low_cpu_mem_usage=True,
79+
torch_dtype=torch.bfloat16,
80+
)
6781

6882
with torch.no_grad():
6983
outputs = model(input_ids=generated_tokens, labels=generated_tokens)
@@ -156,6 +170,86 @@ def test_text_generation(model_id, model_dir, recipe, *, quantize=True, run_only
156170
assert check_causal_lm_output_quality(model_id, generated_tokens) is True
157171

158172

173+
def test_llm_with_image_modality(
174+
model_id, model_dir, recipe, *, quantize=True, run_only=False
175+
):
176+
command = [
177+
"optimum-cli",
178+
"export",
179+
"executorch",
180+
"--model",
181+
model_id,
182+
"--task",
183+
"multimodal-text-to-text",
184+
"--recipe",
185+
recipe,
186+
"--output_dir",
187+
model_dir,
188+
"--use_custom_sdpa",
189+
"--use_custom_kv_cache",
190+
"--qlinear",
191+
"8da4w",
192+
"--qembedding",
193+
"8w",
194+
]
195+
if not run_only:
196+
cli_export(command, model_dir)
197+
198+
tokenizer = AutoTokenizer.from_pretrained(model_id)
199+
tokenizer.save_pretrained(model_dir)
200+
201+
# input
202+
processor = AutoProcessor.from_pretrained(model_id)
203+
image_url = "https://llava-vl.github.io/static/images/view.jpg"
204+
conversation = [
205+
{
206+
"role": "system",
207+
"content": [
208+
{
209+
"type": "text",
210+
"text": "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.",
211+
}
212+
],
213+
},
214+
{
215+
"role": "user",
216+
"content": [
217+
{"type": "image", "url": image_url},
218+
{
219+
"type": "text",
220+
"text": "What are the things I should be cautious about when I visit here?",
221+
},
222+
],
223+
},
224+
]
225+
inputs = processor.apply_chat_template(
226+
conversation,
227+
add_generation_prompt=True,
228+
tokenize=True,
229+
return_dict=True,
230+
return_tensors="pt",
231+
)
232+
233+
from executorch.extension.llm.runner import GenerationConfig, MultimodalRunner
234+
235+
runner = MultimodalRunner(f"{model_dir}/model.pte", f"{model_dir}/tokenizer.model")
236+
generated_text = runner.generate_text_hf(
237+
inputs,
238+
GenerationConfig(max_new_tokens=128, temperature=0, echo=False),
239+
processor.image_token_id,
240+
)
241+
print(f"\nGenerated text:\n\t{generated_text}")
242+
# Free memory before loading eager for quality check
243+
del runner
244+
gc.collect()
245+
assert (
246+
check_causal_lm_output_quality(
247+
model_id, tokenizer.encode(generated_text, return_tensors="pt")
248+
)
249+
is True
250+
)
251+
252+
159253
def test_fill_mask(model_id, model_dir, recipe, *, quantize=True, run_only=False):
160254
command = [
161255
"optimum-cli",
@@ -353,6 +447,9 @@ def test_vit(model_id, model_dir, recipe, *, quantize=False, run_only=False):
353447
required=False,
354448
help="When provided, write the pte file to this directory. Otherwise, a temporary directory is created for the test.",
355449
)
450+
parser.add_argument(
451+
"--run_only", action="store_true", help="Skip export and only run the test"
452+
)
356453
args = parser.parse_args()
357454

358455
_text_generation_mapping = {
@@ -384,8 +481,16 @@ def test_vit(model_id, model_dir, recipe, *, quantize=False, run_only=False):
384481
"vit": ("google/vit-base-patch16-224", test_vit),
385482
}
386483

484+
_multimodal_model_mapping = {
485+
"gemma3-4b": ("google/gemma-3-4b-it", test_llm_with_image_modality),
486+
"llava": ("llava-hf/llava-1.5-7b-hf", test_llm_with_image_modality),
487+
}
488+
387489
model_to_model_id_and_test_function = (
388-
_text_generation_mapping | _mask_fill_mapping | _misc_model_mapping
490+
_text_generation_mapping
491+
| _mask_fill_mapping
492+
| _misc_model_mapping
493+
| _multimodal_model_mapping
389494
)
390495

391496
if args.model not in model_to_model_id_and_test_function:
@@ -400,4 +505,5 @@ def test_vit(model_id, model_dir, recipe, *, quantize=False, run_only=False):
400505
model_dir=tmp_dir if args.model_dir is None else args.model_dir,
401506
recipe=args.recipe,
402507
quantize=args.quantize,
508+
run_only=args.run_only,
403509
)

.ci/scripts/test_llama.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,7 @@ cmake_install_executorch_libraries() {
159159
-DCMAKE_INSTALL_PREFIX=cmake-out \
160160
-DCMAKE_BUILD_TYPE="$CMAKE_BUILD_TYPE" \
161161
-DEXECUTORCH_BUILD_QNN="$QNN" \
162+
-DEXECUTORCH_ENABLE_LOGGING=ON \
162163
-DQNN_SDK_ROOT="$QNN_SDK_ROOT"
163164
cmake --build cmake-out -j9 --target install --config "$CMAKE_BUILD_TYPE"
164165
}

.ci/scripts/test_model.sh

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -131,13 +131,13 @@ test_model_with_xnnpack() {
131131
return 0
132132
fi
133133

134-
# Delegation
134+
# Delegation and test with pybindings
135135
if [[ ${WITH_QUANTIZATION} == true ]]; then
136136
SUFFIX="q8"
137-
"${PYTHON_EXECUTABLE}" -m examples.xnnpack.aot_compiler --model_name="${MODEL_NAME}" --delegate --quantize
137+
"${PYTHON_EXECUTABLE}" -m examples.xnnpack.aot_compiler --model_name="${MODEL_NAME}" --delegate --quantize --test_after_export
138138
else
139139
SUFFIX="fp32"
140-
"${PYTHON_EXECUTABLE}" -m examples.xnnpack.aot_compiler --model_name="${MODEL_NAME}" --delegate
140+
"${PYTHON_EXECUTABLE}" -m examples.xnnpack.aot_compiler --model_name="${MODEL_NAME}" --delegate --test_after_export
141141
fi
142142

143143
OUTPUT_MODEL_PATH="${MODEL_NAME}_xnnpack_${SUFFIX}.pte"

.ci/scripts/test_wheel_package_qnn.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -145,6 +145,7 @@ run_core_tests () {
145145
echo "=== [$LABEL] Import smoke tests ==="
146146
"$PYBIN" -c "import executorch; print('executorch imported successfully')"
147147
"$PYBIN" -c "import executorch.backends.qualcomm; print('executorch.backends.qualcomm imported successfully')"
148+
"$PYBIN" -c "from executorch.export.target_recipes import get_android_recipe; recipe = get_android_recipe('android-arm64-snapdragon-fp16'); print(f'executorch.export.target_recipes imported successfully: {recipe}')"
148149

149150
echo "=== [$LABEL] List installed executorch/backends/qualcomm/python ==="
150151
local SITE_DIR

.ci/scripts/unittest-buck2.sh

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,8 @@ buck2 query "//backends/apple/... + //backends/arm: + //backends/arm/debug/... +
1515
//backends/arm/_passes/... + //backends/arm/runtime/... + //backends/arm/tosa/... \
1616
+ //backends/example/... + \
1717
//backends/mediatek/... + //backends/transforms/... + \
18-
//backends/xnnpack/... + //configurations/... + //extension/flat_tensor: + \
18+
//backends/xnnpack/... + //codegen/tools/... + \
19+
//configurations/... + //extension/flat_tensor: + \
1920
//extension/llm/runner: + //kernels/aten/... + //kernels/optimized/... + \
2021
//kernels/portable/... + //kernels/quantized/... + //kernels/test/... + \
2122
//runtime/... + //schema/... + //test/... + //util/..."
@@ -34,7 +35,17 @@ BUILDABLE_KERNELS_PRIM_OPS_TARGETS=$(buck2 query //kernels/prim_ops/... | grep -
3435
for op in "build" "test"; do
3536
buck2 $op $BUILDABLE_OPTIMIZED_OPS \
3637
//examples/selective_build:select_all_dtype_selective_lib_portable_lib \
38+
//extension/llm/custom_ops/spinquant/test:fast_hadamard_transform_test \
39+
//extension/llm/runner/test:test_multimodal_input \
40+
//extension/llm/runner/test:test_generation_config \
3741
//kernels/portable/... \
3842
$BUILDABLE_KERNELS_PRIM_OPS_TARGETS //runtime/backend/... //runtime/core/... \
3943
//runtime/executor: //runtime/kernel/... //runtime/platform/...
4044
done
45+
46+
# Build only without testing
47+
buck2 build //codegen/tools/... \
48+
//extension/llm/runner/io_manager:io_manager \
49+
//extension/llm/modules/... \
50+
//extension/llm/runner:multimodal_runner_lib \
51+
//extension/llm/runner:text_decoder_runner

0 commit comments

Comments
 (0)