Skip to content
Closed
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
0f4caa1
[flamingo] Update preproc imports (#5160)
lucylq Sep 9, 2024
2dee34e
Refactor namespace usage in module tests.
shoumikhin Sep 9, 2024
647bfd4
Add an overload to skip dtype and sizes.
shoumikhin Sep 9, 2024
b52d4b6
Enable Llama3 Multi-turn conversation
cmodi-meta Sep 9, 2024
cd9d536
Make convert to linear an export pass
mcr229 Sep 9, 2024
b69ae0c
Hide and simplify operator registry internals
dbort Sep 9, 2024
6b1e328
[ExecuTorch] Support BFloat16 in CPUBlas gemm
swolchok Sep 9, 2024
aee0d82
RFC: quantized fast hadamard transform
swolchok Sep 9, 2024
eca9ed5
q to s start ops | add dim order sanity check
Gasoonjia Sep 9, 2024
85410e4
Qualcomm AI Engine Direct - Optimization and fix mutable buffer issue…
shewu-quic Sep 9, 2024
58a9f17
Update base for Update on "RFC: quantized fast hadamard transform"
swolchok Sep 9, 2024
449860f
Update on "RFC: quantized fast hadamard transform"
swolchok Sep 9, 2024
d2014e3
Add a target rule for ops_registrations (#5083)
LeeOHzzZ Sep 9, 2024
b23ee01
Register LLM prefill native method in JNI
kirklandsign Sep 9, 2024
28beeff
Clean up devtools/etdump
dbort Sep 9, 2024
90eaa1f
Update base for Update on "RFC: quantized fast hadamard transform"
swolchok Sep 9, 2024
e7fffa4
Update on "RFC: quantized fast hadamard transform"
swolchok Sep 9, 2024
6ce9f52
t to z start ops | add dim order sanity check
Gasoonjia Sep 9, 2024
542ecb5
Add Echo parameter to multimodal runner (llava) and jni layer (#5181)
cmodi-meta Sep 9, 2024
59d9bad
Use c++17 for size test
lucylq Sep 9, 2024
7650667
Add a default delegate time scale converter
Olivia-liu Sep 10, 2024
f412630
Qualcomm AI Engine Direct - Uplevel QNN version for ci test (#5174)
shewu-quic Sep 10, 2024
c5a385e
Update schema to include infinity for double values
lucylq Sep 10, 2024
f471556
Partition Mutable Buffer as Core ML State (#5165)
YifanShenSZ Sep 10, 2024
67ae762
Qualcomm AI Engine Direct - Add the argument to specify soc model (#5…
shewu-quic Sep 10, 2024
63e794a
Add pass to convert special case of mean.dim to averagepool2d
per Sep 10, 2024
370f304
Add slice_scatter test: large end value
manuelcandales Sep 10, 2024
083b9e6
[ET-VK] Fix gpuinfo CI
junpi3 Sep 10, 2024
1eeded1
Let the app check "aatp/data" subdir for AWS.
shoumikhin Sep 10, 2024
126abb5
Update the API of registering fake kernels to new standard (#5084)
LeeOHzzZ Sep 10, 2024
657789e
Qualcomm AI Engine Direct - Apply spin quant R1 and R2 (#5175)
shewu-quic Sep 10, 2024
549f14b
Restore constant segment
lucylq Sep 10, 2024
e826de3
Add Half/BFloat16 tests for op_mul
manuelcandales Sep 10, 2024
43e2f2d
Qualcomm AI Engine Direct - support skip quantization (#5070)
haowhsu-quic Sep 10, 2024
30acae5
Switch over backend tests to export_for_training
tarun292 Sep 10, 2024
db34239
[LLava] Fix stats for C++ runner
digantdesai Sep 10, 2024
02304d7
Update bundled_program to use new namespace
dbort Sep 10, 2024
c76b22f
Qualcomm AI Engine Direct - Fixed the order of the transforms for lla…
shewu-quic Sep 10, 2024
d38ca81
Android refactor cmake build
kirklandsign Sep 10, 2024
a4d67e2
Android: Leverage prefillPrompt and prefillImage on Llava
Riandy Sep 10, 2024
b54206d
Update the minimum C++ version to C++17
dbort Sep 10, 2024
4ce0f9d
Introduce PlatformMemoryAllocator
manuelcandales Sep 10, 2024
2b50c76
Use dynamic bound by default.
shoumikhin Sep 10, 2024
ced40f4
Fix models in benchinfra (#5226)
guangy10 Sep 10, 2024
e245590
App side change
kirklandsign Sep 10, 2024
4cce620
Minor fix: Create root dir when it doesn't exist. (#5075)
freddan80 Sep 10, 2024
ab6d91c
Fix internal executorch_llama_jni
kirklandsign Sep 10, 2024
f07e4d5
Update setup-with-qnn.sh with runner util flag (#5210)
WuhanMonkey Sep 10, 2024
cac2c05
[ET-VK] Integrate axis mapping into optimized matrix multiplication s…
SS-JIA Sep 10, 2024
cba5bee
fbshipit-source-id: f63634ba171da01328849d84552b125b829403e8
facebook-github-bot Sep 11, 2024
ca889fb
Minibench use model_dir instead (#5250)
kirklandsign Sep 11, 2024
e4d72ce
Update setup.sh for LlamaDemo (#5235)
kirklandsign Sep 11, 2024
d423131
Android app UI/flow improvements (#5241)
Riandy Sep 11, 2024
7942d2c
Allow core aten op exception list (#5237)
larryliu0820 Sep 11, 2024
69aed24
link whole quantized_ops_lib (#5253)
kirklandsign Sep 11, 2024
41bc1ce
spinquant in eager mode (#5125)
Sep 11, 2024
d7a7ec6
Updated the workflow to upload models to S3 (#5232)
Sep 11, 2024
7e374d7
Add model execution scripts and runner (#5217)
neuropilot-captain Sep 11, 2024
af80804
Debug event populates event name (#5142)
Olivia-liu Sep 11, 2024
68397af
Optimized op_mm using CPUBlas gemm (#5242)
swolchok Sep 11, 2024
d73a653
Add optimized op_linear (#5243)
swolchok Sep 11, 2024
3171ede
Add scalar tensor tests. (#5260)
shoumikhin Sep 11, 2024
4da3c5d
Add CoreML Quantize (#5228)
Sep 11, 2024
d6b800b
Add helper function to create empty, full, ones and zeros tensors. (#…
shoumikhin Sep 11, 2024
75a56a2
Add helper function to create random tensors. (#5266)
shoumikhin Sep 11, 2024
315abf8
Update base for Update on "RFC: quantized fast hadamard transform"
swolchok Sep 11, 2024
dd0db34
Update on "RFC: quantized fast hadamard transform"
swolchok Sep 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 82 additions & 5 deletions extension/llm/custom_ops/spinquant/fast_hadamard_transform.h
Original file line number Diff line number Diff line change
@@ -1,9 +1,18 @@
/*
* Copyright (c) Meta Platforms, Inc. and affiliates.
* All rights reserved.
*
* This source code is licensed under the BSD-style license found in the
* LICENSE file in the root directory of this source tree.
*/

// (c) Meta Platforms, Inc. and affiliates.
#pragma once

#include <cassert>
#include <cmath>
#include <cstdint>
#include <memory>

#include "fast_hadamard_transform_special.h"

Expand All @@ -24,9 +33,7 @@ T fast_sqrt_of_power_of_2(int log2_n) {
}

template <typename T>
void normalize_after_fht(
T* out,
int log2_vec_size) {
void normalize_after_fht(T* out, int log2_vec_size) {
const T inv_sqrt = T(1) / fast_sqrt_of_power_of_2<T>(log2_vec_size);
const int vec_size = 1 << log2_vec_size;
for (int ii = 0; ii < vec_size; ++ii) {
Expand All @@ -35,7 +42,7 @@ void normalize_after_fht(
}

template <typename T>
void fast_hadamard_transform_simple_impl(
void fast_hadamard_transform_unnormalized_simple_impl(
T* vec,
int log2_vec_size) {
if (log2_vec_size == 0) {
Expand All @@ -55,7 +62,11 @@ void fast_hadamard_transform_simple_impl(
}
step *= 2;
}
}

template <typename T>
void fast_hadamard_transform_simple_impl(T* vec, int log2_vec_size) {
fast_hadamard_transform_unnormalized_simple_impl(vec, log2_vec_size);
normalize_after_fht(vec, log2_vec_size);
}

Expand All @@ -66,7 +77,73 @@ void fast_hadamard_transform_simple_impl(
// of vec, which must be of length (1 << log2_vec_size).
template <typename T>
void fast_hadamard_transform(T* vec, int log2_vec_size) {
internal::fast_hadamard_transform_simple_impl(vec, log2_vec_size);
internal::fast_hadamard_transform_simple_impl(vec, log2_vec_size);
}

// Compute a quantized fast Walsh-Hadamard transform of vec, which
// must be of length (1 << log2_vec_size) and symmetrically quantized.
//
// Note that we do not need to know the quantization scale, because
// the Fast Hadamard transform is a series of additions and
// subtractions with a final multiplication step, and we have the
// following trivial identities:
//
// scale * a + scale * b = scale * (a + b) (addition doesn't need the scale)
// alpha * (scale * a) = scale * (alpha * a) (multiplication doesn't need the
// scale)
void fast_hadamard_transform_symmetric_quantized_s16(
int16_t* vec,
int log2_vec_size) {
if (log2_vec_size == 0) {
return;
}

const int vec_size = 1 << log2_vec_size;
// We perform log2_vec_size rounds where each round's maximum output
// is at most double the maximum input, so we can at most multiply
// the maximum input by vec_size. Performing intermediate arithmetic
// in 32-bit precision should prevent overflow, since 16 +
// log2_vec_size should be much less than 32.
auto tmp = std::make_unique<int32_t[]>(vec_size);
std::copy(vec, vec + vec_size, tmp.get());

// Per the function-level comment above, we can ignore the
// quantization scale, so we just delegate to the usual unnormalized
// implementation.
// NOTE: if we need this to be fast on CPU, we can use FFHT to
// generate fht_uint32 similar to fht_float.
internal::fast_hadamard_transform_unnormalized_simple_impl(
tmp.get(), log2_vec_size);

// Normalization step: divide by sqrt(1 << log2_vec_size). Similar
// to fast_sqrt above, if N is even, then the maximum-precision way
// to do this is right-shift by log2_vec_size / 2. If N is odd, we
// still do the right-shift, and then we have an extra division by
// sqrt(2) that we perform by making use of a sufficiently accurate
// rational approximation. Our initial idea was to divide by sqrt(2)
// by adjusting the quantization scale, but that would cause this
// function to tend to increase the magnitude of the elements of
// vec, which would resulting in clipping and therefore accuracy
// loss, especially compounded over 30+ transformer layers.
const int log2_sqrt_vec_size = log2_vec_size / 2;
constexpr int32_t qmin = -(1 << 15) + 1;
constexpr int32_t qmax = -qmin;
if (log2_vec_size % 2 != 0) {
// 408 / 577 - 1.0 / sqrt(2) ~= 1.062e-0.6, which should be close enough.
static const int32_t inv_sqrt_2_numerator = 408;
static const int32_t inv_sqrt_2_denominator = 577;
for (int ii = 0; ii < vec_size; ++ii) {
const auto val_over_sqrt_vec_size =
(tmp[ii] * inv_sqrt_2_numerator / inv_sqrt_2_denominator) >>
log2_sqrt_vec_size;
vec[ii] = std::clamp(val_over_sqrt_vec_size, qmin, qmax);
}
} else {
for (int ii = 0; ii < vec_size; ++ii) {
vec[ii] = std::clamp(tmp[ii] >> log2_sqrt_vec_size, qmin, qmax);
}
}
return;
}

// Like fast_hadamard_transform, but vec must be of length 28 * (1 <<
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@

// This file is auto-generated. See "special_hadamard_code_gen.py"
// @generated by special_hadamard_code_gen.py strided_cpu


#pragma once
Expand Down
59 changes: 40 additions & 19 deletions extension/llm/custom_ops/spinquant/special_hadamard_code_gen.py
Original file line number Diff line number Diff line change
Expand Up @@ -176,12 +176,12 @@
had_strings = [had_12, had_20_will, had_28_will, had_40_tpal]

header = """
// This file is auto-generated. See "special_hadamard_code_gen.py"\n

#pragma once

"""


TEMPLATE = """
__device__ __forceinline__ void hadamard_mult_thread_{N}(float x[{N}]) {{
float out[{N}];
Expand Down Expand Up @@ -220,8 +220,13 @@

def string_to_array(string):
# Convert strings of + and - to bool arrays
string = string.strip().replace('+', '1').replace('-', '-1').split()
return np.stack([np.fromstring(" ".join(string[i]), dtype=np.int32, sep=' ') for i in range(len(string))])
string = string.strip().replace("+", "1").replace("-", "-1").split()
return np.stack(
[
np.fromstring(" ".join(string[i]), dtype=np.int32, sep=" ")
for i in range(len(string))
]
)


def strided_load_code_gen(N):
Expand All @@ -233,28 +238,44 @@ def array_code_gen(arr, template):
assert arr.shape[0] == arr.shape[1]
out = []
for i in range(N):
out.append(f"out[{i}] = " + " ".join([f"{'+' if arr[i, j] == 1 else '-'} x[{j}]" for j in range(N)]) + ";")
return template.format(N=str(N), code='\n '.join(out), strided_load_code = strided_load_code_gen(N))


def main(template = TEMPLATE):
output_dir = Path(__file__).parent / "fast_hadamard_transform_special.h"
output_dir.write_text(header + ''.join(array_code_gen(string_to_array(s), template) for s in had_strings))
out.append(
f"out[{i}] = "
+ " ".join([f"{'+' if arr[i, j] == 1 else '-'} x[{j}]" for j in range(N)])
+ ";"
)
return template.format(
N=str(N), code="\n ".join(out), strided_load_code=strided_load_code_gen(N)
)


OPTION_TO_TEMPLATE = {
'cuda': TEMPLATE,
'cpu': CPU_TEMPLATE,
'strided_cpu': STRIDED_CPU_TEMPLATE,
"cuda": TEMPLATE,
"cpu": CPU_TEMPLATE,
"strided_cpu": STRIDED_CPU_TEMPLATE,
}


if __name__ == '__main__':
def main(option="cuda"):
try:
template = OPTION_TO_TEMPLATE[option]
except KeyError:
raise Exception(
f"bad target option {option}; options are {', '.join(OPTION_TO_TEMPLATE.keys())}"
)
output_dir = Path(__file__).parent / "fast_hadamard_transform_special.h"
generated_line = f"// @{'generated'} by special_hadamard_code_gen.py {option}\n"

output_dir.write_text(
generated_line
+ header
+ "".join(array_code_gen(string_to_array(s), template) for s in had_strings)
)


if __name__ == "__main__":
import sys
template = TEMPLATE

option = "cuda"
if len(sys.argv) > 1:
option = sys.argv[1]
if option not in OPTION_TO_TEMPLATE:
raise Exception(f"bad target option {option}; options are {', '.join(OPTION_TO_TEMPLATE.keys())}")
template = OPTION_TO_TEMPLATE[option]
main(template)
main(option)
3 changes: 2 additions & 1 deletion extension/llm/custom_ops/spinquant/targets.bzl
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,9 @@ def define_common_targets():
"""
runtime.cxx_library(
name = "fast_hadamard_transform",
headers = [
exported_headers = [
"fast_hadamard_transform.h",
"fast_hadamard_transform_special.h",
],
visibility = ["@EXECUTORCH_CLIENTS"],
)