-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Add 256-bit AVX support #22430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add 256-bit AVX support #22430
Conversation
system/include/compat/avxintrin.h
Outdated
| }; | ||
|
|
||
| #define UNIMPLEMENTED(name) \ | ||
| emscripten_err("warning: unsupported avx intrinsic: " #name "\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a compiler error instead?
If not that perhaps abort()..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A compile time diagnostic would be nice (ideally one that could be tuned to be a warning or an error or muted, although I think we currently don't have such facilities possible).
If compile time diagnostic is not possible, then at runtime it would be good to not be an abort, but only a warning. This is because if it was an aborting error, it would throw people into a repeated find-first-error, fix, rebuild, retest loop to find and fix all sources of unused avx intrinsic use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can the program meaningfully continue after trying to execute a missing intrinsic? Is there some way to gracefully fall back to something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By continuing to produce garbage (as the missing functions are no-ops) and emitting warnings from subsequent missing fucntions as it executes.
That way a developer would be able to see warning messages of multiple functions missing at one go, rather than having to do that tedious/neverending loop of fixing first error at abort, then rebuilding and retesting, to find the next such source of unimplemented function being executed. It is the same iteration problem that the safe_heap mode suffers from.
That being said, the only places where that UNIMPLEMENTED macro is used is in functions _mm256_zeroall and _mm256_zeroupper, and those can safely be no-op functions anyway, so this whole UNIMPLEMENTED macro can then be removed altogether.
system/include/compat/avxintrin.h
Outdated
| __m128i v1; | ||
| } __m256i; | ||
|
|
||
| typedef long long __m128i_u; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not use long long type here, but use an explicit width type. In particular, long long is currently a 64-bit type in Emscripten. (even in Wasm64)
| ret.v0 = _mm_add_pd(__a.v0, __b.v0); | ||
| ret.v1 = _mm_add_pd(__a.v1, __b.v1); | ||
| return ret; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit of a silly note, though I see for other *mmintrin.h functions the visual style
static __inline__ __m128d __attribute__((__always_inline__, __nodebug__))
_mm_add_pd(__m128d __a, __m128d __b)
{
return (__m128d)wasm_f64x2_add((v128_t)__a, (v128_t)__b);
}is used, which avoids use of a define __DEFAULT_FN_ATTRS. I would lean to having consistency, i.e. either we'd use the #define everywhere, or nowhere. I'd probably lean towards spelling the macro out, although ok to leave like this as well to avoid unnecessary review churn.
system/include/compat/avxintrin.h
Outdated
| return tmp; | ||
| } | ||
|
|
||
| static inline __m128i select4i(__m256i __a, __m256i __b, const int imm8) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid pollution, these would probably be good to be named something like __avx_select4i, __avx_select4, __avx_select4d in case a user codebase might have a function named e.g. select4.
| } | ||
| if (imm8 & 0x8) { | ||
| tmp = (__m128d)wasm_i64x2_const_splat(0); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though Intel intrinsics guide does not well-define what should happen if imm8 has value of 4,5,6 or 7 (making these functions return an uninitialized tmp), I think it would be good to include that in testing so that in our unit test suite, the practical behavior will align with what a native compiled AVX intrinsics using program would do under these inputs.
Also, while having an implementation that diligently follows the reference is nice, I would recommend fusing the switch and if statements into a single check, i.e. try something like
static __inline__ __m128d __attribute__((__always_inline__, __nodebug__))
__avx_select4d(__m256d __a, __m256d __b, const int imm8)
{
switch(index & 0x11) {
case 0: return __a.v0;
case 1: return __a.v1;
case 2: return __b.v0;
case 3: return __b.v1;
default: return (__m128d)wasm_i64x2_const_splat(0);
}and similar for the other select4 variants. And tune the cases depending on what native AVX code does if an imm8 of 4,5,6 or 7 is passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bit 3 and bit 0/1 have different priorities. if bit 3 is set, select4 return zero regardless of what the other bits are, so we cannot fusing the switch and if statements. I will pull if before switch, which may make this more clear.
In select4, Only bit0/1/3 are used, for some intrinsic, I forgot mask the imm8 before use it. I will fix this issue and add more test with different imm8 input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bit 3 and bit 0/1 have different priorities. if bit 3 is set, select4 return zero regardless of what the other bits are, so we cannot fusing the switch and if statements.
I thought the above switch-case takes that into account? If bit 3 is set, then index & 0x11 is >= 8, so it will always go down the default: path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, thanks! New commit is uploaded, address all the review comments.
| if ((__imm) == _CMP_NLE_US || (__imm) == _CMP_NLE_UQ) \ | ||
| __ret = _mm_cmpnle_ss((__a), (__b)); \ | ||
| __ret; \ | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above #defines seem to have gotten rewritten. What happened there? Are these whitespace relayouted, or are there functional changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relayout, the unchanged code is relative small, so the whole file of avxintrin.h and test_avx.cpp are formatted by clang-format, while only the changed lines of test_sse.h are formatted.
| ((N) & 3) < 2 \ | ||
| ? _mm256_set_m128i((X).v1, _mm_insert_epi64((X).v0, (I), (N) & 1)) \ | ||
| : _mm256_set_m128i(_mm_insert_epi64((X).v1, (I), (N) & 1), (X).v0); \ | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be better to use the __attribute__((__always_inline__, __nodebug__)) style of function representation for these? Or was there a particular reason to have these as defines, but some of the previous functions as individual functions?
(iirc the cmp was special cased to make it forced-obvious to the compiler to inline and discard dead code on the compare branches that aren't taken, but I think that __attribute__((__always_inline__)) would achieve the same effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's special, because parameter of some intrinsic need to be compile time constant, constexpr in C++ is suitable for this, but in this header should be work both in C and C++, so some intrinsic is written in macro.
e.g. wasm_i32x4_extract_lane, the __i must be __builtin_constant_p().
static inline int32_t __DEFAULT_FN_ATTRS wasm_i32x4_extract_lane(v128_t __a,
int __i)
__REQUIRE_CONSTANT(__i) {
return ((__i32x4)__a)[__i];
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, that makes sense.
system/include/compat/avxintrin.h
Outdated
| } | ||
|
|
||
| static __inline void __DEFAULT_FN_ATTRS _mm256_zeroupper(void) { | ||
| UNIMPLEMENTED("_mm256_zeroupper"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the above two functions can safely be empty functions, there shouldn't be any drawbacks since when porting any assembly code that would have calls to these functions around, that assembly code in the first place will not compile.
After developer removes all uses and calls to manual x86 assembly code, the lack of action in zeroall and zeroupper functions will not be observable, and can safely do nothing.
|
Very nice and massive body of work overall! This definitely will help compile more code to Wasm. 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good other than this one comment, but it's a rubberstamp for the actual correctness of the operations and tests.
system/include/compat/avxintrin.h
Outdated
| }) | ||
|
|
||
| #define _mm256_insert_epi16(X, I, N) \ | ||
| ({ \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use __extension__ before all of the statement expressions to avoid compiler errors in pedantic mode. Since we're using statement expressions anyway, should we also assign non-constant parameters to variables to prevent them from being evaluated more than once? It would also be good to add -Wpedantic to the compilation of these files to test that it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I forgot to add extension to some macro, it's fixed now.
Assign non-constant parameters to variables is also done, -Wpedantic is added to test_avx compilation.
For the build-test failure, one failure is too many locals in generated test_avx.wasm when build with some option, e.g. -O0, it comes from the test case itself, I want to skip some test config, and another failure comes from test-browser, seems the failure case is not directly related to this PR, and I can't reproduce the failure case when run the failure case alone, may need more time to debug this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too many locals
This error should be possible to fix easily by moving some of the individual function tests into their own functions, and adding a __attribute__((noinline)) into those test functions. This way the optimizer won't inline these functions into a single massive function that would have way too many locals inline to run in a VM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try e.g.
void __attribute__((noinline)) test_arithmetic(void) {in test_avx.cpp or splitting the individual sections there into test_arithmetic_part1, test_arithmetic_part2 etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I tried to add attribute((noinline)) to each test_* function, but this error pop up even when running a single test case, e.g. Ret_M256d_M256d_Tint_5bits(__m256d, _mm256_cmp_pd); there are about 1600 locals for one specific constant, unrolling 32 times will exceed the limit 50000(1600 * 32 = 51200), split function is easier, but split macro directly doesn't change the locals, so I have to add some function to wrap the splitted macro.
test/sse/test_sse.h
Outdated
| fcastu(-0.2f), fcastu(-FLT_MIN), 0xF9301AB9, 0x0039AB12, 0x19302BCD, | ||
| fcastu(1.401298464e-45f), fcastu(FLT_MIN), fcastu(0.3f), fcastu(0.5f), fcastu(0.8f), fcastu(1.0f), fcastu(1.5f), | ||
| fcastu(2.5f), fcastu(3.5f), fcastu(3.6f), fcastu(FLT_MAX), fcastu(INFINITY), fcastu(NAN) }; | ||
| float interesting_floats_[] __attribute__((aligned(32))) = {-INFINITY, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would these read/align better with __attribute__((aligned(32))) on a line by itself before the declaration?
e.g.
__attribute__((aligned(32)))
float interesting_floats_[] = { ... }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
system/include/compat/avxintrin.h
Outdated
| #endif | ||
|
|
||
| #include <emmintrin.h> | ||
| #include <emscripten/console.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
system/include/compat/avxintrin.h
Outdated
| }) | ||
|
|
||
| #define _mm_permute_ps(__a, __imm) \ | ||
| __extension__({ \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should drop the extra scope and the __extension__ when its not needed?
|
I had address the comments, but still failed in some test, I don't have permission to rerun the test, and haven't reproduced the failure on local machine successfully. Any suggestion on debugging this? Thanks!
|
Is this locally, or on the builder? I'll take a look.
This test is known to be flaky. I will re-run that builder.
This seems likely related your change, right? |
|
For the codesize test failure its normally just a question to rebasing/merging the main branch. Can you try that? |
Yes, but I can't reproduce it locally, and the builder reported twice in two different commit, the output of circuleci doesn't contain enough information, this test just compare the output of intrinsic with different input, I can't see the diff from the output of circleci. |
|
@sbc100 , Thanks! after merging the main branch, the codesize cases passed. So how to reproduce the builder locally? or Can I create a new draft PR for debugging purpose? Thanks! |
Are you sure you can't reproduce is the issue running If the only way to reproduce the output is in the bot (seems unlikely, but possible), then you have a couple of options:
|
|
By the way, the test runner should always show the last lines of the test output. Does that now show you the failure? Perhaps the test should be "fail fast" by default such that the last line of the output is always the failure. Seems to me that this is good default. |
ed5ccd4 to
ed3d54d
Compare
|
@sbc100, I dump the output of native and wasm, they are same, so it's not related to the test_avx itself, and the failure msg shows JS subprocess return -9 (SIG_KILL), maybe it's killed by OOM killer or others. JS subprocess failed (/root/emsdk/node/18.20.3_64bit/bin/node --stack-trace-limit=50 --trace-uncaught /root/project/out/test/test_avx.js): -9 (expected=0). I tried to dump dmesg, but failed, I see you skipped some test due to OOM, so could you give some suggestion on this? I didn't see resource usage information and any logs related to process kills, Thanks! dmesg: read kernel buffer failed: Operation not permitted |
Does OOM seem likely? Does this test use a lot of memory? Or perhaps there are functions that really huge and the v8 jit is running out of memory compiling them? That would fit with the fact that it only happens at If its a really big difference then I think we can say that is the issue, and skip running that test at |
125cdbc to
5a29552
Compare
It's caused by test_compare function, test_compare is a big function unrolled many times. I checked the output of ps, core1.test_avx takes <100MB RSS, while core0.test_avx takes > 4GB with node v18.20.3 and the RSS field is much lower under latest nodejs(v22.9.0), core0.test_avx takes less than 1GB. |
Is there any way to avoid this? If not, then feel free to |
I tried to profile the heap usage with Massif, and found most of the memory is allocated when compiling the wasm module, in BuildTFGraph phase, and about 1/3 is allocated in DecodeBrIf, in original _mm_cmp_pd/_mm_cmp_ps implementation, multiple if statements are used to compare the imm8 parameter, I rewrote the implementation, replaced the if chain with switch, saw about 1/3 RSS reduction, with this change, the core0.test_avx can run successfully without being killed due to OOM. |
|
Before we land this, I wonder if this change warrants ChangeLog entry? |
Since Webassembly only supports 128-bit fixed vector length, one 256-bit AVX intrinsic is emulated by two 128-bit intrinsics.
* Empty _mm256_zeroall and _mm256_zeroupper * Remove __DEFAULT_FN_ATTRS macro * Rewrite keyword __inline to __inline__ * Rewrite select4* function * Rewrite _mm256_cmp_pd and _mm256_cmp_ps macro into function
* remove unused header * put __attribute__((aligned(32))) before the declaration * drop the extra scope and __extension__ in some macro
Replace if chain with switch statement, about 1/3 RSS(resident set size) reduction is observed in core0.test_avx
d3e4cd3 to
365e5a1
Compare
Sure, ChangeLog is updated, Thanks! |
|
After this change, For now I think it's fine if we just disable this test in UBSan mode to make the bots happy again. |
This is needed after #22430 expanded the test.
|
I think it's more reasonable to enforce the limit on the output side, maybe most users don't care the intermediate results as long as the final modules is valid. Further more, the core spec allows up to 2^32 locals, the current limitation is much smaller, matching the JS embedding. |
256-bit AVX support added in PR emscripten-core#22430 is not compatible with C compiler. Add the `union` keyword to the declarations of the internally defined `m256_data` so that use of avxintrin.h does not fail with the C compiler. Also adds `__` prefix to make type `__m256_data` to avoid polluting C++ namespace.
An internal union defined to implement 256-bit AVX support added in PR emscripten-core#22430 is missing `union` keywords at declarations so causes errors when used with the C compiler. Add the `union` keyword to the declarations of the `m256_data` union in avxintrin.h. Also adds `__` prefix to make type `__m256_data` to avoid further polluting global namespace.
An internal union defined to implement 256-bit AVX support added in PR emscripten-core#22430 is missing `union` keywords at declarations so causes errors when used with the C compiler. Add the `union` keyword to the declarations of the `m256_data` union in avxintrin.h. Also adds `__` prefix to make type `__m256_data` to avoid further polluting global namespace. Also remove headers from test exception list as C++ only.
An internal union defined to implement 256-bit AVX support added in PR emscripten-core#22430 is missing `union` keywords at declarations so causes errors when used with the C compiler. Add the `union` keyword to the declarations of the `m256_data` union in avxintrin.h. Also adds `__` prefix to make type `__m256_data` to avoid further polluting global namespace. Also remove headers from test exception list as C++ only.
An internal union defined to implement 256-bit AVX support added in PR #22430 is missing `union` keywords at declarations so causes errors when used with the C compiler. Add the `union` keyword to the declarations of the `m256_data` union in avxintrin.h. Also adds `__` prefix to make type `__m256_data` to avoid further polluting global namespace.
An internal union defined to implement 256-bit AVX support added in PR emscripten-core#22430 is missing `union` keywords at declarations so causes errors when used with the C compiler. Add the `union` keyword to the declarations of the `m256_data` union in avxintrin.h. Also adds `__` prefix to make type `__m256_data` to avoid further polluting global namespace.
Followup to #22430. Each 256-bit AVX2 intrinsic is emulated on top of 128-bit intrinsics that wasm supports directly.
Currently only 128-bit subset of the AVX intrinsic are supported, this patch add 256-bit AVX intrinsic.
Since WebAssembly only supports 128-bit fixed vector length, one 256-bit AVX intrinsic is emulated by two 128-bit intrinsics.