Skip to content

Conversation

@jiepan-intel
Copy link
Contributor

Currently only 128-bit subset of the AVX intrinsic are supported, this patch add 256-bit AVX intrinsic.
Since WebAssembly only supports 128-bit fixed vector length, one 256-bit AVX intrinsic is emulated by two 128-bit intrinsics.

};

#define UNIMPLEMENTED(name) \
emscripten_err("warning: unsupported avx intrinsic: " #name "\n")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a compiler error instead?

If not that perhaps abort()..?

Copy link
Collaborator

@juj juj Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A compile time diagnostic would be nice (ideally one that could be tuned to be a warning or an error or muted, although I think we currently don't have such facilities possible).

If compile time diagnostic is not possible, then at runtime it would be good to not be an abort, but only a warning. This is because if it was an aborting error, it would throw people into a repeated find-first-error, fix, rebuild, retest loop to find and fix all sources of unused avx intrinsic use.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can the program meaningfully continue after trying to execute a missing intrinsic? Is there some way to gracefully fall back to something else?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By continuing to produce garbage (as the missing functions are no-ops) and emitting warnings from subsequent missing fucntions as it executes.

That way a developer would be able to see warning messages of multiple functions missing at one go, rather than having to do that tedious/neverending loop of fixing first error at abort, then rebuilding and retesting, to find the next such source of unimplemented function being executed. It is the same iteration problem that the safe_heap mode suffers from.

That being said, the only places where that UNIMPLEMENTED macro is used is in functions _mm256_zeroall and _mm256_zeroupper, and those can safely be no-op functions anyway, so this whole UNIMPLEMENTED macro can then be removed altogether.

@sbc100 sbc100 requested review from juj and tlively August 22, 2024 04:27
__m128i v1;
} __m256i;

typedef long long __m128i_u;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not use long long type here, but use an explicit width type. In particular, long long is currently a 64-bit type in Emscripten. (even in Wasm64)

ret.v0 = _mm_add_pd(__a.v0, __b.v0);
ret.v1 = _mm_add_pd(__a.v1, __b.v1);
return ret;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of a silly note, though I see for other *mmintrin.h functions the visual style

static __inline__ __m128d __attribute__((__always_inline__, __nodebug__))
_mm_add_pd(__m128d __a, __m128d __b)
{
  return (__m128d)wasm_f64x2_add((v128_t)__a, (v128_t)__b);
}

is used, which avoids use of a define __DEFAULT_FN_ATTRS. I would lean to having consistency, i.e. either we'd use the #define everywhere, or nowhere. I'd probably lean towards spelling the macro out, although ok to leave like this as well to avoid unnecessary review churn.

return tmp;
}

static inline __m128i select4i(__m256i __a, __m256i __b, const int imm8) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid pollution, these would probably be good to be named something like __avx_select4i, __avx_select4, __avx_select4d in case a user codebase might have a function named e.g. select4.

}
if (imm8 & 0x8) {
tmp = (__m128d)wasm_i64x2_const_splat(0);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though Intel intrinsics guide does not well-define what should happen if imm8 has value of 4,5,6 or 7 (making these functions return an uninitialized tmp), I think it would be good to include that in testing so that in our unit test suite, the practical behavior will align with what a native compiled AVX intrinsics using program would do under these inputs.

Also, while having an implementation that diligently follows the reference is nice, I would recommend fusing the switch and if statements into a single check, i.e. try something like

static __inline__ __m128d __attribute__((__always_inline__, __nodebug__))
__avx_select4d(__m256d __a, __m256d __b, const int imm8)
{
  switch(index & 0x11) {
  case 0: return __a.v0;
  case 1: return __a.v1;
  case 2: return __b.v0;
  case 3: return __b.v1;
  default: return (__m128d)wasm_i64x2_const_splat(0);
}

and similar for the other select4 variants. And tune the cases depending on what native AVX code does if an imm8 of 4,5,6 or 7 is passed.

Copy link
Contributor Author

@jiepan-intel jiepan-intel Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit 3 and bit 0/1 have different priorities. if bit 3 is set, select4 return zero regardless of what the other bits are, so we cannot fusing the switch and if statements. I will pull if before switch, which may make this more clear.

In select4, Only bit0/1/3 are used, for some intrinsic, I forgot mask the imm8 before use it. I will fix this issue and add more test with different imm8 input.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit 3 and bit 0/1 have different priorities. if bit 3 is set, select4 return zero regardless of what the other bits are, so we cannot fusing the switch and if statements.

I thought the above switch-case takes that into account? If bit 3 is set, then index & 0x11 is >= 8, so it will always go down the default: path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, thanks! New commit is uploaded, address all the review comments.

if ((__imm) == _CMP_NLE_US || (__imm) == _CMP_NLE_UQ) \
__ret = _mm_cmpnle_ss((__a), (__b)); \
__ret; \
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above #defines seem to have gotten rewritten. What happened there? Are these whitespace relayouted, or are there functional changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relayout, the unchanged code is relative small, so the whole file of avxintrin.h and test_avx.cpp are formatted by clang-format, while only the changed lines of test_sse.h are formatted.

((N) & 3) < 2 \
? _mm256_set_m128i((X).v1, _mm_insert_epi64((X).v0, (I), (N) & 1)) \
: _mm256_set_m128i(_mm_insert_epi64((X).v1, (I), (N) & 1), (X).v0); \
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be better to use the __attribute__((__always_inline__, __nodebug__)) style of function representation for these? Or was there a particular reason to have these as defines, but some of the previous functions as individual functions?

(iirc the cmp was special cased to make it forced-obvious to the compiler to inline and discard dead code on the compare branches that aren't taken, but I think that __attribute__((__always_inline__)) would achieve the same effect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's special, because parameter of some intrinsic need to be compile time constant, constexpr in C++ is suitable for this, but in this header should be work both in C and C++, so some intrinsic is written in macro.

e.g. wasm_i32x4_extract_lane, the __i must be __builtin_constant_p().

static inline int32_t __DEFAULT_FN_ATTRS wasm_i32x4_extract_lane(v128_t __a,
int __i)
__REQUIRE_CONSTANT(__i) {
return ((__i32x4)__a)[__i];
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, that makes sense.

}

static __inline void __DEFAULT_FN_ATTRS _mm256_zeroupper(void) {
UNIMPLEMENTED("_mm256_zeroupper");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the above two functions can safely be empty functions, there shouldn't be any drawbacks since when porting any assembly code that would have calls to these functions around, that assembly code in the first place will not compile.

After developer removes all uses and calls to manual x86 assembly code, the lack of action in zeroall and zeroupper functions will not be observable, and can safely do nothing.

@juj
Copy link
Collaborator

juj commented Aug 22, 2024

Very nice and massive body of work overall! This definitely will help compile more code to Wasm. 👍

Copy link
Member

@tlively tlively left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good other than this one comment, but it's a rubberstamp for the actual correctness of the operations and tests.

})

#define _mm256_insert_epi16(X, I, N) \
({ \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use __extension__ before all of the statement expressions to avoid compiler errors in pedantic mode. Since we're using statement expressions anyway, should we also assign non-constant parameters to variables to prevent them from being evaluated more than once? It would also be good to add -Wpedantic to the compilation of these files to test that it works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I forgot to add extension to some macro, it's fixed now.
Assign non-constant parameters to variables is also done, -Wpedantic is added to test_avx compilation.

For the build-test failure, one failure is too many locals in generated test_avx.wasm when build with some option, e.g. -O0, it comes from the test case itself, I want to skip some test config, and another failure comes from test-browser, seems the failure case is not directly related to this PR, and I can't reproduce the failure case when run the failure case alone, may need more time to debug this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too many locals

This error should be possible to fix easily by moving some of the individual function tests into their own functions, and adding a __attribute__((noinline)) into those test functions. This way the optimizer won't inline these functions into a single massive function that would have way too many locals inline to run in a VM.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try e.g.

void __attribute__((noinline)) test_arithmetic(void) {

in test_avx.cpp or splitting the individual sections there into test_arithmetic_part1, test_arithmetic_part2 etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I tried to add attribute((noinline)) to each test_* function, but this error pop up even when running a single test case, e.g. Ret_M256d_M256d_Tint_5bits(__m256d, _mm256_cmp_pd); there are about 1600 locals for one specific constant, unrolling 32 times will exceed the limit 50000(1600 * 32 = 51200), split function is easier, but split macro directly doesn't change the locals, so I have to add some function to wrap the splitted macro.

fcastu(-0.2f), fcastu(-FLT_MIN), 0xF9301AB9, 0x0039AB12, 0x19302BCD,
fcastu(1.401298464e-45f), fcastu(FLT_MIN), fcastu(0.3f), fcastu(0.5f), fcastu(0.8f), fcastu(1.0f), fcastu(1.5f),
fcastu(2.5f), fcastu(3.5f), fcastu(3.6f), fcastu(FLT_MAX), fcastu(INFINITY), fcastu(NAN) };
float interesting_floats_[] __attribute__((aligned(32))) = {-INFINITY,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would these read/align better with __attribute__((aligned(32))) on a line by itself before the declaration?

e.g.

__attribute__((aligned(32)))
float interesting_floats_[] = { ... }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

#endif

#include <emmintrin.h>
#include <emscripten/console.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

})

#define _mm_permute_ps(__a, __imm) \
__extension__({ \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should drop the extra scope and the __extension__ when its not needed?

@jiepan-intel
Copy link
Contributor Author

I had address the comments, but still failed in some test, I don't have permission to rerun the test, and haven't reproduced the failure on local machine successfully. Any suggestion on debugging this? Thanks!
Three failure:

  1. Generated html doesn't match the expected size $./test/runner other skip:other.test_native_link_error_message
  2. Timeout in wasm64.test_memorygrowth_linear_step
  3. Fail in core0.test_avx

@sbc100
Copy link
Collaborator

sbc100 commented Sep 10, 2024

I had address the comments, but still failed in some test, I don't have permission to rerun the test, and haven't reproduced the failure on local machine successfully. Any suggestion on debugging this? Thanks! Three failure:

  1. Generated html doesn't match the expected size $./test/runner other skip:other.test_native_link_error_message

Is this locally, or on the builder? I'll take a look.

  1. Timeout in wasm64.test_memorygrowth_linear_step

This test is known to be flaky. I will re-run that builder.

  1. Fail in core0.test_avx

This seems likely related your change, right?

@sbc100
Copy link
Collaborator

sbc100 commented Sep 10, 2024

For the codesize test failure its normally just a question to rebasing/merging the main branch. Can you try that?

@jiepan-intel
Copy link
Contributor Author

I had address the comments, but still failed in some test, I don't have permission to rerun the test, and haven't reproduced the failure on local machine successfully. Any suggestion on debugging this? Thanks! Three failure:

  1. Generated html doesn't match the expected size $./test/runner other skip:other.test_native_link_error_message

Is this locally, or on the builder? I'll take a look.
it's on the builder, I will try to rebase

  1. Timeout in wasm64.test_memorygrowth_linear_step

This test is known to be flaky. I will re-run that builder.
Thanks, I reproduced it locally once, but it's seldom, hard to verify and debug.

  1. Fail in core0.test_avx

This seems likely related your change, right?

Yes, but I can't reproduce it locally, and the builder reported twice in two different commit, the output of circuleci doesn't contain enough information, this test just compare the output of intrinsic with different input, I can't see the diff from the output of circleci.

@jiepan-intel
Copy link
Contributor Author

@sbc100 , Thanks! after merging the main branch, the codesize cases passed.
the remaining failure is test_avx, I tried to run circleci locally, but didn't success, so run the steps of circleci(without docker) locally, I still can't reproduce it. the output of the builder doesn't contains usefully information due to limit_size, and not in verbose mode
(not all output shown. See limit_size)

So how to reproduce the builder locally? or Can I create a new draft PR for debugging purpose? Thanks!

@sbc100
Copy link
Collaborator

sbc100 commented Sep 18, 2024

@sbc100 , Thanks! after merging the main branch, the codesize cases passed. the remaining failure is test_avx, I tried to run circleci locally, but didn't success, so run the steps of circleci(without docker) locally, I still can't reproduce it. the output of the builder doesn't contains usefully information due to limit_size, and not in verbose mode (not all output shown. See limit_size)

So how to reproduce the builder locally? or Can I create a new draft PR for debugging purpose? Thanks!

Are you sure you can't reproduce is the issue running core0.test_avx locally?

If the only way to reproduce the output is in the bot (seems unlikely, but possible), then you have a couple of options:

  1. Temporarily modify the limit_size function to just return its argument
  2. Update the test so it doesn't generate so much output by default (for example only generate high level output, except in the case of failure).

@sbc100
Copy link
Collaborator

sbc100 commented Sep 18, 2024

By the way, the test runner should always show the last lines of the test output. Does that now show you the failure? Perhaps the test should be "fail fast" by default such that the last line of the output is always the failure. Seems to me that this is good default.

@jiepan-intel
Copy link
Contributor Author

@sbc100, I dump the output of native and wasm, they are same, so it's not related to the test_avx itself, and the failure msg shows JS subprocess return -9 (SIG_KILL), maybe it's killed by OOM killer or others.

JS subprocess failed (/root/emsdk/node/18.20.3_64bit/bin/node --stack-trace-limit=50 --trace-uncaught /root/project/out/test/test_avx.js): -9 (expected=0).

I tried to dump dmesg, but failed, I see you skipped some test due to OOM, so could you give some suggestion on this? I didn't see resource usage information and any logs related to process kills, Thanks!

dmesg: read kernel buffer failed: Operation not permitted

@sbc100
Copy link
Collaborator

sbc100 commented Sep 20, 2024

@sbc100, I dump the output of native and wasm, they are same, so it's not related to the test_avx itself, and the failure msg shows JS subprocess return -9 (SIG_KILL), maybe it's killed by OOM killer or others.

JS subprocess failed (/root/emsdk/node/18.20.3_64bit/bin/node --stack-trace-limit=50 --trace-uncaught /root/project/out/test/test_avx.js): -9 (expected=0).

I tried to dump dmesg, but failed, I see you skipped some test due to OOM, so could you give some suggestion on this? I didn't see resource usage information and any logs related to process kills, Thanks!

dmesg: read kernel buffer failed: Operation not permitted

Does OOM seem likely? Does this test use a lot of memory? Or perhaps there are functions that really huge and the v8 jit is running out of memory compiling them? That would fit with the fact that it only happens at -O0. You could confirm by comparing the memory usage of node with the -O0 vs the -O1 build.

If its a really big difference then I think we can say that is the issue, and skip running that test at -O0 in CI. (i.e. at skip:core0.test_avx to the test-core0 runner in .circleci/config.yml.

@jiepan-intel jiepan-intel force-pushed the avx_to_wasm branch 2 times, most recently from 125cdbc to 5a29552 Compare September 23, 2024 08:31
@jiepan-intel
Copy link
Contributor Author

@sbc100, I dump the output of native and wasm, they are same, so it's not related to the test_avx itself, and the failure msg shows JS subprocess return -9 (SIG_KILL), maybe it's killed by OOM killer or others.
JS subprocess failed (/root/emsdk/node/18.20.3_64bit/bin/node --stack-trace-limit=50 --trace-uncaught /root/project/out/test/test_avx.js): -9 (expected=0).
I tried to dump dmesg, but failed, I see you skipped some test due to OOM, so could you give some suggestion on this? I didn't see resource usage information and any logs related to process kills, Thanks!
dmesg: read kernel buffer failed: Operation not permitted

Does OOM seem likely? Does this test use a lot of memory? Or perhaps there are functions that really huge and the v8 jit is running out of memory compiling them? That would fit with the fact that it only happens at -O0. You could confirm by comparing the memory usage of node with the -O0 vs the -O1 build.

If its a really big difference then I think we can say that is the issue, and skip running that test at -O0 in CI. (i.e. at skip:core0.test_avx to the test-core0 runner in .circleci/config.yml.

It's caused by test_compare function, test_compare is a big function unrolled many times. I checked the output of ps, core1.test_avx takes <100MB RSS, while core0.test_avx takes > 4GB with node v18.20.3

and the RSS field is much lower under latest nodejs(v22.9.0), core0.test_avx takes less than 1GB.

@sbc100
Copy link
Collaborator

sbc100 commented Sep 23, 2024

It's caused by test_compare function, test_compare is a big function unrolled many times.

Is there any way to avoid this? If not, then feel free to self.skipTest() if not self.is_optimizing(), along with a comment as to why.

@jiepan-intel
Copy link
Contributor Author

It's caused by test_compare function, test_compare is a big function unrolled many times.

Is there any way to avoid this? If not, then feel free to self.skipTest() if not self.is_optimizing(), along with a comment as to why.

I tried to profile the heap usage with Massif, and found most of the memory is allocated when compiling the wasm module, in BuildTFGraph phase, and about 1/3 is allocated in DecodeBrIf, in original _mm_cmp_pd/_mm_cmp_ps implementation, multiple if statements are used to compare the imm8 parameter, I rewrote the implementation, replaced the if chain with switch, saw about 1/3 RSS reduction, with this change, the core0.test_avx can run successfully without being killed due to OOM.

@sbc100
Copy link
Collaborator

sbc100 commented Sep 24, 2024

Before we land this, I wonder if this change warrants ChangeLog entry?

Since Webassembly only supports 128-bit fixed vector length, one 256-bit AVX intrinsic is
emulated by two 128-bit intrinsics.
* Empty _mm256_zeroall and _mm256_zeroupper
* Remove  __DEFAULT_FN_ATTRS macro
* Rewrite keyword __inline to __inline__
* Rewrite select4* function
* Rewrite _mm256_cmp_pd and _mm256_cmp_ps macro into function
* remove unused header
* put __attribute__((aligned(32))) before the declaration
* drop the extra scope and __extension__ in some macro
Replace if chain with switch statement, about 1/3 RSS(resident set size)
reduction is observed in core0.test_avx
@jiepan-intel
Copy link
Contributor Author

Before we land this, I wonder if this change warrants ChangeLog entry?

Sure, ChangeLog is updated, Thanks!

@sbc100 sbc100 merged commit 77e24ae into emscripten-core:main Sep 25, 2024
28 checks passed
@dschuff
Copy link
Member

dschuff commented Sep 26, 2024

After this change, test_avx fails to build in UBSan mode because there are too many locals in the function ([parse exception: too many locals (at 0:622961)], see https://logs.chromium.org/logs/emscripten-releases/buildbucket/cr-buildbucket/8735743902421274257/+/u/Emscripten_testsuite__UBSan_/stdout). Specifically, wasm-emscripten-finalize is rejecting it as invalid when it runs.

For now I think it's fine if we just disable this test in UBSan mode to make the bots happy again.
But it did make me think: On one hand it makes sense for wasm-opt to reject invalid modules, but on the other hand, the whole point of wasm-opt is to optimize (and hopefully reduce the number of locals). I wonder if it makes sense not to enforce these limits on the input side, and maybe enforce (or just warn?) on the output side?

dschuff added a commit that referenced this pull request Sep 26, 2024
This is needed after #22430 expanded the test.
@jiepan-intel
Copy link
Contributor Author

I think it's more reasonable to enforce the limit on the output side, maybe most users don't care the intermediate results as long as the final modules is valid. Further more, the core spec allows up to 2^32 locals, the current limitation is much smaller, matching the JS embedding.

jkl1337 added a commit to jkl1337/emscripten that referenced this pull request Nov 5, 2024
256-bit AVX support added in PR emscripten-core#22430 is not compatible with C
compiler.

Add the `union` keyword to the declarations of the internally
defined `m256_data` so that use of avxintrin.h does not fail with
the C compiler. Also adds `__` prefix to make type `__m256_data` to
avoid polluting C++ namespace.
jkl1337 added a commit to jkl1337/emscripten that referenced this pull request Nov 5, 2024
An internal union defined to implement 256-bit AVX support added in
PR emscripten-core#22430 is missing `union` keywords at
declarations so causes errors when used with the C compiler.

Add the `union` keyword to the declarations of the `m256_data` union in
avxintrin.h. Also adds `__` prefix to make type `__m256_data` to
avoid further polluting global namespace.
@jiepan-intel jiepan-intel mentioned this pull request Nov 29, 2024
jkl1337 added a commit to jkl1337/emscripten that referenced this pull request Dec 16, 2024
An internal union defined to implement 256-bit AVX support added in
PR emscripten-core#22430 is missing `union` keywords at
declarations so causes errors when used with the C compiler.

Add the `union` keyword to the declarations of the `m256_data` union in
avxintrin.h. Also adds `__` prefix to make type `__m256_data` to
avoid further polluting global namespace.

Also remove headers from test exception list as C++ only.
jkl1337 added a commit to jkl1337/emscripten that referenced this pull request Dec 16, 2024
An internal union defined to implement 256-bit AVX support added in
PR emscripten-core#22430 is missing `union` keywords at
declarations so causes errors when used with the C compiler.

Add the `union` keyword to the declarations of the `m256_data` union in
avxintrin.h. Also adds `__` prefix to make type `__m256_data` to
avoid further polluting global namespace.

Also remove headers from test exception list as C++ only.
sbc100 pushed a commit that referenced this pull request Dec 16, 2024
An internal union defined to implement 256-bit AVX support added in PR
#22430 is missing `union` keywords at
declarations so causes errors when used with the C compiler.

Add the `union` keyword to the declarations of the `m256_data` union in
avxintrin.h. Also adds `__` prefix to make type `__m256_data` to avoid
further polluting global namespace.
hedwigz pushed a commit to hedwigz/emscripten that referenced this pull request Dec 18, 2024
An internal union defined to implement 256-bit AVX support added in PR
emscripten-core#22430 is missing `union` keywords at
declarations so causes errors when used with the C compiler.

Add the `union` keyword to the declarations of the `m256_data` union in
avxintrin.h. Also adds `__` prefix to make type `__m256_data` to avoid
further polluting global namespace.
kripken pushed a commit that referenced this pull request Jan 23, 2025
Followup to #22430. Each 256-bit AVX2 intrinsic is emulated on top of
128-bit intrinsics that wasm supports directly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants