Conversation
| // Bitmask is slow on AArch64, any_true is much faster. | ||
| if (wasm_v128_any_true(cmp)) { |
There was a problem hiding this comment.
Not sure about this: using the wasm_i8x16_bitmask directly here is a better lowering for x64 than wasm_v128_any_true.
There was a problem hiding this comment.
Yes. This is a mitigation for AArch64. We don't have the luxury of knowing the final architecture, and much less the CPU. Or how good the runtime is at (e.g. peephole) optimizing the final generated assembly.
But I'm pretty sure I measured, and at least for large buffers (and wazero) this was a significant improvement on AArch64, for a pretty insignificant cost on 3 different x86-64 CPUs.
There was a problem hiding this comment.
I tried this again with bench.c with wasmtime on my Xeon W-2135, and... it's hard to measure. I'm not saying it's not slower, it may, but it's close enough that between processors, VMs, lengths, I'm not sure which is better.
So, where do I put something like bench.c, and how do we settle the matter?
There was a problem hiding this comment.
It seems like this kind of thing could fit in sightglass, even though this is a bit of a micro-benchmark. You could take a look at this example of how the blake3 benchmark was added: benchmark.c. In the Dockerfile that builds a benchmark you could add the special "build wasi-libc with SIMD enabled" logic.
But you don't have to put it there. I think we could probably settle this using the bench.c you provided. I'd probably be comfortable merging this without the special aarch64 optimization now and then submitting that as a second PR once I have a chance to measure a few things. Let me make sure I understand what you're saying precisely: (a) you can't detect a difference using bitmask or any_true with the x64 CPUs you tested but (b) it still makes a very big difference for aarch64?
|
On hold due to #593. I don't understand the implications of the UB claim for the pointer arithmetic needed to implement Do feel free to update the PR as required. |
|
Updated with the inline assembly, and suggestions in #593 (comment), @alexcrichton please validate. Also tweaked naming and formatting in |
Continuing #580, implements `strspn` and `strcspn`. This one follows the same general structure as #586, #592 and #594, but uses a somewhat more complicated algorithm, described [here](http://0x80.pl/notesen/2018-10-18-simd-byte-lookup.html). I used the Geoff Langdale alternative implementation (the tweet as since disappeared) which is correctly described there but has a subtle bug in the implementation: WojciechMula/simd-byte-lookup#2 Since the complexity needed for `__wasm_v128_bitmap256_t` is shared for both `strspn` and `strcspn`, I moved the implementation to a common file, when SIMD is used. The tests follow a similar structure as the previous ones, and cover the bug, which I was found through fuzzing.
Continuing #580, followup to #586.
Chose
memchrbecause it's somewhat similar tostrlen, but also because it is the basis forstrnlen(and in that capacity, forstrndupandstrlcat) and is also used bystrstr,fnmatch.