-
Notifications
You must be signed in to change notification settings - Fork 117
StringZilla 4 CUDA 🥳 #201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
StringZilla 4 CUDA 🥳 #201
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
9deb2f8 to
1de3166
Compare
Merged
Co‑authored‑by: Mark Reed mark@chattyr.com (mark@chattyr.com) Co‑authored‑by: Mark Reed 5108907+MarkReedZ@users.noreply.github.com (5108907+MarkReedZ@users.noreply.github.com)
Sadly, code-bloat for maintaining SWAR on s390x is too high, so its wiser to avoid it for now.
770f046 to
230e354
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR entirely refactors the codebase, separating the single-header implementation into separate headers. Moreover, it brings faster kernels for:
And more community contributions:
basic_charsetinitialization are discarded #200HashMaptraits for Rust #215Huge thanks to our partners at Nebius for their continued support and endless stream of GPU installations for the most demanding computational workloads in both AI and beyond!
Why Split the Files? Matching SimSIMD Design
Sadly, most modern software development tooling is subpar. VS Code is just as slow and unresponsive as the older Atom and the other web-based technologies, while LSP implementations for C++ are equally slow and completely mess up code highlighting for files over 5,000 Lines Of Code (LOCs). So, I've unbundled the single-header solution into multiple headers, similar to SimSIMD.
Also, similar to SimSIMD, CPU feature detection has been reworked to separate serial implementations, Haswell, Skylake, Ice Lake, NEON, and SVE.
Faster Sequence Alignment & Scoring on GPUs
Biology is set to be one of the driving forces of the 21st century, and biological DNA/RNA/protein data is already one of the fastest-growing data modalities, outpacing Moore's Law. Still, most of the BioInformatics software today is flawed and pretty slow. Last year, I helped several BioTech and Pharma companies scale up their data-processing capacity with specialized optimizations for various niche use cases. Still, I also wanted to include baseline kernels for the most crucial algorithms to StringZilla, covering:
These kernels are hardly state-of-the-art at this point, but should provide a good baseline, ensuring correctness and equivalent outputs across different CPU & GPU brands.
Faster Sorting
Our old algorithm didn't perform any memory allocations and tried to fit too much into the provided buffers. The new breaking change in the API allows passing a memory allocator, making the implementation more flexible. It now works fine on both 32-bit and 64-bit systems.
The new serial algorithm is often 5 times faster than the
std::sortfunction in the C++ Standard Template Library for a vector of strings. It's also typically 10 times faster than theqsort_rfunction in the GNU C library. There are even quicker versions available for Ice Lake CPUs with AVX-512 and Arm CPUs with SVE.Faster Hashing Algorithms
Our old algorithm was a variation of the Karp-Rabin hash and was designed more for rolling hashing workloads. Sadly, such hashing schemes don't pass SMHasher and similar hash-testing suites, and a better solution was needed. For years, I have been contemplating designing a general-purpose hash function based on AES instructions, which have been implemented in hardware for several CPU generations now. As discussed with @jandrewrogers, and can be seen in his AquaHash project, those instructions provide an almost unique amount of mixing logic per CPU cycle of latency.
Many popular hash libraries, such as AHash in the Rust ecosystem, cleverly combine AES instructions with 8-bit shuffles and 64-bit additions. However, they rarely harness the full power of the CPU due to the constraints of Rust tooling and the complexity of using masked x86 AVX-512 and predicated Arm SVE2 instructions. StringZilla does that and ticks a few more boxes:
--extratests.Implementing this logic, which provides both fast and high-quality hashes, often capable of computing four hashes simultaneously, made these kernels handy not only for hashing itself, but also for higher-level operations like database-style hash joins and set intersections, as well as advanced sequence alignment algorithms for bioinformatics.
Fingerprinting, Sketching, and Min-Hashing... using 52-bit Floats?!
Despite deprecating Rabin-Karp rolling hashes for general-purpose workloads, it's hard to argue with their usability in "fingerprinting" or "sketching" tasks, where a fixed-size feature vector is needed to compare contents of variable-length strings. The features should be as different as possible, covering various substring lengths, so polynomial rolling hashes fit nicely!
That said, implementing modulo-arithmetic over 64-bit integers is extremely expensive on Intel CPUs:
VPMULLQ (ZMM, ZMM, ZMM)for_mm512_mullo_epi64:VPMULLD (ZMM, ZMM, ZMM)for_mm512_mullo_epi32:VPMULLW (ZMM, ZMM, ZMM)for_mm512_mullo_epi16:VPMADD52LUQ (ZMM, ZMM, ZMM)for_mm512_madd52lo_epu64for 52-bit multiplication:It's even pricier on Nvidia GPUs, but as you can see above, we may have a way out! x86 has a cheap instruction for 52-bit integer multiplication & addition. Moreover, 52 is precisely the number of bits we can safely use to exactly store an integer inside a
double-precision floating-point number, opening doors to a really weird, but equally as implementation of rolling hash functions across CPUs and GPUs, implemented using 64-bit floats, and Barrett's reductions for modulo-arithmetics, to avoid division!Nest Steps
linear_score_on_each_cuda_warp_andaffine_score_on_each_cuda_warp_, which scale to 16 blocks of (228 - 1) KB shared memory on each, totaling 3.5 MB of shared memory that can be used for alignments. Withu32scores and 3 DP diagonals (in case of non-affine gaps) that should significantly accelerate the alignment of ~300K-long strings. But keep in mind thatcluster.sync()is still very expensive - 1300 cycles - only 40% less than 2200 cycles forgrid.sync().stringzilla_barebuilds on MSVC, if needed.