The messy reality of SIMD (vector) functions

mfiguiere 117 points johnnysswlab.com

−

exDM69

This right here illustrates why I think there should be better first class SIMD in languages and why intrinsics are limited.

When using GCC/clang SIMD extensions in C (or Rust nightly), the implementation of sin4f and sin8f are line by line equal, with the exception of types. You can work around this with templates/generics.

The sin function is entirely basic arithmetic operations, no fancy instructions are needed (at least for the "computer graphics quality" 32 bit sine function I am using).

Contrast this with intrinsics where the programmer needs to explicitly choose the mm128 or mm256 instruction even for trivial stuff like addition and other arithmetic.

Similarly, a 4x4 matrix multiplication function is the exact same code for 64 bit double and 32 bit float if you're using built in SIMD. A bit of generics and no duplication is needed. Where as intrinsics again needs two separate implementations.

I understand that there are cases where intrinsics are required, or can deliver better performance but both C/C++ and Rust have zero cost fallback to intrinsics. You can "convert" between f32x4 and mm128 at zero cost (no instructions emitted, just compiler type information).

I do use some intrinsics in my SIMD code this way (rsqrt, rcp, ...). The CPU specific code is just a few percent of the overall lines of code, and that's for Arm and x86 combined.

The killer feature is that my code will compile into x86_64/SSE and Aarch64/neon. And I can use wider vectors than the CPU actually supports, the compiler knows how to break it down to what the target CPU supports.

I'm hoping that Rust std::simd would get stabilized soon, I've used it for many years and it works great. And when it doesn't I have a zero cost fallback to intrinsics.

Some very respected people have the opinion that std::simd or its C equivalent suffer from a "least common denominator problem". I don't disagree with the issue but I don't think it really matters when we have a zero cost fallback available.

−

camel-cdr

My personal gripe with Rust's std::simd in its current form is that it makes writing portable SIMD hard while making non-portable SIMD easy.^[0]

the implementation of sin4f and sin8f are line by line equal, with the exception of types. You can work around this with templates/generics

This is true, I think most SIMD algorithms can be written in such a vector length-agnostic way, however almost all code using std::simd specifies a specific lane count instead of using the native vector length. This is because the API favors the use of fixed-size types (e.g. f32x4), which are exclusively used in all documentation and example code.

If I search github for `f32x4 language:Rust` I get 6.4k results, with `"Simd<f32," language:Rust NOT "Simd<f32, 2" NOT "Simd<f32, 4" NOT "Simd<f32, 8"` I get 209.

I'm not even aware of a way to detect the native vector length using std::simd. You have to use the target-feature or multiversion crate, as shown as the last part of the rust-simd-book^[1]. Well, kind of like that, because their suggestion using "suggested_vector_width", which doesn't exist. I could only find a suggested_simd_width.

Searching for "suggested_simd_width language:Rust", we are now down to 8 results, 3 of which are from the target-feature/multiversion crates.

---

What I'm trying to say is that, while being able to specify a fixed SIMD width can be useful, the encouraged default should be "give me a SIMD vector of the specified type corresponding to the SIMD register size". If your problem can only be solved with a specific vector length, great, then hard-code the lane count, but otherwise don't.

See^[0] for more examples of this.

[0] https://github.com/rust-lang/portable-simd/issues/364#issuec...

[1] https://calebzulawski.github.io/rust-simd-book/4.2-native-ve...

−

exDM69

The binary portability issue is not specific to Rust or std::simd. You would have to solve the same problems even if you use intrinsics or C++. If you use 512 but vectors in the code, you will need to check if the CPU supports it or add multiversioning dispatch or you will get a SIGILL.

I have written both type generic (f32 vs f64) and width generic (f32x4 vs f32x8) SIMD code with Rust std::simd.

And I agree it's not very pretty. I had to resort to having a giant where clause for the generic functions, explicitly enumerating the required std::ops traits. C++ templates don't have this particular issue, and I've used those for the same purpose too.

But even though the implementation of the generic functions is quite ugly indeed, using the functions once implemented is not ugly at all. It's just the "primitive" code that is hairy.

I think this was a huge missed opportunity in the core language, there should've been a core SIMD type with special type checking rules (when unifying) for this.

However, I still think std::simd is miles better than intrinsics for 98% of the SIMD code I write.

The other 1% (times two for two instruction sets) is just as bad as it is in any other language with intrinsics.

The native vector width and target-feature multiversioning dispatch are quite hairy. Adding some dynamic dispatch in the middle of your hot loops can also have disastrous performance implications because they tend to kill other optimizations and make the cpu do indirect jumps.

Have you tried just using the widest possible vector size? e.g. f64x64 or something like it. The compiler can split these to the native vector width of the compiler target. This happens at compile time so it is not suitable if you want to run your code on CPUs with different native SIMD widths. I don't have this problem with the hardware I am targeting.

Rust std::simd docs aren't great and there have been some breaking changes in the few years I've used it. There is certainly more work on that front. But it would be great if at least the basic stuff would get stabilized soon.

−

MangoToupe

there should've been a core SIMD type with special type checking rules

What does this mean?

−

dzaima

Overly-wide vectors I'd say are a pretty poor choice in general.

If you're using shuffles at times, you must use native-width vectors to be able to apply them.

If you're doing early-exit loops, you also want the vector width to be quite small to not do useless work.

f64x64 is presumably an exaggeration, but an important note is that overly long vectors will result in overflowing the register file and thus will make tons of stack spills. A single f64x64 takes up the entire AVX2 or ARM NEON register file! There's not really much room for a "widest" vector - SSE only has a tiny 2048-bit register file, the equivalent of just four AVX-512 registers, 1/8th of its register file.

And then there's the major problem that using fixed-width vectors will end up very badly for scalable vector architectures, i.e. ARM SVE and RISC-V RVV; of course not a big issue if you do a native build or do dynamic dispatch, but SVE and RVV are specifically made such that you do not have to do a native build nor duplicate code for different hardware vector widths.

And for things that don't do fancy control flow or use specialized instructions, autovectorization should cover you pretty well anyway; if you have some gathers or many potentially-aliasing memory ranges, on clang & gcc you can _Pragma("clang loop vectorize(assume_safety)") _Pragma("GCC ivdep") to tell the compiler to ignore aliasing and vectorize anyway.

−

exDM69

f64x64 is presumably an exaggeration

It's not. IIRC, 64 elements wide vectors are the widest that LLVM (or Rust, not sure) can work with. It will happily compile code that uses wider vectors than the target CPU has and split accordingly.

That doesn't necessarily make it a good idea.

Autovectorization works great for simple stuff and has improved a lot in the past decade (e.g. SIMD gather loads).

It doesn't work great for things like converting a matrix to quaternion (or vice versa), and then doing that in a loop. But if you write the inner primitive ops with SIMD you get all the usual compiler optimizations in the outer loop.

You should not unroll the outer loop like in the Quake 3 days. The compiler knows better how many times it should be unrolled.

I chose this example because I recently ported the Quake 3 quaternion math routines to Rust for a hobby project and un-unrolled the loops. It was a lot faster than the unrolled original (thanks to LLVM, same would apply to Clang).

−

dzaima

I think LLVM should be able to handle up to less than 65536-bit vectors? At least I saw some llvm issue noting that that's where some things broke down; so a 32kbit vector would be f64x512. But I meant an exaggeration as far as actually using it goes, f64x64 is hilariously overkill even on AVX-512, and utterly awful on pre-AVX512.

−

exDM69

It's actually Rust that is setting the 64 wide limit (see SupportedLaneCount), not LLVM.

I agree, f64x64 is probably a very bad idea.

But something like f32x8 would probably still be "fast enough" on old/mobile CPUs without 256 wide vectors (but good 128 bit SIMD ALU).

I did something like this when using a u16x16 bitmask fit the problem domain. Most of my target CPUs have 256 wide registers but on mobile ARM land they don't. This wasn't particularly performance sensitive code so I just used 256 bit wide vectors anyway. It wasn't worth it trying to optimize for the old CPUs separately.

−

MangoToupe

portable SIMD

this seems like an oxymoron

−

mort96

Why? There's a ton of commonality between SIMD instruction sets. As long as you have a model for SIMD where you can avoid hard-coding a vector width and you use a relatively constrained set of vector instructions, there's no fundamental reason why the same source code shouldn't be able to compile down to AVX-512, AVX2, SSE, ARM NEON, ARM SVE and RVV instructions. For most use cases, we're doing the same basic set of operations: copy data from memory into vector registers (maybe with some transformation, like copy 8-bit ints from memory into a vector of u16), do math operations on vector registers, copy back to memory.

−

MangoToupe

there's no fundamental reason why the same source code shouldn't be able to compile down to

Well if you don't describe your code and dataflow in a way that caters to the shape of the SIMD, it seems ridiculous to expect.

But yes, as always, compilers could be magic programs that transform code into perfectly optimal programs. They could also fix bugs for us, too.

−

mort96

I'm saying that the shape of the SIMD is pretty much the same across platforms. Vector width differs between architectures, and whether the vector width is determined at compile time or at runtime differs between architectures, but you'll have to convince me that the vector width is such an essential component of the abstract description of the computation that you fundamentally can't abstract it away. (In fact, the success of RVV and ARM SVE should tell us that we can describe SIMD computation in a vector width-independent way.)

All vector instruction sets offer things like "multiply/add/subtract/divide the elements in two vector registers", so that is clearly not the part that's impossible to describe portably.

−

anonymoushn

It's really not. As an example, for string processing tasks (including codecs which various server software spends a significant percentage of its runtime on), NEON includes a deinterleaving load into 4 registers and byte-wise shuffles that accept 2, 3, or 4 registers worth of lookup table. These primitives are quite different from those available on AVX2 or AVX-512, and the fact that they are available and cheap to use means you end up with somewhat different algorithms for the two types of targets. Even the practice of using the toys available in AVX2 well for this sort of task is somewhat obscure. Folks who have worked on codec-type stuff but primarily used AVX-512 often have trouble figuring out how to do most of the same things in similar instruction counts if masked versions of the instructions are not available.

−

janwas

I made the same argument a while ago but a coworker changed my mind.

Can you afford to write and maintain a codepath per ISA (knowing that more keep coming, including RVV, LASX and HVX), to squeeze out the last X%? Is there no higher-impact use of developer time? If so, great.

If not, what's the alternative - scalar code? I'd think decent portable SIMD code is still better than nothing, and nothing (scalar) is all we have for new ISAs which have not yet been hand-optimized. So it seems we should anyway have a generic SIMD path, in addition to any hand-optimized specializations.

BTW, Highway indeed provides decent emulations of LD2..4, and at least 2-table lookups. Note that some Arm uarchs are anyway slow with 3 and 4.

−

anonymoushn

For now, at work, it's just some parts with AVX-512, some parts with AVX-512 that we can't really use, so we should use AVX2, and some parts with NEON and SVE. So the implementations for SSE basically are a courtesy to outside users of the libraries, and there are no RVV implementations.

If we were already depending on highway or eve, I would think it's great to ship the generic SIMD version instead of the SSE version, which probably compiles down to the same thing on the relevant targets. This way, if future maintainers need to make changes and don't want to deal with the several implementations I have left behind, the presence of the generic implementation would allow them to delete them rather than making the same changes a bunch of times.

−

janwas

Makes sense :) Generic or fallback versions are also useful for correctness testing and benchmarking.

−

exDM69

if you don't describe your code and dataflow in a way that caters to the shape of the SIMD

But when I do describe code, dataflow and memory layout in a SIMD friendly way it's pretty much the same for x86_64 and ARM.

Then I can just use `a + b` and `f32x4` (or its C equivalent) instead of `_mm_add_ps` and `_mm128` (x86_64) or `vaddq_f32` and `float32x4_t` (ARM).

Portable SIMD means I don't need to write this code twice and memorize arcane runes for basic arithmetic operations.

For more specialized stuff you have intrinsics.

−

owlbite

So we write a lot of code in this agnostic fashion using typedef's and clang's vector attribute support, along with __builtin_shufflevector for all the permutations (something along similar lines to Apple's simd.h). It works pretty well in terms of not needing to memorize/lookup all the mnemonic intrinsics for a given platform, and letting regular arithmetic operations exist.

However, we still end up writing different code for different target SOCs, as the microarchitecture is different, and we want to maximize our throughput and take advantage of any ISA support for dedicated instructions / type support.

One big challenge is targeting in-order cores the compiler often does a terrible job of register allocation (we need to use pretty much all the architectural registers to allow for vector instruction latencies), so we find the model breaks down somewhat there as we have to drop to inline assembly.

−

exDM69

Your experience matches mine, you can get a lot done with the portable SIMD in Clang/GCC/Rust but you can't avoid the platform specific stuff when you need specialized instructions.

Depends on the domain you work in how much you need to resort to platform specific intrinsics. For me dabbling in computer graphics and game physics, almost all of the code is portable except for some rare specialized instructions here and there.

For someone working in specialized domains (like video codecs) or hardware (HPC super computers) the balance might be the other way around.

−

jandrewrogers

SIMD instructions are used for much more than parallelizing math on primitive types. I think this assumption is whence disagreement on this originates.

It is all of this "other" code where the differences and incompatibilities between SIMD instruction sets become painfully obvious. There are many cases where a good mapping from one SIMD ISA to another doesn't really exist, even between nominally related instruction sets like AVX-512 and AVX2.

The common denominator is sufficiently limited that it doesn't help that much in many cases and in fact may be worse than a use case specific SIMD abstraction that carefully navigates these differences but doesn't generalize.

−

MangoToupe

first class SIMD in languages

People have said this for longer than I've been alive. I don't think it's a meaningful concept.

−

Sharlin

It’s such a ridiculous situation we’re in. Just about every consumer CPU of the past 20 years packs an extra order of magnitude or two of punch for data processing workloads, but to not let it go to waste you have to resort to writing your inner loops using low-level nonportable intrinsics that are just a step above assembly. Or pray that the gods of autovectorization are on your side.

−

jltsiren

Adding parallelism is much easier on the hardware side than the software side. We've kind of figured out the easy cases, such as independent tasks with limited interactions with each other, and made them practical for the average developer. But nobody understands the harder cases (such as SIMD) well enough to create useful abstractions that don't constrain hardware development too much.

−

smallmancontrov

Except for GPU manufacturers, who figured out the right way to do it 20 years ago.

20 years ago, it was extremely obvious to anyone who had to write forward/backward compatible parallelism that the-thing-nvidia-calls-SIMT was the correct approach. I thought CPU hardware manufacturers and language/compiler writers were so excessively stubborn that it would take them a decade to catch up. I was wrong. 20 years on, they still refuse to copy what works.

−

jandrewrogers

Intel actually built and launched this 15 years ago. A GPU-like barrel processor with tons of memory bandwidth and wide vector instructions that ran x86. In later versions they even addressed the critical weakness of GPUs for many use cases (poor I/O bandwidth).

It went nowhere, aside from Intel doing Intel things, because most programmers struggle to write good code for those types of architectures so all of that potential was wasted.

−

dzaima

That's because the things GPUs do just isn't what CPUs do. GPUs don't have to deal with ad-hoc 10..100-char strings. They don't have to deal with 20-iteration loops with necessarily-serial dependencies between invocations of the small loops. They don't have to deal with parallelizing mutable hashmap probing operations.

Indeed what GPUs do is good for what GPUs do. But we have a tool for doing things that GPUs do well - it's called, uhh, what's it, uhh... oh yeah, GPUs. Copying that into CPUs is somewhere between just completely unnecessary, and directly harmful to things that CPUs are actually supposed to be used for.

The GPU approach has pretty big downsides for anything other than the most embarassingly-parallel code on very massive inputs; namely, anything non-trivial (sorting, prefix sum) will typically require log(n) iterations, and somewhere between twice as much, and O(n*log(n)) memory access (and even summing requires stupid things like using memory for an accumulator instead of being able to just use vector registers), compared to the CPU SIMD approach of doing a single pass with some shuffles. GPUs handle this via trading off memory latency for more bandwidth, but any CPU that did that would immediately go right in the thrash because that'd utterly kill scalar code performance.

−

smallmancontrov

This reads like you haven't tried CUDA. The whole point of CUDA is that your CUDA code has single-threaded semantics. The problem you assumed it has is the problem it doesn't have, and the fact that it doesn't have this problem is the reason why it works and the reason why it wins.

EDIT: ok, now you're talking about the hardware difference between CPUs and GPUs. This is relevant for the types of programs that each can accelerate -- barrel processors are uniquely suited to embarrassingly parallel problems, obviously -- but it is not relevant for the question of "how to write code that is generic across block size, boundary conditions, and thread divergence." CUDA figured this out, non-embarassingly-parallel programs still have this problem, and they should copy what works. The best time to copy what works was 20 years ago, but the second best time is today.

−

dzaima

It has single-threaded semantics per element. Which is fine for anything that does completely independent computation for each element, but is quite annoying for everything else, requiring major algorithmic changes. And CPU SIMD is used for a lot of such things.

−

smallmancontrov

"Completely independent" except for anything that can be expressed using branches, queues, and locks. Which is everything. Again, are you sure you've tried CUDA? Past, like, the first tutorial?

−

dzaima

I haven't used CUDA (don't have nvidia gpu), but I've looked at examples of code before. And it doesn't look any more simple than CPU SIMD for anything non-trivial.

And pleasepleaseplease don't have locks in something operating over a 20-element array, I'm pretty damn sure that's just simply gonna be a suboptimal approach in any scenario. (even if those are just "software" locks for forcing serialized computation that don't actually end up in any memory atomics or otherwise more instructions, as just needing such is hinting to me of awful approaches like log(n) loops over a n=20-element array, or some in-memory accumulators, or something awful)

As an extreme case of something I've had to do in CPU SIMD that I don't think would be sane in any other way:

How would I in CUDA implement code that does elementwise 32-bit integer addition of two input arrays into a third array (which may be one of the inputs), checking for overflow, and, in the case of any addition overflowing (ideally early-exiting on such to not do useless work), report in some way how much was processed, such that further code could do the addition with a wider result type, being able to compute the full final wider result array even in the case where some of the original inputs aren't available due to the input overlapping the output (which is fine as for those the computed results can be used directly)?

This is a pretty trivial CPU SIMD loop consisting of maybe a dozen intrinsics (even easily doable via any of the generalized arch-independent SIMD libraries!), but I'm pretty sure it'd require a ton of synchronization in anything CUDA-like, and probably being forced to do early-exiting in way larger blocks, and probably having to return a bitmask of which threads wrote their results, as opposed to the SIMD loop having a trivial guarantee of the processed and unprocessed elements being split exactly on where the loop stopped.

(for addition specifically you can also undo the addition to recover the input array, but that gets way worse for multiplication as the inverse there is division; and perhaps in the CUDA approach you might also want to split into separately checking for overflow and writing the results, but that's an inefficient two passes over memory just to split out a store of something the first part already computes)

−

dzaima

re edit - While the hardware differences are significant, some by necessity, some by tradition, it's not my point.

My specific point is that "how to write code that is generic across block size, boundary conditions, and thread divergence." is just not the correct question to ask for many CPU SIMD use-cases. Many of those Just Do Not Fit That Paradigm. If you think you can just squeeze CPU SIMD usage into that box then I don't think you've actually done any CPU SIMD beyond very trivial things (see my example problem in the other thread).

You want to take advantage of block size on CPUs. It's sad that GPU programmers don't get to. In other places I've seen multiple GPU programmers annoyed at not being able to do the CPU SIMD programming paradigm of explicit registers on GPUs. And doing anything about thread divergence on CPUs is just not gonna go well due to the necessary focus on high clock rates (and as such having branch mispredictions be relatively ridiculously expensive).

You of course don't need any of fancy anything if you have a pure embarassingly-parallel problem, for which GPUs are explicitly made. But for these autovectorization does actually already work, given hardware that has necessary instructions (memory gathers/scatters, masked loads/stores if necessary; and of course no programming paradigm would magically make it work for hardware that doesn't). At worst you may need to add a _Pragma to tell the compiler to ignore memory aliasing, at which point the loop body is exactly the same programming paradigm as CUDA (with thread synchronization being roughly "} for (...) {", but you gain better control over how things happen).

−

MangoToupe

Well, yea. You need to describe your data flow in a way the CPU can take advantage of it. Compilers aren't magic.

−

Sharlin

That’s like saying that you have to describe your data flow in terms of gotos because the CPU doesn’t understand for loops and compilers aren’t magic. I don’t mean that autovectorization should just work (tm), I just mean that reasonable portable SIMD abstractions should not be this hard.

−

tomsmeding

I just mean that reasonable portable SIMD abstractions should not be this hard.

Morally, no, it really ought to not be this hard, we need this. Practically, it really is hard, because SIMD instruction sets in CPUs are a mess. X86 and ARM have completely different sets of things that they have instructions for, and even within the X86 family, even within a particular product class, things are inconsistent:

- On normal words, one has lzcnt (leading-zero count) and tzcnt (trailing-zero count), but on SIMD vectors there is only lzcnt. And you get lzcnt only on AVX512, the latest-and-greatest in X86.

- You have horizontal adds (adding adjacent cells in a vector) for 16-bit ints, 32-bit ints, floats and doubles, and saturating horizontal add for 16-bit ints. https://www.intel.com/content/www/us/en/docs/intrinsics-guid... Where are horizontal adds for 8-bit or 64-bit ints, or any other saturating instructions?

- Since AVX-512 filled up a bunch of gaps in the instruction set, you have absolute value instructions on 8, 16, 32 and 64 bit ints in 128, 256 and 512 bit vectors. But absolute value on floats only exists on 512-bit vectors.

These are just the ones that I could find now, there is more. With this kind of inconsistency, any portable SIMD abstraction will be difficult to efficiently compile for the majority of CPUs, negating part of the advantage.

−

exDM69

Practically, it really is hard, because SIMD instruction sets in CPUs are a mess. X86 and ARM have completely different sets of things that they have instructions for

Not disagreeing it's a mess, but there's also quite a big common subset containing all the basic arithmetic ops and some specialized ones rsqrt, rcp, dot product, etc.

These should be easier to use without having to write the code for each instruction set. And they are with C vector extensions or Rust std::simd.

Some of the inconsistencies you mention are less of a problem in portable simd, taking Rust for example:

- lzcnt and tzcnt: std::simd::SimdInt has both leading_zeros and trailing_zeros (also leading/trailing_ones) for every integer size and vector width.

- horizontal adds: notably missing from std::simd (gotta use intrinsics if you want it), but there is reduce_sum (although it compiles to add and swizzle). Curiously LLVM does not compile `x + simd_swizzle!(x, [1, 0, 3, 2])` into haddps

- absolute values for iBxN and fBxN out of the box.

Also these have fallback code (which is mostly reasonable, but not always) when your target CPU doesn't have the instruction. You'll need to enable the features you want at compile time (-C target-features=+avx2).

With this kind of inconsistency, any portable SIMD abstraction will be difficult to efficiently compile for the majority of CPUs, negating part of the advantage.

I agree it negates a part of the advantage. But only a part, and for that you have zero cost fallback to intrinsics. And in my projects that part has been tiny compared to the overall amount of SIMD code I've written.

For basic arithmetic ops it's a huge win to have to write the code only once, and use normal math operations (+, -, *, /) instead of memorizing the per-CPU intrinsics for two (or more) CPU vendors.

−

dzaima

If by that absolute value thing you mean _mm512_abs_pd, that's a pseudoinstruction for 'and'ing via a mask that zeroes out the top bit, which can be done equally as well on 128/256-bit vectors, just without an intrinsic for some arbitrary reason. But yeah the gaps are super annoying. Some of my personal picks:

- There's only 8- and 16-bit integer saturating add/subtract, even on AVX-512

- No 8-bit shifts anywhere either; AVX2 only has 32- and 64-bit dynamic shifts (and ≥16-bit constant shifts; no 64-bit arithmetic shift right though!), AVX-512 adds dynamic 16-bit shifts, still no 8-bit shifts (though with some GFNI magic you can emulate constant 8-bit shifts)

- Narrowing integer types pre-AVX-512 is rather annoying, taking multiple instructions. And even though AVX-512 has instructions for narrowing vectors, you're actually better off using multiple-table-input permute instructions and narrowing multiple vectors at the same time.

- Multiplies on x86 are extremely funky (there's a 16-bit high half instr, but no other width; a 32×32→64-bit instr, but no other doubling width instr; proper 32-bit multiply is only from AVX2, proper 64-bit only in AVX-512). ARM NEON doesn't have 64-bit multiplication.

- Extracting a single bit from each element (movemask/movmsk) exists for 8-/32-/64-bit elements, but not 16-bit on x86 pre-AVX512; ARM NEON has none of those, requiring quite long instruction sequences to do so (and you quite benefit from unrolling and packing multiple vectors together, or even doing structure loads to do some of the rearranging)

- No 64-bit int min/max nor 16-bit element top-bit dynamic blend pre-AVX512

−

janwas

I know dzaima is aware, but for all the other posters who might not be, our Highway library provides all these missing instructions, via emulation if required.

I do not understand why folks are still making do with direct use of intrinsics or compiler builtins. Having a library centralize workarounds (such an an MSAN compiler change which hit us last week) seems like an obvious win.

−

Earw0rm

There's different ways of approaching it which have different performance consequences. Which is why accelerated libraries are common, but if you want accelerated primitives, you kinda have to roll your own.

−

nwallin

this turned into a larger rant than I wanted it to be. But I need to let it out every now and then. Feel free tos kip it.

IMHO programming languages are, for the most part, designed such that a compiler's job is easy if you want to compile to scalar code, but damned near impossible if you want it to compile to vectorized code.

So I have a point type, right? 'struct point3f { float x,y,z; };'. Easy peasy lemon squeezy. I add a bunch of member functions for normal stuff like addition, scalar multiplication, dot/cross product, etc. I write a triangle type: 'struct triangle { point3f a,b,c; };'. I write a bunch of functions for geometry stuff, intersections with rays, normals, etc.

Then I make an array of triangles. I have an origin point and a look ray. I want to iterate over each triangle and figure out which triangles intersect my origin/ray. Now I'm stuck. This is a perfect use case for vectorization, I can trivially get 4x/8x/16x speedup with SSE/AVX/AVX512, but the compiler can't do it. The data's in the wrong layout. It's in the correct layout for scalar code, but the wrong format for vector code. If you want to write vector code, your data has to be in a struct of arrays (SoA) layout.

There ought to exist a programming language that will, by default, automagically convert everything to SoA layout, unless you flag your array as AoS or your class as non-SoA-able. And ranged for loops are, by default, unsequenced.

This will make autovectorization an order of magnitude easier, and will enable vectorization on complicated stuff that might be fiendishly difficult to vectorize, even by hand.

This isn't trivial, and it can't simply be tacked on to a language later on. SIMD needs to be a day-0 priority. Everything else needs to be in support of that.

Until then, until this happens, we will either be leaving 60% of our CPU's silicon idle 99% of the time, or scrubs like me will continue to write SIMD intrinsic laden code with lots of manual bullshit to do what the compiler/language design should be doing for me for free.

Rant over. Thank you for coming to my TED talk.

−

mort96

But why do you need to describe that data flow using non-portable assembly?

−

the__alchemist

Evidence point towards the autovectorization gods being dead, false, or too weak. I hear, but don't believe their prophets.

−

the__alchemist

Here is how I would have it:

Instead of writing:

  fn do_athing(val: f32x69) -> f32x69 {
  }

Instead you do:

  fn do_athing(val: fx) -> fx {
  }

And it just works. f32, f64, as wide as your CPU supports.

−

jiggawatts

The Vector<T> type in dotnet core works precisely like this. Just yesterday I tried it for some spatial map aggregation filters and calculations, and it worked great.

On my older PC it used AVX2 instructions and 256-bits at a time processing, on my newer laptop it automatically switched to AVX-512 and its speed improved 4x without a single line of code having to change!

I.e.: Vector<int> has 8x "int" elements on older processors, but it has 16x elements on newer ones. You can retrieve the count with Vector<int>.Count and use that to split your data.

The only issue is that swizzles and scatter/gather is unavailable for these generic types.

−

exDM69

Here we are discussing the merits of built in SIMD facilities of not one but two programming languages. Waiting for the Zig guys to chime in to make it a three.

Four if you include LLVM IR (I don't).

No reason to be dismissive about it.

−

the__alchemist

I've been dodging the f32/f64-specificity in rust using macros, but I don't love it. I do it because I'm not sure what else to do.

I think `core::simd` is probably not near, but 512-bit AVX SIMD will be out in a month or two! So, you could use f64x8, f32x16.

I've built my own f32xY, Vec3xY etc structs, since I want to use stable rust, but don't want to wait for core::SIMD. These all have syntaxes that mimic core::SIMD. I set up pack/unpack functions too, but they are still a hassle compared to non-SIMD operations.

THis begs the question: If core::simd doesn't end up ideal (For flexibility etc), how far can we get with a thin-wrapper lib? The ideal API (imo) is transparent, and supports the widest instructions available on your architecture, falling back to smaller ones or non-SIMD.

It also begs the question of if you should stop worrying and love the macro... We are talking f32/f64. Different widths. Different architectures. And now conflated with len 3 vecs, len 4 vecs etc (Where each vector/tensor item it a SIMD intrinsic). How does every vector (Not in the SIMD sense; the most popular one uses AoS SIMD which is IMO a mistake) handle this? Macros. This leads me to think we macro this too.

−

exDM69

Here's a toy example of f32 vs. f64 generic Rust std::simd code without macros. It aint't pretty but it works.

https://play.rust-lang.org/?version=nightly&mode=debug&editi...

In my projects, I've put a lot of related SIMD math code in a trait. It saves duplicating the monstrous `where` clause in every function declaration. Additionally it allows me to specialize so that `recip_fast` on f32 uses `__mm_rcp_ps` on x86 or `vrecpeq_f32`/`vrecpsq_f32` on ARM (fast reciprocal intrinsic function) but for f64 it's just `x.recip()` (which uses a division). If compiled on other than ARM or x86, it'll also fall back to `x.recip()` for portability (I haven't actually tested this).

The ergonomics here could be better but at least it compiles to exactly the assembly instructions I want it to and using this code isn't ugly at all. Just `x.recip_fast()` instead of `x.recip()`.

−

ethan_smith

Modern compilers (especially Clang 16+/GCC 13+) have become remarkably good at auto-vectorizing regular scalar code with -O3 -march=native, often matching hand-written SIMD without the maintenance burden.

−

vlovich123

Depends on the specific loop and that’s where the problem is. For many cases, “good enough performance” is ok. However, in other cases like graphics hotspots, video decoders, LLMs etc, performance is actually a requirement and the compiler being free to not do the right thing is a problem. Also, even if it vectorizes it might not vectorize correctly which still violates the contract and there’s no way to have that contract with the compiler is a problem (ie have it fail to compile instead of producing the wrong code)

−

ozgrakkurt

std::simd in rust had atrocious compile time last time I tried it, is it fixed already?

−

mattmaynes

This is where the power and expressiveness of kdb+ shines. It has SIMD primitives out of the box and can optimize your code based on data types to take advantage of it. https://kx.com/blog/what-makes-time-series-database-kdb-so-f...

−

MangoToupe

Time series is vector processing on easy mode, though. The hard part is applying SIMD to problems that aren't shaped to be easily processed in parallel.

−

jgalt212

Fine. What's canonical or most basic example of where SIMD should be applied, but isn't because it's too tricky to do so?

In our shop, we never look to vectorize any function or process unless it's called inside a loop many times.

−

ashvardanian

Text processing. Loading/branching/storing content 1 byte at a time is the CPUs worst nightmare, but most text processing is quite tricky in SIMD.

−

MangoToupe

where SIMD should be applied

That seems to disqualify your example

−

AlotOfReading

There can be huge advantages to text processing in SIMD if you figure it out. Example, simdjson: https://github.com/simdjson/simdjson

−

ashvardanian

Maybe even better examples would be <https://github.com/intel/hyperscan>, <https://github.com/simdutf/simdutf>, and my own <https://github.com/ashvardanian/StringZilla> :)

−

MangoToupe

What's canonical or most basic example of where SIMD should be applied, but isn't because it's too tricky to do so?

There is none. That's a contradiction in terms. SIMD either fits the shape or it doesn't.

−

krapht

Variable length parallelism is hard. You can go to highload.fun (SIMD competition site) for problems that are only parallelized after significant effort.

Try problem #1, parsing numbers.

−

kookamamie

I don't think the native C++, even when bundled with OMP, goes far enough.

In my experience, ISPC and Google's Highway project lead to better results in practice - this mostly due to their dynamic dispatching features.

−

William_BB

Could you elaborate on the dynamic dispatching features a bit more? Is that for portability?

−

camel-cdr

Here is an example using google highway: https://godbolt.org/z/Y8vsonTb8

See how the code has only been written once, but multiple versions of the same functions where generated targeting different hardware features (e.g. SSE, AVX, AVX512). Then `HWY_DYNAMIC_DISPATCH` can be used to dynamically call the fastest one matching your CPU at runtime.

−

William_BB

Thank you so much, this explains it well. I was initially afraid that the dispatch would be costly, but from what I understand it's (almost) zero cost after the first call.

I only code for x86 with vectorclass library, so I never had to worry about portability. In practice, is it really possible to write generic SIMD code like the example using Highway? Or could you often find optimization opportunities if you targeted a particular architecture?

−

ashvardanian

You can go quite far with such libraries if you only perform data-parallel numerics on the CPU. However, if you work on complex algorithms or exotic data structures, there's almost always more upside in avoiding them and writing specialized code for each platform of interest.

−

janwas

I don't understand why it helps to "avoid them" entirely. For the (in my experience) >90% of shared code, we can gain the convenience of the wrapper library. For the rest, Highway allows target-specific specializations amidst your otherwise portable code: `#if HWY_TARGET == HWY_AVX2 ...`

−

jeffreygoesto

Nice. First time that I saw this dynamic dispatch was in FFTW.

−

ashvardanian

Here's an explanation from one of my repos: <https://github.com/ashvardanian/simsimd?tab=readme-ov-file#d...>

−

fancyfredbot

CUDA got writing SIMD functions right. You never have to say "_m1024ps" in CUDA, you just say float. CUDA has been a huge success but somehow nobody has really copied this paradigm. It's just weird.

−

smallmancontrov

Yes!!!

They search every corner of the earth for a clue, from the sulfur vents at the bottom of the ocean to tallest mountains, all very impressive as feats of exploration -- but they are still suffering for want of a clue when clue city is right there next to them, bustling with happy successful inhabitants, and they refuse to look at it. Look, guys, I'm glad you gave a chance to alternatives, sometimes they just need a bit of love to bloom, but you gave them that love, they didn't bloom, and it's time to move on. Do what works and spend your creative energy on a different problem, of which there are plenty.

−

vlovich123

Because SIMT is not a general programming framework like CPUs are. It’s a technique for a dedicated accelerator for a specific kind of problem. SIMD on the other hand lets you get meaningful speed up inline with traditional code.

−

smallmancontrov

No no no! The programming model "meets you where you are" in exactly the way that an auto-vectorizer does. You write unconstrained single-threaded code, the compiler tries to make it parallel, and if it fails your code still works, just slowly. The difference is a few abstractions and social contract tweaks to make the auto-vectorizer reliable and easy to think about. These tweaks "smell like" hacks, but CPU folks have spent 20 years trying to do better and their auto-vectorizers are still failing at the basics so it's past time to copy what works and move on.

−

vlovich123

I think you maybe misunderstood what I was trying to say.

A model like CUDA only works well for the problems it works well on. It requires both HW designed for these kinds of problems, a SW stack that can use it, and problems that fit well within that paradigm. It does not work well for problems that aren’t embarrassingly parallel, where you process a little bit of data, make a decision, process a little bit more etc. As an example, go try to write a TCP stack in CUDA vs a normal language to understand the inherent difficulty of such an approach.

And when I say “hw designed for this class of problems” I mean it. Why does the GPU have so much compute? It throws away HW blocks that modern CPUs have that help with “normal” code. Like speculative execution hardware, thread synchronization, etc.

It’s an tradeoffs and there’s no easy answers.

−

fancyfredbot

I'm so glad someone else gets it. We don't want an auto-vectorizer, it doesn't work, just give us a trivial way to vectorise the easy parts and leave the difficult parts to be difficult. We're better at the difficult stuff than your compiler.

−

janwas

If SIMT is so obviously the right path, why have just about all GPU vendors and standards reinvented SIMD, calling it subgroups (Vulkan), __shfl_sync (CUDA), work group/sub-group (OpenCL), wave intrinsics (HLSL), I think also simdgroup (Metal)?

−

dwattttt

Function calls also have that negative property that the compiler doesn’t know what happens after calling them so it needs to assume the worst happens. And by the worst, it has to assume that the function can change any memory location and optimize for such a case. So it omits many useful compiler optimizations.

This is not the case in C. It might be technically possible for a function to modify any memory, but it wouldn't be legal, and compilers don't need to optimise for the illegal cases.

−

RossBencina

Sounds like the author hasn't heard of full program optimisation. EDIT: except they explicitly mention LTO near the end.

−

forrestthewoods

Not many compilers support vector functions

Hrm. Anyone who has spent time writing SIMD optimizations know not to trust the compiler.

If you want to write SIMD then… just use intrinsics? It’s a bit annoying to create a wrapper that supports x64 SSE/AVX and also Arm NEON. But that’s what we’ve been doing for decades.

−

janwas

Mature such wrappers exist, for example our Highway library :)

−

kragen

This is pretty good stuff. I've only done the barest minimum of SIMD programming, so things like the variable/uniform/linear trichotomy and the discussion of how OpenMP relates to GCC function attributes are very valuable to me.