FMA4 Instruction Set Hidden But Is Working On AMD Zen Processors
In an interesting find, it has been discovered that AMD processors based on ZEN architecture actually support the latest iteration of FMA, the FMA4-instruction set. The theory is that the FMA3 supplement instruction set would have been disabled for unknown reasons, however as it seems, it at the very least is partially working and active. With its “Zen” CPU microarchitecture, AMD removed support for the FMA4 instruction-set, on paper. This, while retaining FMA3. Level1Techs discovered that “Zen” CPUs do support FMA4 instructions, even through the instruction-set is not exposed to the operating system. FMA, or fused multiply add, is an efficient way to compute linear algebra. FMA3 and FMA4 are not generations of the instruction-set (unlike SSE3 and SSE4), but rather the digit denotes the number of operands per instruction. Support for both were introduced by AMD in 2012 with its FX-series processors, while Intel added FMA3 support in 2013 with “Haswell.”
As it now seems, Level1Techs tested this with Zen processors by running an adapted script that sends FMA4 instructions to the processor. The FMA4 task fired off at the processor surprisingly did not get refused and got executed successfully. It’s an interesting find. Meanwhile, CPUID still states it is not supported/detected.
The exact reasons why AMD deprecated FMA4 with “Zen” are unknown, but some developers speculate it’s because AMD’s implementation of FMA4 is buggy, even though it’s more efficient (33% more throughput). Intel’s adoption of FMA3 made it more popular, and hence more stable over the years. Level1Techs used an OpenBLAS FMA4 test-program to confirm that feeding “Zen” processors with FMA4 instructions won’t just return a “illegal instruction” error, but also the processor will go ahead and complete the operation. This is interesting because FMA4 isn’t exposed as a CPUID bit, and the operating system has no idea the processor even supports the instruction. For linear algebra, FMA4 has proven more efficient than AVX in both single- and double-precision.