Exploring the scalable matrix extension of the Apple M4 processor

184 points by gok 8 days ago | 63 comments

In my experience, based on profiling and optimizing of ML-based guitar amp models in the PiPedal project (https://rerdavies.github.io/pipedal/), when using only neon instructions, performance is almost completely constrained by L2 memory bandwidth. Compute cost almost completely disappear while waiting for memory loads and stores.

So, although these devices have ferociously impressive FLOP rates, I'm extremely curious as to how the cost of memory loads and stores is going to work.

I can very well imagine that having large local tile buffers is going to dramatically improve performance. But I'm curious how much. No matter how fast the compute speed is, it seems to me that performance of these sorts of devices in practice is going to be constrained by memory transfer rates. And perhaps by L1 caches in the tile compute unit that are better optimized for tile computation than the L1 cache on a general-purpose cPU.

My current expectation: that performance of matrix multiplies increases linearly with respect to tile size. i.e. a tile size if 8x8 floats will perform twice as fast as a matrix multiplier with a tile size of 4x4, since doubling the tile size reduces the required transfers to and from L2 by a factor of two.

So, compared to a basic A72 ARM neon (effectively, 4x8 tile size), I would expect about a 4x improvement by virtue of the fact that the tile size is larger on the Apple tile processor. Both entirely otherwise limited by the cost of L2 memory loads and stores. And maybe another 2x or 3x improvement because the tile processor L1 caches (tile buffers) are tuned for tile multiply/accumulate operations.

Could somebody comment on how these devices actually perform on real matrix multiplies? It seems inconceivable to me that these devices will actually achieve peak FLOP rates in anything but meaningless test cases. And also somewhat of a meaningless exercise to measure peak performance using test cases that are designed to completely eliminate L2 memory transfers.

dividuum 5 days ago | prev | next |

> Although Apple has included a matrix accelerator in its devices since 2019, it used a proprietary instruction set inaccessible to developers, who officially could only use Apple-provided numerical libraries.

How does that work? Does the hardware throw some kind of fault when using those instructions? Or are they merely undocumented and you could use them if you figure out how they work? I guess the second, as hinted by the "officially"?

jonstewart 5 days ago | root | parent | next |

Peter Cawley has a good write-up on the undocumented M1/M2/M3 AMX instructions: https://github.com/corsix/amx

As others have said, just undocumented.

IIRC there was a BLIS fork that used AMX instructions. I think it was unofficial though(?). It is hard to do science without properly documented tools.

my123 5 days ago | root | parent | prev |

Merely undocumented

freeqaz 7 days ago | prev | next |

Any comparison with how much faster this is compared with the previous way of doing things on the CPU?

svnt 7 days ago | root | parent |

Based on my understanding from the description, it is ~8x faster (250 GFLOPS) for vector ops (vs. SVE mode at 31 GFLOPS which is CPU-ish) and 60-100 times faster (e.g. 2005 GFLOPS) for matrix multiplication for single-precision values.

jandrese 5 days ago | root | parent | next |

That's alright, but not mindblowing. How does it compare to doing the same work on a GPU? Is there a particular set of tasks that GPUs struggle with that would be well suited for this? Or is this more a fig leaf over lousy GPU compute support in Apple land?

huijzer 5 days ago | root | parent | next |

60 times faster could mean 2 minutes instead of 2 hours, or 2 seconds instead of 2 minutes. How is that not mind blowing, or at least very useful (for specific uses)?

jandrese 4 days ago | root | parent |

Compared to 600 or 6000 times faster on a GPU though?

astrange 4 days ago | root | parent |

M* CPUs aren't made for maximum performance, but for maximum power/performance tradeoffs, since they're mostly used in portables.

bee_rider 5 days ago | root | parent | prev |

Apple should mostly care about power-efficient inference I think, right? Not training. Spinning up a GPU seems like something to avoid.

I mean, I wonder how this thing compares to a gemm using all the cores in a cpu cluster. They might be ok with not even meeting that performance, if the accelerator can not hog all the cores and power.

At least that’s what my uninformed gut says. The workload for these things is like: little AI enhancements inside conventional apps, I think.

lxgr 4 days ago | root | parent |

> Spinning up a GPU seems like something to avoid.

You can do inference on GPUs as well, and for anything other than very small/lightweight models, such as noise cancellation or maybe speech recognition, it's probably worth the initial overhead.

I believe CoreML already splits workloads between CPU, NPU, and GPU as appropriate.

jhugo 4 days ago | root | parent |

It’s likely not worth the additional energy usage though, at least when running on battery.

bee_rider 4 days ago | root | parent |

Yeah, this is what I was getting at. In some sense, the list of “capabilities which don’t require spinning up the GPU” is expanded. Whether something could be done by spinning up the GPU is beside the point.

TinkersW 4 days ago | root | parent | prev |

What? The article says this thing does 2005 GFOPs, aka 2 TFLOPS, which is decent, but we have had CPUs that could do more than this for a long time now. My Zen2 12 core does about 3 TFLOPs, and a modern 16 core Zen5 can do 8-10 TFLOPS(I'm unsure what clock speed it can maintain with all cores engaged). And that is generally purpose SIMD not specialized matrix stuff(less generally useful).

Apple CPU's kinda suck at vector ops, but they aren't that bad, this thing is only mildly better. I would guess power savings is a big part of why they use this SVE streaming matrix mode.

mmoskal 4 days ago | root | parent |

IIUC this the cpu in an iPad. The pro/max versions would be more appropriate to compare against the Zen when they are released.

nxobject 5 days ago | prev | next |

If Apple’s going for one SME accelerator per base M4 chiplet, it’ll be interesting to see how to program scalably for Pro/Max/Ultra variants.

wtallis 5 days ago | root | parent |

You should be thinking in terms of CPU clusters, not chiplets. The Ultra is the only one with multiple chiplets, but all of their processors have multiple CPU clusters, and so far it's one AMX/SME per cluster.

nxobject 4 days ago | root | parent | next |

Ah, thank you! That’s the right word. They’re on the same die, no, so “chiplet” isn’t the appropriate word?

bee_rider 5 days ago | root | parent | prev |

I guess the the CPU/cluster and cluster/chiplet ratios change from generation to generation?

wtallis 5 days ago | root | parent |

They're not constant even within a generation. The M3, M3 Pro, and M3 Max are each monolithic SoCs of different sizes (no chiplets) with different CPU cluster configurations, and the phone chip of the same generation is yet another configuration.

astrange 4 days ago | root | parent |

This isn't hard to deal with because it's just an evolution of having to check the # of CPU cores to know how many worker threads to start.

But there are a few more problems because of cache hierarchies; touching the same memory from different CPU clusters at once can be extra slow, possibly even slower than fetching it from DRAM.

This is called NUMA (which is ironic for a unified memory SoC.)

jhugo 4 days ago | root | parent |

Unified but not uniform.

kjkjadksj 5 days ago | prev | next |

I wish they made computers that ran software like games again. Seems like the last few iterations they’ve been working hard on making computers that are able to run ai models a little faster. Are people really asking for that? I would think far more people would like to play a video game over rolling their own matrix multiplication, but I guess that’s why they pay the people at apple the big bucks because they must know best.

jwells89 5 days ago | root | parent | next |

Overall, GPU strength is the best it's ever been in portable Apple devices by a significant margin. The problem isn't the hardware, it's that game developers are reticent to support anything that's not x86 Windows+DirectX or one of the consoles.

It's often said that macOS/iOS supporting Vulkan would help and while I think that's true to an extent, native Vulkan support is still rare enough that it's not going to change all that much in terms of ease of porting. It might improve things on the front of running games through WINE (DirectX → Vulkan translation), but unless developers produce ARM builds of their games there's always going to be the overhead of being run through an x86 translator, which varies depending on how CPU heavy the game is.

wkat4242 3 days ago | root | parent |

Anything other than metal will be a big win. It's just a non-starter for developers of desktop quality games. Mobile ports sure.

But Apple has never really cared about gaming. During the powerpc era they had a short phase of paying aspyr to make some ports (most notably CoD 4 modern warfare and some battlefield ports) but it was over within a year.

Then about a decade later they had a phase around the 320M chipset where they promoted game releases. And again within a year they dropped the efforts and also let their OpenGL go totally stagnant. This caused for example elite dangerous to drop support.

Now we're stuck with metal. Apple is just too small in gaming for desktop game devs to bother with metal. Not sure if Vulkan will be best but metal surely isn't. And the added complexity of building for arm doesn't help either (arm on windows is non-existent on any hardware aimed at gaming)

I don't think Apple and Mac gaming will ever really become serious. I'm sure that if they do partner with studios like the last few times they'll just abandon the efforts like they always have.

I think the biggest problem is just Apple's total lack of interest (save for the few half-hearted efforts above) to make Mac gaming real.

jwells89 3 days ago | root | parent |

I think a bit in the first sentence in your first post is key.

They don't want devs to think as Macs and iDevices as separate targets, but rather as one big platform. They don't Mac ports, they want Apple platform ports.

It makes some amount of sense. App Store revenue split aside, iDevices massively outnumber Macs and the gap in graphics horsepower between Macs and iDevices shrinks every year.

Devs and to a lesser extent users don't really think that way though.

wkat4242 3 days ago | root | parent |

It doesn't really make sense. iOS games are built to be played directly on the touchscreen. That rules out a lot of types of games (imagine playing WoW without a keyboard). It's not just about horsepower.

I do think Apple thinks that way but there's a good reason for Devs not doing so.

You can spend a small amount of die space on something that will yield 10x performance benefits for some things, and you can spend a lot of die space on something that will only yield a general 5% improvement. Which you choose depends on a lot of factors. In other words, the relationship between the "things on the chip" and general performance, or specific application performance, is not a strictly linear relationship.

The 20 series Nvidia GPUs with RTX were a good example. RT cores were added and took up significant die space, people said "why not more CUDA cores", but given the design of consumer GPUs it's extremely unlikely that just replacing those with more CUDA cores would have had a proportional uplift. In Nvidia's case, they realized RT cores were a better bet and served their customer bases (industrial graphics, gaming) better than just more raw numbers.

As it stands, specialization like this is a key element of new designs on leading edge processes. You're going to see more of it, not less.

> I guess that’s why they pay the people at apple the big bucks because they must know best.

Well I don't know about "best", they almost certainly know ~infinitely more about their customers and workloads than random people like us do, I can at least say that much.

wmf 5 days ago | root | parent |

The 20 series Nvidia GPUs with RTX were a good example. RT cores were added and took up significant die space, people said "why not more CUDA cores"

Or they could have had the same number of CUDA cores without RT at a lower price (the fabled "1180")...

They are! The graphics for video games are just a series of matrix multiplications. Before it can get shown to the screen, the graphics are a bunch of triangles, represented by matrices, and in order to do anything in game, those matrices need to be multiplied in order to move them around in 3d space, before getting rendered out to the screen. Making computers better at matrix math means better rendering for video games.

kjkjadksj 3 days ago | root | parent |

Apples issue is not whether it is a powerful enough device to run games, but an issue of software compatibility with modern games.

Are you implying that recent Apple SoCs can't run games?

While there's the ML-centric "Neural Engine", the GPU really isn't stagnating by any means: Just in the iPhone 16 presentation this week, ray tracing and a 20% faster GPU were among the headline features. Gaming got its own section in the video presentation!

The fastest GPU I own is in my Mac; the second fastest is in my iPhone. My dedicated (last-gen) game consoles are a distant third and forth, respectively.

kjkjadksj 3 days ago | root | parent |

The issue isn’t the hardware its the software.

Modern Apple Silicon based laptops have fantastic graphics performance, manufacturers just aren't that interested in supporting them.

It's probably a bit of a chicken and egg thing at this point, plus the fact that most "serious" gamers are going to have desktop PC's anyway.

lxgr 4 days ago | root | parent | next |

> manufacturers just aren't that interested in supporting them

AAA games are starting to show up on Steam for macOS these days. Baldur's Gate 3 runs pretty well, for example!

The real shame is that some older indie games are disappearing just as easily, given Apple's deprecation strategy – while Microsoft basically never breaks backwards compatibility, Apple recently cut off 32 bit games (killing about half my Steam library), and presumably Intel-only binaries are next.

wtallis 4 days ago | root | parent |

Apple dropped support for 32-bit Mac applications five years ago; recent only by comparison to Microsoft's theoretical backwards compatibility. Apple dropped support for 32-bit Mac hardware, firmware, and drivers in 2012, so there was a period of seven years where game developers had every reason to make their Mac releases 64-bit, but to a disappointingly large degree they didn't.

This was probably due in large part to a lack of pressure on the Windows side. It was absolutely absurd that even a big budget (and memory-hungry) game like Skyrim was released in 2011 as a 32-bit only game, and didn't get a 64-bit release until 2016.

I didn't enjoy macOS killing compatibility with so much of my Steam library either, but I do at least respect that Apple had some solid reasons, and save some of my ire for the game devs that shipped outdated binaries.

diebeforei485 4 days ago | root | parent |

Dropping 32bit support was the right decision. Most of those games work fine in emulation on Apple Silicon if they are single player, or alternatively in a cloud gaming service like Nvidia's that you can use to access your Steam library directly.

lxgr 4 days ago | root | parent |

Well, as I said, about half my library is gone due to the lack of 32 bit support. The entire Orange Box by Valve, a few indie games...

Not sure if many of them are even available in emulators, and a cloud gaming service for a 2D indie game seems like overkill.

And yes, I generally agree with Apple deprecating technologies after a while (sometimes it's better to make a clear cut by forcing a minimum API version, CPU architecture etc. rather than to have compatibility be hit and miss for really old things), but in the case of gaming specifically, I sometimes prefer Microsoft's approach.

diebeforei485 15 hours ago | root | parent |

There are also tools like CrossOver[1] or even free tools like Wine that work reasonably well, 2D games should not have an issue. People have played TF2 in Wine on Apple Silicon. So while half your Steam library won't run directly, it's not like it's gone forever. Parallels is also an option.

https://www.codeweavers.com/crossover

Detrytus 5 days ago | root | parent | prev |

I thought one of the reasons to bring Apple Silicon to Mac was that all the iPhone games can now be easily ported?

Higher AI performance actually means higher game performance, because you can render the game at lower resolution and use ML upscaling. Very popular technique now, especially because people prefer higher frame rates over higher resolutions.

Apple literally marketed the new iPhone running Death Stranding

samatman 5 days ago | root | parent | prev |

Are you under the impression that fast matrix operations in the CPU are useless for,, games?

Where did you get that idea?

kjkjadksj 3 days ago | root | parent |

What matters for games is apple rebuilding bridges with game devs and creating a developer environment that supports game dev on the platform once again.

ein0p 7 days ago | prev | next |

I’m not sure why they added this feature. All Apple SoCs have far more energy efficient compute than the CPU. This would only make sense for really tiny models which need extremely quick forward pass. For such models the overhead of a GPU or Neural Engine kernel launch would be quite noticeable. But for those the old NEON was already OK, and if not, there also is a dedicated matrix unit there called AMX. Seems kinda random to me.

adrian_b 7 days ago | root | parent | next |

This replaces AMX, it is its successor.

The older Apple CPUs implemented a custom form of AMX that was not standardized by Arm.

Presumably as a result of cooperation with Apple, the Arm ISA now includes a set of instructions with the same purpose like the original Apple AMX.

The newer Apple CPUs have been updated to use the standard Arm ISA, instead of their older proprietary ISA.

In the Apple CPUs, the former AMX and the current SME provide a much higher throughput than the CPU cores, even if lower than the GPU, and a much lower latency than the GPU, even if higher than the CPU cores.

AMX/SME is implemented as a separate accelerator, distinct from the CPU cores, because this saves power and area in comparison with implementing such instructions in each CPU core. The Apple CPUs do not attempt to compete in high-performance computing applications, so the extra throughput provided by a separate shared matrix operation accelerator is good enough for them.

saagarjha 5 days ago | root | parent |

This has both actually.

adrian_b 4 days ago | root | parent |

That must be for preserving the compatibility with the older software versions.

The same happens in x86, where there are hundreds of obsolete instructions, which have been replaced by better instructions, but which are still supported to allow the execution of old programs.

Both the new Arm SME instructions and the old Apple AMX instructions are executed by the same hardware matrix operation accelerator.

Previously Arm has extended the Aarch64 ISA with the SVE instructions, in order to support the Fujitsu supercomputer.

Then they were not satisfied with the original SVE and they have extended it into SVE2.

I suppose that something similar has happened with SME. Apple must have negotiated with Arm the inclusion of matrix and vector operations implemented by a separate shared accelerator. The result was SME, which differs from the original AMX either because Apple has thought some improvements based on the experience with the first instruction set or because Arm has desired some changes from the Apple proposal.

Matrix multiplication is very commonly used in science and engineering, not just machine learning.

The neural engine is optimized for machine learning use cases.

This standardized successor to AMX is more general purpose than the neural engine and has much improved matrix multiplication performance vs NEON.

As a bonus, since this is no longer just an experimental implementation of a matrix unit, you get documented access to the new ARM standardized low level instruction set.

The neural engine by design cannot handle all possible kernels, and the GPU is significantly slower for integer math, and cannot do fp64. Then for the iPhone SoCs with 4 or 5 core GPUs, the GPU is a bit slower for fp16 and fp32 too.

Dedicated FP64 is great for real-time audio processing. Like an included DSP chip.

phkahler 5 days ago | root | parent |

Isn't FP32 sufficient for audio processing? Even though we have 24bit DACs and ADCs these days I feel like 16bit was really good enough. FP32 with 24bit mantissa should avoid rounding errors at the 16bit level right?

Archit3ch 5 days ago | root | parent | next |

It depends on the application.

16bit is enough for representation.

24bit is enough for recording (some leeway because recording levels won't be ideal).

FP32 for processing with simple effects (e.g. mixer, some EQs). If that's enough for your needs, you can SIMD/GPU to your heart's content.

FP64 for high Q filters, phasors, LU decompositions.

rerdavies 4 days ago | root | parent | prev |

##### Is 16 bits good enough?

Amp models need all the precision they can get. The effective range of output signals is greatly compressed because the signal is soft-clipped by the amplifier's non-linear response. Real guitar amplifiers have significant levels of noise in their output signals; digital guitar simulations of guitar amplifiers are typically even more sensitive to noise in their input signals. Currently, probably the easiest way to tell the difference between recordings of real guitar amps and neural model simulations of guitar amps: how they respond to signal noise in their inputs.

A really good ADC may have a 24-bit representation, but it will only have an 18 to 20 bit signal to noise ratio. Cheap audio adapters (pretty much all the audio adapters costing less than $100) will happily deliver you an input signal in 24-bit (or even 32-bit) representation, but will have less than 16 bits of signal above the noise floor.

For example, I have an M-AUDIO Fast Track usb audio adapter ($50) that provides 24-bit input but only has 12 bits of signal above the noise floor, even with levels meticulously set. Guitar amp models sound horrible when using this device. But when I use my MOTU-M2 (~$200) which probably provides a full 20 bits of signal above the noise floor, the same models sound faaabulous!

Those extra bits of SN/R are precious. An amp simulation of an input signal on a cheap ADC sounds noticeably "fizzier" than an amp simulation of the same input signal on an ADC with 19 actual significant bits of actual signal above the noise floor.

So 16 bits is not good enough. And 24 bits does make a difference (even if it's only 19 bits of actual difference)

##### Would FP64 be better?

Currently, Machine Learning models of guitar amps use FP32, because they are extremely compute-intensive when running in realtime (and extremely compute intensive when training the model in realtime).

Would FP64 calculations improve the quality of amp simulations? That would depend on how much precision gets lost while performing ML simulation. Probably a fair bit of precision does gets lost, between the massive matrix multiplies that are involved, and the calculation of non-linear activation functions (typically atan functions in current ML guitar models).

Roughly, I think the answer goes like this. We have an input signal with 19 bits of precision. And the 19th bit seems to make a difference. FP32 provides 24 bits of precision -- 5 extra bits of precision -- to avoid rounding errors while calculation massive matrix multiplies, and at least two rounds of atan activation functions (some of which are in a feedback loop). Are those five extra bits of guard precision being consumed during processing? Heck yes!

I'm almost certain that the quality of amp models would improve if the models were trained in FP64, and am reasonably certain that quality would improve if realtime calculations were performed in FP64 as well.

But on a Raspberry Pi (and probably on a x64 device as well), neural models cannot be run with FP64 precision in realtime. An ML-based amp model consumes bout 45% of available CPU bandwidth running with FP32 precision. Running with FP64 precious would add least quadruple that.

As a point of interest, matrix multiplies running on a Raspberry Pi 4 Arm Cortex A72 are almost completely limited by memory bandwidth to L2 cache and main memory. And that performance is (mostly) constrained by the tile size used in the matrix multiples, which (when using A72 neon registers) is constrained by the number of neon registers available. I believe that performance would roughly increase linearly as a function of available tile size. Whether it's linear or not depends a bit on how well matrix units deal with Nx1 matrices (vectors). Although the to perform NxM matrix multiples dominates, a significant amount of execution time also gets spent doing Nx1 and/or vector processing. Whether the corresponding performance boost is good enough to allow realtime audio processing at FP64.... the only way to find out would be to do it.

* Results based on extensive optimization and profiling of Toob ML and TooB Neural Amp Modeler guitar effects hosted by [PiPedal](https://rerdavies.github.io/pipedal/)

lxgr 4 days ago | root | parent | prev |

> if not, there also is a dedicated matrix unit there called AMX

This seems to be the successor to AMX.

brcmthrowaway 5 days ago | prev | next |

I'm dim, whats the difference between SVE and SME?

mrmuagi 5 days ago | root | parent |

Vector vs Matrices. Higher dimensional.

DanielLee5 4 days ago | prev | next |

Great review.

DanielLee5 20 hours ago | root | parent |

The scalable matrix extension of the Apple M4 processor sounds like a significant advancement, especially for tasks that rely on heavy computation. Apple's processors have been known for their efficiency and performance, which is why so many users leave positive feedback in apple reviews https://apple.pissedconsumer.com/review.html . This new feature could further improve tasks like machine learning and graphics processing. It’s really exciting to see how Apple continues to push its hardware capabilities. Looking forward to seeing how this technology is implemented in future devices.

softwaredoug 5 days ago | prev |

I just wish they’d make native tensorflow installation actually work without a million apple silicon specific exceptions :)

TheFuzzball 5 days ago | root | parent |

They will, just in time for everyone to have switched to pytorch!