Skip to content

Asahi Linux Apple Neural Engine (ANE) support #1021

New issue

Have a question about this project? Sign up for a free account to open an issue and contact its maintainers and the community.

By clicking “Sign up for ”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on ? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

eiln
Copy link

@eiln eiln commented Jun 16, 2023

Hi. I'm working on Apple Neural Engine (ANE) support for Asahi Linux.
See my driver/userspace repo: https://.com/eiln/ane.

Attached is a POC of running the whisper encoder on the ANE (on Asahi).

Total time halves. Encoder speeds up 630.75 ms/run -> 117.54 ms/run for CPU vs. ANE.

Load time is minimal because my framework pre-compiles the model. In fact, it's 20ms less than the CPU here. Again, the middleman is excised, so I guarantee load times are shorter than the CoreML/Xcode behemoth, though I really can't be bothered to testify with exact figures.

Both on Asahi, 6.3.0-asahi-8-1-ARCH, J293AP (2020 M1 MacBook Pro).

CPU:

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | COREML = 0 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:10.500]   And so my fellow Americans ask not what your country can do for you ask what you can do for your country.

whisper_print_timings:     load time =    73.23 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    96.77 ms
whisper_print_timings:   sample time =     8.17 ms /    25 runs (    0.33 ms per run)
whisper_print_timings:   encode time =   630.75 ms /     1 runs (  630.75 ms per run)
whisper_print_timings:   decode time =   138.37 ms /    25 runs (    5.53 ms per run)
whisper_print_timings:    total time =   955.54 ms

vs. ANE:

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | COREML = 1 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:10.500]   And so my fellow Americans ask not what your country can do for you ask what you can do for your country.

whisper_print_timings:     load time =    55.54 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   112.75 ms
whisper_print_timings:   sample time =     9.22 ms /    25 runs (    0.37 ms per run)
whisper_print_timings:   encode time =   117.54 ms /     1 runs (  117.54 ms per run)
whisper_print_timings:   decode time =   131.69 ms /    25 runs (    5.27 ms per run)
whisper_print_timings:    total time =   446.05 ms

I need more test cases before up the driver. Also to improve userspace, which the hard-coded home directory in the Makefile should evince.

If Asahi support interests you, I'd be happy to work things out.

@ggerganov
Copy link
Member

Very interesting work. Yes, I'm interested in Asahi support and I believe other people would also would be interested.

Would this driver work also on Mac OS or is it somehow possible only on Asahi?

Reading https://.com/eiln/ane/tree/main/docs, can you ELI5 if the ANE can only do batched multiply-adds what are the limitations compared to the instruction sets that we are used to. Is there something that you cannot compute on the ANE or is it just a matter of finding a way to express it as FMAs?

Does it support integer operations or just F16 / F32?

Do you think the Decoder can also be implemented in a similar way?

@eiln
Copy link
Author

eiln commented Jun 19, 2023

Response from 2 days ago:

Driver is Linux-specific and at the firmware/kernel level. It won't work on OSX unlesss someone makes a XNU port of those Linux syscalls (if kexts are still allowed). Also, the iBoot-loaded firmware is inactivated for various reasons, so I'd imagine the anti-firmware driver vs. firmware clashing without explicit ownership of power domains & shutting down the coprocessor (if an Apple guru is reading this...). OSX poses far too many low-level limitations for this to work, so my focus is Linux. I can assist on a port later.

Sorry, those docs are shitty and outdated. The ANE, having no ISA, cannot execute instructions (e.g. shaders or openCL). All actions must be known and hard-coded at compile time as "configuration states", the poor man's instructions. The entire computation sequence must be encoded in the microcode (model.hwx), including DMA memory movement. Think of it as a stream of circuit actions triggered by one big red nuclear button (which acutally exists). I guess the ELI5 is: with static comes speed. Notably, flexible/dynamic shapes are impossible. At best, shapes can be padded and computed in full; the Stable Diffusion repo does that. High-precision math may be limited (e.g. trig, discussed below), but I wouldn't worry about representing instructions as multiply-adds. The compiler is capable of approx 97% of these.

ANE is IEE-754 F16. There's no F32 mode. I (hastily) convert F32 to F16 using ggml there. INT8 mode exists, but I've never been able to trigger it. I haven't looked at INT8 further, but nothing changes at the driver level to support it (e.g. ANE-specific NCHW tiling is precision-independent). Apple doesn't like admitting its low precision hardware limitations, so they'll only accept F32 params and sneakily downcast underneath.

It's the matter of passing the PyTorch model through the converter. This means doing surgery on any of the pitfalls described above. From what I can tell, the decoder's token_embedding shape branching might be the problem. I'll look into it.

Update on decoder:

See added asahi/decoder.py

I tried to convert the decoder directly from the openai-whisper module. It converts after changing audio_data dimensions to (1, 1500, hparams.n_audio_state), same as the encoder output. The issue is, the converted model doesn't fully run on the ANE due to an unresolved intermediate CPU layer. It's that the input token_data must index the token embeddings, but arbitrary memory movements won't work on the ANE; there simply isn't a way to guarantee the input lies within n_vocab bounds, or set a fallback for it. CoreML (using convert_to="neural_network", the default option, not "mlprogram", never use that), always prioritizes ANE time, so it'll internally schedule unresolved layers on the CPU to still run whatever it can on the ANE.

Since computation units are determined at runtime at the driver level, I had to fish the ANE portion of the model (model.hwx) out of its coprocessor virtual space (under the fantastic m1n1 hypervisor), calculating the microcode iova offset with special boot-args enabled logs in a development kernel, and then reverse-engineer the exact pass in which the ANE portion begins/ends by matching DMA'd tiles to a shit ton of tensor print statements (it starts at y=xA^T in midst of the first cross_attn_ln() layer norm without the bias added yet!), but yeah, I got the decoder to (at least) partially run on the ANE.

But I'm already getting a 7x speedup compared to plain CPU pytorch. It's 6x on startup but hits 7-9x after some warmup. Precision loss compared to CPU F32 is 0.01-0.09 ish range, not bad. I'm sure the multihead attention & logit matmul can be optimized to exploit more ANE.

ModelDimensions(n_mels=80, n_audio_ctx=1500, n_audio_state=384, n_audio_head=6, n_audio_layer=4, n_vocab=51865, n_text_ctx=448, n_text_state=384, n_text_head=6, n_text_layer=4)
LIBANE: LOG: loaded model @ 0xaaabd8ceea50!
ANE: tensor([[[10.5773,  6.5641, 11.9055,  ..., 14.1438, 13.2985, 12.3015]]])
ANE: 0.00923562049865722656
CPU: tensor([[[10.5582,  6.5746, 11.8969,  ..., 14.1543, 13.3140, 12.3116]]])
CPU: 0.07123231887817382812
speedup: 7.71278106203371471850
diff: tensor([[[-0.0190,  0.0106, -0.0085,  ...,  0.0105,  0.0155,  0.0100]]])

I don't know a way to fetch these CPU-intermediate models without the hypervisor. I could either figure out more Apple-isms or host the compiled/fetched models. Model binaries are shared across all M1 ANEs, but also that dealing with Xcode invokes an irrational visceral rage in me, enough to start all this, so I'm leaning towards the latter. Integrating into whisper.cpp isn't gonna be the prettiest either, having to split graph passes on top of all that, but it seems promising.

I forgot to mention earlier, but these are all "tiny" models for the sake of quick testing. Gains grow exponentially with size.

@AIWintermuteAI
Copy link
Contributor

Hi @eiln ! Thank you for your contribution.
Are you still continuing working on this? This looks very promising for Dockerizing llama.cpp on Mac M1 - currently only CPU inference is possible this way and it is painfully slow.

@aep
Copy link

aep commented Dec 28, 2023

Dockerizing llama.cpp on Mac M1

Exactly our use case (on asahi as host), and we've got funding to spend on this. But no relevant skills on the team

@marcan
Copy link

I don't expect any of this to be releasable on macOS. Replacing their kernel driver is a non-starter, it'll break webcam for one on M2 machines and probably other stuff, plus requires an installation process similar to the Asahi Linux install process itself, only worse, and probably breaks macOS updates (if not, definitely breaks with updates). And if you try to use Apple's underlying APIs with their driver, you run into the same issue as with the GPU: they are undocumented and probably unstable and any such usage is likely to break with macOS upgrades and there is nothing you can do about that.

Realistically, if you want to use @eiln's work, it's going to have to be on a native bare-metal install of Asahi Linux.

@AndreasKunar
Copy link

Hi @eiln ! Thank you for your contribution. Are you still continuing working on this? This looks very promising for Dockerizing llama.cpp on Mac M1 - currently only CPU inference is possible this way and it is painfully slow.

Container support with Apple silicon GPU acceleration via MoltenVK/Venus/Vulkan is available in podman via krunkit, see https://sinrega.org/2024-03-06-enabling-containers-gpu-macos/. With this I was able to run the llama.cpp/llama2 benchmarks partially 2x as fast as with pure containers on an M2 Max (see my postings in discussions). Sadly its still awfully slow ...

Sign up for free to join this conversation on . Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants