You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TL;DR: use navigator.hardwareConcurrency / 2 in main-worker.js
I maintain two open source Flutter libraries for cross-platform ML. (all platforms, macOS, iOS, Android, Windows, Linux)
FONNX wraps the ONNX runtime.
FLLAMA wraps llama.cpp - except on web.
I saw your post in /r/localllama a few days ago (I'm refulgentis).
Today, I looked at the code: it is the first to run llama.cpp on WASM in many months, excellent work.
Also this week, I updated FLLAMA's llama.cpp version, and it had a really interesting issue on Android. It took 3 minutes to load a 3B model. Used to take 15 seconds. Turned out the issue was setting # of threads equal to # of CPU cores. Simply changing it from 4 to 2 fixed everything and made it much faster during inference too.
After playing around with this project for an hour trying to speed it up, I realized the same trick worked.
It may seem hacky, but I recommend changing use of navigatior.hardwareConcurrency() to navigatior.hardwareConcurrency() / 2:
I don't 100% understand why it helps so much, other than the general reasons (threads can get starved for data, etc.)
In my experience, it is also best practice for ML on web generally. Approximately all the ONNX web implementations I've seen do the same thing.
Part of me thinks it has something to do with a change in llama.cpp, because my slow Android load happened sometime between llama.cpp commit ceebbb5b21b971941b2533210b74bf359981006c and 7930a8a6e89a04c77c51e3ae5dc1cd8e845b6b8f. But, that is unlikely. The Android problem was an extremely slow model load with inference speed that stayed the same.
Benchmarks (M2 Max/Ultra/whatever MacBook Pro). number of threads to tokens per second
Phi2
1: 1.2
6: 6.5 (1/2 hardware concurrency)
8: 7.4
12: 3.6 (hardware concurrency)
TL;DR: use navigator.hardwareConcurrency / 2 in main-worker.js
I maintain two open source Flutter libraries for cross-platform ML. (all platforms, macOS, iOS, Android, Windows, Linux)
FONNX wraps the ONNX runtime.
FLLAMA wraps llama.cpp - except on web.
I saw your post in /r/localllama a few days ago (I'm refulgentis).
Today, I looked at the code: it is the first to run llama.cpp on WASM in many months, excellent work.
Also this week, I updated FLLAMA's llama.cpp version, and it had a really interesting issue on Android. It took 3 minutes to load a 3B model. Used to take 15 seconds. Turned out the issue was setting # of threads equal to # of CPU cores. Simply changing it from 4 to 2 fixed everything and made it much faster during inference too.
After playing around with this project for an hour trying to speed it up, I realized the same trick worked.
It may seem hacky, but I recommend changing use of navigatior.hardwareConcurrency() to navigatior.hardwareConcurrency() / 2:
ceebbb5b21b971941b2533210b74bf359981006c
and7930a8a6e89a04c77c51e3ae5dc1cd8e845b6b8f
. But, that is unlikely. The Android problem was an extremely slow model load with inference speed that stayed the same.Benchmarks (M2 Max/Ultra/whatever MacBook Pro). number of threads to tokens per second
Phi2
1: 1.2
6: 6.5 (1/2 hardware concurrency)
8: 7.4
12: 3.6 (hardware concurrency)
Mistral:
1: 13
6: 62 (1/2 hardware concurrency)
8: 65
12: 26 (hardware concurrency)
The text was updated successfully, but these errors were encountered: