Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2x speed-up via 1/2 hardware concurrency #1

Open
jpohhhh opened this issue Feb 17, 2024 · 1 comment
Open

2x speed-up via 1/2 hardware concurrency #1

jpohhhh opened this issue Feb 17, 2024 · 1 comment

Comments

@jpohhhh
Copy link

jpohhhh commented Feb 17, 2024

TL;DR: use navigator.hardwareConcurrency / 2 in main-worker.js

I maintain two open source Flutter libraries for cross-platform ML. (all platforms, macOS, iOS, Android, Windows, Linux)
FONNX wraps the ONNX runtime.
FLLAMA wraps llama.cpp - except on web.

I saw your post in /r/localllama a few days ago (I'm refulgentis).
Today, I looked at the code: it is the first to run llama.cpp on WASM in many months, excellent work.

Also this week, I updated FLLAMA's llama.cpp version, and it had a really interesting issue on Android. It took 3 minutes to load a 3B model. Used to take 15 seconds. Turned out the issue was setting # of threads equal to # of CPU cores. Simply changing it from 4 to 2 fixed everything and made it much faster during inference too.

After playing around with this project for an hour trying to speed it up, I realized the same trick worked.

It may seem hacky, but I recommend changing use of navigatior.hardwareConcurrency() to navigatior.hardwareConcurrency() / 2:

  • I don't 100% understand why it helps so much, other than the general reasons (threads can get starved for data, etc.)
  • In my experience, it is also best practice for ML on web generally. Approximately all the ONNX web implementations I've seen do the same thing.
  • Part of me thinks it has something to do with a change in llama.cpp, because my slow Android load happened sometime between llama.cpp commit ceebbb5b21b971941b2533210b74bf359981006c and 7930a8a6e89a04c77c51e3ae5dc1cd8e845b6b8f. But, that is unlikely. The Android problem was an extremely slow model load with inference speed that stayed the same.

Benchmarks (M2 Max/Ultra/whatever MacBook Pro). number of threads to tokens per second
Phi2
1: 1.2
6: 6.5 (1/2 hardware concurrency)
8: 7.4
12: 3.6 (hardware concurrency)

Mistral:
1: 13
6: 62 (1/2 hardware concurrency)
8: 65
12: 26 (hardware concurrency)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants