2x speed-up via 1/2 hardware concurrency #1

jpohhhh · 2024-02-17T21:41:03Z

TL;DR: use navigator.hardwareConcurrency / 2 in main-worker.js

I maintain two open source Flutter libraries for cross-platform ML. (all platforms, macOS, iOS, Android, Windows, Linux)
FONNX wraps the ONNX runtime.
FLLAMA wraps llama.cpp - except on web.

I saw your post in /r/localllama a few days ago (I'm refulgentis).
Today, I looked at the code: it is the first to run llama.cpp on WASM in many months, excellent work.

Also this week, I updated FLLAMA's llama.cpp version, and it had a really interesting issue on Android. It took 3 minutes to load a 3B model. Used to take 15 seconds. Turned out the issue was setting # of threads equal to # of CPU cores. Simply changing it from 4 to 2 fixed everything and made it much faster during inference too.

After playing around with this project for an hour trying to speed it up, I realized the same trick worked.

It may seem hacky, but I recommend changing use of navigatior.hardwareConcurrency() to navigatior.hardwareConcurrency() / 2:

I don't 100% understand why it helps so much, other than the general reasons (threads can get starved for data, etc.)
In my experience, it is also best practice for ML on web generally. Approximately all the ONNX web implementations I've seen do the same thing.
Part of me thinks it has something to do with a change in llama.cpp, because my slow Android load happened sometime between llama.cpp commit ceebbb5b21b971941b2533210b74bf359981006c and 7930a8a6e89a04c77c51e3ae5dc1cd8e845b6b8f. But, that is unlikely. The Android problem was an extremely slow model load with inference speed that stayed the same.

Benchmarks (M2 Max/Ultra/whatever MacBook Pro). number of threads to tokens per second
Phi2
1: 1.2
6: 6.5 (1/2 hardware concurrency)
8: 7.4
12: 3.6 (hardware concurrency)

Mistral:
1: 13
6: 62 (1/2 hardware concurrency)
8: 65
12: 26 (hardware concurrency)

The text was updated successfully, but these errors were encountered:

AsynchronousAI · 2024-03-20T02:48:18Z

Can I check your fork on this?

flatsiedatsie mentioned this issue May 11, 2024

The current configuration of Emscripten with PTHREAD_POOL_SIZE=32 for multi-threading may be causing memory wastage ngxson/wllama#16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2x speed-up via 1/2 hardware concurrency #1

2x speed-up via 1/2 hardware concurrency #1

jpohhhh commented Feb 17, 2024 •

edited

Loading

AsynchronousAI commented Mar 20, 2024

2x speed-up via 1/2 hardware concurrency #1

2x speed-up via 1/2 hardware concurrency #1

Comments

jpohhhh commented Feb 17, 2024 • edited Loading

AsynchronousAI commented Mar 20, 2024

jpohhhh commented Feb 17, 2024 •

edited

Loading