-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine tune MUL_MAT, new threading (spin+wait/notify), speedup q_f32 BLAS by splitting COMPUTE stage #1632
Conversation
949dfc1
to
30abf88
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There were too many comments to post at once. Showing the first 25 out of 32. Check the log or trigger a new build to see more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
d53f410
to
201d5ff
Compare
CMakeFiles does not work, perhaps should move mulmat-tune.[c,h] to root dir. |
I was thinking recently that better threading would be nice to have. Anyways, I didn't yet look at the PR in detail but I can already give you feedback regarding the way you represent your data to make it easier to understand:
Regarding the contents of the README: unless I'm misunderstanding something you are at one point talking about doing dequantization on the CPU and then doing the actual matrix multiplication on the GPU. This is not a viable approach. The weights are very large and become even larger when dequantized. Transferring that much data between CPU and GPU is very slow, slower than to just do everything on the CPU. My implementation only works because weights are stored in VRAM and thus don't need to be copied to the GPU. |
We started out that way, at first cuBLAS was used without the custom kernels. It did work but obviously was much slower than it is now. |
I think so. It is seems to be another part of ggml, so I would rename them to |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
@JohannesGaessler feedbacks from you and others corrected me the misunderstandings. I managed to improve the README file a bit for now: fixed wrong terms, no longer use image, pasted some example results. I'll will keep updating it. As of the term In this PR, bench result is tightly bond to specified implementation, so I named several backend vendors for validating the loaded bench file. Now I read the backend as "mixed implementation on top of hardware and software library spec", so I use it to control which part of code to run explicitly. I'm aware that your PR Cuda refactor, multi GPU support #1670 is ready to merge, congratulations! Thanks! |
I'll try fix the CMake build. I'm not familiar with it, so will reference the configuration of ggml-opencl. |
Is it optional? Because ggml-opencl is optional. Otherwise you can just add the files to the ggml library target. |
As far as I know, llama will pass I'm anticipating that in the future the choice of whether use mulmat tune or not will be controlled by two command line options: I'm doubting the usefulness of Thanks for the tip! |
1701eeb
to
9306367
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
…che thread safe; fixed shape comprison
…sk runer and profile id, many changes, see the f codes
* removed ggml_task_backend, infavour of ggml_task_profile.runner and newly added id and name. * extracted mul_mat blas codes into ggml_compute_forward_mul_mat_blas, thus align with CUDA/CL a bit more and make it easier to fix profile and run tune. * rewrote task profile and update/add some cuda/cl codes, finnaly made CL GPU offloading work. * misc minor fix/update to tune, the data format was changed.
5923518
to
0ec4dab
Compare
…pool at session level
…ock in windows AVX
Oh man, after all that work. Hopefully you at least learned some useful stuff that will help you in future projects. (Also, unfortunately I wasn't going to be able to provide any further CUDA testing help since my GPU got fried by lightning.) |
What's the current state of overhauling threading in llama.cpp? If no one else is working on it I'll maybe take a crack at it once I'm done with my current objectives. |
It's hard to say - it seems there could be improvements made to the threading, but it is not very clear what exactly.
Do you have something specific in mind? |
|
Threads pools may become very important in the future for mixed GPU/CPU computations with graphs to allow keeping the k/v cache in the CPU, while still running the feed forward parts of the layer on the GPU. Essentially, to support this we will need a |
What I'm thinking about primarily is this: currently when you offload all layers to the GPU using CUDA you get better performance when you set the number of threads to 1. Presumably either the overhead from creating threads or the CPU load from the constantly spinning threads is the problem. To me this suggests that the performance of only partial offloading could be improved if you had a thread pool and were to control worker threads via wait/notify. It would also eliminate the need for users to manually set the optimal number of threads since waiting threads that are created only once should not have a performance impact. |
I think that's most definitely caused by the threads constantly spinning. It is also an issue when using BLAS, because it forces us to set the number of threads to 1 to not interfere with the BLAS library, but that also means that operations other than matrix multiplication are only run in 1 thread. This will not be as much of an issue when offloading at the graph level, since only one compute backend will be running at the same time, but should be fixed nonetheless, it is very inefficient. |
Coarse-grained wait/broadcast is not that difficult to implement. One thing to consider is the wait/broadcast time, I had written a test https://github.com/mqy/compute.cpp/blob/main/testing/test_wait.c only work on *nix The actual response time may not be that small and wakeup may
Of course it's just a naive test, may not match actual situation. |
Introduction
MUL_MAT take most of the compute time (about 95%). So to speed up llama, we have to focus on MUL_MAT.
BLAS, as one of the fastest MUL_MAT solution on CPU, typically efficient at computing large matrix multiplication and tends to be very slow when run parallel in multi OS threads. Accelerate is the native BLAS implementation on macOS, which has the problems exactly as said. OpenBLAS or BLIS are a bit slower than Accelerate, the authors claim that they support multi-threads, but I did not test that. So I assume for the big matrix sizes in llama, multi threaded BLAS does not run faster than single thread.
We have three kinds of MUL_MAT to compute:
For every kind of MUL_MAT, we have pure CPU solution which has optional INIT stage and COMPUTE stage.
And optional solutions: CUDA/CL that run in GPU, and BLAS that run in CPU.
As of BLAS, there are three known problems to solve:
The typical mul_mat time when N/K >= 4096 ranges from several ms to hundreds ms. Given n_threads > 1, when run BLAS in main thread, worker threads has nothing to do thus keep spinning. The spinning overhead is not acceptable.
Given M/N/K, n_threads (and even src0 type), due to the diverse of matrix dimensions and hardware/software stacks, we are not sure which of the solutions is the fastest. At present, master branch applies this rule: run CUDA/CL/BLAS in single OS thread when both src0 and src1 are continuous and M >=32 && N >=32 && K >= 32. As of llama model, this rule almost equals to M >= 32 && N >= 4096 && K >= 4096.
Solutions
This PR tries to solve the above problems, they are tightly coupled together. So it's hard to just solve one without touching others.
1. A new threading infrastructure that supports spin + wait/notify
Typical usages are:
idle wait
, it issues await_now
command, workers get this command almost at once, then go wait.wait_on_task_done
: that means we can look ahead a few future task stages to see if there are no immediate multi-thread needs. If no, then tell workers go waiting after finishing task. The optimization benefits energy saving, but is hard to implement correctly and efficiently. In addition to mutex, I have to use spin lock.2. A way to configure how to run task stage.
I want to explicitly define: which part of code to run, single thread or multi-thread, workers should go idle wait or not. This is not new but introduced the
idle wait
and make the configure more explicit. With this we can run bench at will, this unlock us from current implicit#if defined(xxx)
, and allow us to build with all kinds solutions. I formally defined task profiles for the three kinds of mul_mat. This took not little codes, but is very important for the whole solution.3. A flexible tune(bench) tool to generate bench data
This tool has the following features/benefits:
Analyze bench data for n_threads. The output is CSV blocks, thus can be easily visualized.M
s range from 1 up to 512 in 3 passes 1 thread, while one pass bench takes about 35 seconds 1 thread, with 4 threads 1 pass and max-M 128 takes about 13s.Current speed is not good enough in case of running bench at program startup.4. Adapt llama and ggml to schedule with bench
After the bench data was loaded into program, when do graph computing, we can at first match shape by given N/K, then estimate time for every profile that this shape supports, finally select the fastest profile. Since in practice, we only bench for limited
M
(10s or so) , we have to leverage some magic to estimate time for any M. Due the the near linear nature of M-time curve, I useinterpolate
. This is not very cool, but is the best affordable way I can think. Non-continuous matrices are not suitable to run in BLAS, so they will be scheduled to the pure CPU profile. If both src0 and src1 of matrix are continuous, but we do not have bench loaded or for some unknown reasons or bugs that we can not find corresponding shape for given N/K, or unable to estimate, we fallback to the traditional logic: M >= 32 && N >=32 && K >= 32 -- this is totally unfortunate because estimating bias around 32 is highly sensitive to performance. You will see this in the following section.5. Split single thread BLAS
I separated de-quantization with de-quantization + mul_mat from the
for loops
. Thus I can create the third task profile for the q_f32'suse BLAS
solution: run de-quantization in INIT stage with multi-threads, run mul_mat with BLAS and single thread, let workers idle wait.Results
Due to the nature of predicating, it's a bit hard for me to bench end to end. I wrote a bench tool named
prompt.sh
to ask llama questions like this:0+0=0.1+1=1.2+2=
. Although in this way it is easy to construct prompt at almost any approximate size, this kind of questions are likely take llama too much time tothink
, thus result in unusual bench time that may be longer than those normal questions. I have to say that I don't know how to efficiently and correctly run the end-to-end bench at all. Anyway, I did run theexamples/chat.sh
with 4 threads for many times. Often observed the prompt time decreases about 35%, sometimes over 40%, comparing to master.So, let me explain in more strict but perhaps easier understood way with a bunch of images.
First of all let's remember several tokens that will be used to identify the task stages for the three q_f32 profiles.
Where stage 0 is the
INIT stage
and stage 1 is theCOMPUTE
stage.The values of n_threads are typical because:
All data in the following images are created from llama 7B. I will not show you all models because that's too lengthy and I can only run 7B/13B. Instead I'll try Q4_0, Q5_0 and Q8_0 because they are enough for us to catch the points.
I ran bench/analyze on my MacBook pro 2018 with 32 GB 2400 MHz DDR4 memory, 2.6 GHz 6-Core Intel Core i7-8850H @2.60GHz.
The data are all plotted in 2-D lines, where the x-axis is M, and the y-axis is per-thread execution time with unit of
ms
.4096x4096, Q4_0
The M >=32 rule and bias
The next diagram shows the execution time of profile-0 at stage-0 and stage-1. The axis scale is logarithmic. The stage-0 time is very fast, and is negligible comparing to that of stage-1. We can anticipate that:
n
, the per-thread execution time should be 1/n of the single thread.The next diagram shows the execution time of profile-1 at stage-1 (BLAS). The axis scale is logarithmic. It's almost near constant when M <= 64, otherwise the Δt/ΔM goes up more and more finally the time becomes linear to M. I guess the reason why the time increases so much when M>64 is because 4096x4096x64 is the total 1 billon number of float32 to allocate at 32GiB memory, this is identical to my device memory. When it exceeds max memory, the OS has to compress memory or use swap, this would greatly hurt performance.
The next picture is used to explain bias ranges in current master code. Let's firstly find the points that the blue line intersects with other lines. The blue line represents the overall execution time for profile-2, whereas other 4 lines represent the overall execution time for profile-0 at that n_threads. Every line for profile-0 intersects with the line for profile-2 at some point. So given n_threads and M, we can easily determine the fastest profile (line) by simply having a glance at the intersection point. For those
M
s not in x-axis, we can easily estimate the corresponding time.Now let's focus on the vertical line at M=32. Given n_threads, we can find the corresponding line for profile-0 and profile-2.
Let's recall the default profile selecting policy in master code: M >=32 && N >= 32 && K >=32. This means: for NxK= 4096x4096, when M <32 we follow the line for profile-0, otherwise follow the line for profile-2.
This is ideal when the two line intersect at M=32, otherwise the estimation bias will show up for those
M
s between the intersection point and 32. We can see that for any line of profile-0, the bias goes up from 0 (at intersection point) to |t0-t1| (at M=32), where t0 is the profile-0 time and t2 is the profile-2 time. The max bias is so large that may reach up to 30% for n_threads=1 and 2, and up to 60% for n_thread=4 or 6. Of course, with the increasing of n_threads, the spinning and memory contention or cache miss would cause certain performance degradation, finally the per-thread average time would not reach that ideal (small) value.As I had said before, M is the token size. Since white spaces and stems are also be counted in the token size, for any typical question or statement, the corresponding prompt token size
shouldis likely get closes to 32.Anyway, nowadays personal computers tends to have big memory and fast CPUs, thus the bias may not be noticed or tolerable.
Parallel de-quantizing
The next two pictures shows the trend of de-quantization time at INIT stage as a percentage of the whole execution time. In theory, de-quantization (INIT) time is determined by N/K only, so it can be seen as a constant. But BLAS time increases after M>64.
The important thing to learn from this plotting is: the INIT time is near or bigger than the COMPUTE time at pretty large M range: up to 128! It is about 1/3 of the overall time even at M=256. So if we run INIT with multi-threads, we can get far better performance than single thread. Ideally, we can speed up over 50% when M <= 64, and 30% ~ 40 % when M between 64 and 128.
Finally I show you the multi-threaded plotting, for simplicity purpose I just show nth=1 and nth=4. From this picture we can see that: M at intersection point increases with n_threads. I've seen that there is no intersection point at all when n_threads=8: that means the pure CPU solution always run faster than BLAS solution even if both run with multi-threads.
With fine tuning, given model, type, M,N,K and n_threads, we will able to select the correct profile.
Other images
I will not explain them. The important reason that I list these images is: show similarity and minor differences.
How to evaluate
Build with make or CMake
Make sure one of the BLAS vendor is enabled and compiled into program.
Evaluate:
NOTE when GPU offloading is enabled (-ngl > 0), mul_mat tuning is disabled atomatically.
Have a look at examples/mulmat-tune/README.md for details
Conclusion
Software systems are complicated. It's hard to optimize when target platforms vary widely. I'm certain that the speed up to q_f32 would not become reality without the new threading infrastructure, task config profile and the mulmat tune tool. I'm happy that for so long time I finally able to show you the working codes. Enjoy!
@ggerganov @SlyEcho @0cc4m @JohannesGaessler @zenixls2 @slaren
EDITED on Jun 18
EDITED ON Jun 26
I haven't updated this PR for a few days, because of the following reasons I think:
Great thanks to @KerfuffleV2 for help testing and all of you who took time on this PR.
I'm sorry @ggerganov this took you time to review, so I close this PR?