-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Will setting different lmul
result in obvious performance difference?
#237
Comments
Should be fixed by #240 |
After retesting, it is verified that:
ratio of difference(more workload between load&store) = 12.6% < ratio of difference(simple AX plus Y) = 34.6%
ratio of difference(before) = (total_cycles(lmul=1) - total_cycles(lmul=8)) / total_cycles(lmul=8) * 100% = 28.4% |
The retest's results seem promising. The procedures now run faster than before and the gap between the total cycles of different lmuls has been narrowed (except for the case of lmul=1). As for lmul=1, do you have any understanding about it? @sequencer
ratio of difference(excluded lmul=1) = (total_cycles(lmul=8) - total_cycles(lmul=2)) / total_cycles(lmul=2) * 100% = 0.3%, which can be safely ignored. ratio of difference = (total_cycles(lmul=1) - total_cycles(lmul=2)) / total_cycles(lmul=2) * 100% = 13.0%, which is relatively noticable.
ratio of difference(excluded lmul=1) = (total_cycles(lmul=2) - total_cycles(lmul=4)) / total_cycles(lmul=4) * 100% = 6.0%, which I assume to be reasonable given its low computational demand. |
It is possibly caused by VFU hazard l, ask @SharzyL for making sure. |
In RVV,
lmul
stands for "Vector Register Group Multiplier" to specify the number of registers to create a group. Differentlmul
settings have different number of instructions generated when executing the same procudure. For example, settinglmul=4
requires twice as many instructions as settinglmul=8
when executing the same procedure. Although the additional instructions will cause extra fetching and decoding overhead, it shouldn't have too much impact on overall performance in theory.However, it seems that experiments on an AXpY case based on stripmining method have demonstrated that various
lmul
will make a difference. As shown in the table below, non-computing overheads such as fetching and decoding account for about 4.0%((663,647 - 638,047) / 638,047) of the total cycles, and the performance gap caused by changinglmul=8
tolmul=1
is catching up to the original 28.4%( (819,295 - 638,047) / 638,047), which is exactly seven times the non-computing overheads when settinglmul=8
.Will setting different
lmul
result in obvious performance difference? Do we need further analysis on that?The AXpY case is as follows(sew=32, vlen=1024 and input lengh = 262145):
axpy.mlir:
The text was updated successfully, but these errors were encountered: