-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added test for MetalWalls (mw
)
#164
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried only single node runs for now, they seem to complete just fine (11/12 completed, benchmark5 is still running on a 192-core node). Couple of small comments, please have a look at those.
I'll do runs at all scales next.
Another thing I noticed is that in the performance metrics, I get e.g.:
[ OK ] (11/12) EESSI_MetalWalls_MW %benchmark_info=hackathonGPU/benchmark5 %scale=1_node %module_name=MetalWalls/21.06.1-foss-2023a %compute_device=cpu /04d3c17b @snellius:rome+default
P: total_elapsed_time: 819.429 s (r:0, l:None, u:None)
P: extract_time: 0 s (r:0, l:None, u:None)
I'm not sure what that extract_time is, but from your PR to the hpctestlib
I have the feeling it should list a couple of concrete extract-times (and they probably shouldn't be 0?), if I understand this line correctly?
I think it was due to |
From a quick test i did previously on CSCS i think benchmark5 tends to get significantly slower if too many cores are given. (Also i think all the other benchmarks scalability tend to plateau at 64~300 cores depending on the test) |
Yeah, I figured as much, as I saw that the 192 core run took longer than the 128 core run. A bit longer is not an issue (as long as it stays within the 30 min walltime), but if it becomes unreasonable, it's better to skip that test instance. Since the core count depends on the type of node the test is scheduled for, it is only known after |
Just also ran it. Most passed without a problem.
|
Thanks @laraPPr In general i've seen slowdowns only with |
I origanally ran it on our debug cluster and it seems like I shot myself in the foot by doing that. Just did a CPU only run on hortense and it ran all the tests under an hour It seems like all the multinode tests of Almost all the And the timewall was also hit with I will now have a look at the output to get a closer look why the ones failed that did not hit the timewall |
The error that I found in the output and stderr of the
|
This error is interesting as it seems giving too many cores is causing some math routines to fail. For now i think i will try with @casparvl advice and add a skip based on the core counts if there are too many (>~ 256) |
Co-authored-by: Caspar van Leeuwen <[email protected]>
I've ran all tests again, on two of our architectures.
Two failures were the single core runs of Two other runs failed only on our Genoa partition. These were Also, I noticed the upstream PR to the HPCtestlib is merged. So from my point of view, the walltime would be the final thing to change - though I've asked @laraPPr to also rerun it on her system to check if she still hits other issues (get one more data point :)) |
Do you think 1h of time limit would be enough or should I set it higher? |
Ran on Hortense hit the timewall on two tests. I think putting an hour is good.
|
I've increased the base timeout to 60m, hope with the 120m for the large benchmark with few cores this will work |
Ah that one actually passed. This is the only one that hit the timewall on Hortense cpu_rome. I don't know if this is a large or small bechmark but 2 nodes should be enough cores right?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm!
This test requires reframe-hpc/reframe#3233 to be merged first in the ReFrame repo
The test will run the 6 benchmarks on MetalWalls repo under
hackathonGPU
.Test runtime has been tested with 4 cores on a i9-13900K workstations, with the CI test
benchmark
taking 77s and the longestbenchmark2
taking 340 secondsDetailed breakdown of time-taken per routine can be enabled by setting the ReFrame variable
debug_metrics
toTrue
in the test