-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: long-running nexmark on madsim #5170
Comments
For the set of computations without any side effects, i.e. pure expression evaluation, can they be executed concurrently on multiple cores? |
I missed it. Sorry for the late response.
Yes. But unfortunately, it's hard for madsim to identify which task is pure computation. In practice, tasks without any side effects almost don't exist. More or less, they interact with each other through channels or shared states. Once it happens, we have to determine the order of the two tasks, otherwise the determinism will be broken. If we could intervene every time they make a side effect, then parallel execution seems possible. But I feel that it would take a lot of effort, the determinism would be hard to guarantee, and I'm afraid it can not be well-parallelized given the ubiquitous dependencies. 🥹 Thinking from the other side, simply speeding up the execution may not be the right direction for this problem. Concurrency bugs usually have a small depth, which means they can happen within a few steps if you carefully construct the schedule sequence. So they should be found quickly by massive simulations with different seeds. If they can't, the reason could be that some conditions are not satisfied. For example, the storage data is not large enough to trigger compaction. The only way to meet this condition from scratch is to run data ingestion for a long time. However, why do we have to run from scratch? If our simulator supports loading from a checkpoint, we can prepare a large dataset in advance and directly start from here. That's what we plan to do next. |
Thanks for the detailed explanation!
It makes sense! |
Any updates? |
After some rethinking, I decided to make this issue low-priority, as long-running also makes it slow to reproduce. We can't benefit much from it compared with existing longevity test. Instead, I was trying to add more short-term fault injection tests (e.g. #7623) so that problems would be found more efficient. |
We hope to run nexmark for a long period in deterministic simulation to find more stability issues.
Some potential challenges:
The text was updated successfully, but these errors were encountered: