Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

liveness: allow testing main heartbeat loop deterministically #107452

Open
tbg opened this issue Jul 24, 2023 · 0 comments
Open

liveness: allow testing main heartbeat loop deterministically #107452

tbg opened this issue Jul 24, 2023 · 0 comments
Labels
A-testing Testing tools and infrastructure C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@tbg
Copy link
Member

tbg commented Jul 24, 2023

Liveness starts a goroutine that heartbeats the own liveness record periodically. This means that lots of tests that interact with liveness need to exert control over this goroutine. It would be easier to test NodeLiveness if the concurrency were externalized. In other words, rather than NodeLiveness.Start spawning a goroutine that loops and runs code that can only be reached by that goroutine, the goroutine should be a method on NodeLiveness that can be invoked manually, and Start should take a suitably defined abstraction over a "periodic runner". In prod, the periodic runner would be the familiar async task with a for loop. In testing, it may just be a no-op, and the harness can call the method directly whenever it wants to pretend the ping interval elapsed.

It's a bit tricky to get this right but there is probably an abstraction here that applies similarly to many other subsystems in CRL that start auxiliary goroutines (which are then difficult to test against). We should architect to give tests as much control over concurrency as possible.

Jira issue: CRDB-30050

@tbg tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-testing Testing tools and infrastructure T-kv KV Team labels Jul 24, 2023
tbg added a commit to tbg/cockroach that referenced this issue Jul 24, 2023
I think there was a bug here. This method was previously invoked in
`updateLiveness`, but that method is the general workhorse for updating
anyone's liveness. In particular, it is called by `IncrementEpoch`. So we were
invoking `onSelfHeartbeat` when we would increment other nodes' epochs. This
doesn't seem great.

Additionally, the code was trying to avoid invoking this callback before
liveness was officially "started". Heartbeating yourself before liveness is
started is unfortunately a thing due to the tangled start-up initialization
sequence; we may see heartbeats triggered by lease requests.

Avoid both complications by invoking `onSelfCallback` from the actual main
heartbeat loop, whose only job is to heartbeat the own liveness record.

I tried to adopt `TestNodeHeartbeatCallback` to give better coverage, but
it's a yak shave. A deterministic node liveness (i.e. a way to invoke the
main heartbeat loop manually) would make this a lot simpler. I filed an
issue to that effect:

cockroachdb#107452
@andrewbaptist andrewbaptist added the C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. label Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
None yet
Development

No branches or pull requests

2 participants