-
Notifications
You must be signed in to change notification settings - Fork 805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(queue): Reset ackAttempts on successful SQL ack #4479
base: master
Are you sure you want to change the base?
Conversation
Related to and the cause of spinnaker/spinnaker#6597 |
This looks like a sql-specific change. Mind updating the commit message to indicate that? Or perhaps implement it for redis as well :) |
092caf8
to
95b371d
Compare
I've updated the commit message to mention that this only affects the SQL implementation. I'm not really sure what can be done with Redis, the docs are correct to say that a significant data loss occurs when the Redis backend goes away. If it's even fixable it will be a much more complex change than what is being proposed here. |
Thanks @nicolasff! |
What's involved in writing an automated test to exercise this functionality? |
I've been wondering how to write a test for this. The way I went about validating the change involved manual fault injection within Orca in specific places, followed by breakpoints in other places to observe the To trigger this condition and cover the
I'm going to look into implementing this today. |
In the SQL queue implemementation, reset ackAttempts from the message metadata when a message is ack'd. This avoids having this number grow over time for very long-lived messages that keep getting re-enqueued, with the occasional failure causing a message to not be acknowledged and eventually get dropped once 5 acks have been missed in total.
b57dd91
to
9340a52
Compare
Latest push: added a Spek test for both values of |
@@ -842,25 +844,62 @@ class SqlQueue( | |||
} | |||
} | |||
|
|||
private fun ackMessage(fingerprint: String) { | |||
private fun ackMessage(fingerprint: String, message: Message) { | |||
if (log.isDebugEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pardon my ignorance, is this required? If debug logging isn't enabled log.debug
won't do anything anyway right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If fingerprint is big, we'll still spend non-trivial compute resources building the string to log even though log.debug doesn't actually log it.
import java.util.* | ||
|
||
|
||
@RunWith(JUnitPlatform::class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With PRs like #4484, we're trying to modernize junit things. I tried on my local machine and the tests still execute without this line. Here are the diffs I used:
diff --git a/keiko-sql/src/test/kotlin/com/netflix/spinnaker/q/sql/SqlAckQueueTest.kt b/keiko-sql/src/test/kotlin/com/netflix/spinnaker/q/sql/SqlAckQueueTest.kt
index e7bb78f82..0ff10042a 100644
--- a/keiko-sql/src/test/kotlin/com/netflix/spinnaker/q/sql/SqlAckQueueTest.kt
+++ b/keiko-sql/src/test/kotlin/com/netflix/spinnaker/q/sql/SqlAckQueueTest.kt
@@ -29,12 +29,8 @@ import org.jetbrains.spek.api.dsl.describe
import org.jetbrains.spek.api.dsl.given
import org.jetbrains.spek.api.dsl.it
import org.jetbrains.spek.api.dsl.on
-import org.junit.platform.runner.JUnitPlatform
-import org.junit.runner.RunWith
import java.util.*
-
-@RunWith(JUnitPlatform::class)
class SqlAckQueueTest : Spek({
describe("both values of resetAttemptsOnAck") {
// map of resetAttemptsOnAck to expected number of ackAttempts still on the message after ack
import org.jetbrains.spek.api.dsl.on | ||
import org.junit.platform.runner.JUnitPlatform | ||
import org.junit.runner.RunWith | ||
import java.util.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please avoid star imports
|
||
// check both values of resetAttemptsOnAck | ||
flagToAckAttempts.forEach { resetFlag, expectedAckAttempts -> | ||
val testDescription = "SqlQueue with resetAttemptsOnAck = $resetFlag" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbyron-sf thanks for all the feedback, I'll make the changes you suggested and will look into this discrepancy with the HTML test report. |
Sounds great. The more coverage the better :) |
@dbyron-sf Any thoughts about this? Looks like the requested changes didn't happen but it's potentially a useful fix? |
Reset
ackAttempts
from the message metadata when a message is ack'd.This avoids having this number grow over time for very long-lived messages that keep getting re-enqueued, with the occasional failure causing a message to not be acknowledged and eventually get dropped once 5 acks have been missed in total.
More detailed explanation
A long-running Orca Task in a Stage will cause the same
RunTask
message to be de-queued and re-enqueued over and over with all its attributes as it returnsRUNNING
to indicate it has not yet finished executing. If any part of the queue management code fails because Orca is too busy or even crashes right after de-queuing a message, the message will have its “un-acknowledged” counter incremented by one (ackAttempts++
). For long-running stages, it is possible for a task to eventually reachackAttempts = 5
(Queue.maxRetries
, link) which will causeSqlQueue
in Orca to abandon the message and effectively stop processing its branch of the pipeline. When a long-running pipeline execution is still marked asRUNNING
but no longer has any messages on the queue, it becomes a “zombie” that can’t be canceled by regular users.The proposed fix is to reset
ackAttempts
to zero when a message is processed successfully, as would happen repeatedly with a long-lived stage. Instead of dropping messages when they reach 5 missed acknowledgments in total, we’ll now drop them only if they miss 5 in a row – which gives us a clear indication that the message just cannot be processed at this time.Consider the analogy of a
ping
sequence used to monitor a host's uptime: if we leaveping
running for 2 weeks monitoring a remote host, do we mark the host as down once it has missed 5 pings in total over these 2 weeks, or when it has missed 5 in a row?