Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BZPop bug #3276

Closed
adiholden opened this issue Jul 7, 2024 · 10 comments
Closed

BZPop bug #3276

adiholden opened this issue Jul 7, 2024 · 10 comments
Assignees
Labels
bug Something isn't working Next Up task that is ready to be worked on and should be added to working queue

Comments

@adiholden
Copy link
Collaborator

adiholden commented Jul 7, 2024

E20240706 06:12:17.647733 2255 zset_family.cc:1343] BUG: Callback didn't run! BLOCK//claimed=1 coord_state=6 local_res=0/
E20240706 06:12:11.367316 2255 zset_family.cc:1343] BUG: Callback didn't run! BLOCK//claimed=1 coord_state=6 local_res=0/

@adiholden adiholden added the bug Something isn't working label Jul 7, 2024
@romange
Copy link
Collaborator

romange commented Jul 7, 2024

We should not put private links in our public repo (for security considerations).

@adiholden adiholden added the Next Up task that is ready to be worked on and should be added to working queue label Jul 9, 2024
@BorysTheDev
Copy link
Contributor

@adiholden @romange I have found only one possible issue. We have 2 steps:

  1. Check key presence
  2. trans->execute()
    During 2nd step we lock db_slice::cb_mu_ and can preempt so if the key is removed we will execute a transaction with a deprecated iterator.
    This bug should be fixed by db_slice and journal RegisterOnChange method refactoring #3229

@BorysTheDev
Copy link
Contributor

BorysTheDev commented Jul 11, 2024

I haven't found using of cluster, replication but the save operation happened on July 10, 2024 2:59:31 PM (GMT)

@BorysTheDev
Copy link
Contributor

I've found one more bug with KEY_MOVE error and have fixed it #3334

But it looks like we have at least one more

@BorysTheDev
Copy link
Contributor

BorysTheDev commented Jul 19, 2024

I haven't found the real reason for this bug, but it looks like the execution order was next:

  1. run trans->WaitOnWatch(...)
  2. run trans->CancelBlocking(...)
    2.1) Transaction::coordinator_state_ |= COORD_CANCELLED;
    2.2) Transaction::local_result_ = status; // where status != OK
  3. Transaction::local_result_ somehow should be set OK
  4. trans->WaitOnWatch(..) returns Transaction::local_result_ // that is already OK

The problem is that I don't see how the 3rd step can happen.
Also, it looks like the 2nd step was done from the connection closing code:

  owner->RegisterBreakHook([res, this](uint32_t) {
    if (res->transaction)
      res->transaction->CancelBlocking(nullptr);
    this->server_family().BreakOnShutdown();
  });

@romange
Copy link
Collaborator

romange commented Jul 19, 2024

ok, how can you prove that 2 has happened, and it was due to RegisterBreakHook?

@BorysTheDev
Copy link
Contributor

BorysTheDev commented Jul 22, 2024

coord_state=6 means that the operation was canceled, I think about RegisterBreakHook because we don't have a replication and a cluster so only one option is left

@dranikpg
Copy link
Contributor

Based on the logs... isn't the key an empty string?

@BorysTheDev
Copy link
Contributor

Based on the logs... isn't the key an empty string?

I took it into account that is why I have written about the 3rd step

@dranikpg
Copy link
Contributor

Fixed with #3371

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Next Up task that is ready to be worked on and should be added to working queue
Projects
None yet
Development

No branches or pull requests

4 participants