Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dangling transactions after 'failed to deallocate cached statement' errors #2100

Open
Valodim opened this issue Aug 1, 2024 · 3 comments
Open
Labels

Comments

@Valodim
Copy link

Valodim commented Aug 1, 2024

Hey there. First of all, thank you for all your work on this library 👍

Describe the bug

We are using pgx in a fairly typical backend application, that serves requests to clients via grpc. We use pgx as a driver via sqlx. Starting with version 5.5.4, we started to notice some problems:

  • Our production logs contained lines "failed to deallocate cached statement: conn closed". Roughly it was a couple hundred times a day, for a couple hundred thousand requests a day where each request is a transaction.
  • This happened predominantly (but not exclusively) in streaming requests. Those are ended when cancelled by the user, so we assume it's some race condition around cancelled context behavior.
  • The logged error happens from a call to tx.Rollback(), which is called from a defer that is set up with the transaction and run at the end of the cancelled rpc method.
  • We also started to see dangling transactions left in an idle in transaction state, indefinitely (hours), causing issues particularly when they held on to locks. While this is not definitely related to the log entries, it started happening at the same time and the log entries indicate an error during transaction cleanup.

It seems like something goes wrong around transaction cleanup, but we can't tell what's going on at all.

To Reproduce

Unfortunately, we were unable to reproduce this bug outside our production environment, which also isn't source available. :( Rolling back to 5.5.3 "fixed" the issue for us, so it might be caused by 046f497. We tried some things in our codebase around transaction behavior and made sure we begin/end transactions consistently throughout, to no avail. If you have any ideas, we'd be happy to try.

Version

  • Go: $ go version -> go version go1.21.5 linux/amd64
  • PostgreSQL: $ psql --no-psqlrc --tuples-only -c 'select version()' -> PostgreSQL 15.7 on x86_64-pc-linux-musl, compiled by gcc (Alpine 13.2.1_git20240309) 13.2.1 20240309, 64-bit
  • pgx: $ grep 'github.com/jackc/pgx/v[0-9]' go.mod -> v5.5.4
@Valodim Valodim added the bug label Aug 1, 2024
@Valodim Valodim changed the title dangling transactions after f'failed to deallocate cached statement' errors dangling transactions after 'failed to deallocate cached statement' errors Aug 1, 2024
@jackc
Copy link
Owner

jackc commented Aug 3, 2024

I do not think that failed to deallocate cached statement: conn closed is directly connected to dangling transactions -- after all the connection is closed -- it can't be holding open a connection. I suppose it could be worth checking that the connection actually is closed on the server though.

I suppose it's possible this has something to do with the context cancellation. But because of the difficulty in reliably interrupting a query and recovering the connection, pgx simply closes the connection when a query is interrupted. So it would seem there's not much that could go wrong there.

@xocasdashdash
Copy link

👋
I'm seeing some similar errors when using a "child" context to query some rows for locking.

So i'm doing basically this:

tctx, cancel := context.WithTimeout(ctx, 10 * time.Second)
defer cancel()

txx, err := s.sqlClient.BeginTxx(tctx, nil)

defer func() {
   // handle recover
   fn := txx.Commit
   if err != nil {
     fn = txx.Rollback
    }
    funcErr := fn()
    if funErr != nil {
       log.Error("error applying transaction %+v", funcErr)
        // Here i see the error with `failed to deallocate cached statement: conn closed `
       err = fmt.Errorf("failed to apply transaction %w", funErr)
     }
     return err
}

childCtx, cancelChildren := context.WithTimeout(tctx, 1 * time.Secoond)
defer cancelChildren()

var ids []string
err := s.sqlClient.ExecCtx(childCtx, &ids, "select id from users where age > 99 for update"); // this takes more than 1 second
if err != nil {
 return err
}
// use the ids to run an update
// loop through users and run an update
// commit transaction

After the "lock" queries started hitting their context timeout, i started seeing this errors on the application logs. Is it the case that this "child" context being cancelled is closing down the original transaction connection ?

I'm trying to understand where in the code does this happen if there's something I can do to better handle this (I already want to refactor how the transaction is committed/rollbacked to avoid failing)

@jackc
Copy link
Owner

jackc commented Oct 5, 2024

@xocasdashdash

Is it the case that this "child" context being cancelled is closing down the original transaction connection ?

Yes. Context cancellation closes the connection in most cases because it is very difficult or impossible to reliably recover from an interrupted query.

But beyond that, the fundamental means of interrupting a query in PostgreSQL causes the interrupted query to return an error. If a query returns an error the wrapping transaction is places in an error state. So even at the PostgreSQL level the only way to handle this would be to use savepoints (pseudo nested transactions) around every query you might want to cancel and rollback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants