Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query execution stuck until idle-in-transaction timeout #2119

Open
sergeydobrodey opened this issue Sep 4, 2024 · 5 comments
Open

Query execution stuck until idle-in-transaction timeout #2119

sergeydobrodey opened this issue Sep 4, 2024 · 5 comments
Labels

Comments

@sergeydobrodey
Copy link

sergeydobrodey commented Sep 4, 2024

Query execution stuck until idle-in-transaction timeout is reached and application receives error

FATAL: terminating connection due to idle-in-transaction timeout (SQLSTATE 25P03)

Library stack trace:

goroutine 11794184 [IO wait, 4 minutes]:
internal/poll.runtime_pollWait(0x7f98fab5fda8, 0x72)
	runtime/netpoll.go:343 +0x85
internal/poll.(*pollDesc).wait(0xc00bca4d00?, 0xc004116000?, 0x0)
	internal/poll/fd_poll_runtime.go:84 +0x27
internal/poll.(*pollDesc).waitRead(...)
	internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc00bca4d00, {0xc004116000, 0x2000, 0x2000})
	internal/poll/fd_unix.go:164 +0x27a
net.(*netFD).Read(0xc00bca4d00, {0xc004116000?, 0xce20c48?, 0xc0043ef670?})
	net/fd_posix.go:55 +0x25
net.(*conn).Read(0xc002f02e40, {0xc004116000?, 0x300e060?, 0xc0043ef6b0?})
	net/net.go:179 +0x45
github.com/jackc/pgx/v5/pgconn/internal/bgreader.(*BGReader).Read(0xc007450880, {0xc004116000, 0x2000, 0x2000})
	github.com/jackc/pgx/[email protected]/pgconn/internal/bgreader/bgreader.go:100 +0xd7
io.ReadAtLeast({0x2120140, 0xc007450880}, {0xc004116000, 0x2000, 0x2000}, 0x5)
	io/io.go:335 +0x90
github.com/jackc/pgx/v5/pgproto3.(*chunkReader).Next(0xc0067c6ff0, 0x5)
	github.com/jackc/pgx/[email protected]/pgproto3/chunkreader.go:80 +0x291
github.com/jackc/pgx/v5/pgproto3.(*Frontend).Receive(0xc00bda2000)
	github.com/jackc/pgx/[email protected]/pgproto3/frontend.go:307 +0x3f
github.com/jackc/pgx/v5/pgconn.(*PgConn).peekMessage(0xc00bca8480)
	github.com/jackc/pgx/[email protected]/pgconn/pgconn.go:524 +0x14a
github.com/jackc/pgx/v5/pgconn.(*PgConn).receiveMessage(0xc00bca8480)
	github.com/jackc/pgx/[email protected]/pgconn/pgconn.go:544 +0x1c
github.com/jackc/pgx/v5/pgconn.(*MultiResultReader).receiveMessage(0xc00bca8590)
	github.com/jackc/pgx/[email protected]/pgconn/pgconn.go:1413 +0x25
github.com/jackc/pgx/v5/pgconn.(*MultiResultReader).NextResult(0xc00bca8590)
	github.com/jackc/pgx/[email protected]/pgconn/pgconn.go:1441 +0x4b
github.com/jackc/pgx/v5.(*Conn).execSimpleProtocol(0xc009cf50e0?, {0x2134fe8?, 0xc00446a4b0?}, {0xc00dbf9a40?, 0x0?}, {0xc009b315c0?, 0x196e820?, 0x1923f80?})
	github.com/jackc/pgx/[email protected]/conn.go:544 +0xb0
github.com/jackc/pgx/v5.(*Conn).exec(0xc009cf50e0, {0x2134fe8, 0xc00446a4b0}, {0xc00dbf9a40, 0x126}, {0xc009b315c0?, 0xc009b315c0?, 0x6?})
	github.com/jackc/pgx/[email protected]/conn.go:529 +0x558
github.com/jackc/pgx/v5.(*Conn).Exec(0xc009cf50e0, {0x2134fe8?, 0xc00446a480?}, {0xc00dbf9a40, 0x126}, {0xc009b315c0, 0x6, 0x6})
	github.com/jackc/pgx/[email protected]/conn.go:448 +0x12f
github.com/jackc/pgx/v5/stdlib.(*Conn).ExecContext(0xc00af702c0, {0x2134fe8, 0xc00446a480}, {0xc00dbf9a40, 0x126}, {0xc004659ef0, 0x6, 0xc000017e40?})
	github.com/jackc/pgx/[email protected]/stdlib/sql.go:485 +0x245
github.com/XSAM/otelsql.(*otConn).ExecContext(0xc002c1b600, {0x2134fe8, 0xc00446a3f0}, {0xc00dbf9a40, 0x126}, {0xc004659ef0, 0x6, 0x6})
	github.com/XSAM/[email protected]/conn.go:109 +0x451
database/sql.ctxDriverExec({0x2134fe8?, 0xc00446a3f0?}, {0x7f98faf88718?, 0xc002c1b600?}, {0x0?, 0x0?}, {0xc00dbf9a40?, 0x1b12f40?}, {0xc004659ef0, 0x6, ...})
	database/sql/ctxutil.go:31 +0xd7
database/sql.(*DB).execDC.func2()
	database/sql/sql.go:1675 +0x165
database/sql.withLock({0x21289f0, 0xc009c78f30}, 0xc0043f0288)
	database/sql/sql.go:3502 +0x82
database/sql.(*DB).execDC(0xbee82a?, {0x2134fe8?, 0xc00446a3f0}, 0xc009c78f30, 0x1bf96e0?, {0xc00dbf9a40, 0x126}, {0xc009b31560, 0x6, 0x6})
	database/sql/sql.go:1670 +0x24f
database/sql.(*Tx).ExecContext(0xc00e020f80, {0x2134fe8, 0xc00446a3f0}, {0xc00dbf9a40, 0x126}, {0xc009b31560, 0x6, 0x6})
	database/sql/sql.go:2478 +0xad

pg_stat_activity shows previous query within transaction. Actual (hanged) query is insert to different table. We exec query INSERT INTO changelog, search_path is configured via connection string.

datid,datname,pid,leader_pid,usesysid,usename,application_name,client_addr,client_hostname,client_port,backend_start,xact_start,query_start,state_change,wait_event_type,wait_event,state,backend_xid,backend_xmin,query_id,query,backend_type
17747,example,380476,,16391,username,"",172.22.254.100,,56947,2024-09-04 10:30:22.571011+03,2024-09-04 10:30:37.258628+03,2024-09-04 10:30:37.264224+03,2024-09-04 10:30:37.26514+03,Client,ClientRead,idle in transaction,2363397245,,-8112338289380391596,"UPDATE example_schema.customers SET name =  'User Name'",client backend

pg_locks

locktype,database,relation,page,tuple,virtualxid,transactionid,classid,objid,objsubid,virtualtransaction,pid,mode,granted,fastpath,waitstart
relation,17747,342277,,,,,,,,500/3386930,380476,RowExclusiveLock,t,t,
relation,17747,342275,,,,,,,,500/3386930,380476,RowExclusiveLock,t,t,
relation,17747,45981109,,,,,,,,500/3386930,380476,AccessShareLock,t,t,
relation,17747,45981109,,,,,,,,500/3386930,380476,RowExclusiveLock,t,t,
relation,17747,42632125,,,,,,,,500/3386930,380476,AccessShareLock,t,t,
relation,17747,42632125,,,,,,,,500/3386930,380476,RowExclusiveLock,t,t,
relation,17747,18332,,,,,,,,500/3386930,380476,AccessShareLock,t,t,
relation,17747,18332,,,,,,,,500/3386930,380476,RowExclusiveLock,t,t,
relation,17747,18194,,,,,,,,500/3386930,380476,AccessShareLock,t,t,
relation,17747,18194,,,,,,,,500/3386930,380476,RowExclusiveLock,t,t,
virtualxid,,,,,500/3386930,,,,,500/3386930,380476,ExclusiveLock,t,t,

Version

  • Go: go1.21.8
  • PostgreSQL: PostgreSQL 15.6 (Ubuntu 15.6-1.pgdg20.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0, 64-bit
  • pgx: github.com/jackc/pgx/v5 v5.5.5
@rf
Copy link

rf commented Sep 12, 2024

I recently switched a service to pgx from lib/pq because of a suspiciously similar issue. I'm only able to reproduce it by hammering my service with a large number of requests. A connection which is idle in transaction according to Postgres will, on the client side, still be waiting for data. This stack trace is from lib/pq; the same conditions do not produce this condition using pgx, but the issue is similar enough that I thought I'd add this here.

internal/poll.runtime_pollWait+0x84                   /usr/local/go/src/runtime/netpoll.go:345
internal/poll.(*pollDesc).wait+0x26                   /usr/local/go/src/internal/poll/fd_poll_runtime.go:84
internal/poll.(*pollDesc).waitRead+0x279              /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read+0x267                        /usr/local/go/src/internal/poll/fd_unix.go:164
net.(*netFD).Read+0x24                                /usr/local/go/src/net/fd_posix.go:55
net.(*conn).Read+0x44                                 /usr/local/go/src/net/net.go:185
bufio.(*Reader).Read+0x196                            /usr/local/go/src/bufio/bufio.go:241
io.ReadAtLeast+0x8f                                   /usr/local/go/src/io/io.go:335
io.ReadFull+0xc4                                      /usr/local/go/src/io/io.go:354
github.com/lib/pq.(*conn).recvMessage+0xa6            /src/vendor/github.com/lib/pq/conn.go:1018
github.com/lib/pq.(*conn).recv1Buf+0x27               /src/vendor/github.com/lib/pq/conn.go:1073
github.com/lib/pq.(*conn).recv1+0x44                  /src/vendor/github.com/lib/pq/conn.go:1100
github.com/lib/pq.(*conn).readParseResponse+0x17      /src/vendor/github.com/lib/pq/conn.go:1823
github.com/lib/pq.(*conn).query+0x13b                 /src/vendor/github.com/lib/pq/conn.go:909
github.com/lib/pq.(*conn).QueryContext+0x126          /src/vendor/github.com/lib/pq/conn_go18.go:24
database/sql.ctxDriverQuery+0xd6                      /usr/local/go/src/database/sql/ctxutil.go:48
database/sql.(*DB).queryDC.func1+0x158                /usr/local/go/src/database/sql/sql.go:1776
database/sql.withLock+0x81                            /usr/local/go/src/database/sql/sql.go:3530
database/sql.(*DB).queryDC+0x1dd                      /usr/local/go/src/database/sql/sql.go:1771
database/sql.(*Tx).QueryContext+0xc4                  /usr/local/go/src/database/sql/sql.go:2525

Postgres 14.12

@rf
Copy link

rf commented Oct 25, 2024

Coming back to this, we've now seen the issue with pgx as well:

internal/poll.runtime_pollWait+0x84                                            /usr/local/go/src/runtime/netpoll.go:345
internal/poll.(*pollDesc).wait+0x26                                            /usr/local/go/src/internal/poll/fd_poll_runtime.go:84
internal/poll.(*pollDesc).waitRead+0x279                                       /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read+0x267                                                 /usr/local/go/src/internal/poll/fd_unix.go:164
net.(*netFD).Read+0x24                                                         /usr/local/go/src/net/fd_posix.go:55
net.(*conn).Read+0x44                                                          /usr/local/go/src/net/net.go:185
github.com/jackc/pgx/v5/pgconn/internal/bgreader.(*BGReader).Read+0xd6         /src/vendor/github.com/jackc/pgx/v5/pgconn/internal/bgreader/bgreader.go:100
io.ReadAtLeast+0x8f                                                            /usr/local/go/src/io/io.go:335
github.com/jackc/pgx/v5/pgproto3.(*chunkReader).Next+0x290                     /src/vendor/github.com/jackc/pgx/v5/pgproto3/chunkreader.go:80
github.com/jackc/pgx/v5/pgproto3.(*Frontend).Receive+0x3b                      /src/vendor/github.com/jackc/pgx/v5/pgproto3/frontend.go:307
github.com/jackc/pgx/v5/pgconn.(*PgConn).peekMessage+0x149                     /src/vendor/github.com/jackc/pgx/v5/pgconn/pgconn.go:547
github.com/jackc/pgx/v5/pgconn.(*ResultReader).readUntilRowDescription+0x2c    /src/vendor/github.com/jackc/pgx/v5/pgconn/pgconn.go:1635
github.com/jackc/pgx/v5/pgconn.(*PgConn).execExtendedSuffix+0x18a              /src/vendor/github.com/jackc/pgx/v5/pgconn/pgconn.go:1220
github.com/jackc/pgx/v5/pgconn.(*PgConn).ExecPrepared+0x164                    /src/vendor/github.com/jackc/pgx/v5/pgconn/pgconn.go:1165
github.com/jackc/pgx/v5.(*Conn).execPrepared+0xf3                              /src/vendor/github.com/jackc/pgx/v5/conn.go:586
github.com/jackc/pgx/v5.(*Conn).exec+0x36f                                     /src/vendor/github.com/jackc/pgx/v5/conn.go:523
github.com/jackc/pgx/v5.(*Conn).Exec+0x12e                                     /src/vendor/github.com/jackc/pgx/v5/conn.go:466
github.com/jackc/pgx/v5/stdlib.(*Conn).ExecContext+0x247                       /src/vendor/github.com/jackc/pgx/v5/stdlib/sql.go:475
database/sql.ctxDriverExec+0xd6                                                /usr/local/go/src/database/sql/ctxutil.go:31
database/sql.(*DB).execDC.func2+0x158                                          /usr/local/go/src/database/sql/sql.go:1703
database/sql.withLock+0x81                                                     /usr/local/go/src/database/sql/sql.go:3530
database/sql.(*DB).execDC+0x228                                                /usr/local/go/src/database/sql/sql.go:1698
database/sql.(*Tx).ExecContext+0xac                                            /usr/local/go/src/database/sql/sql.go:2506

There is something in the Postgres protocol which is not being handled correctly.

Note that this is a rather severe bug. When this happens, it essentially invalidates a connection to Postgres. The connection will never close unless Postgres (or in our case, pgpool) is restarted. In our application, we use advisory locks to serialize writes to an entity with many constraints that need to be checked. This bug causes our entire application to lock up because all of the connections are waiting on advisory locks which cannot proceed because one connection waiting for a lock has been invalidated by this bug.

@jackc
Copy link
Owner

jackc commented Oct 25, 2024

@rf Given that the same problem occurs in pgx and in pq, it makes me doubt the problem is in pgx. Is a load balancer or connection pooler being used? Another commonality is database/sql. I don't suppose the problem can be reproduced with the native pgx interface?

@rf
Copy link

rf commented Oct 25, 2024

We are using pgpool-II. The issue isn't easy to reproduce so it isn't trivial to try using the native interface. I actually was not able to synthetically reproduce this issue in pgx after switching from pq to pgx. The stack trace above is from a production process that had locked up, we just saw this issue and it caused an outage for our product.

It's wholly possible that pgpool-II is somehow munging the protocol. I tend to doubt that since it should mostly be a dumb pipe.

I'm going to try reproducing this with full TCP dumping and a debugger on the process.

@rf
Copy link

rf commented Oct 29, 2024

OK, after further investigation I'm pretty sure we are seeing this error because of a bug in pgpool-II. I don't know if the other people who have seen this are seeing the same bug, however.

The bug presents when reads in the beginning of a transaction are routed to a replica by pgpool-II, and the reads are killed because the replica can't keep up with the WAL (terminating connection or cancelling statement due to conflict with recovery). It would appear that in this case, pgpool-II fails to rollback the transaction on the primary and/or otherwise 'loses track' of the connection, which results in no data being returned to the client. We were able to work around this by first issuing a call to a VOLATILE function which will cause pgpool-II to route the read statements to the primary (SELECT random()).

I didn't end up doing any low level TCP debugging, so I can't say if a message is successfully sent back through pgpool-II which is then misunderstood by pgx. But I think it's likely that the bug lies outside of PGX.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants