Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover gracefully from laptop sleep #1558

Open
enocom opened this issue Sep 28, 2023 · 11 comments · Fixed by #1771
Open

Recover gracefully from laptop sleep #1558

enocom opened this issue Sep 28, 2023 · 11 comments · Fixed by #1771
Assignees
Labels
priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@enocom
Copy link
Member

enocom commented Sep 28, 2023

Right now if a laptop goes to sleep with an active connection to a Cloud SQL instance, the socket factory won't recover on its own once the laptop has been restored.

There are limitations in Java to handling this:

https://stackoverflow.com/questions/52876556/how-does-java-calculate-sleep-time-when-pc-goes-into-hibernate-mode

Nonetheless, we should throw out a bad connection and force a refresh, such that the socket factory recovers gracefully from sleep.

See https://github.com/luryus/cloud-sql-jdbc-hang for a reproduction of this issue.

Related to GoogleCloudPlatform/cloud-sql-proxy#1788.

Whatever we do here, we should port to the AlloyDB Java Connector as well.

@enocom enocom added priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Sep 28, 2023
@enocom
Copy link
Member Author

enocom commented Sep 28, 2023

cc @luryus

This is a problem across Connectors and we do want to improve support for sleep.

@enocom enocom added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels Oct 2, 2023
@luryus
Copy link

luryus commented Dec 20, 2023

Hi! I noticed that this (or a very similar issue) got fixed in the Go connector: GoogleCloudPlatform/cloud-sql-go-connector#686

Any chance this could now get ported to this Java library as well?

@annafang-google
Copy link

Ack. We'll port the fix over shortly.

@enocom
Copy link
Member Author

enocom commented Dec 20, 2023

I think this change should match the Go connector -- checking for an invalid cert and blocking until one is available if the cert is in fact expired.

@luryus
Copy link

luryus commented Feb 18, 2024

Hi @ttosta-google ,

it seems that this is still not fixed in v1.16.0 and with #1771.

Even though the library now immediately starts a refresh when it notices that the current cert is expired, the force refresh does not succeed. Logs indicate that it gets stuck in the refresh rate limiter somehow. The library logs "Acquiring rate limiter permit.", but not "Rate limiter permit acquired". All subsequent connection retries then just get the good old bad_certificate errors.

This reproduces with both my reproducer project and DBeaver.

Here's sample logs:

Feb 17, 2024 1:33:22 PM com.google.cloud.sql.core.Refresher handleRefreshResult
FINE: [<instance name>] Refresh Operation: Completed refresh with new certificate expiration at 2024-02-17T12:33:21Z.
Feb 17, 2024 1:33:22 PM com.google.cloud.sql.core.Refresher handleRefreshResult
FINE: [<instance name>] Refresh Operation: Next operation scheduled at 2024-02-17T12:29:20Z.
Feb 17, 2024 1:33:37 PM com.lkoskela.googlejdbchang.App main
FINE: Connection is still valid
Feb 17, 2024 1:33:57 PM com.lkoskela.googlejdbchang.App main
FINE: Connection is still valid

...
<sleep for more than an hour>
...

Feb 17, 2024 3:48:29 PM org.postgresql.jdbc.PgConnection isValid
FINE: Validating connection.
org.postgresql.util.PSQLException: This connection has been closed.
        at org.postgresql.jdbc.PgConnection.checkClosed(PgConnection.java:993)
        at org.postgresql.jdbc.PgConnection.setNetworkTimeout(PgConnection.java:1694)
        at org.postgresql.jdbc.PgConnection.isValid(PgConnection.java:1524)
        at com.lkoskela.googlejdbchang.App.main(App.java:56)

Feb 17, 2024 3:48:29 PM com.lkoskela.googlejdbchang.App main
INFO: Connection no longer valid
Feb 17, 2024 3:48:30 PM com.lkoskela.googlejdbchang.App main
INFO: Opening connection...
Feb 17, 2024 3:48:30 PM org.postgresql.Driver connect
FINE: Connecting with URL: jdbc:postgresql:///<redacted>
Feb 17, 2024 3:48:30 PM org.postgresql.jdbc.PgConnection <init>
FINE: PostgreSQL JDBC Driver 42.6.0
Feb 17, 2024 3:48:30 PM org.postgresql.jdbc.PgConnection setDefaultFetchSize
FINE:   setDefaultFetchSize = 0
Feb 17, 2024 3:48:30 PM org.postgresql.jdbc.PgConnection setPrepareThreshold
FINE:   setPrepareThreshold = 5
Feb 17, 2024 3:48:30 PM org.postgresql.core.v3.ConnectionFactoryImpl openConnectionImpl
FINE: Trying to establish a protocol version 3 connection to :5432
Feb 17, 2024 3:48:30 PM com.google.cloud.sql.core.Refresher refreshIfExpired
FINE: [<instance name>] Client certificate has expired. Starting next refresh operation immediately.
Feb 17, 2024 3:48:30 PM com.google.cloud.sql.core.Refresher forceRefresh
FINE: [<instance name>] Force Refresh: the next refresh operation was cancelled. Scheduling new refresh operation immediately.
Feb 17, 2024 3:48:30 PM com.google.cloud.sql.core.Refresher startRefreshAttempt
FINE: [<instance name>] Refresh Operation: Acquiring rate limiter permit.
Feb 17, 2024 3:48:30 PM jdk.internal.event.EventHelper logX509CertificateEvent
<redacted>
Feb 17, 2024 3:48:30 PM jdk.internal.event.EventHelper logTLSHandshakeEvent
FINE:  TLSHandshake: <redacted>
Feb 17, 2024 3:48:30 PM org.postgresql.core.v3.ConnectionFactoryImpl tryConnect
FINE: Receive Buffer Size is 65,536
Feb 17, 2024 3:48:30 PM org.postgresql.core.v3.ConnectionFactoryImpl tryConnect
FINE: Send Buffer Size is 65,536
Feb 17, 2024 3:48:30 PM org.postgresql.Driver connect
FINE: Connection error:
org.postgresql.util.PSQLException: The connection attempt failed.
        at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:354)
        at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
        at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
        at org.postgresql.Driver.makeConnection(Driver.java:443)
        at org.postgresql.Driver.connect(Driver.java:297)
        at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:681)
        at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:190)
        at com.lkoskela.googlejdbchang.App.main(App.java:54)
Caused by: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate
        at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:131)

@ttosta-google
Copy link
Collaborator

Thanks for the info @luryus! I've reopened this issue and will investigate it further.

@ttosta-google ttosta-google reopened this Feb 20, 2024
@enocom
Copy link
Member Author

enocom commented Feb 20, 2024

This is another example where our rate limiting approach to refreshes doesn't work so well.

The rate limiter should be only wait 30 seconds to try again, though.

@luryus
Copy link

luryus commented Apr 8, 2024

Hi, any updates on this? With 1.17.1 this still happens. I just retested this with my test program and it seems that after the pc has been asleep, the rate limiter lets the force refresh succeed only after a long time - probably only after the duration for the next scheduled refresh is reached. Until that the connection attempts fail with the errors I pasted in my previous message, even if I wait more than 30 seconds between each attempt.

It looks like a thread pool is starved or something, and that's why the refresh operations get stuck in the rate limiter.

Please take a look at this again - I run to this almost daily while using DBeaver and it's super annoying.

@enocom
Copy link
Member Author

enocom commented Apr 10, 2024

We've recently fixed a similar issue in the Proxy. For now I'd recommend connecting through the Proxy (which handles sleep without issue). And meanwhile, we've moved this up in our list of priorities and will investigate soon.

@luryus
Copy link

luryus commented Aug 29, 2024

This issue seems to be fixed by switching to the lazy refresh strategy. Makes sense - Windows sleeping disturbs timers just like idling on serverless platforms does, so avoiding them entirely fixes the problem.

There's a slight delay when the certificates are refreshed after I wake up my laptop and make a query in DBeaver, but that's not a problem at all.

Perhaps the docs could be updated to cover this use case?

@jackwotherspoon
Copy link
Collaborator

@luryus I would agree. Probably a good idea for us to update the README's Lazy Refresh section to mention the awaking from sleep use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants