-
-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiencing various database exceptions with Rails 7.1 #796
Comments
# 3 - zero-length identifier - Probably the second most common. This one wedges the process and requires a restart.
# 1 - Hard crash of Ruby - D message before T message - I think I've seen this just once.
|
@zarqman thank you for trying the bleeding edge and opening this issue. That is not good! I just pushed up #797. Could you give that branch a try? I wasn't able to reproduce what you're seeing with a trivial reproduction application, but I fully believe that there can be autoloading problems/deadlocks with current Rails HEAD. I'm fairly certain this is related to recent updates to Active Record, but can't pinpoint yet exactly what. Something like either: |
@bensheldon, thanks for looking into this! I just tried #797 @ fb3e386 and unfortunately got another error # 6 ( |
@zarqman nuts! Would you be able to share your project with me or set up a repro project? |
I've setup a repro project here: https://github.com/zarqman/goodjob-rails71-test I'm able to trigger some sort of error with I did a lot of exploring as to why that might be and I think it comes down to how much code is being loaded by the Rails reloader. The proxy for that seems to be how many files are in I haven't yet determined if my added parameters in By setting I also happened to see several instances of what appear to be unexpected inbound messages (from PG to Rails) on both 3.7.4 and # 9 - inbound messages
The most common error continues to be # 6, however. Don't know if this is related or not, but I'm also sometimes seeing GoodJob take extra long to shutdown. Instead of the normal < 1 second, it ends up being 5 (at which point |
Thank you for the help with the reproduction repo 🙏🏻 I haven't had much luck getting a reproduction of these exceptions, but I have seen the situation where GoodJob fails to shutdown on SIGINT and foreman has to send it a SIGKILL. That's not right. Here's my debugging log 😊 https://ruby.social/@bensheldon/109713413526769074 I was running these in separate terminal windows to try to trigger the deadlock:
I feel somewhat confident that the problem is that the database connection is being accessed by multiple threads and that's bad. But I'm not quite sure which threads would be doing it. Still figuring that out. |
@bensheldon I don't know if I clarified how I'm triggering the errors. They all happen at launch. So, my process has effectively been:
Based on the variety of errors, I agree that the likely problem is the db connection being shared across threads. Your note of |
@bensheldon I just had a chance to try v3.12.1 and same issue, unfortunately. 😢 Wish I had a better report! |
@bensheldon I spent some time this evening trying to debug this further. A Near as I can tell at this point, the connection that I'm suspicious that the DB connection is being cached somewhere inside Even though I've explored adding a check in What's interesting is It's possible that the underlying issue has been there for some time, but the conservative locking in Rails <= 7.0 compensated for it. With the recent change in locking, it's now exposed. I can also confirm (and the backtraces in earlier comments show) that ActiveRecord is using NullLock when |
Thank you, really, for continuing to dig into this. I want to flag the really dumb/genius thing I'm doing which is reusing the Notifier's connection for Process via this thing: https://github.com/bensheldon/good_job/blob/main/lib/good_job/assignable_connection.rb I can speculate that something (schema cache?) is happening when the connection is assigned/overridden and then maybe Active Record is storing that connection somewhere unexpected so it's not truly "just for the life of the AssignableConnection block" and it's leaking into usage by other models. Re:NullLock. I think the intention behind it is correct: connections should implicitly be isolated within threads already so a mutex isn't necessary unless they are, and the only place they possibly could be is in tests / System tests. I'm pretty sure AssignableConnection is the culprit. Though it would be nice to know how the connection is being leaked out. |
@bensheldon, the AssignableConnection stuff might aggravate this a little, but at this point I don't think it's the underlying cause. I'm also not sure it's wrong. If I'm understanding things right, turns out that To that end, I can conceive of ways to trigger this in multi-threaded code without wonky connection reassignments. If Notifier used its reserved connection to make queries that triggered schema queries, that would likely trigger this as well--all without AssignableConnection. I suspect the underlying issue has been there a long time, but when connections had their own locking, that effectively covered over it. With those locks replaced by |
@zarqman oh yeah! I see that now. Yeah, I do not think that SchemaCache is threadsafe. To walk through what I'm seeing: Here is where the current connection (locked to the current thread) is passed down: ...and into this un-threadsafe method with an assignment. There should probably be a mutex around that cached assignment ^^ this is not threadsafe. I want to propose that this could simply (?) be made threadsafe by turning this attr_accessor into a thread_cattr_accessor and use that instead of the ...though I think it would be better (though a much bigger change) for the SchemaCache to instead take a reference to the connection pool and use that to access the current thread's connection without caching a reference to a connection at all. I dunno, what do you think? Is this something you'd be able to share/propose upstream in Rails/Rails? I think you're spot on. |
This was my thought as well, so it's encouraging that you arrived at the same conclusion! 💯
The question there is what to do with connections that aren't part of the connection pool, such as with Notifier after it removes its connection from the pool. In that particular case, Notifier's thread doesn't have a pool-based connection, but would need one during Process.register (just for the schema parts). Either the external assignment is still needed (in which case, is the bigger change warranted at all?) or a fallback mechanism of checking out a new, separate connection becomes needed (and checking that back in looks tricky at first glance). The fallback is an even bigger change, and would also end up requiring an additional connection from the pool itself--something the Rails folks seem (rightly) hesitant to go for. Anyway, yes, I'm working on a PR for Rails now as it'd be great to get this fixed at the source. |
@zarqman I just saw you opened up rails/rails#38577 Seems like we may just need to wait on Rails Core. |
rails/rails#47582 is the new PR (just a typo above?). That other PR (rails/rails#38577) is the same issue though, just with Sidekiq. That makes sense since both GoodJob and Sidekiq spin up multiple threads upon launch. I've been asked to produce a more real-life test demonstrating the underlying problem, so that's the next step. |
Oh no, rails are so slow on accepting fixes (attention to detail). :) @bensheldon has been known to fix my bug reports within minutes from reporting hehe Shame it was a rails thing but cest la vie |
fyi, this is currently in the Rails 7.1 release milestone so I'm hopeful a fix is coming. |
@zarqman it looks like the upstream Rails fix has been merged 🎉 |
@bensheldon, excellent! And thanks again for your help in figuring out the underlying issue! |
I'm running Rails 7.1 from HEAD on a new app and have run into several strange exceptions coming from somewhere in the GoodJob / Rails / pg gem stack.
As you'll see, the errors vary. I'm suspicious of a DB connection being accidentally shared between threads, but am not really sure if it's in GoodJob, ActiveRecord, or even the pg gem. It appears that Rails 7.1 made some optimizations to the threading/locking in ActiveRecord back a couple (few?) months ago, but I can't say for sure if that's related.
The errors all seem to happen right when GoodJob is first loading and making its first queries. If things load without error, the process remains stable. Further, depending on the initial error, the process may stabilize afterward. For other errors, something gets corrupted inside ActiveRecord (schema-type information it seems), and subsequent queries continue to error. In one instance, Ruby itself crashed with a segfault.
I'm posting this here to start, but I'm not fully sure if the issue lies within GoodJob or elsewhere.
Context
I'm running GoodJob in a separate process, via bin/good_job, which is where these show up. I do not recall seeing errors from inside the Rails server process (when scheduling GoodJob jobs or otherwise). I have not tried async or any other modes to compare.
good_job
3.7.4rails
7.1.0.alpha from main as of Jan 14, but examples below vary from today back to Jan 1pg
1.4.5ruby
3.1.3, on macOS 12.6.2I also tried 2-3 earlier versions of
pg
and forced a full recompile of 1.4.5. And, I also tried withoutbootsnap
, but that made no difference.The errors
As a general rule, I silence ActiveRecord queries in development, so many of these backtraces don't include the queries. I apologize. I'll start with the newest, since it does show them.
(Numbering comes from my own notes; please excuse skipped numbers.)
# 6 - map_types! - This one seems to be the most common.
# 5 - result.clear - I think I've seen this just once.
# 4 - Length error in message type Z
The text was updated successfully, but these errors were encountered: