-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: cluster performance degrades after adding and dropping hundreds of tables #20753
Comments
We've run clusters with tens of thousands of replicas and not seen significant performance degradation due to the number of replicas. Can you show the graph of "quiescent" replicas? As long as most of the replicas are quiescent, the number doesn't matter too much. (That's not completely true, quiescent replicas still consume a very tiny amount of cpu). Regardless of whether it is the number of replicas or something else, this deserves investigating. Assigning to @nvanbenschoten for triage. |
Thanks for filing @LEGO. @petermattis I'll take this one myself. |
Got it. All the replicas do become quiescent. Nothing else stands out about the graphs, just that SQL latency climbs and KV latency climbs (but far less, 5ms starting to 10ms during bad performance) I'm going to give a shot at running these tests on VMs and a cluster, as it might be other factors at play. I did think it may be the fact of running it locally but if the tests once when there is no other activity going on, CockroachDB doesn't increase much in CPU or memory usage but still has trouble. For reference, here are all the queries executed. It's quite noisy, I'll try to narrow down some of the problems. There are 2 files, one for the queries that create the data, and one that runs test cases. It appears that they both run twice, also. The second set of queries start running a bit before the first session finishes. |
So I created a short Go program to record the elapsed time for 5 different queries:
SELECT t2.oid::REGCLASS::TEXT AS to_table, a1.attname AS column, a2.attname AS primary_key, c.conname AS name, c.confupdtype AS on_update, c.confdeltype AS on_delete FROM pg_constraint AS c JOIN pg_class AS t1 ON c.conrelid = t1.oid JOIN pg_class AS t2 ON c.confrelid = t2.oid JOIN pg_attribute AS a1 ON (a1.attnum = c.conkey[1]) AND (a1.attrelid = t1.oid) JOIN pg_attribute AS a2 ON (a2.attnum = c.confkey[1]) AND (a2.attrelid = t2.oid) JOIN pg_namespace AS t3 ON c.connamespace = t3.oid WHERE ((c.contype = 'f') AND (t1.relname = <TABLENAME>)) AND (t3.nspname = ANY (current_schemas(false))) ORDER BY c.conname
Here's a rolling average of each of the queries. The latency increase is related to the dropped tables and only applies to the drop table and introspective query. Create table has a marginal increase in latency. I haven't modified GC time yet in my tests. (the code for benchmarking and plotting can be found in this gist) A quick guess is that getting all of the descriptors and/or iterating over them could be slow for the introspective query in |
@LEGO let me know if you're planning on continuing this investigation. If not, I'd be happy to take it over. |
I discussed things a bit more with @nvanbenschoten but didn't get to fully investigate this. This is the trace for the (rather long, joining) introspection query.
This is the trace for the
I also ran the experiment doing The second span, for drop table, seems to not reflect any of the time spent in within the transaction. I would guess a big chunk of this is coming from after the transaction during During these tests, the GC time was not modified so I would assume the table descriptors were hanging around (although, the single node cluster was down for a week, should it have been cleaned out?) |
Thanks for the work here @LEGO! I played around with your gist to get more insight into why I dug down deeper into the cause of the slowness in I'm not familiar enough with how the system config works to know what the best course of action here is. At the very least, Overall, it seems like there could be a few easy wins related to gossiping the system config now that we know its a bottleneck for certain access patterns:
The method is "best effort", but I don't know the implications of a failure to gossip the system config well enough to weigh the relative tradeoffs here. @vivekmenezes or @mberhault, perhaps you could shed some light on this for me. |
Nice debugging!
Yes, we should definitely do this.
Yes.
I don't think we can drop the calls as schema changes rely on nodes getting the updated table descriptors gossiped in a timely manner. I think aggressive coalescing would be sufficient. This would be similar to how we coalesce WAL sync calls in Something I'm not clear on: is the test creating and dropping tables fast enough that coalescing of the system config gossip calls would have an effect?
Are we writing a tombstone for deleted tables? In your testing here, how large is the system config that is gossiped? How many key/values does it contain? #21334 might help significantly as scanning over tombstones will take place entirely in C++.
If we don't gossip the system config, schema changes will not make progress. There may be some other badness as well. I'm pretty sure nothing relies on the system config being gossiped synchronously. |
Thanks for debugging. This is also what I saw when debugging another issue
related to truncate being slow when executing on a 100 tables.
Looks like changing the gossiping to be coalesced and asynchronous is the
way to go. Thanks
…On Tue, Jan 9, 2018, 8:51 AM Peter Mattis ***@***.***> wrote:
Nice debugging!
we could make the process beneath Raft asynchronous
Yes, we should definitely do this.
we could coalesce multiple calls to MaybeGossipSystemConfig together
Yes.
we could simply drop calls to MaybeGossipSystemConfig when we see too many
I don't think we can drop the calls as schema changes rely on nodes
getting the updated table descriptors gossiped in a timely manner. I think
aggressive coalescing would be sufficient. This would be similar to how we
coalesce WAL sync calls in rocksdb.go.
Something I'm not clear on: is the test creating and dropping tables fast
enough that coalescing of the system config gossip calls would have an
effect?
we might be able to do a better job keeping the size of the system config
down when tables are deleted
Are we writing a tombstone for deleted tables? In your testing here, how
large is the system config that is gossiped? How many key/values does it
contain?
#21334 <#21334> might help
significantly as scanning over tombstones will take place entirely in C++.
The method is "best effort", but I don't know the implications of a
failure to gossip the system config well enough to weigh the relative
tradeoffs here.
If we don't gossip the system config, schema changes will not make
progress. There may be some other badness as well. I'm pretty sure nothing
relies on the system config being gossiped synchronously.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20753 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ALOpBAzpjUXCWe2_1R77zzluxU6cWzZFks5tI27ogaJpZM4RD2t4>
.
|
related to #19267
…On Tue, Jan 9, 2018, 10:10 AM Vivek Menezes ***@***.***> wrote:
Thanks for debugging. This is also what I saw when debugging another issue
related to truncate being slow when executing on a 100 tables.
Looks like changing the gossiping to be coalesced and asynchronous is the
way to go. Thanks
On Tue, Jan 9, 2018, 8:51 AM Peter Mattis ***@***.***>
wrote:
> Nice debugging!
>
> we could make the process beneath Raft asynchronous
>
> Yes, we should definitely do this.
>
> we could coalesce multiple calls to MaybeGossipSystemConfig together
>
> Yes.
>
> we could simply drop calls to MaybeGossipSystemConfig when we see too many
>
> I don't think we can drop the calls as schema changes rely on nodes
> getting the updated table descriptors gossiped in a timely manner. I think
> aggressive coalescing would be sufficient. This would be similar to how we
> coalesce WAL sync calls in rocksdb.go.
>
> Something I'm not clear on: is the test creating and dropping tables fast
> enough that coalescing of the system config gossip calls would have an
> effect?
>
> we might be able to do a better job keeping the size of the system config
> down when tables are deleted
>
> Are we writing a tombstone for deleted tables? In your testing here, how
> large is the system config that is gossiped? How many key/values does it
> contain?
>
> #21334 <#21334> might
> help significantly as scanning over tombstones will take place entirely in
> C++.
>
> The method is "best effort", but I don't know the implications of a
> failure to gossip the system config well enough to weigh the relative
> tradeoffs here.
>
> If we don't gossip the system config, schema changes will not make
> progress. There may be some other badness as well. I'm pretty sure nothing
> relies on the system config being gossiped synchronously.
>
> —
> You are receiving this because you were mentioned.
>
>
> Reply to this email directly, view it on GitHub
> <#20753 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ALOpBAzpjUXCWe2_1R77zzluxU6cWzZFks5tI27ogaJpZM4RD2t4>
> .
>
|
Adding a new Unfortunately, those numbers don't translate over to the full workload @LEGO provided. In fact, when run with the full workload, the change only seems to provide about a 20% improvement. I spent a while trying to figure out why this is and it looks like the issue has to do with table leases. After creating each table, the full workload inserts and queries the table. These operations require a table lease to be acquired on each table. Unfortunately, the acquisition of this table lease re-establishes the dependency of the This is pretty bad. The more I explore, the further away the low hanging fruit appears. To summarize:
The only change I've made so far that actually seems worthwhile performance-wise is caching results on the receiving side of cockroach/pkg/sql/schema_changer.go Lines 1029 to 1031 in a81e484
DROP TABLE -only workload (without the systemConfigGossipQueue ). I suspect that this technique could be applied in a number of places.
|
The last time I looked at this I remember LeaseStore.WaitForOneVersion() having rather unpredictable performance. I think it might be the case that it polls and each retry waits for a longer time than it need. There could also be retries caused by transaction retries due to contention. |
Ah, I had forgotten about that recent change.
Yes, these timings make sense. Thinking a bit radically here, perhaps we can gossip only the changed system config values. We'd still need to have a fallback path to gossip the entire system config range when a replica acquires the lease for the system config range, but when mutating a small set of keys it should be safe to have a different gossip key (i.e. |
Yes, LeaseStore.WaitForOneVersion is polling the
I was thinking about ideas like this, but I don't think gossip is the right transport mechanism here because it is best-effort, asynchronous, and could lose deltas if multiple updates overwrote one another. This actually seems like a perfect use case for CDC, where a listener could monitor the SystemConfig range directly and be informed of any updates immediately. In the long-term I agree that this is something that can be improved tremendously. For now, I've been able to make a series of changes which both captures the previous gains seen in #20753 (comment) and also improves the situation for cases like the full workload where a table has outstanding leases. Below are graphs using @LEGO's workload before the change and after: Before:After:The change is on the branch nvanbenschoten/gossipQueue. This is an extreme work-in-progress, but the salient points of the change are:
Each of these changes can probably be broken into separate PRs as they're all pretty self-contained. |
I also did some investigation into the In the case here, the |
Related to cockroachdb#20753. Until now, the population of most virtual tables (information_schema, pg_catalog, crdb_internal) has required a kv scan over all descriptors. This became expensive for complex introspection queries that joined multiple virtual tables together, since each would require their own scan to collect the descriptors. To make this worse, certain workloads like adding and dropping table repeatedly could generate lots of descriptors, making each of these scans take a significant amount of time. This was pretty bad because we knew that each scan would return the same results when inside the same query. In fact, we knew that even across statements but within the same transaction, the results would almost always be the same unless the transaction itself had done a schema change. To address this issue, this change now caches all DescriptorProtos in TableCollection, clearing them only when necessary. TableCollection already managed the caching of descriptor-related state like this and made sure to purge the cache on schema changes and txn restarts, so it was a natural place for this extra bit of caching to live. Release note (performance improvement): information_schema and pg_catalog should be faster to query.
@nvanbenschoten is this core or is this sql? (Looks more sql to me?) |
While running Ruby's ActiveRecord test suite, each test file will drop and add 185 tables.
After running the tests for awhile, the amount of replicas grows to several thousand and the tests become extremely slow. As the replica count increases so did the 99th percentile latency.
From measurements of running the same test continuously, it would take around 2s total while running against a fresh database. The 99th percentile latency was about 20ms. That time would increase and would reach a point where the file would consistently take at least 10s to run, and often 20s-30s. The 99th percentile then would be over 200ms.
Something to note is that dropping and re-creating the database did not help with performance, or reduce the replicas. The only way to bring the latency back down was to completely delete the database files.
These were the graphs of the database while running it. Mind the break in activity as there was a period where no tests were being run.
The text was updated successfully, but these errors were encountered: