sql: expose table on-disk size #20712

tbg · 2017-12-14T15:13:03Z

As of #20627, the admin UI computes the approximate on-disk size of tables. However, it makes sense to expose this information programmatically via SQL. All that's needed for this is to find a place to expose it, and populate it with from (*adminServer).TableStats. This call isn't extremely cheap (has fan-out and needs to scan a subset of the meta entries), but caching could be added as necessary.

Perhaps there is a connection to be made with #20323. It seems reasonable that approximate on-disk size for a table would be exposed in SELECT .. FROM system.table_statistics. (cc @RaduBerinde).

@dianasaur323 for triage.

Jira issue: CRDB-5915

Epic: CRDB-24527

The text was updated successfully, but these errors were encountered:

dianasaur323 · 2017-12-14T18:07:19Z

@RaduBerinde and perhaps @tschottdorf do you mind giving an approximate time to do this? doesn't sound like it would take too long? SELECT .. FROM system.table_statistics makes sense to me.

petermattis · 2017-12-14T18:15:18Z

system.table_statistics is meant for powering query planning. It is possible that it could contain approximate disk size, but I see some impedance mismatch. For example, I believe that table only contains stats on the primary index right now while for the admin UI I imagine you'd want to include the approximate disk size of indexes too. Also, the refresh rate of system.table_statistics might be too low for the admin UI. Lastly, table statistics are only going to be populated on demand (via a CREATE STATISTICS) call for 2.0.

Adding a crdb_internal virtual table (or a column to an existing table) and powering it via a fan-out TableStats might be more tenable in the near term.

dianasaur323 · 2017-12-14T18:18:59Z

@petermattis makes sense to me, although my understanding was that we don't do much in terms of documenting crdb_internal. @jseldess correct me if I'm wrong?

I think we are rapidly coming to the point where we have to consider exposing metrics in the admin UI also through either SQL or some other API for users trying to automate things based on metrics, so I'm a bit concerned about choosing one path here that we then have to change when we finally have time to think about how to expose these metrics in a way that is easily consumable by users in the format they want. What are your thoughts on that?

jseldess · 2017-12-14T20:24:56Z

That's was the basic agreement in the past, yes, @dianasaur323. @knz was involved in that decision. Could we add a SQL statement that queries that internal table?

petermattis · 2017-12-14T20:29:19Z

I think we are rapidly coming to the point where we have to consider exposing metrics in the admin UI also through either SQL or some other API for users trying to automate things based on metrics, so I'm a bit concerned about choosing one path here that we then have to change when we finally have time to think about how to expose these metrics in a way that is easily consumable by users in the format they want. What are your thoughts on that?

I'm in agreement that we should be able to access vs internal metrics via SQL. Adding SQL statements for each of these would likely result in an explosion of statements. I believe the hesitance in the past to document crdb_internal was to give ourselves freedom to make backwards incompatible changes. But we can either revisit that decision or design another set of virtual tables for which we promise to maintain compatibility.

dianasaur323 · 2017-12-14T20:37:52Z

In this case, should we just do the crdb_internal virtual table approach for this specific metric and revisit when more people start asking for programmatic access to these metrics? Diana

…

On Thu, Dec 14, 2017 at 3:29 PM, Peter Mattis ***@***.***> wrote: I think we are rapidly coming to the point where we have to consider exposing metrics in the admin UI also through either SQL or some other API for users trying to automate things based on metrics, so I'm a bit concerned about choosing one path here that we then have to change when we finally have time to think about how to expose these metrics in a way that is easily consumable by users in the format they want. What are your thoughts on that? I'm in agreement that we should be able to access vs internal metrics via SQL. Adding SQL statements for each of these would likely result in an explosion of statements. I believe the hesitance in the past to document crdb_internal was to give ourselves freedom to make backwards incompatible changes. But we can either revisit that decision or design another set of virtual tables for which we promise to maintain compatibility. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20712 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFgDWJ98GfSydy0Qx24FFZgZnQQDAwEHks5tAYUtgaJpZM4RCNKm> .

-- Diana Hsieh [email protected] 407-690-9048

tbg · 2017-12-14T22:21:08Z

I think the virtual table makes sense for now. Querying this programmatically is generally a little dangerous too because the work done in computing these quantities isn't trivial, so we shouldn't exactly advertise it.

But this also constrains us somewhat: if we make this a virtual table with strawman schema (table_id, approx_size_bytes), will any query to this table compute the stats for all tables? We should avoid that.

If that restriction makes things difficult, a more straightforward way is to make that a function (so that it can only be invoked for individual tableIDs).

RaduBerinde · 2017-12-15T01:14:51Z

With a little bit of work we can plumb a filter to the virtual table generator function so it only computes it for one table if there is a `WHERE table_id = x`

…

-Radu

On Thu, Dec 14, 2017 at 5:21 PM, Tobias Schottdorf ***@***.*** > wrote: I think the virtual table makes sense for now. Querying this programmatically is generally a little dangerous too because the work done in computing these quantities isn't trivial, so we shouldn't exactly advertise it. But this also constrains us somewhat: if we make this a virtual table with strawman schema (table_id, approx_size_bytes), will any query to this table compute the stats for *all* tables? We should avoid that. If that restriction makes things difficult, a more straightforward way is to make that a function (so that it can only be invoked for individual tableIDs). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20712 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/APxxeGuC0HCn_MW7fB4XOZrBGfJNvRgvks5tAZ9kgaJpZM4RCNKm> .

dianasaur323 · 2017-12-15T06:43:40Z

@vivekmenezes do you mind seeing if anyone would have time to do this? it would be nice since it is a customer request.

vivekmenezes · 2017-12-15T14:24:12Z

https://www.postgresql.org/docs/current/static/functions-admin.html#FUNCTIONS-ADMIN-DBSIZE

The entire page is a list of the various admin related functions that we could support. I'm in favor of us figuring out what we want to support and building a project around it rather than doing a one off feature request.

tbg · 2017-12-18T17:00:24Z

@vivekmenezes I think we should leverage the flexibility we have in crdb_internal to iterate quickly and stabilize things as they've been in for a while (moving them out of crdb_internal or coming up with some other scheme). It's hard to communicate to a user that we're not going to provide necessary functionality because we want to polish first.

knz · 2017-12-18T18:17:38Z

👍 on publicizing a set of vtables that we're willing to expose in docs and users. I would however use two separate namespaces though -- perhaps crdb_internal for those we want to expose and crdb_experimental as a "staging area" for features we're not sure about yet.

I'd suggest migrating most of those we have in crdb_internal to crdb_experimental, with the exception of those few the users want and we can document.

Also I think it's OK to have some redundancy between the namespaces.

dianasaur323 · 2017-12-18T18:21:36Z

@knz that sounds like a good approach. cc @jseldess so that he is aware that we may have to account for time to document this

bdarnell · 2017-12-18T18:43:16Z

I'd keep crdb_internal for the experimental/fast-moving stuff and maybe crdb_system for things that we want users to be able to rely on.

jordanlewis · 2017-12-18T18:47:29Z

Just want to put a soft -1 on introducing another top-level virtual table namespace until we nail down what's going on with our schemas vs databases.

UI tools show crdb_internal and system as top-level databases and schemas that are present in all databases.

Perhaps we could use this moment to discuss the UX around adding a top-level database that contains several schemas for internal purposes. For example, the crdb database could contain internal, system (our current system db), and some other schema that represents this reliable external interface.

dianasaur323 · 2017-12-18T18:50:42Z

i agree that we should have that larger discussion - the things we expose to users is getting pretty spread out across different interfaces. that being said, i think it sounds like it's time for someone to take ownership of this and come up with a more complete proposal?

vivekmenezes · 2017-12-18T22:12:56Z

Anyone against implementing

pg_table_size(regclass)

for this issue?

jordanlewis · 2017-12-18T22:17:50Z

+1 to that suggestion - I've seen several PG admin ui tools that use that builtin.

tbg · 2017-12-18T22:26:50Z

The API call we have right now gives the approximate total replicated size of a table and it needs to fan out across the cluster and also scans from the meta ranges. Hooking this up to pg_table_size is possibly something we could regret.

vivekmenezes · 2017-12-19T03:16:43Z

I imagine a system tracking table/database statistics and putting it in one place, and pg_table_size() looking up the stats table.

dianasaur323 · 2017-12-20T03:07:06Z

it seems like this project is expanding a bit in scope. How long do you think this would take in terms of implementing, and is it still reasonable to get this done by 2.0?

vivekmenezes · 2017-12-20T16:15:07Z

I doubt anyone's going to get to it anytime soon

dianasaur323 · 2017-12-20T21:49:00Z

Let's do this in 2.1 then. Moving milestone.

charl · 2018-04-22T09:56:48Z

https://forum.cockroachlabs.com/t/index-size-on-disk/1519 is somewhat related.

When evaluating CRDB (and comparing it with other databases) as well as getting a handle on primary key choices to best suit ones use cases it's really useful to know how the increase of table rows affects the growth of indexes.

brendesp · 2019-09-23T19:52:46Z

I need to a programmatic way to determine disk usage of each database as well. From what I can see this is still not resolved. I was originally looking for TABLESPACE, but with the distributed nature I see why that would be difficult to manage.

Any ETA on the DB size issue?

robert-s-lee · 2020-04-13T13:37:13Z

is there an alternate way to estimate size from SQL?

jseldess · 2020-04-13T13:48:56Z

^ @awoods187

RoachietheSupportRoach · 2020-04-13T14:17:06Z

Zendesk ticket #5125 has been linked to this issue.

jordanlewis · 2020-04-13T14:41:14Z

@tbg isn't this possible (but slow) with something like:

select sum(range_size) from crdb_internal.ranges where database_name = 'foo' and table_name='bar'?

charl · 2020-04-14T07:21:05Z

@jordanlewis the query above does not provide any output on my v19.2.5 clusters.

What version were you testing this with?

robert-s-lee · 2020-04-14T11:54:09Z

on 19.2.5, the defaultdb is reported as 321.0 KiB by Admin UI. Below returns 731KiB. range_size is (crdb_internal.range_stats(start_key)->>'key_bytes')::INT8 + (crdb_internal.range_stats(start_key)->>'val_bytes')::INT8 AS range_size. what is the difference in data captured between Admin UI and the range_size?

> select sum(range_size)/1000 from crdb_internal.ranges where database_name = 'defaultdb';
  ?column?
+----------+
   731.563

charl · 2020-04-15T04:51:45Z

From one of my production clusters:

user@host:26257/defaultdb> select sum(range_size)/1000 from crdb_internal.ranges where database_name = 'defaultdb';
  ?column?
+----------+
  NULL
(1 row)

Time: 31.687465ms

Version:

# cockroach version
Build Tag:    v19.2.5
Build Time:   2020/03/16 18:27:12
Distribution: CCL
Platform:     linux amd64 (x86_64-unknown-linux-gnu)
Go Version:   go1.12.12
C Compiler:   gcc 6.3.0
Build SHA-1:  4f36d0c62435596ca103454e113ebe8e55f005de
Build Type:   release

robert-s-lee · 2020-04-17T17:08:40Z

@charl could show NULL if there are no tables in defaultdb

show tables from defaultdb;
table_name
+------------+
(0 rows)

charl · 2020-04-20T08:06:37Z

@robert-s-lee , gotcha.

user@host:26257/oneconfig> select sum(range_size)/1000 from crdb_internal.ranges where database_name = 'oneconfig';
   ?column?
+-------------+
  3697358.846
(1 row)

Time: 439.516065ms

knz · 2020-04-20T08:30:38Z

Note that anything based off crdb_internal.ranges is performing a cluster-wide query and is thus relatively expensive to compute, both in time and in network resources.

Also anything based off crdb_internal accumulates all the rows in RAM before the client does anything, so in this case if you have many ranges you can run out of RAM.

So all in all, the solutions discussed so far are band-aids and no substitute for a properly designed feature able to report table sizes without such a large expense in resources.

rafiss · 2022-08-12T15:42:38Z

Linked to kvserver,server,sql: provide efficient mechanism to retrieve data size information for span

kocoten1992 · 2024-11-17T10:35:39Z

Hi there, any chance the PR will be merge?

tbg assigned dianasaur323 Dec 14, 2017

tbg mentioned this issue Dec 14, 2017

cli/user,zone,node: run user, zone, node commands through SQL #20713

Closed

dianasaur323 added this to the 2.1 milestone Dec 20, 2017

knz assigned awoods187 and unassigned dianasaur323 May 9, 2018

knz added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label May 9, 2018

iamnapo mentioned this issue Jul 2, 2019

(0.5) An overview of each collection would be nice (total number of events and MB used) [0.5] AuthEceSoftEng/cenote#52

Closed

awoods187 removed their assignment Mar 25, 2020

robert-s-lee mentioned this issue Jun 3, 2020

Sizing Calculator: Estimating Size projections for Table with existing data cockroachdb/docs#7468

Open

This was referenced Jan 29, 2021

builtins: add pg_total_relation_size builtin #59604

Open

sql: add schema_name,table_id to crdb_internal.ranges #59865

Merged

rafiss mentioned this issue Aug 12, 2022

pg_catalog.pg_relation_size() #86017

Open

blathers-crl bot added the T-cluster-observability label Apr 27, 2023

rafiss mentioned this issue Apr 27, 2023

sql,kv: extend SHOW RANGES to include table/index size estimates #97858

Closed

wsamoht mentioned this issue Dec 1, 2023

fix: Any artisan command that drops tables erroring with unknown function: pg_total_relation_size() ylsideas/cockroachdb-laravel#26

Merged

exalate-issue-sync bot added T-observability and removed T-cluster-observability labels Mar 21, 2024

sql: expose table on-disk size #20712

sql: expose table on-disk size #20712

Comments

tbg commented Dec 14, 2017 • edited by kevin-v-ngo Loading

dianasaur323 commented Dec 14, 2017

petermattis commented Dec 14, 2017

dianasaur323 commented Dec 14, 2017 • edited Loading

jseldess commented Dec 14, 2017

petermattis commented Dec 14, 2017

dianasaur323 commented Dec 14, 2017 via email

tbg commented Dec 14, 2017

RaduBerinde commented Dec 15, 2017 via email

dianasaur323 commented Dec 15, 2017

vivekmenezes commented Dec 15, 2017

tbg commented Dec 18, 2017

knz commented Dec 18, 2017

dianasaur323 commented Dec 18, 2017

bdarnell commented Dec 18, 2017

jordanlewis commented Dec 18, 2017

dianasaur323 commented Dec 18, 2017

vivekmenezes commented Dec 18, 2017

jordanlewis commented Dec 18, 2017

tbg commented Dec 18, 2017

vivekmenezes commented Dec 19, 2017

dianasaur323 commented Dec 20, 2017

vivekmenezes commented Dec 20, 2017

dianasaur323 commented Dec 20, 2017

charl commented Apr 22, 2018

brendesp commented Sep 23, 2019

robert-s-lee commented Apr 13, 2020

jseldess commented Apr 13, 2020

RoachietheSupportRoach commented Apr 13, 2020

jordanlewis commented Apr 13, 2020

charl commented Apr 14, 2020

robert-s-lee commented Apr 14, 2020

charl commented Apr 15, 2020

robert-s-lee commented Apr 17, 2020

charl commented Apr 20, 2020

knz commented Apr 20, 2020

rafiss commented Aug 12, 2022

kocoten1992 commented Nov 17, 2024

tbg commented Dec 14, 2017 •

edited by kevin-v-ngo

Loading

dianasaur323 commented Dec 14, 2017 •

edited

Loading