Requests Over Time, Tracked Easily Now is a project written to help optimize your app by letting you know which queries it is murdering your DB with. A reasonable person might ask, "Isn't this why we have pgBadger?" to which I would say that while pgBadger provides excellent analytics in a readable format,
- it requires full logging to not be misleading
- full logging is problematic when you're logging to a remote server for compliance reasons, and also already approaching the packets/sec limits of your hardware before enabling full logging.
- pgBadger reports are generated once for a static set of queries. If you want to drill in on a subset of data, that doesn't work.
rotten attemps to address these things. It does so by frequently looking at the helpful data gathered by pg_stat_statements, and then resetting those counters to get a fresh sample for the next interation. This has shortcomings, but it gets us 95% of the usefulness of logging all queries in order to see which are problematic, without any of the firehose problems that full logging can bring.
Instead of recording our analysis to text files, like pgBadger does, we just store the data in an additional database and punt on any form of UI. Yes, that's a bit of a cop out, but it also lets us employ the power of SQL to filter our data and pull whatever report we might want, for whatever time period we might want.
As a young project rotten makes a lot of assumptions. Among them:
- You are using pg_stat_statements and have no qualms about frequently resetting them.
- You are ok with not getting everything from pg_stat_statements, but rather the "most" interesting queries. Getting everything is orders of magnitude too expensive to be useful on busy systems, and anyway, this project is only trying to find the worst offenders, not an exhaustive snapshot.
- You are ok with a SQL prompt as a UI for now.
- You will have a role on your monitored dbs called "rotten-observer" and it will call some security definer functions (to query and reset pg_stat_statments in a "dba" schema.
- Your monitored databases have distinct identifiers of some kind (fqdn, IP, etc) as well as some logical identification ("the primary production server" or "cluster38 secondary").
- Get and install Go. http://www.golang.org
- Make sure you have modvendor installed:
go get -u github.com/goware/modvendor
- After setting up your $GOPATH and $GOBIN, build it:
go mod tidy
go mod vendor
~/go/bin/modvendor -copy="**/*.c **/*.h **/*.proto" -v
go build github.com/benchub/rotten
- Install pg_partman in the rotten db. See https://github.com/pgpartman/pg_partman. tldr:
- download pg_partman and
make install
- add
pg_partman_bgw
toshared_preload_libraries
in postgresql.conf - run
CREATE EXTENSION pg_partman
in the rotten db
- Import
schema/tables.sql
into the rotten db. Also create arotten-client
role that will use these tables. - Install
pg_stat_statements
in the monitored database:
- add
pg_stat_statements
toshared_preload_libraries
in postgresql.conf (this is a comma-separated string) - run
CREATE EXTENSION pg_stat_statements
in the monitored database
- Install the
rotten-observer
role in your monitored databases and theschema/function.sql
in thedba
schema on those databases. Useschema/function-pg13.sql
if you're on PG 13 or newer. - Unless you like to be webscale with tmux, script up some systemd services to run rotten.
- Modify the conf to fit your environment.
RottenDBConn
andObservedDBConn
are hopefully self-explanatory. Extra care has been given in rotten to make sure that rotten will correct send a root ca with all the needed intermediate certs, if you are working with such an environment.SanityCheck
is a query that will be run against the Observed DB before each window. Returning a boolean True value will tell rotten to proceed; a False will cause rotten to quit. The assumption is that systemd will keep restarting rotten until SanityCheck returns True, and also that you have a function you might call which tells you what the database you have connected to thinks it is. This is useful in environments where the host rotten is connecting to might not be what rotten intends. For example, you might want to be gathering statistics from a secondary database, but the secondary hostname might currently point to the primary server, while the secondary undergoes maintenance. While that might be exactly what most database clients would want, it's not helpful for rotten's purposes. Potentially worse, rotten would not know when to reconnect once maintenance is done, and so would stay connected to the primary until it dies or is manually restarted.StatusInterval
is how often to report status (in seconds) to its log.ObservationInterval
is how long (in seconds) to let pg_stat_statements gather info for. This is the most granular you can make your reports, and the lower you set this, the more data you will need to store in your rotten db.FQDN
is some unique string (typically the FQDN of the observed db) to help find a physical log if more information is desired other than the fingerprint.Project
,Environment
,Cluster
, andRole
are logical identifiers for where the samples of data are coming from.
- The code is ugly.
Um yeah quite a bit.
- make a UI
- allow for arbitrary logical source descriptions, not just Project/Environment/Cluster/Role
- allow for arbitrary locations of the functions in the observed db
- allow for an arbitrary observer role name other than "rotten-observer"
- allow the observation window to adjust size as needed for processing
- allow for a worker pool of reparse executions to speed things up in wall time
- configurable context, instead of the hardcoded controller/action/job_tag, with their hard-coded regexes
- keep a log of the queries we can't parse
- While we make an effort to normalize cursors and temp tables, those regexs should probably not be hardcoded.
- Collapse IN () and VALUES clauses of constants, so that IN (1,2,3) is the same as IN (1).