Prometheus Metrics Cleanup and Expansion #2169

jackzampolin · 2018-08-28T01:25:45Z

Currently (excluding the go_*, promhttp_*, and process_*) we have the following prometheus metrics exported:

Metric	type
`consensus_block_interval_seconds`	histogram
`consensus_block_size_bytes`	gauge
`consensus_byzantine_validators`	gauge
`consensus_byzantine_validators_power`	gauge
`consensus_height`	gauge
`consensus_missing_validators`	gauge
`consensus_missing_validators_power`	gauge
`consensus_num_txs`	gauge
`consensus_rounds`	gauge
`consensus_total_txs`	gauge
`consensus_validators`	gauge
`consensus_validators_power`	gauge
`mempool_size`	gauge
`p2p_peers`	gauge

There are a couple of additional metrics that have been requested by the validator community (tendermint/tendermint#2272):

Metric	type	tags	notes
`consensus_latest_block_height`	gauge	none	/status sync_info number
`consensus_catching_up`	gauge	none	either 0 (syncing) or 1 (synced)
`p2p_peer_receive_bytes_total`	counter	`peer_id`	for debugging p2p issues
`p2p_peer_send_bytes_total`	counter	`peer_id`	for debugging p2p issues
`p2p_peer_pending_send_bytes`	gauge	`peer_id`	for debugging p2p issues

NOTE: May be able to handle the p2p metrics by adding SendMonitor and RecvMonitor readings as histograms

Other metrics that would be amazing to add:

Metric	type	tags	notes
`mempool_num_txs`	gauge	none	the number of transactions in the mempool at time of poll
`mempool_tx_size_bytes`	histogram	none	a histogram of transaction size
`mempool_failed_txs`	gauge	none	number of failed transactions
`mempool_recheck_times`	histogram	none	number of transactions rechecked in the mempool
`p2p_num_txs`	gauge	`peer_id`	number of transactions submitted by each `peer_id`
`p2p_block_parts`	gauge	`peer_id`	number of blockparts transmitted by peer
`p2p_pending_send_bytes`	gauge	`peer_id`	amount of data pending to be sent to peer
`consensus_block_gas`	gauge	none	amount of gas in the previous block (needs work)
`consensus_block_processing_time`	histogram	none	time between `BeginBlock` and `EndBlock`
`sdk_num_transactions`	counter	`type`	number of transactions by type
`sdk_gov_proposals`	counter	`type`	number proposals

Other work that needs to be done on the prometheus implementation:

Add configuration option that allows for a metric prefix for all metrics coming out of the instance. (i.e. tm_* or gaiad_*)
Add ability to deactivate metric families from the config file (consensus, p2p, mempool, etc...)

related: tendermint/tendermint#2272

The text was updated successfully, but these errors were encountered:

alexanderbez · 2018-08-28T12:34:54Z

This is awesome aggregation work @jackzampolin ! Wouldn't this need to be handled on the TM side though?

jackzampolin · 2018-08-28T21:28:08Z

I think a lot of it needs to be handled there, but there is also some stuff in the SDK that needs to be handled. I think its going to be critical to come up with a framework where we can tie metrics to modules and enable monitoring in some parts of the system, but not others. I thought we could use this issue as a place to define what that would look like, and which metrics we want, then open separate issues on tendermint for implementation.

alexanderbez · 2018-08-29T01:19:19Z

I see I see. @xla pointed out some good insight in #1409. Coming up with an interface for modules to export some sort of metrics shouldn't be that far from trivial, but we have to be very particular about what we want to expose and track.

Hyung-bharvest · 2018-08-29T01:28:31Z

I am highly interested in p2p_peer data, because it will give us strong tools to analyze malicious attack including layer 7. I would like to know "p2p_peer_pending_recieve(send)_cnt(bytes)" which implies how many packets(data) received(sent) from(to) each peer.
I also want to know "REAL" public IP(Not the IP from external_IP in config.toml) of each peer, if gaiad has it from raw tcp data for analysis and defence purpose.

mdyring · 2018-08-30T05:50:33Z

@jackzampolin agreed all of the above would be very intersting!

Related to #2131 it would make good sense to include some AVL stats as well as database (insert/update/delete counts + whatever else can be gathered), to further diagnose this.

It would also be valuable for monitoring moving forward.

mdyring · 2018-08-30T12:58:31Z

@jackzampolin I'd like to add gaia_proposals_total (or similar wording) to track number of proposals.

This should make it trivial to setup monitoring that will alert the operator that a new proposal was added.

jackzampolin · 2018-08-30T22:50:56Z

@mdyring can you mention the names of the IAVL stats we should have?

melekes · 2018-08-31T06:08:16Z

mempool_num_txs

that's the same as existing mempool_size?

mempool_recheck_times | histogram | none | number of transactions rechecked in the mempool

the name seems inconsistent with description to me. Do you want a histogram of how much it took to recheck txs (i.e. 1s - 18 txs, 2s - 10 txs, ...) or the number of transactions rechecked (gauge)?

p2p_num_txs

peers are not aware of the term "transactions". they operate on channels with raw bytes.

number of transactions submitted by each peer_id

moreover, peers do not submit transactions (clients do). peers exchange transactions between each other using mempool.

p2p_block_parts | gauge | peer_id | number of blockparts transmitted by peer

same. it should be either (consensus_proposal_block_parts) or (blockstore_block_parts)

consensus_block_gas | gauge | none | amount of gas in the previous block (needs work)

as far as I know, it's not currently implemented.

mdyring · 2018-08-31T07:19:48Z

@jackzampolin well basically anything that could be used to pinpoint performance issues.

I am not that familiar with the AVL codebase, but I think the below might be useful.

Metric	type	tags	notes
tm_avl_immutable_gets_total	gauge	none	Incremented for each immutable_tree get*()
tm_avl_mutable_gets_total	gauge	none	Incremented for each mutable_tree get*()
tm_avl_mutable_sets_total	gauge	none	Incremented for each mutable_tree set()
tm_avl_mutable_removes_total	gauge	none	Incremented for each mutable_tree remove()
tm_avl_mutable_rollbacks_total	gauge	none	Incremented for each mutable_tree rollback()
tm_avl_mutable_load_versions_total	gauge	none	Incremented for each mutable_tree load_version()
tm_avl_mutable_save_versions_total	gauge	none	Incremented for each mutable_tree save_version()
tm_avl_mutable_delete_versions_total	gauge	none	Incremented for each mutable_tree delete_version()
tm_avl_nodedb_saves_total	gauge	none	Incremented for each nodedb save()

xla · 2018-08-31T09:49:24Z

@jackzampolin Great work, thanks for putting this together. I'd propose to break the issue apart in three distinct pieces:

Cosmos metrics - anything which can be tracked fromt the SDK codebase
Tendermint metrics - extension, rework of the prom metric
Low-level libs - likely raw performance for resource utilisation

It will make the discussion easier and we should also think in this context how we can expose the monitoring facilities of tm meaningful to the SDK.

xla · 2018-08-31T09:52:12Z

As @melekes pointed out most of the metrics concerning blocks and transactions belong in the realm of consensus.

mempool_num_txs

that's the same as existing mempool_size?

It seems both are inaccurate, we should encode the nature of the transactions and probably have metrics for the stages a transaction can go through, so the operator has insight into how many are pending, how many have been processed overall.

jackzampolin · 2018-09-05T00:40:39Z

@mslipper and team will be handling this work. I think next steps here are to open issues in tm, sdk and iavl that detail metrics there and begin working on implementation per our call today.

mslipper · 2018-09-06T22:49:10Z

Proposal for how we can handle this:

Currently, we have a MetricsProvider function that returns Metrics structs for the cs, p2p, and mempl packages. Since we're aiming to provide additional metrics for more packages than just these, I'd like to propose making the following changes to Tendermint's Metrics subsystem:

The Metrics structs as they are defined right now return the raw go-kit Counter, Gauge, and Histogram interfaces. Since Go doesn't support union types yet, I suggest we wrap the supported metrics for a given package in an interface with the following signature:

type Metrics interface {
	Counters() []Counter
	Gauges() []Gauge
	Histograms() []Histogram
	Family() string
}

Family() is used to identify the class of metrics exposed by the Metrics in order to programmatically enable/disable them.

We'll need to refactor MetricsProvider to return a slice of Metrics interfaces rather than the three package-specific structs it does now.

To support command-line flags that enable or disable particular metrics, we'll allow an additional string slice argument to NewNode that defines a list of metric categories to enable. By default, all metrics will be enabled. On node startup, the the node will iterate through the slice of Metrics objects provided by MetricsProvider disable any that are not in the list of enabled metric families.

jackzampolin · 2018-09-06T22:59:36Z

This looks great @mslipper, nice and extensible as well as being a small change from the current implementation. 👍 from me

melekes · 2018-09-07T07:57:49Z

Since we're aiming to provide additional metrics for more packages than just these, I'd like to propose making the following changes to Tendermint's Metrics subsystem

most probably, but here I don't see any metrics outside of existing consensus, mempool and p2p packages.

What problem exactly are you trying to solve?

jackzampolin · 2018-09-07T14:52:04Z

@melekes What about the IAVL and sdk module specific (gov mentioned) metrics mentioned above? This comment above also speaks to this:

I think its going to be critical to come up with a framework where we can tie metrics to modules and enable monitoring in some parts of the system, but not others. I thought we could use this issue as a place to define what that would look like, and which metrics we want, then open separate issues on tendermint for implementation.

melekes · 2018-09-07T15:48:16Z

Add ability to deactivate metric families from the config file

Prometheus metrics are cheap (it's just counters in RAM, which are periodically read by outside process). I am pretty sure you can configure Prometheus server to not read some metrics. Just want to make sure that you understand that it might be easier to expose everything & read only what user needs VERSUS framework where we enable monitoring in some parts of the system, but not others.

jackzampolin · 2018-09-07T16:43:20Z

@melekes If you don't think that should be a requirement how would it change the implementation? I think having each module export metrics makes metrics a first class citizen in our ecosystem and allows devs an easy way to add them when building. Do you have any issues with this approach?

jackzampolin · 2018-09-07T19:44:13Z

Lets go ahead and move forward with this proposal!

melekes · 2018-09-08T07:22:45Z

I think having each module export metrics makes metrics a first class citizen in our ecosystem and allows devs an easy way to add them when building.

This sounds amazing! I am not opposite to adding more metrics.

The thing I am struggling to understand is why each module needs to export:

type Metrics interface {
	Counters() []Counter
	Gauges() []Gauge
	Histograms() []Histogram
	Family() string
}

melekes · 2018-09-08T07:27:32Z

if you don't think that should be a requirement how would it change the implementation?

this requires no changes in Tendermint.

jackzampolin · 2018-09-10T15:18:44Z

@melekes That way each module can export metrics of any type. Some of those functions will likely return empty slices in practice. The family name is for specifying the metric namespace.

melekes · 2018-09-11T07:41:38Z

OK, maybe I don't understand something. Please go ahead with the proposal.

melekes · 2018-09-11T07:42:41Z

That way each module can export metrics of any type

you don't need to export metrics (as an interface or something else) in order to export them to Prometheus!!!

xla · 2018-09-11T12:26:18Z

I share @melekes concerns, the proposed solution seems to not address the original problem, which should be challenged. Can we expand why we need such granularity on metric configuration. This would only be reasonable if certain metrics impose performance hits. As long as this is not the case we should only support toggling of metrics and the metircs provider (prometheus for now).

jackzampolin · 2018-09-11T15:41:10Z

@melekes You do need to register the metrics with prometheus however. If the application has multiple modules and each module has some metrics it wants to expose to operators/clients then having an interface to export those metrics seems sensible.

@xla can you expand on proposed solution seems to not address the original problem?

And that is correct, as long as there is no performance hit we would only need to support configuration for toggling metrics and the provider.

melekes · 2018-09-11T16:00:04Z

You do need to register the metrics with prometheus however.

True, but you don't need to export any functions. All you need is to call it (you can do it in the constructor within the package). The fact that Tendermint exposes Metrics struct and PrometheusMetrics() / NopMetrics() functions makes sense for Tendermint because some of our users use Tendermint as a library and they specifically requested the ability to change default MetricsProvider.

mslipper · 2018-09-17T20:21:15Z

Taking discussion to Slack.

xla · 2018-09-17T22:25:31Z

@xla can you expand on proposed solution seems to not address the original problem?

The original problem is that we want to expose a variety of new metrics, which boils down to set a gauge, add a counter or observe a histogram in the right place in the code. Fine grained control down to the metric level is introducing complexity that is asking for overhead we should just shy away from, as it will open to quite complex configuration matrix. In the most naive way we can always collect those metrics and have the configuration influence if the prometheus endpoint is enabled or not.

Addition of metrics should be confined to the package boundaries and not neccesitate any changes to how we setup metrics. The one change we should drive is the support for a common prefix (namespace in the context of prometheus) as mentioned in tendermint/tendermint#2272 Speaking from the tendermint perspective only here.

Partially addresses cosmos/cosmos-sdk#2169.

Continues addressing cosmos/cosmos-sdk#2169.

* Add additional metrics to p2p and consensus Partially addresses cosmos/cosmos-sdk#2169. * WIP * Updates from code review * Updates from code review * Add instrumentation namespace to configuration * Fix test failure * Updates from code review * Add quotes * Add atomic load * Use storeint64 * Use addInt64 in writePacketMsgTo

Continues addressing cosmos/cosmos-sdk#2169.

* Add additional metrics Continues addressing cosmos/cosmos-sdk#2169. * Add nop metrics to fix NPE * Tweak buckets, code review * Update buckets * Update docs with new metrics * Code review updates

hleb-albau · 2018-10-21T15:28:06Z

@melekes So, as far as I understand, there is global DefaultRegisterer. To add your cosmos-based app metrics, all you need is to register metrics using default Prometheus functions. Is it correct?

jackzampolin · 2018-10-22T17:43:20Z

Going to go ahead and close this. @hleb-albau please open another issue for documenting the process of adding more metrics.

jackzampolin · 2019-09-11T21:31:03Z

This issue confuses tendermint and sdk metrics. I'm going to close this in anticipation of @tnachen opening another issue on prometheus metrics in SDK based apps.

jackzampolin added the observability label Aug 28, 2018

mdyring mentioned this issue Aug 30, 2018

Excessive per-block write IO activity #2131

Closed

4 tasks

jackzampolin added the prelaunch label Sep 5, 2018

ebuchman mentioned this issue Sep 6, 2018

Add more metrics tendermint/tendermint#1791

Closed

ValarDragon added if-have-time prelaunch and removed prelaunch labels Sep 15, 2018

mslipper added a commit to mslipper/tendermint that referenced this issue Sep 18, 2018

Add additional metrics to p2p and consensus.

ab2d3c1

Partially addresses cosmos/cosmos-sdk#2169.

mslipper added a commit to mslipper/tendermint that referenced this issue Sep 18, 2018

Add additional metrics to p2p and consensus

2ff211f

Partially addresses cosmos/cosmos-sdk#2169.

mslipper added a commit to mslipper/tendermint that referenced this issue Sep 18, 2018

Add additional metrics to p2p and consensus

9936e82

Partially addresses cosmos/cosmos-sdk#2169.

mslipper mentioned this issue Sep 18, 2018

metrics: Add additional metrics to p2p and consensus tendermint/tendermint#2425

Merged

4 tasks

mslipper added a commit to mslipper/tendermint that referenced this issue Sep 19, 2018

Add additional metrics to p2p and consensus

4758e34

Partially addresses cosmos/cosmos-sdk#2169.

mslipper added a commit to mslipper/tendermint that referenced this issue Sep 21, 2018

Add additional metrics

74f8b74

Continues addressing cosmos/cosmos-sdk#2169.

alexanderbez mentioned this issue Sep 24, 2018

Recognizing finish of catching up #2387

Closed

4 tasks

mslipper added a commit to mslipper/tendermint that referenced this issue Sep 27, 2018

Add additional metrics

2a4ff4e

Continues addressing cosmos/cosmos-sdk#2169.

mslipper mentioned this issue Sep 27, 2018

Add additional metrics tendermint/tendermint#2500

Merged

4 tasks

Thunnini mentioned this issue Oct 2, 2018

How do you think about opening port for prometheus on cosmos-sdk? #2426

Closed

4 tasks

jackzampolin closed this as completed Oct 22, 2018

tnachen reopened this Aug 23, 2019

jackzampolin closed this as completed Sep 11, 2019

snyk-bot mentioned this issue Jul 6, 2021

[Snyk] Upgrade @vuepress/plugin-google-analytics from 1.3.1 to 1.8.2 terrorizer1980/cosmos-sdk#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus Metrics Cleanup and Expansion #2169

Prometheus Metrics Cleanup and Expansion #2169

jackzampolin commented Aug 28, 2018 •

edited

Loading

alexanderbez commented Aug 28, 2018

jackzampolin commented Aug 28, 2018

alexanderbez commented Aug 29, 2018

Hyung-bharvest commented Aug 29, 2018

mdyring commented Aug 30, 2018

mdyring commented Aug 30, 2018

jackzampolin commented Aug 30, 2018

melekes commented Aug 31, 2018 •

edited

Loading

mdyring commented Aug 31, 2018 •

edited

Loading

xla commented Aug 31, 2018

xla commented Aug 31, 2018

jackzampolin commented Sep 5, 2018

mslipper commented Sep 6, 2018

jackzampolin commented Sep 6, 2018

melekes commented Sep 7, 2018 •

edited

Loading

jackzampolin commented Sep 7, 2018

melekes commented Sep 7, 2018

jackzampolin commented Sep 7, 2018

jackzampolin commented Sep 7, 2018

melekes commented Sep 8, 2018

melekes commented Sep 8, 2018

jackzampolin commented Sep 10, 2018

melekes commented Sep 11, 2018

melekes commented Sep 11, 2018 •

edited

Loading

xla commented Sep 11, 2018

jackzampolin commented Sep 11, 2018

melekes commented Sep 11, 2018 •

edited

Loading

mslipper commented Sep 17, 2018

xla commented Sep 17, 2018 •

edited

Loading

hleb-albau commented Oct 21, 2018

jackzampolin commented Oct 22, 2018

jackzampolin commented Sep 11, 2019

Prometheus Metrics Cleanup and Expansion #2169

Prometheus Metrics Cleanup and Expansion #2169

Comments

jackzampolin commented Aug 28, 2018 • edited Loading

alexanderbez commented Aug 28, 2018

jackzampolin commented Aug 28, 2018

alexanderbez commented Aug 29, 2018

Hyung-bharvest commented Aug 29, 2018

mdyring commented Aug 30, 2018

mdyring commented Aug 30, 2018

jackzampolin commented Aug 30, 2018

melekes commented Aug 31, 2018 • edited Loading

mdyring commented Aug 31, 2018 • edited Loading

xla commented Aug 31, 2018

xla commented Aug 31, 2018

jackzampolin commented Sep 5, 2018

mslipper commented Sep 6, 2018

jackzampolin commented Sep 6, 2018

melekes commented Sep 7, 2018 • edited Loading

jackzampolin commented Sep 7, 2018

melekes commented Sep 7, 2018

jackzampolin commented Sep 7, 2018

jackzampolin commented Sep 7, 2018

melekes commented Sep 8, 2018

melekes commented Sep 8, 2018

jackzampolin commented Sep 10, 2018

melekes commented Sep 11, 2018

melekes commented Sep 11, 2018 • edited Loading

xla commented Sep 11, 2018

jackzampolin commented Sep 11, 2018

melekes commented Sep 11, 2018 • edited Loading

mslipper commented Sep 17, 2018

xla commented Sep 17, 2018 • edited Loading

hleb-albau commented Oct 21, 2018

jackzampolin commented Oct 22, 2018

jackzampolin commented Sep 11, 2019

jackzampolin commented Aug 28, 2018 •

edited

Loading

melekes commented Aug 31, 2018 •

edited

Loading

mdyring commented Aug 31, 2018 •

edited

Loading

melekes commented Sep 7, 2018 •

edited

Loading

melekes commented Sep 11, 2018 •

edited

Loading

melekes commented Sep 11, 2018 •

edited

Loading

xla commented Sep 17, 2018 •

edited

Loading