-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus Metrics Cleanup and Expansion #2169
Comments
This is awesome aggregation work @jackzampolin ! Wouldn't this need to be handled on the TM side though? |
I think a lot of it needs to be handled there, but there is also some stuff in the SDK that needs to be handled. I think its going to be critical to come up with a framework where we can tie metrics to modules and enable monitoring in some parts of the system, but not others. I thought we could use this issue as a place to define what that would look like, and which metrics we want, then open separate issues on |
I am highly interested in p2p_peer data, because it will give us strong tools to analyze malicious attack including layer 7. I would like to know "p2p_peer_pending_recieve(send)_cnt(bytes)" which implies how many packets(data) received(sent) from(to) each peer. |
@jackzampolin agreed all of the above would be very intersting! Related to #2131 it would make good sense to include some AVL stats as well as database (insert/update/delete counts + whatever else can be gathered), to further diagnose this. It would also be valuable for monitoring moving forward. |
@jackzampolin I'd like to add gaia_proposals_total (or similar wording) to track number of proposals. This should make it trivial to setup monitoring that will alert the operator that a new proposal was added. |
@mdyring can you mention the names of the IAVL stats we should have? |
that's the same as existing
the name seems inconsistent with description to me. Do you want a histogram of how much it took to recheck txs (i.e. 1s - 18 txs, 2s - 10 txs, ...) or the number of transactions rechecked (gauge)?
peers are not aware of the term "transactions". they operate on channels with raw bytes.
moreover, peers do not submit transactions (clients do). peers exchange transactions between each other using mempool.
same. it should be either (consensus_proposal_block_parts) or (blockstore_block_parts)
as far as I know, it's not currently implemented. |
@jackzampolin well basically anything that could be used to pinpoint performance issues. I am not that familiar with the AVL codebase, but I think the below might be useful.
|
@jackzampolin Great work, thanks for putting this together. I'd propose to break the issue apart in three distinct pieces:
It will make the discussion easier and we should also think in this context how we can expose the monitoring facilities of tm meaningful to the SDK. |
As @melekes pointed out most of the metrics concerning blocks and transactions belong in the realm of consensus.
It seems both are inaccurate, we should encode the nature of the transactions and probably have metrics for the stages a transaction can go through, so the operator has insight into how many are pending, how many have been processed overall. |
@mslipper and team will be handling this work. I think next steps here are to open issues in tm, sdk and iavl that detail metrics there and begin working on implementation per our call today. |
Proposal for how we can handle this: Currently, we have a The type Metrics interface {
Counters() []Counter
Gauges() []Gauge
Histograms() []Histogram
Family() string
}
We'll need to refactor To support command-line flags that enable or disable particular metrics, we'll allow an additional string slice argument to |
This looks great @mslipper, nice and extensible as well as being a small change from the current implementation. 👍 from me |
most probably, but here I don't see any metrics outside of existing What problem exactly are you trying to solve? |
@melekes What about the IAVL and sdk module specific (
|
Prometheus metrics are cheap (it's just counters in RAM, which are periodically read by outside process). I am pretty sure you can configure Prometheus server to not read some metrics. Just want to make sure that you understand that it might be easier to expose everything & read only what user needs VERSUS framework where we enable monitoring in some parts of the system, but not others. |
@melekes If you don't think that should be a requirement how would it change the implementation? I think having each module export metrics makes metrics a first class citizen in our ecosystem and allows devs an easy way to add them when building. Do you have any issues with this approach? |
Lets go ahead and move forward with this proposal! |
This sounds amazing! I am not opposite to adding more metrics. The thing I am struggling to understand is why each module needs to export:
|
this requires no changes in Tendermint. |
@melekes That way each module can export metrics of any type. Some of those functions will likely return empty slices in practice. The family name is for specifying the metric namespace. |
OK, maybe I don't understand something. Please go ahead with the proposal. |
you don't need to export metrics (as an interface or something else) in order to export them to Prometheus!!! |
I share @melekes concerns, the proposed solution seems to not address the original problem, which should be challenged. Can we expand why we need such granularity on metric configuration. This would only be reasonable if certain metrics impose performance hits. As long as this is not the case we should only support toggling of metrics and the metircs provider (prometheus for now). |
@melekes You do need to register the metrics with prometheus however. If the application has multiple modules and each module has some metrics it wants to expose to operators/clients then having an interface to export those metrics seems sensible. @xla can you expand on And that is correct, as long as there is no performance hit we would only need to support configuration for toggling metrics and the provider. |
True, but you don't need to export any functions. All you need is to call it (you can do it in the constructor within the package). The fact that Tendermint exposes Metrics struct and PrometheusMetrics() / NopMetrics() functions makes sense for Tendermint because some of our users use Tendermint as a library and they specifically requested the ability to change default MetricsProvider. |
Taking discussion to Slack. |
The original problem is that we want to expose a variety of new metrics, which boils down to set a gauge, add a counter or observe a histogram in the right place in the code. Fine grained control down to the metric level is introducing complexity that is asking for overhead we should just shy away from, as it will open to quite complex configuration matrix. In the most naive way we can always collect those metrics and have the configuration influence if the prometheus endpoint is enabled or not. Addition of metrics should be confined to the package boundaries and not neccesitate any changes to how we setup metrics. The one change we should drive is the support for a common prefix (namespace in the context of prometheus) as mentioned in tendermint/tendermint#2272 Speaking from the tendermint perspective only here. |
Partially addresses cosmos/cosmos-sdk#2169.
Partially addresses cosmos/cosmos-sdk#2169.
Partially addresses cosmos/cosmos-sdk#2169.
Partially addresses cosmos/cosmos-sdk#2169.
Continues addressing cosmos/cosmos-sdk#2169.
* Add additional metrics to p2p and consensus Partially addresses cosmos/cosmos-sdk#2169. * WIP * Updates from code review * Updates from code review * Add instrumentation namespace to configuration * Fix test failure * Updates from code review * Add quotes * Add atomic load * Use storeint64 * Use addInt64 in writePacketMsgTo
Continues addressing cosmos/cosmos-sdk#2169.
* Add additional metrics Continues addressing cosmos/cosmos-sdk#2169. * Add nop metrics to fix NPE * Tweak buckets, code review * Update buckets * Update docs with new metrics * Code review updates
@melekes So, as far as I understand, there is global |
Going to go ahead and close this. @hleb-albau please open another issue for documenting the process of adding more metrics. |
This issue confuses tendermint and sdk metrics. I'm going to close this in anticipation of @tnachen opening another issue on prometheus metrics in SDK based apps. |
Currently (excluding the
go_*
,promhttp_*
, andprocess_*
) we have the following prometheus metrics exported:consensus_block_interval_seconds
consensus_block_size_bytes
consensus_byzantine_validators
consensus_byzantine_validators_power
consensus_height
consensus_missing_validators
consensus_missing_validators_power
consensus_num_txs
consensus_rounds
consensus_total_txs
consensus_validators
consensus_validators_power
mempool_size
p2p_peers
There are a couple of additional metrics that have been requested by the validator community (tendermint/tendermint#2272):
consensus_latest_block_height
consensus_catching_up
p2p_peer_receive_bytes_total
peer_id
p2p_peer_send_bytes_total
peer_id
p2p_peer_pending_send_bytes
peer_id
Other metrics that would be amazing to add:
mempool_num_txs
mempool_tx_size_bytes
mempool_failed_txs
mempool_recheck_times
p2p_num_txs
peer_id
peer_id
p2p_block_parts
peer_id
p2p_pending_send_bytes
peer_id
consensus_block_gas
consensus_block_processing_time
BeginBlock
andEndBlock
sdk_num_transactions
type
sdk_gov_proposals
type
Other work that needs to be done on the prometheus implementation:
tm_*
orgaiad_*
)consensus
,p2p
,mempool
, etc...)related: tendermint/tendermint#2272
The text was updated successfully, but these errors were encountered: