Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: Runtime stats #3845

Closed
wants to merge 1 commit into from
Closed

rfc: Runtime stats #3845

wants to merge 1 commit into from

Conversation

LucioFranco
Copy link
Member

Rendered

This RFC proposes the low level stats implementation within tokio to be used by metrics aggregators/collectors to expose within dashboards such as grafana, etc. This low level stats will be the foundation for tokio's future runtime observability goals and do not present a complete story since they will mostly be raw values that are unaggregated.

@LucioFranco LucioFranco requested review from carllerche, seanmonstar, jonhoo, hawkw and a team June 7, 2021 14:32
@Darksonn Darksonn added A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime labels Jun 7, 2021

### I/O and Timer implementations

Since, the two drivers (I/O and timer) that `tokio` provides are singtons within the runtime there is no need to iterate through their stats like the executor stats. In-addition, it is possible to stream the metrics directly from the driver events rather than needing to batch them like the executor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we are using a lot of atomics in this manner, we should be careful regarding false sharing of the atomics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand on what you mean by false sharing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have a bunch of atomic variables stored together (in the same cache line), with many threads writing to them concurrently, then this can impact performance quite a lot, even if the writes are affecting two different counters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case the issue is true sharing (contention on atomics), so padding stuff out won't solve it either.

Copy link
Member

@carllerche carllerche Jun 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it a case of true sharing? My read is that there is a single driver and one thread polls it. It can store stats in an atomic and an arbitrary number of stats aggregators can load it.

Contention happens when there are concurrent mutations, which is not (afaik) the case here.


To avoid any extra overhead in the executor loop, each worker will batch metrics into a `Core` local struct. These values will be incremented or sampled during regular executor cycles when certain operations happen like a work steal attempt or a pop from one of the queues.

The batches will be streamed via atomics to the stats struct directly. This will reduce any cross CPU work while the executor is running and amortize the cost of having to do cross CPU work. Batches will be sent before the executor attempts to park the thread. This will happen either when there is no work to be done or when the executor has hit the maintance tick. At this point before the thread will park, the executor will submit the batch. Generally, since parking is more expensive then submitting batches there should not be any added latency to the executor cycle in this process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the user supposed to hook up to the stats struct? Read it at regular intervals? Is there a mechanism for being notified when a batch update happens?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is polling, and described in the next paragraph.

Copy link
Member

@tobz tobz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most of the technical aspects of the RFC make sense, but tightening up the grammar/structure/flow will strengthen the overall proposal, especially since this will end up as documentation read by users.


### Executor

The handle will provide stats for each worker thread or in the case of the single threaded executor will provide a single worker. This provides a detailed view into what each worker is doing. Providing the ability for the `tokio-metrics` crate to expose the stats aggregated as a single metric or as a per-worker metric.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: structure/flow

Suggested change
The handle will provide stats for each worker thread or in the case of the single threaded executor will provide a single worker. This provides a detailed view into what each worker is doing. Providing the ability for the `tokio-metrics` crate to expose the stats aggregated as a single metric or as a per-worker metric.
Statistics will be provided on a per-worker basis, whether using the single-threaded or multi-threaded executor. Aggregating and merging these per-worker statistics in a way that makes more sense when used from existing telemetry collection systems will be provided by crates like `tokio-metrics`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To iterate on that:

Suggested change
The handle will provide stats for each worker thread or in the case of the single threaded executor will provide a single worker. This provides a detailed view into what each worker is doing. Providing the ability for the `tokio-metrics` crate to expose the stats aggregated as a single metric or as a per-worker metric.
Statistics will be provided on a per-worker basis, whether using the single-threaded or multi-threaded executor. Aggregated and merged per-worker statistics, which may be more amenable to existing telemetry collection systems, will be provided by crates like `tokio-metrics`.


The values will be updated in batch from the executor to avoid needing to stream the data on every action. This should amortize the cost by only needing to emit stats at a specific executor wall clock time. Where the executor wall clock time is determinetd by a single executor tick rather than actual system time. This allows the collectors to observe the time and the stats to determine how long certain executor cycles took. This removes the need to acquire the time during executor cycles.

Each worker will expose these stats, updated in batches:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: redundancy

You already mentioned in the above paragraph that these are batched.


The handle will provide stats for each worker thread or in the case of the single threaded executor will provide a single worker. This provides a detailed view into what each worker is doing. Providing the ability for the `tokio-metrics` crate to expose the stats aggregated as a single metric or as a per-worker metric.

The values will be updated in batch from the executor to avoid needing to stream the data on every action. This should amortize the cost by only needing to emit stats at a specific executor wall clock time. Where the executor wall clock time is determinetd by a single executor tick rather than actual system time. This allows the collectors to observe the time and the stats to determine how long certain executor cycles took. This removes the need to acquire the time during executor cycles.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to speak more to the time aspect. This will be important to deriving rates from monotonic counters.

In other words, I know what you're driving at by talking about the executor ticking at a predictable interval, but that needs to be made explicit here in order to drive home the point that it's being used, or could be used, as an invariant, specifically because it ties into the staleness guarantees around specific statistics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. The references to "wall clock time" are also confusing, since wall clock time is, by definition, "real" time and not something that happens in, say, ticks.

- Amount of executor ticks (loop iterations)
- Number of `block_in_place` tasks entered

The main goal of this implementation is to allow a user to run this metrics collection at all times in production with minimal overhead to their application. This would allow users to alarm on any regressions and track how the runtime is performing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: structure/flow

I think wording like this could go into the summary/motivation sections.


The main goal of this implementation is to allow a user to run this metrics collection at all times in production with minimal overhead to their application. This would allow users to alarm on any regressions and track how the runtime is performing.

Some of the stats include min/max (specifically the queue depth stats) this is because the depth of the queues changes throughout the stats batch window. The value could start low, spike up during the middle of the window then come back down. To understand this behavior the executor stats module will aggregate the depth values to reduce the need to stream the values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: structure/flow

Suggested change
Some of the stats include min/max (specifically the queue depth stats) this is because the depth of the queues changes throughout the stats batch window. The value could start low, spike up during the middle of the window then come back down. To understand this behavior the executor stats module will aggregate the depth values to reduce the need to stream the values.
Some of these statistics, such as queue depth, include the minimum and maximum value measured during the given observation window. These statistics can rapidly change under heavy load, and this approach provides a middle ground between streaming all measurements/changes (expensive) and potentially not observing the spikes at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason why we choose min/max/avg over, say, percentiles?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a question of the runtime overhead caused by this. Percentiles will require more work will collecting data, and you might not be able to use atomic counters anymore.

An external aggregator that consumes the stats (e.g. once per second) could still perform aggregation and do percentiles based on the amount of occurrences inside the sampling period.

However an external aggregator won't be able to capture min/max values in case there are peaks inside that sampling period. E.g. if you want to have a metric which is around "maximum tasks polled inside an executor iteration" and "minimum tasks polled", you couldn't get that if you just have counters of

  • eventloop iterations
  • tasks polled

I guess for tasks where we find those values useful, it makes sense to add them.

Otherwise it's probably easiest to just add always incrementing counters and let the external application do the diffing and aggregation. You can provide some helpers that allow like:

let mut last_stats = stats.executor();
loop {
    std::thread::delay(sampling_time);
    let stats = stats.executor();
    let delta_stats = stats.diff(last_stats);
    my_favorite_metric_system.aggregate_and_emit(delta_stats); // or potentially also the raw stats
    last_stats = stats;

E.g. we had issues in the past where some metrics that only had been emitted once per minute didn't show BPS spikes that happened inside some seconds and caused excessive packet drops.

@LucioFranco Might be worthwhile to document that kind of periodic sampling system in the "guide" section, since there had been a few questions on how to use the thing.


### I/O and Timer implementations

Since, the two drivers (I/O and timer) that `tokio` provides are singtons within the runtime there is no need to iterate through their stats like the executor stats. In-addition, it is possible to stream the metrics directly from the driver events rather than needing to batch them like the executor.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

singtons -> singletons


The batches will be streamed via atomics to the stats struct directly. This will reduce any cross CPU work while the executor is running and amortize the cost of having to do cross CPU work. Batches will be sent before the executor attempts to park the thread. This will happen either when there is no work to be done or when the executor has hit the maintance tick. At this point before the thread will park, the executor will submit the batch. Generally, since parking is more expensive then submitting batches there should not be any added latency to the executor cycle in this process.

This then allows the collector to poll the stats on any interval. Thus, allowing it to drive its own timer to understand an estimate of the duration that a certain amount of ticks too, or how many times the the executor entered the park state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: structure/flow

This sentence has grammatical issues, but I think the bigger problem is that it potentially conflicts with the idea that the executor ticks on a predictable interval. Why would we need to track the duration vs ticks ratio ourselves?


### `tokio-metrics`

The `tokio-metrics` crate will provide aggregated metrics based on the `Stats` struct. This will include histograms and other useful aggregated forms of the stats that could be emitted by various metrics implementations. This crate is designed to provide the ability to expose the aggregated stats in an unstable `0.1` method outside of the runtime and allow the ability to iterate on how they are aggregated without the need to follow `tokio`'s strict versioning scheme.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would help to expand on this. What specific aggregations will tokio-metrics expose. How do you expect this to be used in practice? What will trigger alerts, how are engineers expected to use the aggregations in their workflow, etc...

- Min local queue depth
- Avg local queue depth
- Queue depth at time of batch emission
- Amount of executor ticks (loop iterations)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my understanding is correct, you intend to provide information about how long executors spend busy in tasks by allowing collectors to observe the number of ticks which occur in a known time interval. However, I think you also need to measure the amount of time spent parked in order to avoid counting parked time as "busy" time.


The main goal of this implementation is to allow a user to run this metrics collection at all times in production with minimal overhead to their application. This would allow users to alarm on any regressions and track how the runtime is performing.

Some of the stats include min/max (specifically the queue depth stats) this is because the depth of the queues changes throughout the stats batch window. The value could start low, spike up during the middle of the window then come back down. To understand this behavior the executor stats module will aggregate the depth values to reduce the need to stream the values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will require a way to atomically capture the values in a single batch (and observe when a new batch is ready), I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree — unless we want consumers to poll these statistics, we'll need some kind of subscription/notify mechanism.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a question if atomicity is important to interpret metrics correctly, or whether you are ok that individual values don't match each other (e.g. the sum of tasks run per worker doesn't match the total tasks run metric).

Since it's "just metrics", I think one can be ok with the latter. It will simplify the implementation.

And polling metrics is reasonable. You can always increase the polling frequency to get more details. Polling mostly isn't feasible if are interested in every single event. But that won't work in this system anyway, if it makes use of internal batching.


### I/O and Timer implementations

Since, the two drivers (I/O and timer) that `tokio` provides are singtons within the runtime there is no need to iterate through their stats like the executor stats. In-addition, it is possible to stream the metrics directly from the driver events rather than needing to batch them like the executor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case the issue is true sharing (contention on atomics), so padding stuff out won't solve it either.


There are two main types of stats that a runtime can expose. A per-runtime stats (eg `executor_load`, `fds_registered`) that are collected indepdently of the tasks running on the runtime. A per-task stats (eg `poll_duration`, `amount_polls`) that are collected and aggregated at the task level. This RFC will propose an implemenation for implementing per-runtime stats but will also mention methods to capture per-task stats.

A small note, the term `stats` is used instead of `metrics` because we are only concerned with exposing raw data rather than methods of aggregating and emiting that data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was odd to me. "stats" is short for "statistics", which are just as much aggregated as "metrics" are. Would "performance counters" be better"? Or "performance events"? Or maybe "observations" or simply "data"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for "metrics"; I'm not sure the distinction here matters that much. People will want the metrics, try to figure out how it works, see they need an extra crate, and be on their way. Calling it "stats" doesn't imply that, we'll have to spell it out in the documentation. So just go with the more common term of "metrics," IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think of those are fine (and like the "performance counters" too). It should just be consistent


## Motivation

When developing and writing Tokio applications, there are many forms of support, be it tutorials or discord. But when running these applications in production there is not much support. Users want to understand what is happening behind the scenes. What is my runtime up too? How can I optimize my application? This RFC intends to provide a foundation to answer these questions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When developing and writing Tokio applications, there are many forms of support, be it tutorials or discord. But when running these applications in production there is not much support. Users want to understand what is happening behind the scenes. What is my runtime up too? How can I optimize my application? This RFC intends to provide a foundation to answer these questions.
When developing and writing Tokio applications, there are many forms of support, be it tutorials or discord. But when running these applications in production there is not much support. Users want to understand what is happening behind the scenes. What is my runtime up to? How can I optimize my application? This RFC intends to provide a foundation to answer these questions.


Runtime stats will be exposed via a struct that is attainable via the `tokio::runtime::Handle`. Calling `Handle::stats()` will return a reference counted struct that contains raw stat values. Through this, there will be a `tokio-metrics` crate that converts these raw stats into proper aggregated metrics that can be consumed by end user metrics collection systems like the `metrics` crate.

```rust=
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the = intentional here? It prevents highlighting.


### Executor

The handle will provide stats for each worker thread or in the case of the single threaded executor will provide a single worker. This provides a detailed view into what each worker is doing. Providing the ability for the `tokio-metrics` crate to expose the stats aggregated as a single metric or as a per-worker metric.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To iterate on that:

Suggested change
The handle will provide stats for each worker thread or in the case of the single threaded executor will provide a single worker. This provides a detailed view into what each worker is doing. Providing the ability for the `tokio-metrics` crate to expose the stats aggregated as a single metric or as a per-worker metric.
Statistics will be provided on a per-worker basis, whether using the single-threaded or multi-threaded executor. Aggregated and merged per-worker statistics, which may be more amenable to existing telemetry collection systems, will be provided by crates like `tokio-metrics`.


The handle will provide stats for each worker thread or in the case of the single threaded executor will provide a single worker. This provides a detailed view into what each worker is doing. Providing the ability for the `tokio-metrics` crate to expose the stats aggregated as a single metric or as a per-worker metric.

The values will be updated in batch from the executor to avoid needing to stream the data on every action. This should amortize the cost by only needing to emit stats at a specific executor wall clock time. Where the executor wall clock time is determinetd by a single executor tick rather than actual system time. This allows the collectors to observe the time and the stats to determine how long certain executor cycles took. This removes the need to acquire the time during executor cycles.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. The references to "wall clock time" are also confusing, since wall clock time is, by definition, "real" time and not something that happens in, say, ticks.

The blocking pool already tracks the number of idle threads and the total number of threads. These values are currently within a shared mutex but can be moved to be `AtomicUsize` values and then shared with the `Stats` struct to be sampled by the collector. In addition, a counter that is incremented on each task execution will be included. All values will be streamed to the stats struct via atomics.

Stats from the blocking pool:
- Number of idle threads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value also feels like it might vary wildly. Should it also present min/max/avg?


### Task

This implementation will avoid tracking stats/metrics at the task level due to the overhead required. This will instead be accomplished by the [tokio console](https://github.com/tokio-rs/console). This will allow the user to attach the console and take the performance hit when they want to explore issues in more detail.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This implementation will avoid tracking stats/metrics at the task level due to the overhead required. This will instead be accomplished by the [tokio console](https://github.com/tokio-rs/console). This will allow the user to attach the console and take the performance hit when they want to explore issues in more detail.
This RFC does not propose tracking stats/metrics at the task level due to the overhead required. Instead, the this is left to projects like the [tokio console](https://github.com/tokio-rs/console), which allows the user to attach the console and take the performance hit when they want to explore issues in more detail.


### I/O driver

Unlike, the executor stats, stats coming from the I/O driver will be streamed directly to the `Stats` struct via atomics. Each value will be incremented (via `AtomicU64::fetch_add`) for each event.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I don't know what "streaming ... via atomics" means.

Also, Stats struct hasn't been defined.

Unlike, the executor stats, stats coming from the I/O driver will be streamed directly to the `Stats` struct via atomics. Each value will be incremented (via `AtomicU64::fetch_add`) for each event.

List of stats provided from the I/O driver:
- Amount of compact
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, too, "number" seems preferable to "amount".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "compact" mean here?

List of stats provided from the I/O driver:
- Amount of compact
- Amount of "token dispatches" (aka ready events)
- Amount of fd currently registered with `io::Driver`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Amount of fd currently registered with `io::Driver`
- Number of file descriptors currently registered with `io::Driver`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work on Windows?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call it IO handles or IO resources?


This RFC proposes a new way to gather understanding from the Tokio runtime. Currently, the runtime does not expose any methods to understand what is happening under the hood. This provides a rough experience when deploying Tokio based applications into production where you would like to understand what is happening to your code. Via this RFC, we will propose a few methods to collect this data at different levels. Beyond what is proposed as implemenation in this RFC, we will also discuss other methods to gather the information a user might need to be successful with Tokio in a production environment.

There are two main types of stats that a runtime can expose. A per-runtime stats (eg `executor_load`, `fds_registered`) that are collected indepdently of the tasks running on the runtime. A per-task stats (eg `poll_duration`, `amount_polls`) that are collected and aggregated at the task level. This RFC will propose an implemenation for implementing per-runtime stats but will also mention methods to capture per-task stats.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "A ... stats" seems grammatically weird --- i would just say

Suggested change
There are two main types of stats that a runtime can expose. A per-runtime stats (eg `executor_load`, `fds_registered`) that are collected indepdently of the tasks running on the runtime. A per-task stats (eg `poll_duration`, `amount_polls`) that are collected and aggregated at the task level. This RFC will propose an implemenation for implementing per-runtime stats but will also mention methods to capture per-task stats.
There are two main types of stats that a runtime can expose. Per-runtime stats (eg `executor_load`, `fds_registered`) that are collected indepdently of the tasks running on the runtime, and per-task stats (eg `poll_duration`, `amount_polls`) that are collected and aggregated at the task level. This RFC will propose an implemenation for implementing per-runtime stats but will also mention methods to capture per-task stats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus implemenation -> implementation


There are two main types of stats that a runtime can expose. A per-runtime stats (eg `executor_load`, `fds_registered`) that are collected indepdently of the tasks running on the runtime. A per-task stats (eg `poll_duration`, `amount_polls`) that are collected and aggregated at the task level. This RFC will propose an implemenation for implementing per-runtime stats but will also mention methods to capture per-task stats.

A small note, the term `stats` is used instead of `metrics` because we are only concerned with exposing raw data rather than methods of aggregating and emiting that data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
A small note, the term `stats` is used instead of `metrics` because we are only concerned with exposing raw data rather than methods of aggregating and emiting that data.
A small note: the term "stats" is used instead of "metrics", because we are only concerned with exposing raw data rather than methods of aggregating and emitting that data.


Runtime stats will be exposed via a struct that is attainable via the `tokio::runtime::Handle`. Calling `Handle::stats()` will return a reference counted struct that contains raw stat values. Through this, there will be a `tokio-metrics` crate that converts these raw stats into proper aggregated metrics that can be consumed by end user metrics collection systems like the `metrics` crate.

```rust=
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```rust=
```rust

Copy link
Contributor

@Matthias247 Matthias247 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting started on this! Looking forward to it.


There are two main types of stats that a runtime can expose. A per-runtime stats (eg `executor_load`, `fds_registered`) that are collected indepdently of the tasks running on the runtime. A per-task stats (eg `poll_duration`, `amount_polls`) that are collected and aggregated at the task level. This RFC will propose an implemenation for implementing per-runtime stats but will also mention methods to capture per-task stats.

A small note, the term `stats` is used instead of `metrics` because we are only concerned with exposing raw data rather than methods of aggregating and emiting that data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think of those are fine (and like the "performance counters" too). It should just be consistent


## Motivation

When developing and writing Tokio applications, there are many forms of support, be it tutorials or discord. But when running these applications in production there is not much support. Users want to understand what is happening behind the scenes. What is my runtime up too? How can I optimize my application? This RFC intends to provide a foundation to answer these questions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is my runtime up too? How can I optimize my application?

I would recommend to make this a bit more concrete, because the "what is going on" duplicates a couple of times in the doc without going much deeper.

Among:

  • Why is the latency of the system higher than expected?
  • Why does memory utilization grow over time?
  • Why does the service run out of the file descriptor limit?


## Guide-level explanation

Runtime stats will be exposed via a struct that is attainable via the `tokio::runtime::Handle`. Calling `Handle::stats()` will return a reference counted struct that contains raw stat values. Through this, there will be a `tokio-metrics` crate that converts these raw stats into proper aggregated metrics that can be consumed by end user metrics collection systems like the `metrics` crate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what is returned could be a reference-counted accessor for the raw values. But it doesn't have to store the stats itself. It simply can contain a fn stats(&self) -> RealStats function which returns a POD struct with just values in it. How the accessor handler gets those doesn't matter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Through this, there will be a tokio-metrics crate that converts these raw stats into proper aggregated metrics that can be consumed by end user metrics collection systems like the metrics crate.

This is more confusing than helpful to be at the moment. What are "proper aggregated metrics"? What is not proper about the other ones? Maybe its easier to leave that detail out of this proposal, and just mention that metric submission is out of scope because it is application dependent?

let executor = stats.executor();

// per-worker stats via the executor.
for worker in executor.workers() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that number even static? Is there a unique worker ID?


Each worker will expose these stats, updated in batches:

- Amount of futures executed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amount of futures polled sees right to me. I don't think it should be "distinct". If the same future gets scheduled multiple times, it is also work.


### Task

This implementation will avoid tracking stats/metrics at the task level due to the overhead required. This will instead be accomplished by the [tokio console](https://github.com/tokio-rs/console). This will allow the user to attach the console and take the performance hit when they want to explore issues in more detail.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of confusing, since per-task stats seem mentioned in the intro?

A per-task stats (eg poll_duration, amount_polls) that are collected and aggregated at the task level. This RFC will propose an implemenation for implementing per-runtime stats but will also mention methods to capture per-task stats.

Apart from that I'm ok with them not being able in the beginning. Once users have global visibility and see some abnormalities they can always add custom instrumentation to their tasks/futures to figure out the details. The executor-level stats are more tricky because those details are not exposed to users.

List of stats provided from the I/O driver:
- Amount of compact
- Amount of "token dispatches" (aka ready events)
- Amount of fd currently registered with `io::Driver`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call it IO handles or IO resources?


List of stats provided from the I/O driver:
- Amount of compact
- Amount of "token dispatches" (aka ready events)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a tokio/mio concept. "ready events" might be a better term to expose externally


To avoid any extra overhead in the executor loop, each worker will batch metrics into a `Core` local struct. These values will be incremented or sampled during regular executor cycles when certain operations happen like a work steal attempt or a pop from one of the queues.

The batches will be streamed via atomics to the stats struct directly. This will reduce any cross CPU work while the executor is running and amortize the cost of having to do cross CPU work. Batches will be sent before the executor attempts to park the thread. This will happen either when there is no work to be done or when the executor has hit the maintance tick. At this point before the thread will park, the executor will submit the batch. Generally, since parking is more expensive then submitting batches there should not be any added latency to the executor cycle in this process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is polling, and described in the next paragraph.


The main goal of this implementation is to allow a user to run this metrics collection at all times in production with minimal overhead to their application. This would allow users to alarm on any regressions and track how the runtime is performing.

Some of the stats include min/max (specifically the queue depth stats) this is because the depth of the queues changes throughout the stats batch window. The value could start low, spike up during the middle of the window then come back down. To understand this behavior the executor stats module will aggregate the depth values to reduce the need to stream the values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a question of the runtime overhead caused by this. Percentiles will require more work will collecting data, and you might not be able to use atomic counters anymore.

An external aggregator that consumes the stats (e.g. once per second) could still perform aggregation and do percentiles based on the amount of occurrences inside the sampling period.

However an external aggregator won't be able to capture min/max values in case there are peaks inside that sampling period. E.g. if you want to have a metric which is around "maximum tasks polled inside an executor iteration" and "minimum tasks polled", you couldn't get that if you just have counters of

  • eventloop iterations
  • tasks polled

I guess for tasks where we find those values useful, it makes sense to add them.

Otherwise it's probably easiest to just add always incrementing counters and let the external application do the diffing and aggregation. You can provide some helpers that allow like:

let mut last_stats = stats.executor();
loop {
    std::thread::delay(sampling_time);
    let stats = stats.executor();
    let delta_stats = stats.diff(last_stats);
    my_favorite_metric_system.aggregate_and_emit(delta_stats); // or potentially also the raw stats
    last_stats = stats;

E.g. we had issues in the past where some metrics that only had been emitted once per minute didn't show BPS spikes that happened inside some seconds and caused excessive packet drops.

@LucioFranco Might be worthwhile to document that kind of periodic sampling system in the "guide" section, since there had been a few questions on how to use the thing.


Each worker will expose these stats, updated in batches:

- Amount of futures executed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about futures passed to block_on? Which thread are they on? What about futures polled in a LocalSet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So localset is a good question...

For block_on I would say it doesn't run on the main executor so it doesn't count?

@Darksonn
Copy link
Contributor

Please see the initial work in #4043 and provide feedback on the direction.

@Darksonn Darksonn added M-metrics Module: tokio/runtime/metrics and removed M-runtime Module: tokio/runtime labels Aug 27, 2021
@carllerche
Copy link
Member

Thanks for the work. I'm going to close this due to inactivity. If you want to continue this patch, please open a new PR and reference this one.

@carllerche carllerche closed this Nov 22, 2022
@Darksonn Darksonn deleted the lucio/runtime-stats-rfc branch November 22, 2022 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate M-metrics Module: tokio/runtime/metrics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants