Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coroutine scheduler monitoring #1360

Open
asad-awadia opened this issue Jul 21, 2019 · 24 comments
Open

Coroutine scheduler monitoring #1360

asad-awadia opened this issue Jul 21, 2019 · 24 comments

Comments

@asad-awadia
Copy link

Are there any monitoring tools available for how many coroutines are currently active and their state etc? It would be nice if it could be exposed so that something like Prometheus can scrape it and visualise it in grafana.

It will also help in debugging leaks - and errors that might occur if we see coroutines just rising linearly

If not can this be done by looking at the thread stats instead?

Go exposes it via runtime.NumGoroutines()

Related:? #494

@elizarov
Copy link
Contributor

Please, take a look at kotlinx-coroutines-debug module: https://github.com/Kotlin/kotlinx.coroutines/blob/master/kotlinx-coroutines-debug/README.md

@glasser
Copy link
Contributor

glasser commented Aug 14, 2019

This does look pretty useful, but it also seems like it might have a notable performance impact?

The monitoring that looks attractive to me would be getting a gauge on the sizes of the CoroutineScheduler queues (global and local).

Our biggest fear is accidentally putting slow blocking work (or worse, deadlocks) in our main dispatcher (which happened to us once on a previous project using Kotlin coroutines incorrectly, and also when using Ratpack’s coroutine-style execution).

So getting alerted if work is building up over time (ie, if the queues are getting too big/growing indefinitely) seems helpful.

Would it be reasonable to expose some of these stats somewhere? These stats are specific to the CoroutineScheduler so I don't think kotlinx-coroutines-debug is relevant.

As an awful hack we are considering parsing (Dispatchers.Default as ExecutorCoroutineDispatcher).executor.toString(), with full understanding that it may break at any time.

@elizarov elizarov reopened this Aug 14, 2019
@elizarov elizarov changed the title Coroutines Monitoring CoroutinesScheduler Monitoring Aug 14, 2019
@elizarov elizarov changed the title CoroutinesScheduler Monitoring Coroutine scheduler monitoring Aug 14, 2019
@elizarov
Copy link
Contributor

The monitoring that looks attractive to me would be getting a gauge on the sizes of the CoroutineScheduler queues (global and local).

@glasser Yes, that can be done without the slow debug mode and makes sense. I'll keep it open as an enhancement.

@glasser
Copy link
Contributor

glasser commented Aug 14, 2019

Thanks! Should I interpret that as "you're going to do it" or "you'd accept patches"?

@qwwdfsad
Copy link
Contributor

Unfortunately, we are not ready to accept patches right now because the scheduler is being actively reworked.

But it would be really helpful if you could provide a more detailed example of the desired API shape and problem you want to solve with this API.

For example, "Ideally, we'd see it as pluggable SPI service for dispatcher with the following methods ..., so we could use to trigger our monitoring if ..."

@glasser
Copy link
Contributor

glasser commented Aug 14, 2019

Interesting — is there a branch or design doc or something for the reworking? Curious how it's changing.

My proposal is pretty simple. A few of the core objects involved with coroutine scheduling should be (a) publicly accessible and (b) expose a few properties that provide statistics about them. It's fine if these are documented as "experiment, up for change, don't rely on this" and as "fetching these properties may have a performance impact if done frequently" (eg, ConcurrentLinkedQueue.size is O(n)).

Most specifically, I'd want to have access to

  • ExperimentalCoroutineDispatcher.coroutineScheduler (which perhaps would return an interface declared to only contain the metrics below)
  • LimitingDispatcher.queueSize: Int
  • CoroutineScheduler.corePoolSize: Int
  • CoroutineScheduler.maxPoolSize: Int
  • CoroutineScheduler.queueSizes: Map<WorkerState, List<Int>>
  • CoroutineScheduler.globalQueueSize: Int
  • CoroutineScheduler.schedulerName: String (for tagging in the unlikely case of multiple schedulers)
    (ie, basically all the stuff in CoroutineScheduler.toString(); I think getting the control state isn't super necessary.)

I don't need kotlinx.coroutines to provide any machinery for hooking this up to my metrics service: I'm happy to keep at application (or external library) level the code that takes the dispatchers I care about, polls them for metrics, and publishes to my metrics service of choice.

@qwwdfsad
Copy link
Contributor

qwwdfsad commented Aug 20, 2019

Interesting — is there a branch or design doc or something for the reworking? Curious how it's changing.

No for both, though changes will be, of course, properly documented. But mostly it's about changing the parking/spinning strategy without violating liveness property to reduce CPU consumption during the low rate of the requests and to have a robust idle thread termination. Change is just too intrusive and touches all the places in the scheduler.

Thanks for the details!
Could you please clarify, is it for Android app or for some backend service?
Asking because there are also chances that Dispatches.Default will be backed with ForkJoinPool on Android by default (mostly to reduce dex size and count of threads), so we have to interop this observability with FJP as well.

@glasser
Copy link
Contributor

glasser commented Aug 20, 2019

This is for server usage.

We are currently porting a few web servers from Ratpack to Ktor. Ratpack has a similar async structure (with a recommended usage of a pool of "compute" threads approximately equal in size to the number of CPUs plus a scaling "blocking" pool) to Kotlin coroutines, but because you have to do all work with explicit Promise composition rather than the nice syntax of Kotlin coroutines, we've found that developers often don't bother to keep blocking work out of the compute pool, and often implement error handling incorrectly (eg by putting try/catch/finally or retry loops around functions that return Promises rather than properly using the Promise API). Our hope is that Kotlin coroutines will be much more accessible. But we still want to monitor that we're not clogging up the pools!

(Ratpack Promises also have some other odd behavior — eg, Blocking.get {}, which is somewhat like withContext(Dispatchers.IO {}, does not actually invoke the given block on the scalable threadpool until after the currently-running code fully returns to the event loop (equivalent of suspension), which meant that some misguided attempts to make a blocking call within a non-Promise-returning function use the "right" threadpool by writing (effectively) Blocking.get {}.get() not only tied up the current thread like you might expect, but actually blocked indefinitely because the code never got run! Hopefully our complete rewrite will avoid these border cases.)

@cprice404
Copy link

+1 to everything that @glasser said. Looking to start replacing some thread pools with coroutines in our high-volume, production, back-end service, and would feel a lot better about it if we had some way to emit metrics about the health of the pools/scheduler. Thanks!

@lfmunoz
Copy link

lfmunoz commented Jan 10, 2020

I have an app that launches millions of coroutines that are CPU bound and they are taking longer than would be expected to complete. I am wondering if they are taking a long time because of the overhead of them being scheduled and executed. Would like to have monitoring on the queue size for this reason.

@damian-pacierpnik-jamf
Copy link

damian-pacierpnik-jamf commented Aug 26, 2020

Any updates on this? Any news when it may be implemented? We are also interested in monitoring number of Coroutinies, and it is really disappointing, that such basic metric is not available by default.

@anderssv
Copy link

Any updates on this? Any other ways of getting similar numbers? Wanting metrics basically because of the same reasons as @glasser . :)

@vikiselev
Copy link

vikiselev commented Nov 9, 2020

Any updates? I'm interested as well.

@premnirmal
Copy link

Also interested in this

@qwwdfsad
Copy link
Contributor

qwwdfsad commented Apr 6, 2021

We aim to implement it in the next releases after 1.5.0

@joost-de-vries
Copy link

Our use case is also high load server side.
In addition to the metrics glasser mentioned:

  • latency
  • completed tasks.

@soudmaijer
Copy link
Contributor

@qwwdfsad any updates? Also very much interested in this.

@joost-de-vries
Copy link

@soudmaijer for us this is so critical that I implemented the 'awful hack' that glasser mentioned. See https://github.com/joost-de-vries/spring-reactor-coroutine-metrics/tree/coroutineDispatcherMetrics/src/main/kotlin/metrics

@cprice404
Copy link

We aim to implement it in the next releases after 1.5.0

Does that mean that this will be addressed in 1.6.0 (which appears to be close to release)?

@dovchinnikov
Copy link
Contributor

In IJ we have own unlimited executor (let's call it ApplicationPool). We log a thread dump when the number of threads exceeds a certain value, but we don't prevent spawning new threads. I'd like to replace ApplicationPool with Dispatchers.IO.limitedParallelism(MAX_VALUE), but I'm missing the diagnostics part.

Using effectively unlimited IO dispatcher will allow us to drop own executor service (single pool for the whole app approach) and avoid unnecessary thread switches which inevitably happen between Dispatchers.Default and ApplicationPool.asCoroutineDispatcher().

@jaredjstewart
Copy link

Is there any update on this issue?

@chenzhihui28
Copy link

any update?

@glasser
Copy link
Contributor

glasser commented Apr 14, 2023

@joost-de-vries is your hack still working out reasonably well for you?

@cleidiano
Copy link

cleidiano commented Apr 25, 2024

Is there any update on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests