-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
behavior of context and trace in the presence of timers #3144
Comments
After working with tracing for almost 8yrs I don't think I can formulate a one-size-fits-all strategy of how to deal with deferred work. It's very much dependent on the specific scenario and how the application wants to define the "workflow". |
I agree. If I did not misunderstand - current SDK implementations are free to implement whatever behavior they desire, with no consistency no documentation, and no configuration options. These can lead to a bad user experience and infinite traces by default. Tried to trace the reasoning behind the decisions in JS SDK, but it seems just random and not following any documented logic to justify the implementation. @vmarchaud - maybe you know why things were written as they are? |
I don't agree on this, i'm pretty sure that if you start a async loop from within a trace, most SDKs will tie each iteration to the parent trace resulting in infinite traces. I agree that you might not want that in your case but this is consistent with how context propagation works in other cases.
It isn't a decision for timer per se, more of a logical pattern across nodejs callback in general so it's not random. The implementation is mostly ported from opencensus which was then ported from google tracing agents which was inspired from what everyone was doing at the time. Again i can understand that you might not want that in the browser but saying that its random isn't true, it's how everyone was tracing code since it was available in NodeJS (i believe around node 4). |
Thank you @vmarchaud for the input.
What kind of loop are you referring to? for example, this loop will behave differently in node and browser: import api from '@opentelemetry/api';
const tickTracer = api.trace.getTracer('timeout');
const tick = () => {
// in zonejs (browser) - the active context here is `clickSpan`
// in async hooks, the active context here is `ROOT_CONTEXT`
const span = tickTracer.startSpan('interval timeout');
// span will be root trace in nodejs, but will be child of clickSpan in browser
// while the browser tab is open (could be days), more and more spans are added to the `click` trace
api.context.with(api.trace.setSpan(api.context.active(), span), () => {
// process timeout
});
span.end();
}
const clickSpan = api.trace.getTracer('user-interaction').startSpan('click');
api.context.with(api.trace.setSpan(api.context.active(), clickSpan), () => {
setInterval(tick, 1000);
});
clickSpan.end();
In this case, maybe the node implementation is not aligned and needs to propagate the context across timers to be consistent with other SDKs? I think addressing this in spec in a loose manner (SHOULD) will contribute to a consistent user experience and mental model for what traces represents.
In the browser, timers callbacks preserve context because this is how zonejs operates by default. Other callbacks will not be picked up by zone and will thus not preserve context, so browser is already not consistent in this manner, which is what I am suggesting to fix.
Thank you for the input on how things came together. What I meant to say is that the behavior of (at least 2) sdks varies in this aspect, leading to inconsistent behavior which users cannot anticipate.
I was wondering if you know from past discussions if this implementation is intentional. are there any reasons to connect timers in browser and not connect them in node? |
By the way, looks like splunk already encountered a similar issue and are offering custom context manger for the browser to overcome this issue in their otel distro
|
I agree. At the end of the line, one has to install a context manager and it will be implemented one way or another. So my question is more of - should there be a decent default? or is it left for the SDK to decide based on it's knowledge of the runtime |
That's not how it should, setTimeout continue the context for exemple so setInterval should too. What nodejs version are you using to test this ? For me that's a bug.
Adding it in the spec will just make the browser sdk not in spec. We can't track other callbacks in the browser because there are no API to do so, it's a known issue and sadly a enormous task that @legendecas has started (https://github.com/legendecas/proposal-async-context).
Adding on my previous comment, we are constrained by the platform each sdk target, i don't think it's reasonable to expect a standard behavior across all of them (node, browser, lambdas and in the future deno, workers like cloudflare or even bun). However i'm not saying it's impossible but considering known resources (both in the community or companies around the space) it's complicated right now. For example it took Qard ~13 months to land implementing promises hook into NodeJS: nodejs/node#36394 + https://chromium-review.googlesource.com/c/v8/v8/+/2759188) which was i believe sponsored by Datadog. You can see the similar kind of work for Deno in denoland/deno#7010 |
Besides the fragmentation within JS the spec would have to consider other techs like Java. As far as I know Java uses thread local on default for context propagation which works nice for synchronous calls but may fail for async, thread pool based frameworks. Limiting discussion to timer APIs is also just a small part of the topic. In JS if you promisify them special handling for timer APIs is likely lost as promise context propagation takes over. And there are other functionalities which might prolong a trace to a potential unrelated chain of operations. In the end the data collecting side should match to the backend. It's of no help that OTel instrumentations correctly link spans for days or weeks if the backend (or the user on the backend) only wants to correlated for < 2 hours. Therefore this is something for the tuning area not to specify once for all. |
FYI the Java instrumentation agent contains several instrumentations that carry the current context across threads - but only for one-off async tasks. Repeatable scheduled operations (e.g. |
@vmarchaud , I checked again in node, and it seems you are right - the context is indeed passed across timers invocations. My issue was different. Sorry about that. So looks like the behaviors are consistent which is great. I guess for me and for other cases (java distribution and splunk distribution of the browser SDK), it makes sense to start a new trace for timers that are set to happen in the future (
The spec should not define required behaviors that are not feasible to implement. It can describe the preferred behavior as a SHOULD requirement which allows SDKs to diverge if needed. However, I do think that in js it is possible to achieve various behaviors with instrumentation of the timer functions / libaries or registering to zonejs/asynchooks hooks to alter the behavior depending on the kind of trigger. It could be amazing to have some text in the spec to promote consistency and help users know what to expect. It could also be great if the spec explicitly allows implementations to be configured to support various useful cases which users can set to their needs. No matter what, eventually the same spans are generated in all cases, but how to connect them together into traces and where the boundaries are set is a subjective matter, and users might want it one way or another. |
I don't see how both statement can be true at the same time, either it's consistent or it's not.
I understand the use case but i'm pretty sure it's not possible using AsyncLocalStorage in node so you must use AsyncHooks which isn't great. I think you can just implement your own context manager to solve this. |
I was Imagining something like "consistent if possible", to describe the desired expected behavior if it is possible to implement, so at least the common experience is as consistent as possible.
In JS one can patch the |
Yeah sorry i wasn't clear, i meant you can't do this just with ALS so you must patch timers manually which requires having to write your own context manager |
Edit: it turns out I was wrong - nodejs and web ARE consistent across timers. This issue still discusses if the default behavior is the most useful one for normal users, and if it makes sense for the spec to state the behavior explicitly.
I recently started using OpenTelemetry for the browser, using the JS SDK.
In OTEL JS, there are separate context managers for browser and node, one uses nodejs async hooks
which only preserves context across async operations, and the second useszone.js
librarywhich preserves context across async calls, timers, event handlers etc. My findings are that it creates a different behavior for the same JS code that runs in the browser vs node - single large trace vs multiple short unrealted fragmented tracesIn our application, the browser has some polling that fetches an API once every 30 seconds. The generated trace using the current context manager which propagate context across timers makes it so we group all these operations into one single infinite trace which can last for hours and contain many thousands of spans (across browser and the instrumented services). This behavior is not friendly to our tail sampler, and the huge output trace makes it difficult for human end users to extract value from such traces.
I searched the spec to try to understand what is the expected behavior for context managers but couldn't find any reference. Generally, my question is:
I guess different users can have different needs depending on the use case, but:
Would love to hear more opinions on this issue
Do you think the spec is the right place to define the expected behavior and give guidance on how to instrument in practice?
The text was updated successfully, but these errors were encountered: