Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smarter waiting for late spans in tailsamplingprocessor #31498

Closed
djluck opened this issue Feb 29, 2024 · 18 comments
Closed

Smarter waiting for late spans in tailsamplingprocessor #31498

djluck opened this issue Feb 29, 2024 · 18 comments
Labels
enhancement New feature or request processor/tailsampling Tail sampling processor

Comments

@djluck
Copy link

djluck commented Feb 29, 2024

Component(s)

No response

Is your feature request related to a problem? Please describe.

Currently we have to deploy the tailsamplingprocessor with a reasonably large decision_wait value (2m). We do this in order to be able to capture our long-tail traces but this imposes an unwelcome cost due to how completed traces are buffered in memory:

  1. Memory use is higher than it needs to be: once we've received the root span, waiting the full decision_wait period seems unnecessary (although it makes sense to wait a short period for any lagging spans that are a part of the trace).
  2. Completed traces are delayed for decision_wait, even though they are ready to view. In our case, this adds two minutes of latency to every trace!

Describe the solution you'd like

I'm not entirely sure what the solution looks like here but some thoughts:

  1. Once we receive the root span, we consider the trace complete and send it on.
  2. For traces that start externally, it would be useful to have some mechanism to consider a span as the "root" and to trigger the above behavior.
  3. We might maintain a list of trace ids for recently sampled traces: this would ensure that any late spans can be forwarded on appropriately.

Describe alternatives you've considered

I'm aware of no other alternatives.

Additional context

The services we tail sample are deal with some very high throughput and can process 100K spans/ sec+. Attempting to tail sample at this volume imposes a significant memory overhead considering we need to effectively buffer 12mil spans (120s x 100K spans).

@djluck djluck added enhancement New feature or request needs triage New item requiring triage labels Feb 29, 2024
@crobert-1 crobert-1 added the processor/tailsampling Tail sampling processor label Feb 29, 2024
Copy link
Contributor

Pinging code owners for processor/tailsampling: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jiekun
Copy link
Member

jiekun commented Mar 1, 2024

Hello. I think this idea is great. Currently, in tail sampling, there is no check to determine whether a span is the root span. Only the time tick triggers the actual analysis.

I would like to suggest some modifications base on your idea:

  1. Receiving the root span does not guarantee that all spans in the trace have reached the collector, but it can be assumed that most spans have been generated and are on their way to the collector. Therefore, if the decision_wait is set to 30s, upon receiving the root span, it doesn't need to wait for the entire duration but can still wait for a shorter period, such as 5s, to ensure that the remaining spans are received.
  2. For those services that generate async span, user can stick to the current mechanism of decision_wait, and wait for 30s.

So the configuration layout could be:

processors:
  tail_sampling:
    decision_wait: 30s
    # new config here. default empty, which is equal to current mechanism.
    # it could also be set to 0s so that it will start analysing once the root span is received.
    decision_wait_after_root_span: 5s  
    num_traces: 100
    expected_new_traces_per_sec: 10

@djluck
Copy link
Author

djluck commented Mar 2, 2024

Hey @jiekun, thanks for your thoughts! You raise some good points, especially around async spans (I hadn't considered this edge case).

For the above reason, perhaps it's not enough to just use a shorter wait for any lagging or async spans? Perhaps maintaining a set of trace ids that we have sampled in the past X minutes is a better approach. It'll consume less memory that storing all spans for the full duration of decision_wait so we can make the duration X much longer. This will allow us to instantly forward on any late spans and handle the long tail of traces far more gracefully.

There's also the question of spans that arrive late for traces that we decided not to sample- what should we do with these? Do these become interesting if they've arrived late?

@jiekun
Copy link
Member

jiekun commented Mar 4, 2024

Perhaps maintaining a set of trace ids that we have sampled in the past X minutes is a better approach.

I understand your perspective of reducing memory usage while maintaining sampling accuracy, which is great! However, I'm concerned that it might make the processor more complex and less easy to understand. Additionally, for users who don't need to worry about asynchronous spans, if they upgrade without adjusting the tail sampling latency, they will only notice an increase in memory consumption due to storing additional trace IDs.

I still support these new ideas, but they may require the support of the maintainer. Personally, I would be more than happy to implement them in the our internal collector :)

@jiekun
Copy link
Member

jiekun commented Mar 4, 2024

BTW, may I ask if you plan to submit a PR for those ideas or just the feature request?

I would like to split them into (at least) 2 parts:

  1. Support decision_wait_after_root_span.
  2. optimize / add trace_id cache and sample async span which arrive later.

They can be implemented independently if we have support from the maintainer.

@jpkrohling
Copy link
Member

I've been talking to a few people about a decision cache, which should solve the second problem. The idea would be to have a simple map of trace id with boolean values, indicating whether they were sampled. Note that we want to cache both a positive and a negative answer: we don't want to sample spans for a trace that was rejected before, and we want to sample spans for traces that were accepted. A limitation we need to document is that this cache isn't distributed, so, a scalable tail-sampling setup will likely still have the same problems as today if spans get into different collectors where decisions were made, potentially because of topology changes.

My original idea was to implement a ring buffer as the cache, so that we have a fixed number of decisions in memory. I also considered a LRU, but not sure this brings any benefits.

@jpkrohling
Copy link
Member

cc @kentquirk, as I believe you are interested in those components as well

@djluck
Copy link
Author

djluck commented Mar 4, 2024

Thanks for your thoughts @jpkrohling, I had a similar thought about the negative case but was thinking about a different solution: what about pairing the set of sampled trace IDs with a bloom filter that contained the set of non-sampled trace ids?

The idea of this approach is based on the premise that the tail sampler is likely to reject more traces than it accepts. Using a map for the accepted + rejected spans would increase memory use further. A bloom filter could optimize the rejected case.

EDIT: I thought about it a bit more and modelled some potential parameters, I don't think the bloom filter would be the first implementation choice- the map would be simpler and not require orders of more memory.

@djluck
Copy link
Author

djluck commented Mar 5, 2024

@jiekun I'm interested in submitting a PR but if you're keen to work this too, we could always split the work between us- happy either way 👍

@jpkrohling
Copy link
Member

and not require orders of more memory

Right, traceID is only 16 bytes, which means that ~10MiB is enough to store more than 650k entries in the cache, if my math is right.

@jpkrohling
Copy link
Member

I recorded some of the things I had in mind here: #31580

@crobert-1
Copy link
Member

It looks like there's been a lot of good discussion here, and some actions items have been noted. Removing needs triage.

@crobert-1 crobert-1 removed the needs triage New item requiring triage label Mar 5, 2024
@djluck
Copy link
Author

djluck commented Mar 5, 2024

I'm happy to starting thinking about #31583 if that's fine with everyone. I can move future discussion around the decision cache into this issue.

EDIT: wrong issue link, was referring to decision cache only

Copy link
Contributor

github-actions bot commented May 6, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jamesmoessis
Copy link
Contributor

Personally also wondering if disk can play a role in storing longer lived traces. Or additionally an option to compress the spans in memory before caching them (if you are willing to take the CPU tradeoff).

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jpkrohling
Copy link
Member

I believe this has been implemented by the decision caches implemented recently by @jamesmoessis. If that's not the case, feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request processor/tailsampling Tail sampling processor
Projects
None yet
Development

No branches or pull requests

5 participants