Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trace-preserving mode for processor/tailsampling #25122

Open
garry-cairns opened this issue Aug 9, 2023 · 11 comments
Open

Trace-preserving mode for processor/tailsampling #25122

garry-cairns opened this issue Aug 9, 2023 · 11 comments
Labels
enhancement New feature or request help wanted Extra attention is needed processor/tailsampling Tail sampling processor

Comments

@garry-cairns
Copy link
Contributor

garry-cairns commented Aug 9, 2023

Component(s)

processor/tailsampling

Is your feature request related to a problem? Please describe.

We would like to use tail-based sampling because we believe it will give better insights into our running processes than head-based, and we have far too much data volume to store 100% of traces. We would, however, like to retain connections between our aggregated metrics, which we produce using the spanmetrics connector, and our stored traces. This is not currently possible.

Describe the solution you'd like

We would like there to be a configurable option to separate the concerns of sampling from that of filtering. In this model, the tail-based sampling processor could be configured in a "soft" mode (the name isn't important if you prefer another) that would simply update sampling.priority on all spans for a trace it has decided to sample and do no filtering. This would let subsequent processors including, but not limited to, spanmetrics use this information. The user would then be responsible for filtering unsampled traces/spans using the filter processor in their trace pipeline(s).

To expand on the connector/spanmetrics example, this would involve a separate feature request to make its exemplar behavior smarter such that it would only include trace IDs where sampling.priority > 0 as exemplars of aggregated metrics in the presence of such an attribute. This means spanmetrics could produce accurate metrics based on 100% of traces, which it would need, without incurring the cost of storing all of those traces.

sampling

Describe alternatives you've considered

One alternative we considered was changing spanmetrics such that it would mutate any trace it used as an exemplar to make connections between its metrics and the traces from which they were derived simpler. But this would mean further changes to spanmetrics, which currently stores references to 100% of traces it uses to produce its output as "exemplars" and also
couples the solution too tightly to spanmetrics. Our preferred solution leaves current behavior in place for those relying on it, while also offering a nice separation of concerns giving other users much more flexibility to innovate with their pipelines.

Additional context

We are working in an environment with many thousands of hosts running hundreds of thousands of services, each of which may pass context belonging to the same logical traces between them.

@garry-cairns garry-cairns added enhancement New feature or request needs triage New item requiring triage labels Aug 9, 2023
@github-actions github-actions bot added the processor/tailsampling Tail sampling processor label Aug 9, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Aug 9, 2023

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jpkrohling jpkrohling removed the needs triage New item requiring triage label Aug 22, 2023
@jpkrohling
Copy link
Member

I see the problem, and instead of using the filter processor, it would probably make sense to use a second-stage sampler as a connector:

receivers:
- otlp

processors:
- firststagesampling # (our current tail-sampling processor?)
- spanmetrics
- batch

exporters: 
- otlp

connectors:
- secondstagesampling

pipelines:
- traces:
  - receivers: [otlp]
  - processors: [firstagesampling, spanmetrics]
  - exporters: [secondstagesampling]
- traces/export:
  - receivers: [secondstagesampling]
  - processors: [batch]
  - exporters: [otlp]

I'm not sure I would use the current tail-sampling for that.

@garry-cairns
Copy link
Contributor Author

garry-cairns commented Aug 22, 2023

I like the pipeline design, and would likely use it, but couldn't we just use the existing routing connector with the first stage sampling decision as the criterion on which it's routing? (this may have been your intent but it wasn't clear to me so let me know)

@jpkrohling
Copy link
Member

The idea is that the first stage sampling will appropriately mark the root spans with the sampling decision and the second stage sampling will effectively sample out the traces that were not marked as selected. While the routing connector has some of the same features (filter out data that is not relevant for the pipeline's specific exporter), I think having sampling in two stages will have a better user experience.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Oct 23, 2023
@garry-cairns
Copy link
Contributor Author

I've got some capacity just now so I'm going to have a go at implementing this.

@github-actions github-actions bot removed the Stale label Dec 16, 2023
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Feb 15, 2024
Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 15, 2024
@jpkrohling jpkrohling self-assigned this Apr 30, 2024
@jpkrohling jpkrohling reopened this Apr 30, 2024
Copy link
Contributor

github-actions bot commented Jul 1, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jul 1, 2024
@jpkrohling jpkrohling added help wanted Extra attention is needed and removed Stale labels Jul 8, 2024
@jpkrohling jpkrohling removed their assignment Jul 8, 2024
Copy link
Contributor

github-actions bot commented Sep 9, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Sep 9, 2024
@jpkrohling jpkrohling removed the Stale label Sep 9, 2024
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Nov 11, 2024
@jpkrohling jpkrohling removed the Stale label Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed processor/tailsampling Tail sampling processor
Projects
None yet
Development

No branches or pull requests

2 participants