Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decision cache for "keep" trace IDs #33533

Merged

Conversation

jamesmoessis
Copy link
Contributor

Description:

Adds simple LRU decision cache for sampled trace IDs.

The design makes it easy to add another cache for non-sampled IDs.

It does not save any other information other than the trace ID that is sampled. It only holds the right half of the trace ID (as a uint64) in the cache.

By default the cache remains no-op. Only when the user configures the cache size will the cache become active.

Link to tracking Issue: #31583

Testing:

  • unit tests on new code
  • test in processor_decision_test.go to test that a trace that was sampled, cached, but the span data was dropped persists a "keep" decision.

Documentation: Added description to README

@jamesmoessis jamesmoessis requested a review from jpkrohling as a code owner June 13, 2024 04:40
@jamesmoessis jamesmoessis requested a review from a team June 13, 2024 04:40
@github-actions github-actions bot added the processor/tailsampling Tail sampling processor label Jun 13, 2024
@jpkrohling jpkrohling assigned jpkrohling and unassigned mx-psi Jun 13, 2024
Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start, thank you!

processor/tailsamplingprocessor/processor.go Outdated Show resolved Hide resolved
@@ -0,0 +1,14 @@
package decisioncache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two suggestions about the interface:

  1. perhaps this can be a more generic cache, receiving a key and value types as parameters?
  2. a delete operation, which might be noop for the lru implementation

It would then look like this:

package cache

type Cache[T comparable, V any] interface {
	// Get returns the value for the given key if it exists in the cache.
	Get(key T) (V, bool)
	// Put sets the value for the given key.
	Put(key T, value V)
	// Delete deletes the value for the given key.
	Delete(key T)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, I think we could use this to evenetually replace the idToTrace map as well, if we see a better performance with the other implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further idea, perhaps for a follow-up PR: make the caches observable. We probably want to know the following metrics:

  • cache hits
  • cache misses
  • cache size

This could either be a type CacheMetrics struct, make the cache interfaced return those counters/gauge, or make the individual cache implementations accept the TelemetryBuilder instance we create in the new processor function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good ideas, working on them now

Copy link
Contributor Author

@jamesmoessis jamesmoessis Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the interface suggestions make sense, but after playing around with the code I think it's easier to use just [V any] as the only generic, since the trace ID will always be the key, and we can make optimisations around that in the implementation layer if it's structured like this (like using an uint64 instead of pcommon.TraceID internally). This would work for replacing idToTrace as well.

I like the idea of the observable caches as well. Would be happy to do a follow up PR for that.

Edit: I've pushed it up

@@ -45,6 +45,11 @@ The following configuration options can also be modified:
- `decision_wait` (default = 30s): Wait time since the first span of a trace before making a sampling decision
- `num_traces` (default = 50000): Number of traces kept in memory.
- `expected_new_traces_per_sec` (default = 0): Expected number of new traces (helps in allocating data structures)
- `decision_cache` (default = `sampled_cache_size: 0`): Configures amount of trace IDs to be kept in an LRU cache,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mentioned it before, but why not a decision cache where the decision is the value get back from the cache? We have two states already, sampled and not sampled. The boolean we have already being added to the cache could serve this purpose. Or is the point to allow users to keep a bigger cache for sampled than for not sampled?

Copy link
Contributor Author

@jamesmoessis jamesmoessis Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to have separate caches for sampled/not sampled. This means you can have different sizes and performance characteristics for each. I thought this comment mentioned it well: #31583 (comment)

For example the "do not keep" cache can be different, like a cuckoo filter. I also probably want to keep the different decisions for different lengths of times.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This explanation definitely belongs to the readme!

processor/tailsamplingprocessor/processor.go Outdated Show resolved Hide resolved
@jamesmoessis jamesmoessis force-pushed the jmoe/tailsampling-decisioncache branch from 1b3463c to f3bcac9 Compare June 14, 2024 06:42
@jamesmoessis
Copy link
Contributor Author

Thanks @jpkrohling for the review and top suggestions. I've addressed the suggestions (some with slight alterations).

@jamesmoessis jamesmoessis force-pushed the jmoe/tailsampling-decisioncache branch from f3bcac9 to e19917f Compare June 17, 2024 05:27
Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll merge this as is, as I think it's a great feature to have already. The last comments can be addressed in a follow-up PR.

processor/tailsamplingprocessor/config.go Show resolved Hide resolved
@@ -2,6 +2,8 @@ tail_sampling:
decision_wait: 10s
num_traces: 100
expected_new_traces_per_sec: 10
decision_cache:
sampled_cache_size: 500
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If our recommendation is to keep the number of items an order of magnitude higher, we should make this way bigger than the num traces in this example.

- `decision_cache` (default = `sampled_cache_size: 0`): Configures amount of trace IDs to be kept in an LRU cache,
persisting the "keep" decisions for traces that may have already been released from memory.
By default, the size is 0 and the cache is inactive.
If using, configure this as much higher than `num_traces` so decisions for trace IDs are kept
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The beauty of this cache is that it's memory hit is predictable: it's 8 bytes for the uint64 plus one byte for the boolean, per entry (plus the overhead from the cache implementation itself). So, we could find what that overhead is per entry, and add some guidance here. Like:

if you want to allocate 100MB to this cache, you can set this value to 10_000_000

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, can add this as a follow up

@jpkrohling jpkrohling merged commit 18dc9ac into open-telemetry:main Jun 17, 2024
154 checks passed
@github-actions github-actions bot added this to the next release milestone Jun 17, 2024
@jamesmoessis jamesmoessis deleted the jmoe/tailsampling-decisioncache branch June 18, 2024 04:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
processor/tailsampling Tail sampling processor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants