Decision cache for "keep" trace IDs #33533

jamesmoessis · 2024-06-13T04:40:54Z

Description:

Adds simple LRU decision cache for sampled trace IDs.

The design makes it easy to add another cache for non-sampled IDs.

It does not save any other information other than the trace ID that is sampled. It only holds the right half of the trace ID (as a uint64) in the cache.

By default the cache remains no-op. Only when the user configures the cache size will the cache become active.

Link to tracking Issue: #31583

Testing:

unit tests on new code
test in processor_decision_test.go to test that a trace that was sampled, cached, but the span data was dropped persists a "keep" decision.

Documentation: Added description to README

jpkrohling

This is a great start, thank you!

processor/tailsamplingprocessor/processor.go

jpkrohling · 2024-06-13T07:42:49Z

processor/tailsamplingprocessor/internal/decisioncache/types.go

@@ -0,0 +1,14 @@
+package decisioncache


I have two suggestions about the interface:

perhaps this can be a more generic cache, receiving a key and value types as parameters?

a delete operation, which might be noop for the lru implementation

It would then look like this:

package cache type Cache[T comparable, V any] interface { // Get returns the value for the given key if it exists in the cache. Get(key T) (V, bool) // Put sets the value for the given key. Put(key T, value V) // Delete deletes the value for the given key. Delete(key T) }

For completeness, I think we could use this to evenetually replace the idToTrace map as well, if we see a better performance with the other implementation.

Further idea, perhaps for a follow-up PR: make the caches observable. We probably want to know the following metrics:

cache hits

cache misses

cache size

This could either be a type CacheMetrics struct, make the cache interfaced return those counters/gauge, or make the individual cache implementations accept the TelemetryBuilder instance we create in the new processor function.

All good ideas, working on them now

I think the interface suggestions make sense, but after playing around with the code I think it's easier to use just [V any] as the only generic, since the trace ID will always be the key, and we can make optimisations around that in the implementation layer if it's structured like this (like using an uint64 instead of pcommon.TraceID internally). This would work for replacing idToTrace as well.

I like the idea of the observable caches as well. Would be happy to do a follow up PR for that.

Edit: I've pushed it up

processor/tailsamplingprocessor/internal/decisioncache/lru_cache.go

processor/tailsamplingprocessor/internal/decisioncache/lru_cache_test.go

jpkrohling · 2024-06-13T07:52:24Z

processor/tailsamplingprocessor/README.md

@@ -45,6 +45,11 @@ The following configuration options can also be modified:
 - `decision_wait` (default = 30s): Wait time since the first span of a trace before making a sampling decision
 - `num_traces` (default = 50000): Number of traces kept in memory.
 - `expected_new_traces_per_sec` (default = 0): Expected number of new traces (helps in allocating data structures)
+- `decision_cache` (default = `sampled_cache_size: 0`): Configures amount of trace IDs to be kept in an LRU cache,


I think you mentioned it before, but why not a decision cache where the decision is the value get back from the cache? We have two states already, sampled and not sampled. The boolean we have already being added to the cache could serve this purpose. Or is the point to allow users to keep a bigger cache for sampled than for not sampled?

The idea was to have separate caches for sampled/not sampled. This means you can have different sizes and performance characteristics for each. I thought this comment mentioned it well: #31583 (comment)

For example the "do not keep" cache can be different, like a cuckoo filter. I also probably want to keep the different decisions for different lengths of times.

This explanation definitely belongs to the readme!

processor/tailsamplingprocessor/internal/decisioncache/lru_cache.go

processor/tailsamplingprocessor/processor.go

jamesmoessis · 2024-06-14T06:44:56Z

Thanks @jpkrohling for the review and top suggestions. I've addressed the suggestions (some with slight alterations).

add dep nop cache impl

fix comment typo

more

fix metric

generation, imports, formatting ok

jpkrohling

I'll merge this as is, as I think it's a great feature to have already. The last comments can be addressed in a follow-up PR.

processor/tailsamplingprocessor/config.go

processor/tailsamplingprocessor/internal/cache/lru_cache.go

processor/tailsamplingprocessor/internal/cache/nop_cache.go

processor/tailsamplingprocessor/internal/cache/types.go

jpkrohling · 2024-06-17T13:26:41Z

processor/tailsamplingprocessor/testdata/tail_sampling_config.yaml

@@ -2,6 +2,8 @@ tail_sampling:
  decision_wait: 10s
  num_traces: 100
  expected_new_traces_per_sec: 10
+  decision_cache:
+    sampled_cache_size: 500


If our recommendation is to keep the number of items an order of magnitude higher, we should make this way bigger than the num traces in this example.

jpkrohling · 2024-06-17T13:44:44Z

processor/tailsamplingprocessor/README.md

+- `decision_cache` (default = `sampled_cache_size: 0`): Configures amount of trace IDs to be kept in an LRU cache,
+  persisting the "keep" decisions for traces that may have already been released from memory. 
+  By default, the size is 0 and the cache is inactive. 
+  If using, configure this as much higher than `num_traces` so decisions for trace IDs are kept 


The beauty of this cache is that it's memory hit is predictable: it's 8 bytes for the uint64 plus one byte for the boolean, per entry (plus the overhead from the cache implementation itself). So, we could find what that overhead is per entry, and add some guidance here. Like:

if you want to allocate 100MB to this cache, you can set this value to 10_000_000

Awesome, can add this as a follow up

jamesmoessis requested a review from jpkrohling as a code owner June 13, 2024 04:40

jamesmoessis requested a review from a team June 13, 2024 04:40

github-actions bot assigned mx-psi Jun 13, 2024

github-actions bot added the processor/tailsampling Tail sampling processor label Jun 13, 2024

jpkrohling assigned jpkrohling and unassigned mx-psi Jun 13, 2024

jpkrohling reviewed Jun 13, 2024

View reviewed changes

jamesmoessis force-pushed the jmoe/tailsampling-decisioncache branch from 1b3463c to f3bcac9 Compare June 14, 2024 06:42

jamesmoessis added 11 commits June 17, 2024 15:27

decision cache type and lru impl

346a73e

add dep nop cache impl

integrate decision cache into processor

e82866b

tests for decision cache functionality

1ed1386

fix comment typo

add info to readme

33aaad1

more

changelog

c6c4309

changes to cache interface/impl as suggested in PR

bf0bc9f

fix tests

bb0a66f

changing the calling code to use new cache type

ea7bf0a

add counter for short circuited decisions

8ed90e3

fix metric

short circuit cache hit so it is earlier

645aa47

codegen, licenses, fmt, import

e19917f

generation, imports, formatting ok

jamesmoessis force-pushed the jmoe/tailsampling-decisioncache branch from f3bcac9 to e19917f Compare June 17, 2024 05:27

jamesmoessis added 2 commits June 17, 2024 15:50

fix broken link

8d6b032

porto

131bbf9

jpkrohling approved these changes Jun 17, 2024

View reviewed changes

jpkrohling merged commit 18dc9ac into open-telemetry:main Jun 17, 2024
154 checks passed

github-actions bot added this to the next release milestone Jun 17, 2024

jamesmoessis deleted the jmoe/tailsampling-decisioncache branch June 18, 2024 04:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decision cache for "keep" trace IDs #33533

Decision cache for "keep" trace IDs #33533

jamesmoessis commented Jun 13, 2024

jpkrohling left a comment

jpkrohling Jun 13, 2024

jpkrohling Jun 13, 2024

jpkrohling Jun 13, 2024

jamesmoessis Jun 14, 2024

jamesmoessis Jun 14, 2024 •

edited

Loading

jpkrohling Jun 13, 2024

jamesmoessis Jun 14, 2024 •

edited

Loading

jpkrohling Jun 17, 2024

jamesmoessis commented Jun 14, 2024

jpkrohling left a comment

jpkrohling Jun 17, 2024

jpkrohling Jun 17, 2024

jamesmoessis Jun 18, 2024

Decision cache for "keep" trace IDs #33533

Decision cache for "keep" trace IDs #33533

Conversation

jamesmoessis commented Jun 13, 2024

jpkrohling left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesmoessis Jun 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesmoessis Jun 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesmoessis commented Jun 14, 2024

jpkrohling left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesmoessis Jun 14, 2024 •

edited

Loading

jamesmoessis Jun 14, 2024 •

edited

Loading