-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[receiver/kafka] otelcol_kafka_receiver_offset_lag not calculated correctly #36093
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Hi @dyanneo, Based on my initial analysis:
We have to be potentially careful here with using Cleanup func to solve this issue. By the time Cleanup is called, the session's claims may no longer reflect the partitions that were assigned, or session.Claims() might be empty. We may need to check that assumption as well... That said... I agree with your approach as long as we are guaranteed on Cleanup(), that the session is accurately populated, we can do something like this:
|
@StephanSalas Thanks for your inputs on this issue. I appreciate your concerns and the suggestion makes sense to me. |
According to the docs, this is the lifecycle of the serama kafka consumer framework:
https://github.com/IBM/sarama/blob/main/consumer_group.go#L23C1-L36C87 Thus it looks likely that we can do it in Cleanup(), except for one noteable edge case:
TODO: We may need to somehow figure out this edge case... will look into it. Initial idea is to use the Errors() func channel listed here: https://github.com/IBM/sarama/blob/main/consumer_group.go#L87C2-L87C28 |
Component(s)
receiver/kafka
What happened?
We wanted to collect, graph, and alert on lag for the kafka receiver, but observed unexpected behavior when observing the
otelcol_kafka_receiver_offset_lag
'slast
values, compared to the values observed using kafka's consumer-groups utility.Description
The value of
last
for the measurementotelcol_kafka_receiver_offset_lag
does not appear to be calculated correctly.Also, for context, we are seeing an issue in the otel collector where it keeps emitting lag metrics for partitions it's no longer consuming.
Steps to Reproduce
last
value, we were surprised to see values that didn't correspond to kafka utility outputs.Expected Result
Per this screen shot, the value of partition 4's lag as shown using kafka's consumer-groups.sh utility changes over time, and is in the low hundreds or smaller:
Actual Result
Per this screen shot, the value of partition 4's lag does not match what true lag is, per kafka's tools:
Query in Grafana query builder:
Query as Grafana is running it:
Collector version
otelcol_version: 0.109.0
Environment information
Environment
OS: linux
container
OpenTelemetry Collector configuration
Log output
Additional context
We think the reason for the incorrect data is the fact that the gauge exists within OTEL's registry even after a rebalance and the metric is not receiving updates
Where this gauge is defined:
opentelemetry-collector-contrib/receiver/kafkareceiver/internal/metadata/generated_telemetry.go
Lines 75 to 79 in 0d28558
One of the places where this gauge is updated:
opentelemetry-collector-contrib/receiver/kafkareceiver/kafka_receiver.go
Line 560 in 0d28558
Where this could be addressed:
opentelemetry-collector-contrib/receiver/kafkareceiver/kafka_receiver.go
Lines 529 to 531 in 0d28558
The text was updated successfully, but these errors were encountered: