Rate Limit Processor #35204

juergen-kaiser-by · 2024-09-16T10:32:28Z

Component(s)

No response

Is your feature request related to a problem? Please describe.

We run an observability backend (Elasticsearch) shared by many teams and services (thousands). The services run in kubernetes clusters and we want to collect the logs of all pods.

Problem: If a service/pod becomes very noisy for some reason, it can burden the backend so much that all of other teams feel it. In short: One team can ruin the day for all others.

We would like to limit the effect a single instance or service can have on the observability backend.

Describe the solution you'd like

There should be a method to limit the flow of logs based on some atributes. We apply a standardized set of labels to each Kubernetes deployment, so we would work with those.
Dropping log lines would be okay (for us) since we can not assume that a noisy service stops being noisy soon. Sampling is not okay because of the nature of logs.
If rate limiting happens then we and our users should be able to see it.
- A single metric ingested into some (other) pipeline would suffice.
- Let us choose a set of attributes the metric should have (copied) from the rate limited logs so that we can map it to the affected service. In our case, we would like to have our standard pod labels copied to the metric.

Considering the points above, we think that there should be a processor for this.

We have no requirements regarding the algorithm backing the rate limiting. It seems that a token bucket filter (example blog entry) is a reasonable choice here.

Describe alternatives you've considered

Rate Limiter in Receivers

Receiver rate limiting is okay if you only work with attributes available in the receivers. In our case, those are insufficient because we need pod labels. As a workaround, we could inject them into the collector config as env variables and focus the collector on a single pod by deploying it as a sidecar. However, a sidecar deployment consumes too much resources across all pods because we have large clusters (>= 10K pods).

A benefit of a rate limiting in receivers could give collector users to choose between dropping incoming telemetry and just not receive it, effectively creating backpressure.

Rate limiting in Receivers is discussed in #6908.

Rate Limiter as Extension

We lack knowledge about how extensions work internally to say anything about it. Rate limiting as extensions also is discussed in #6908.

Additional context

rate limiter of filebeat
Issue for rate limiting in receivers: Rate Limit Receiver Extension #6908
This partially has been discussed in Limit data production from receivers to avoid backpressure scenarios #29410
Official docs for Filelog Receiver in Kubernetes: link

atoulme · 2024-09-17T19:55:09Z

Can you explain why sampling doesn't work for you here? What would the configuration of the processor look like?

juergen-kaiser-by · 2024-09-18T08:38:42Z

The sampling existing today does not work because it always is enabled. We would like to only limit too noisy pods/services. Note that a service suddenly can become noisy, so we do not know the beforehand.

Once noisyness is detected, there are different options how to limit the flow. Whether we need sampling or some other rate limiting algorithm boils down to how useful the output is afterwards. For logs, we think that larger chunks of consecutive log lines are more useful than smaller ones (think about a samples stack trace vs. having a full one), therefore the limiting algorithm should create large once before starting to drop log lines. A simple sampling creates small chunks.

That being said, I do not see why it should not be possible to specify different algorithms in the long run. For the other telemetry types, other algorithms are better.

I do not have a good example configuration, yet. Let me think about that.

juergen-kaiser-by · 2024-09-18T12:54:30Z

As a first draft:

#[...]

processors:
  ratelimit:

    # defines the rate limiting for log signals. Idea for the structure is borrowed from the transformprocessor. We can have own sections for log, metric, and trace if we want this processor to be generic to all types.
    log:

      # grouping defines wheter and how to group the telemetry by a set of attributes.
      grouping:

        #   mode defines how to deal with logs/metrics/traces not not falling into one of the defined groups (not having the fields)
        #   - strict (default): nonexisting fields are treated as if they do (having a default value) => always rate limit. Fall back to the last group if existent.
        #   - relaxed: rate limiting applies only if all fields are present
        mode: strict

        groups:
        - tbf: # used rate limiting algorithm. Others are possible but only one must be set. (tbf = token bucket filter)
            average_rate_per_sec: 1000
            max_burst_per_sec: 5000

          # defines the list of attributes to use for grouping.
          attributes:
          - context: resource
            attribute: k8s.pod.uid
          - context: resource
            attribute: k8s.pod.tag.my-project-name

#[...]

The structures allows user to:

define rate limiting for different signal types
define different rate limiting strategies
define multiple groups

We could also add conditions to the group attributes to enable users to define more specialized groups.

axw · 2024-10-11T08:20:56Z

Receiver rate limiting is okay if you only work with attributes available in the receivers. In our case, those are insufficient because we need pod labels. As a workaround, we could inject them into the collector config as env variables and focus the collector on a single pod by deploying it as a sidecar. However, a sidecar deployment consumes too much resources across all pods because we have large clusters (>= 10K pods).

Do you have any special authn/z requirements? If not, an option that may work for you is Kubernetes service account tokens in conjuction with the oidcauthextension.

Service account tokens are OIDC-compatible, and their claims contain the namespace and service account name, among other things. They do not contain pod labels - maybe you can enforce service account names for your customers?

If all of those stars align, then you could extract those claims and use them as keys for a rate limiter extension...

Rate limiting in Receivers is discussed in #6908.

We have built a distributed rate limiter extension for internal purposes, and we are planning to offer it to the contrib repo. The extension is currently rate limiting in the exporter, but it would be fairly straightforward to update it to also rate limit in receivers, and it's our plan to do so.

juergen-kaiser-by · 2024-10-11T12:26:29Z

In our case, we consume logs generated by pods/containers in a Kubernetes clusters. Those logs are stored in files by Kubernetes and are mounted into the otel collector's container for consumption. There is no authentication involved in this path (in my understanding).

If you need some unique key for the "entity" which should be rate limited then we could work with the pod id or container id. However, those must be extracted first from the log file name. If the extension you mention works with the output of a receiver then this can work. However, it would not allow us to get useful notifications/metrics telling us about rate limited services. For that, we need more information than just the pod/container id.

axw · 2024-10-14T01:45:55Z

@juergen-kaiser-by I see, thanks for elaborating. The extension we have built is based on the auth framework (not the ideal interface, but it works) which would not fit with filelog anyway. The core functionality could also be extracted into a processor.

I think creating back pressure on the producer is ideal, which makes rate limiting on resource/record attributes tricky: there are potentially multiple resources per OTLP batch, so rate limiting on one resource may cause back pressure for the other. I think that probably doesn't apply in your case, I'm just thinking about whether it would lead to surprising behaviour for others.

Another option would be to allow the processor to only rate limit based on batch/request-level metadata. For example you might use this to rate limit a batch of data based on some HTTP header (e.g. a tenant ID), if you were using it with the OTLP receiver. Then you would find a way to get the metadata you need (i.e. pod labels) in there. We could potentially extend receivercreator to set this metadata based on pod labels. Something like:

extensions:
  k8s_observer:

processors:
  ratelimiter:
    metadata_keys: [tenant]
    # other config ...

receivers:
  receiver_creator:
    watch_observers: [k8s_observer]
    receivers:
      filelog:
        rule: type == "pod.container"
        metadata:
          - key: tenant
            value: `pod.labels["tenant"]`
        config:
          include:
            - /var/log/pods/`pod.namespace`_`pod.name`_`pod.uid`/`container_name`/*.log
          include_file_name: false
          include_file_path: true
          operators:
            - id: container-parser
              type: container

juergen-kaiser-by · 2024-10-15T12:04:12Z

I think creating back pressure on the producer is ideal, which makes rate limiting on resource/record attributes tricky: there are potentially multiple resources per OTLP batch, so rate limiting on one resource may cause back pressure for the other. I think that probably doesn't apply in your case, I'm just thinking about whether it would lead to surprising behaviour for others.

Please remember that we would like to be able to see when log messages get dropped.

If you work with backpressure, then it is up to the previous instances to decide what to do if there is too much telemetry. In our case, the source of the log messages can be seen as the logging service or Kubernetes which stores the logs into files. Obviously, we do not want to slow down the service. Kubernetes itself does not care whether you consumed the logs. It does not wait for you and will rollover and delete the log messages as necessary. Consequently, our only chance to get information about dropped logs is to get them into the collector and drop the messages there as necessary.

axw · 2024-10-16T04:05:16Z

@juergen-kaiser-by makes sense. I think the ratelimit processor could have configuration to either return an error, silently discard, or create backpressure. In your case could configure it to silently discard the data, and the processor would record metrics about rate limiting. I think that aligns with the solution you originally described.

github-actions · 2024-12-16T03:40:10Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

axw · 2024-12-16T03:59:16Z

We're in the process of open sourcing a rate limiter processor that we've been using internally: elastic/opentelemetry-collector-components#247

Eventually we intend to propose adding it opentelemetry-collector-contrib, that repo is just an interim home.

juergen-kaiser-by added enhancement New feature or request needs triage New item requiring triage labels Sep 16, 2024

juergen-kaiser-by changed the title ~~Rate Limit based on Atributes~~ Rate Limit Processor Sep 16, 2024

github-actions bot mentioned this issue Sep 17, 2024

Weekly Report: 2024-09-10 - 2024-09-17 #35228

Closed

github-actions bot mentioned this issue Sep 24, 2024

Weekly Report: 2024-09-17 - 2024-09-24 #35377

Closed

crobert-1 added the Sponsor Needed New component seeking sponsor label Sep 25, 2024

This was referenced Oct 1, 2024

Weekly Report: 2024-09-24 - 2024-10-01 #35498

Closed

Weekly Report: 2024-10-01 - 2024-10-08 #35659

Closed

atoulme removed the needs triage New item requiring triage label Oct 12, 2024

github-actions bot mentioned this issue Oct 15, 2024

Weekly Report: 2024-10-08 - 2024-10-15 #35785

Closed

github-actions bot mentioned this issue Oct 22, 2024

Weekly Report: 2024-10-15 - 2024-10-22 #35905

Closed

This was referenced Oct 29, 2024

Weekly Report: 2024-10-22 - 2024-10-29 #36039

Closed

Weekly Report: 2024-10-29 - 2024-11-05 #36187

Closed

github-actions bot mentioned this issue Nov 12, 2024

Weekly Report: 2024-11-05 - 2024-11-12 #36302

Closed

github-actions bot mentioned this issue Nov 19, 2024

Weekly Report: 2024-11-12 - 2024-11-19 #36426

Closed

github-actions bot mentioned this issue Nov 26, 2024

Weekly Report: 2024-11-19 - 2024-11-26 #36533

Closed

anuraj381 mentioned this issue Nov 28, 2024

Processor/logratelimitprocessor #36592

Open

github-actions bot mentioned this issue Dec 3, 2024

Weekly Report: 2024-11-26 - 2024-12-03 #36628

Closed

github-actions bot mentioned this issue Dec 10, 2024

Weekly Report: 2024-12-03 - 2024-12-10 #36739

Open

github-actions bot added the Stale label Dec 16, 2024

github-actions bot removed the Stale label Dec 16, 2024

github-actions bot mentioned this issue Dec 17, 2024

Weekly Report: 2024-12-10 - 2024-12-17 #36867

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rate Limit Processor #35204

Rate Limit Processor #35204

juergen-kaiser-by commented Sep 16, 2024 •

edited

Loading

atoulme commented Sep 17, 2024

juergen-kaiser-by commented Sep 18, 2024 •

edited

Loading

juergen-kaiser-by commented Sep 18, 2024

axw commented Oct 11, 2024

juergen-kaiser-by commented Oct 11, 2024

axw commented Oct 14, 2024

juergen-kaiser-by commented Oct 15, 2024

axw commented Oct 16, 2024

github-actions bot commented Dec 16, 2024

axw commented Dec 16, 2024

Rate Limit Processor #35204

Rate Limit Processor #35204

Comments

juergen-kaiser-by commented Sep 16, 2024 • edited Loading

Component(s)

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Rate Limiter in Receivers

Rate Limiter as Extension

Additional context

atoulme commented Sep 17, 2024

juergen-kaiser-by commented Sep 18, 2024 • edited Loading

juergen-kaiser-by commented Sep 18, 2024

axw commented Oct 11, 2024

juergen-kaiser-by commented Oct 11, 2024

axw commented Oct 14, 2024

juergen-kaiser-by commented Oct 15, 2024

axw commented Oct 16, 2024

github-actions bot commented Dec 16, 2024

axw commented Dec 16, 2024

juergen-kaiser-by commented Sep 16, 2024 •

edited

Loading

juergen-kaiser-by commented Sep 18, 2024 •

edited

Loading