-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Receiver to extract Tenant from a label present in incoming timeseries #7081
Comments
This would be a really cool feature indeed! We've tried to build a proxy that extracts the tenant from a label and sends one request per tenant with the appropriate header, but it overwhelmed receivers and was not worth the hassle. Having the feature natively built into Thanos is the way to go. |
Thanks Filip - do you think the proposal for how the feature would work
makes sense? I’m a little worried to add too much overhead to the routing
receives and cause requests to get backed up.
…On Sat, 20 Jan 2024 at 10:49, Filip Petkovski ***@***.***> wrote:
This would be a really cool feature indeed! We've tried to build a proxy
that extracts the tenant from a label and sends one request per tenant with
the appropriate header, but it overwhelmed receivers and was not worth the
hassle. Having the feature natively built into Thanos is the way to go.
—
Reply to this email directly, view it on GitHub
<#7081 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALE5VOKEKSMDLZSCHTFBCLDYPOHLTAVCNFSM6AAAAABCCR5KMWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBSGA2DSNJUGU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Wouldnt this be evaluated on the ingesting receiver when looking in which tenant the sample should be written to? I would imagine that instead of one local write, we would inspect the request and group it by tenant and issue multiple local writes here Line 793 in 4a73fc3
|
The ingesting receiver does not always have access to the hashring (e.g. in router-ingester split mode). So routers need to know which ingester to send samples to. |
Had some discussions with @MichaHoffmann. We came to conclusion that the proposed implementation would completely break the current rate limiting concept...e.g. if 1 tenant in a batch of 20 is over the limit, what should Thanos do? 429 will be retried and result in the whole batch being ingested again. If we drop all metrics in the batch due to 1 tenant being over the limit, we have created a noisy neighbour problem. But if we accept the metrics from the valid 19/20 tenants, then we will have out of order issues. So the current design is incompatible with per-tenant rate limiting as it stands. |
Another way to do it, is via current
The same way it is done in mimir: Example of changes needed in Thanos: |
@sepich I like that approach. Do you know how Mimir handles per-tenant limits in that situation? |
@verejoel I think it has the same issues with the ratelimit since all the samples still come from one remote write request! |
Implemented a PoC for this, works really well. A few caveats:
|
@GiedriusS any chance you can open a PR for this issue? Would love to try it out. |
@GiedriusS Hi! |
I've tried Thanos is deployed with Bitnami Helm Chart, here is the receive configuration:
I created a ServiceMonitor which adds the label |
Tenant might be a reserved label iirc: --receive.tenant-label-name="tenant_id" |
So, if I understand, it's impossible to use the |
Also, Receive is adding a |
It's an important feature, so does this PR complete everything in this proposal? @GiedriusS @verejoel Thanks! |
@benjaminhuo I tried using this with the stateless ruler and the router/ingester setup. It seems to not work as expected, as the ALERTS metrics disappear. I still need to devote some time to work out why that might be the case - it could be they get written to some default tenant that is not quriable in our setup. Would be interested if anyone else has managed to get this working with the stateless ruler, as this is one of the ways to enable a multi-tenanted ruler based on internal metric labels. |
@verejoel Is this still an issue for you to use this feature with stateless ruler? Stateless ruler labels are kind of hard to configure but if you can share your configuration, we can help you debug a bit. It would be also nice to create a doc for users about how to troubleshoot this. I will create an issue |
@yeya24 that sounds good, I will dedicate some time to it this week. Let me know which issue and I will post my findings there |
Currently experiencing this same issue. Was an issue ever created? |
Is your proposal related to a problem?
One way to support a multi-tenant ruler is to enable receives to infer the tenant based not on HTTP headers, but on labels present in the incoming time-series. See the comment in issue #5133 for more information.
As well as helping move forward with the multi-tenant ruler, this would help enable multi-tenancy in clients that deliver telemetry. For example, currently if we want to ship telemetry using the OpenTelemetry collector prometheus remote write exporter, we would need to configure an exporter per tenant (I am aware that one can use the headers setter extension to dynamically set headers, but this only works if you have the same tenant for the whole request context).
Describe the solution you'd like
Introduce a new CLI flag for the receiver,
--incoming-tenant-label-name
. Thanos receiver will then search each time-series for occurrences of this label, building a map of unique tenant names discovered from the label values to slices of time series belonging to that tenant. Each element of the map can then be distributed according to the hashring config.The process is summarised in this flow chart:
Notes:
Describe alternatives you've considered
Performing the manipulations to the THANOS-TENANT header upstream dynamically (using an OTel collector, for example).
Additional context
We'd need to modify the behaviour of the
receiveHTTP
handler, in particular where we extract the tenant from the HTTP request.The text was updated successfully, but these errors were encountered: