Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not create ES indices too far into the past/future #841

Open
mabn opened this issue May 23, 2018 · 5 comments
Open

Do not create ES indices too far into the past/future #841

mabn opened this issue May 23, 2018 · 5 comments

Comments

@mabn
Copy link

mabn commented May 23, 2018

Requirement - what kind of business use case are you trying to solve?

Jaeger performance should not degrade due to bugs in services reporting spans.

Problem - what in Jaeger blocks you from solving the requirement?

One of our services (due to some bug) reports spans with begin timestamps far into the future (years). This causes a lot of indices to be created in the elasticsearch for strange dates because ES indices are created per day. For example:

$ curl $ESADDR/_cat/shards -s | wc -l
9986
$ curl $ESADDR/_cat/indices -s | head
green open jaeger-service-8160-12-02   E759tRv5TeyZD0cB2xubCw 48 0          1   0   9.4kb   9.4kb
green open jaeger-span-8154-09-01      tDhwyd9sQEqIgDYxmd8xXw 48 0          0   0     6kb     6kb
green open jaeger-span-8157-02-28      4Yam481RTF-jS7NpGKBS1g 48 0          0   0     6kb     6kb
green open jaeger-service-8154-03-05   YGICrbBkTNaFu-tiaoFYbw 48 0          1   0   9.1kb   9.1kb
green open jaeger-service-8148-12-15   97E695J7R3uyFSImODLx7g 48 1          1   0    19kb   9.5kb
green open jaeger-span-8151-11-09      U1owZ6ieQ6CrPMHlmcNc1g 48 0          0   0     6kb     6kb
green open jaeger-span-8160-02-18      FFVy6vQ1RamzyauHT_Njow 48 0          0   0     6kb     6kb
green open jaeger-span-8156-02-26      -hNsr6rtS6CCxAUH4HrJmw 48 0          0   0     6kb     6kb
green open jaeger-service-208917-08-31 Qx7Xm9jbQe-3Lf1ryv0paQ 48 0          1   0   9.1kb   9.1kb
green open jaeger-service-8163-10-27   g5mLfk9IQQCvYlCooWWqWQ 48 0          1   0   9.4kb   9.4kb9.5kb

This impacts ES cluster for example because each index's shard holds own file handles.
Additionally our curator script does not remove those indices as they are considered to be in the future (and only past ones are removed).

Proposal - what do you suggest to solve the problem or improve the existing situation?

Restrict in the collector what timestamps are allowed and reject spans which are too old or too far into the future. E.g. not older than 14 days, at most 1 day into the future. Drop spans outside of this range or save them into the "current" index.
The range could be configurable.

@vprithvi
Copy link
Contributor

vprithvi commented May 23, 2018

I think this is a good idea.

Internally, we've run into a problem where services were setting timestamps far enough in the future to cause overflows.

Drop spans outside of this range or save them into the "current" index.

I don't think dropping spans for incorrect timestamps is reasonable, instead we could overwrite the timestamp with the ingestion time, and log a warning on the span. (Ideally, we'd like users to be able to retrieve these spans as part of a trace even if the timestamps are invalid). I'm not sure whether saving them into the current index accomplishes the same thing.

@mabn
Copy link
Author

mabn commented May 23, 2018

Overwriting sounds good.

@mabn
Copy link
Author

mabn commented Jun 12, 2018

It turns out that it isn't a bug in the service - I've added extra logging wrapped around Sender and it didn't catch anything. I suspect that once in a while UDP packets sent to the agent are corrupted.

Based on number of extra indices in ES It happens few times per 10^9 spans.

@pavolloffay
Copy link
Member

pavolloffay commented Dec 5, 2019

Just adding a note this behavior is not present when using rollover aliases --es.use-aliases flag as It uses a single index to write data.

@mehta-ankit
Copy link
Member

I think this is a good idea.

Internally, we've run into a problem where services were setting timestamps far enough in the future to cause overflows.

Drop spans outside of this range or save them into the "current" index.

I don't think dropping spans for incorrect timestamps is reasonable, instead we could overwrite the timestamp with the ingestion time, and log a warning on the span. (Ideally, we'd like users to be able to retrieve these spans as part of a trace even if the timestamps are invalid). I'm not sure whether saving them into the current index accomplishes the same thing.

Solution to this problem would really help. Currently we run into the trouble with elasticsearch having too many indices.

I agree that we should rewrite timestamps if they are in the future.
But ideally have a flag where the user can decide if they want our of order spans (aka future spans) or rewrite timestamp.
Similar to what Vector has: https://vector.dev/docs/reference/configuration/sinks/loki/#out_of_order_action

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants