Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Elasticsearch Rollover API to manage indices #1242

Closed
pavolloffay opened this issue Dec 7, 2018 · 34 comments
Closed

Use Elasticsearch Rollover API to manage indices #1242

pavolloffay opened this issue Dec 7, 2018 · 34 comments
Assignees

Comments

@pavolloffay
Copy link
Member

Requirement - what kind of business use case are you trying to solve?

Use ES Rollover API to manage retention. It's an alternative to date based indices currently used in Jaeger. We could make it as an optional feature.

Before running jaeger we have to create write(read) alias:

curl -ivX PUT -H "Content-Type: application/json" localhost:9200/jaeger-span-000001 -d '{
  "aliases": {
    "jaeger-span": {"is_write_index": true} // note that is_write_index works only in ES6.4
  }
}'

The command creates index jaeger-span-000001 and alias jaeger-span.

Now collector can write to jaeger-span alias. Once the index is too large an external service can rollover new index. This API has to be called periodically and once conditions are met (during the call). ES will create a new index.

curl -ivX POST -H "Content-Type: application/json" localhost:9200/jaeger-span/_rollover -d '{
  "conditions": {
    "max_age":   "7d",
    "max_docs":  1
  }
}'

The command creates index jaeger-span-000002 which is put into alias jaeger-span. Note that the old index jaeger-span-000001 stays in alias if "is_write_index": true (supported only in ES > 6.4).

ES < 6.4

When using ES < 6.4. We have to also use a read alias because the main alias jaeger-span can contain only one index.

curl -ivX POST -H "Content-Type: application/json" localhost:9200/_aliases -d '{
    "actions" : [
        { "add" : { "index" : "jaeger-span", "alias" : "jaeger-span-read" } }
    ]
}'

This command creates read alias jaeger-span-read which points to jaeger-span index (the write index).

When calling rollover we have to specify the alias names. A newly created index will be put into the alias.

curl -ivX POST -H "Content-Type: application/json" localhost:9200/jaeger-span/_rollover -d '{
  "conditions": {
    "max_age":   "7d",
    "max_docs":  1
  },
  "aliases": {
    "jaeger-span-read": {}
  }
}'

https://www.elastic.co/guide/en/elasticsearch/reference/6.5/indices-rollover-index.html
https://www.elastic.co/guide/en/elasticsearch/reference/5.6/indices-rollover-index.html
https://www.elastic.co/blog/managing-time-based-indices-efficiently

Proposal - what do you suggest to solve the problem or improve the existing situation?

Introduce flag which will use a single index (alias) to read and write.

--es.use-single-index                          Use a single index names without date (e.g. "jaeger-span") to write and read.
--es.read-alias                                     Use "-read" alias for read indices.

Any open questions to address

  • Should be jaeger service responsible for creatin aliases or is it a responsibility of an external component e.g. operator? (I think it should be done externally)
@pavolloffay
Copy link
Member Author

cc @jaegertracing/elasticsearch

@yurishkuro
Copy link
Member

I just read https://www.elastic.co/blog/managing-time-based-indices-efficiently - while the primitives make sense, the process itself is absolutely horrifying: 7 or more steps, any of which can fail, with undefined wait periods between them. At least our daily indices require almost no maintenance, just the delete job with a single step.

I would only consider the rollover pattern if it's fully supported by the curator. If it is, I think it's a good direction, but it sounds like we'd still need to provide a tool to generate the curator yaml files with all the actions.

@pavolloffay
Copy link
Member Author

7 or more steps, any of which can fail, with undefined wait periods between them. At least our daily indices require almost no maintenance, just the delete job with a single step.

What steps exactly do you mean? I the linked example is using even more complicated deployment model with hot/cold nodes. Curator already supports rollover https://www.elastic.co/guide/en/elasticsearch/client/curator/current/rollover.html. In addition to that I am also interested in adding elastic/curator#1278 to its API.

To make rollover work the only required steps are:

  1. setup aliases at init-time
  2. periodically call rollover API.

The date based indices will be still supported. Rollover will be an additional feature for users which can benefit from it.

@yurishkuro
Copy link
Member

I was referring to the steps in the blog post. Rollover is just one step, all others have to do with managing the index aliases, relocating index to warm nodes, compressing it, etc. Calling the rollover API only triggers index rollover once in a while, it's not sufficient for managing the whole thing via aliases.

I am not opposed to the approach, as long as curator provides the necessary automation for managing the aliases after the rollover.

@pavolloffay
Copy link
Member Author

pavolloffay commented Jan 11, 2019

Adding questions from weekly meeting:

  • if use read alias with many indices, how can we tell ES to only search those of them that match the desired time range

--es.max-span-age - The maximum lookback for spans in Elasticsearch
Reponse: The reader will access only one alias pointing to potentionally multiple indices. An external component (part of this project) will be executed to remove old indices from read alias.

@otisg
Copy link

otisg commented Jan 11, 2019

If it helps: https://sematext.com/blog/field-stats-plugin-elasticsearch/ (Github repo for the plugin linked in there)

@pavolloffay
Copy link
Member Author

thanks for the pointer @otisg I think we would like to stay with only official ES distribution if it is possible. The --es.max-span-age - The maximum lookback for spans in Elasticsearch can be managed by curator/ES API by removing old indices from read alias. We will provide a script/component to do that.

@pavolloffay
Copy link
Member Author

The #1197 introduces esRollover to manage rollover index. My last open question is how to manage --es.max-span-age - The maximum lookback for spans in Elasticsearch? At the moment the reader creates a list of indices based on the supplied es.max-span-age. With the rollover we will read always from one index - read alias.

My design is that an external component would remove indices from the alias to mimic the behavior of es.max-span-age.

Any more thoughts on this from @jaegertracing/elasticsearch ?

@yurishkuro
Copy link
Member

seems like es.max-span-age should not be used if the alias mode is selected

@pavolloffay
Copy link
Member Author

Yes, But we should provide an alternative solution to that... I have added this functionality to esRollover script in #1197. It removes indices from read alias based on configured age.

@masteinhauser
Copy link
Member

the es.max-span-age is a fairly critical feature for us, unfortunately, and it doesn't seem like the archive indices will be enough? I could absolutely be confused here, based on the multiple PRs inflight for this refactor

@pavolloffay Would it be possible to use a NumericRangeQuery? It feels like this might be the most efficient query method from Elasticsearch's perspective:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-range-query.html#_date_format_in_range_queries

Possible example:

GET _search
{
    "query": {
        "range" : {
            " startTimeMillis" : {
                "format": "epoch_millis", # though it appears this field IS NOT needed in Jaeger's query.
                "gte" : "now-72h/h",
                "lt" :  "now/d"
            }
        }
    }
}

@pavolloffay
Copy link
Member Author

@masteinhauser I think using NumericRangeQuery compared to first filtering by indices has performance implications as the ES would have to go through all data for all indices kept in the read alias.

is a fairly critical feature for us

Can you please explain why is it critical to you? Do you deploy multiple query services with a different es.max-span-age? I think this is the only tricky part when using rollover, then you would have to use different index prefixes and keep a different set of indices per index.

There is only one PR related to rollover: #1197, see the first comment and section Managing max-span-age and delete old indices to better understand how it works.

@masteinhauser
Copy link
Member

masteinhauser commented Jan 17, 2019

as the ES would have to go through all data for all indices kept in the read alias.

Yep, we already see that exact behavior with our es.max-spane-age=720h.

Can you please explain why is it critical to you?

Unfortunately, we have far too many defects filed from production speakers and customers that get worked on outside the default 72h range, but almost all of them get worked within 30 days. These don't always have an exact TraceID to refer to during their investigation. I'm actively trying to determine how to better support this use case, and was hoping this work might be related. My next attempt is to deploy multiple services with different configurations.

I'm not sure how Kibana does this, but I do know it handles far more data over a larger timeframe much better than the Jaeger Query searches seem to. (We use Kibana to figure out all of our Traces, and then use those TraceIDs to pull up Jaeger's view of the spans)

Oh, apologies, I'll take a look at #1197 once again to re-familiarize myself. Thanks for the reference!

@pavolloffay
Copy link
Member Author

https://discuss.elastic.co/t/filter-indices-for-range-query-in-time-based-indices/149913 mentions that range query could be used with a large number of indices, that ES does some optimizations to avoid going through all indices. One way or another this will be done separately with some perf tests.

@pavolloffay
Copy link
Member Author

ES 6.5 and 7.0.0 (I was able to test this with 7 only) supports rollover policies https://www.elastic.co/guide/en/elasticsearch/reference/6.x//using-policies-rollover.html. It means that rollover conditions are set in a policy and ES automatically creates new index - no need to periodically call index/_rollover endpoint.

The following example will create a new index every 5s and delete if older than 20s. To make this wor per seconds we have to modify cluster setting indices.lifecycle.poll_interval=1s when starting ES.

docker run --rm -it -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node"  -e "indices.lifecycle.poll_interval=1s" docker.elastic.co/elasticsearch/elasticsearch:7.0.0-alpha2

curl -ivX PUT -H "Content-Type: application/json" localhost:9200/_ilm/policy/archive_index_policy -d '{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_age": "5s"
          }
        }
      },
      "delete": {
        "min_age": "20s",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}'

curl -ivX PUT -H "Content-Type: application/json" localhost:9200/_template/archive_index_template -d '{
  "index_patterns": ["jaeger-span-archive-*"], 
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1,
    "index.lifecycle.name": "archive_index_policy", 
    "index.lifecycle.rollover_alias": "jaeger-span-archive-write" 
  }
}'

curl -ivX PUT -H "Content-Type: application/json"  localhost:9200/jaeger-span-archive-000001 -d '{
  "aliases": {
    "jaeger-span-archive-write":{"is_write_index": true} // I am using a single index here
  }
}'

@pavolloffay
Copy link
Member Author

Heads up https://www.elastic.co/guide/en/elasticsearch/reference/6.7/index-lifecycle-management-api.html is enterprise x-pack feature so we cannot use it in OSS. The only improvement we can do is time range queries #1361.

Maybe there is an OSS plugin which provides index lifecycle management, then the deployment will not require to run esRollover rollover action. However we still have to provide it.

@owenhaynes
Copy link

Is ILM now not a basic feature of elastic 7+ now?

@pavolloffay
Copy link
Member Author

@primeroz
Copy link

@pavolloffay

https://discuss.elastic.co/t/filter-indices-for-range-query-in-time-based-indices/149913 mentions that range query could be used with a large number of indices, that ES does some optimizations to avoid going through all indices. One way or another this will be done separately with some perf tests.

I have been trying to find more info about that comment, were you able to confirm how this optimizations work ?

@pavolloffay
Copy link
Member Author

Actually I wasn't able to find any concrete docs. There is a PR that implements wildcard index for query - depending only on time range. #1969

Our (for now) internal results show that it is slower than providing a complied list of indices to query.

@bhiravabhatla
Copy link
Contributor

@pavolloffay - I have been trying to use ILM to manage the jaeger rollovers and deletion - Instead of having a cron job hitting rollover api to manually perform rollover - as specified in this blog (https://medium.com/jaegertracing/using-elasticsearch-rollover-to-manage-indices-8b3d0c77915d).

To achieve the same, I am creating override index templates (for span and service) before running the init. Then run esrollover.py init to creating span,service templates ,aliases and first indices (span-00001 and service-00001)

PUT _template/override-jaeger-span-index-template { "order": 1, "index_patterns": [ "jaeger-span-*" ], "settings": { "index": { "lifecycle": { "name": "jaeger-ILM-Policy", "rollover_alias": "jaeger-span-write" } } }, "aliases": { "jaeger-span-read": {} } }

jaeger-ILM-Policy is created before hand.

PUT _ilm/policy/jaeger-ILM-Policy { "policy": { "phases": { "hot": { "min_age": "0ms", "actions": { "rollover": { "max_age": "1d" }, "set_priority": { "priority": 100 } } }, "delete": { "min_age": "1d", "actions": { "delete": {} } } } } }

In the override template I add a alias "jaeger-span-read" which will make sure all the indices created by jaeger would have the read alias. And I use "jaeger-span-write" as index_rollover_alias. I see the initial rollover (rollover from hot) working fine. I am having a challenge, when it tries to perform checks after initial rollover (to delete), it fails as the initial index or previous index no longer is part of index_rollover_alias (jaeger-span-write). I wanted to understand the rationale of using two different alias for reading and writing, we could have used one alias and used "is_write_index". I see the same mentioned in one of the above comments for archive-index.

@pavolloffay
Copy link
Member Author

I am having a challenge, when it tries to perform checks after initial rollover (to delete), it fails as the initial index or previous index no longer is part of index_rollover_alias (jaeger-span-write)

What component is causing the issue?

I wanted to understand the rationale of using two different alias for reading and writing, we could have used one alias and used "is_write_index". I see the same mentioned in one of the above comments for archive-index.

IIRC it was done for ES5. The ES5 does not support is_write_index property.

@bhiravabhatla
Copy link
Contributor

@pavolloffay - Thanks for reverting quickly. As we dont associate is_write_index to initial index. After first rollover, span-0001 is removed from jaeger-span-write alias (which is ilm_rollover_alias). When ilm polls the span-0001 index for further lifecycle events it complains:

{\"type\":\"illegal_argument_exception\",\"reason\":\"index.lifecycle.rollover_alias [jaeger-span-write] does not point to index [jaeger-span-000001]\",....}

If we add is_write_index true while creating span-0001 - I suspect this would work. I am going to give it a try and update.

@pavolloffay
Copy link
Member Author

thanks @bhiravabhatla. It would be great to put a guide/docs or blog post on this topic if you are interested.

@bhiravabhatla
Copy link
Contributor

bhiravabhatla commented Aug 31, 2020

Will do @pavolloffay, Thank you!. I think we can add the is_write_index true while adding indices to the write alias here by passing extra_settings here -

alias.add(ilo)

Correct me if I am wrong

@bhiravabhatla
Copy link
Contributor

bhiravabhatla commented Aug 31, 2020

Hi @pavolloffay - Was able to implement the same, made few tweaks to esRollover.py. Pushed the updated image here - https://github.com/bhiravabhatla/jaeger-index-rollover-with-ilm. Have tested it with example application, I could see that Jaeger is able to read from read-alias and ILM is able to rollover and delete indices as specified in config.

Note - Have not tested for archive indices.

Summary:
To use ILM for managing jaeger rollover, I followed below steps:

-- Create a ILM policy for jaeger in elastic search. In below sample for demo, I have kept max_age and delete after in minutes.

Sample :
PUT _ilm/policy/jaeger-ILM-Policy { "policy": { "phases": { "hot": { "min_age": "0ms", "actions": { "rollover": { "max_age": "1m" }, "set_priority": { "priority": 100 } } }, "delete": { "min_age": "2m", "actions": { "delete": {} } } } } }

-- Run Init to create the initial set of aliases and templates. I am creating override templates[with different name and order=1] - as when jaeger starts up, it creates/updates the templates with name - jaeger-service and jaeger-span.

docker run -it --rm --net=host bhiravabhatla/jaeger-es-rollover-init:latest init <ES HOST>

-- Start Jaeger with es.use-aliases=true

Note - By default indices.lifecycle.poll_interval is set to 10m, for testing, we would have to set it to something less say 10s

PUT /_cluster/settings?flat_settings=true { "transient" : { "indices.lifecycle.poll_interval" : "10s" } }

@bhiravabhatla
Copy link
Contributor

@pavolloffay - Could you please share feedback on above. One thing I could have done was to parameterise jaeger ILM policy names in the templates

@pavolloffay
Copy link
Member Author

One thing I could have done was to parameterise jaeger ILM policy names in the templates

In the jaeger index templates? We should make the ILM work with the upstream Jaeger if possible without requiring users to do changes.

I don't have experience with ILM configuration so I cannot really comment if it's good or not. Perhaps somebody from @jaegertracing/elasticsearch can have a look on the approach mentioned above?

@bhiravabhatla would you be interested documentig this in jaegertracing.io or writing a medium post?

@bhiravabhatla
Copy link
Contributor

bhiravabhatla commented Sep 1, 2020

In the jaeger index templates? We should make the ILM work with the upstream Jaeger if possible without requiring users to do changes.

I agree we should make this work with upstream Jaegar. The above can be looked as a workaround to use ILM with current jaeger capabilities. In future, I think we can have a flag --es.use-ILM or something similar and create the index template accordingly from jaeger itself - open to discussions on this.

@bhiravabhatla would you be interested documentig this in jaegertracing.io or writing a medium post?

Sure. I can, let me know the process.

@pavolloffay
Copy link
Member Author

pavolloffay commented Sep 1, 2020

, I think we can have a flag --es.use-ILM or something similar and create the index template accordingly from jaeger itself - open to discussions on this.

Would you be also intereted in submitting a PR to do this?

The docs are hosted here https://github.com/jaegertracing/documentation/blob/master/content/docs/next-release/deployment.md#elasticsearch you can create a PR against that. The blog is hosted on medium https://medium.com/jaegertracing. If you prefer the blog I can add you to the medium Jaeger org so that you can submit a publication there - I will need your medium account.

@bhiravabhatla
Copy link
Contributor

@pavolloffay - I actually have drafted a blog in my medium account. Have not published yet. My medium account https://medium.com/@bhiravabhatla

@bhiravabhatla
Copy link
Contributor

Would you be also intereted in submitting a PR to do this?

Have not used golang before, I am interested - but would need some help. :)

@pavolloffay
Copy link
Member Author

np we can help you with golang :). I have sent you an invite on medium to join jaegertracing.

@bhiravabhatla
Copy link
Contributor

bhiravabhatla commented Sep 1, 2020

np we can help you with golang :)

Thank you :).

I have sent you an invite on medium to join jaegertracing.

Thank you @pavolloffay - I have submitted the draft.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants