Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degradation after upgrading from 8.6.1 to 8.16.1 #118623

Open
gregolsen opened this issue Dec 12, 2024 · 10 comments
Open

Performance degradation after upgrading from 8.6.1 to 8.16.1 #118623

gregolsen opened this issue Dec 12, 2024 · 10 comments
Labels
>bug :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch

Comments

@gregolsen
Copy link

gregolsen commented Dec 12, 2024

Elasticsearch Version

8.16.1

Installed Plugins

No response

Java Version

bundled

OS Version

6.1.112-124.190.amzn2023.aarch64

Problem Description

After starting an upgrade from 8.6.1 to 8.16.1 we noticed that data nodes running new version are performing worse.
Their CPU was significantly higher than on the data nodes with the old version:
Image

We also noticed that new version nodes have much higher flush rate:
Image

It doesn't matter whether an old node is upgraded to 8.16.1 or a brand new node spun up with 8.16.1 - they all show same symptoms of higher CPU and much higher flush rate degrading performance of the cluster as a whole.

We added more data nodes with the new 8.16.1 version and now have a cluster with 35 "old" nodes and 36 "new" nodes.
New nodes are consuming twice more CPU and have much higher flush rate:
Image
Image

Another notable difference is how new nodes have higher cache size:
Image

The reason why we were upgrading to 8.16.1 is to mitigate this issue

We are running Elasticsearch on EC2 with ephemeral instance store on c7gd.16xlarge. Elasticsearch version aside no changes were made to the cluster infrastructure, JVM/Elasticsearch options, etc.

Steps to Reproduce

Upgrade a 8.6.1 data node to 8.16.1

Logs (if relevant)

No response

None of the below has changed between the upgrade

JVM options

## JVM configuration
-Xms30500m
-Xmx30500m
8-13:-XX:+UseConcMarkSweepGC
8-13:-XX:CMSInitiatingOccupancyFraction=75
8-13:-XX:+UseCMSInitiatingOccupancyOnly
8:-XX:+AlwaysPreTouch
8:-Xss1m
8:-Djava.awt.headless=true
8:-Dfile.encoding=UTF-8
8:-Djna.nosys=true
8:-XX:-OmitStackTraceInFastThrow
8:-Dio.netty.noUnsafe=true
8:-Dio.netty.noKeySetOptimization=true
8:-Dio.netty.recycler.maxCapacityPerThread=0
8:-Dlog4j.shutdownHookEnabled=false
8:-Dlog4j2.disable.jmx=true
8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:logs/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m
9-:-Djava.locale.providers=COMPAT
-Des.index.memory.max_index_buffer_size=10240m
14-:-XX:+UseG1GC
14-:-XX:G1ReservePercent=25
14-:-XX:InitiatingHeapOccupancyPercent=30
14-:-Djava.io.tmpdir=${ES_TMPDIR}
14-:-XX:+HeapDumpOnOutOfMemoryError
14-:-XX:HeapDumpPath=data
14-:-XX:ErrorFile=logs/hs_err_pid%p.log
-Dlog4j2.formatMsgNoLookups=true

Elasticsearch static node config

cloud.node.auto_attributes: true
cluster.routing.allocation.awareness.attributes: aws_availability_zone
action.auto_create_index: false
indices.memory.index_buffer_size: 25%
network.host: 0.0.0.0
http.compression: true
action.destructive_requires_name: true
bootstrap.memory_lock: true
thread_pool.write.queue_size: 2500
script.max_compilations_rate: 30/1m
xpack.graph.enabled: false
xpack.ml.enabled: false
xpack.watcher.enabled: false
discovery.seed_providers: ec2
indices.breaker.request.limit: 1gb
indices.breaker.fielddata.limit: 1gb
xpack.security.enabled: true
xpack.security.http.ssl.enabled: false
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.client_authentication: required
xpack.security.transport.ssl.keystore.path: elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: elastic-certificates.p12
transport.compress: true
transport.compression_scheme: lz4
cluster.name: CLUSTER_NAME
discovery.ec2.tag.elasticsearch_cluster_name: CLUSTER_NAME
node.name: NODE_NAME
path.data: "/media/ephemeral"
path.logs: "/var/log/elasticsearch"
reindex.remote.whitelist: 10.0.*.*:9200
node.roles:
- data

Elasticsearch cluster settings

{
  "persistent": {
    "cluster": {
      "routing": {
        "allocation": {
          "cluster_concurrent_rebalance": "100",
          "node_concurrent_recoveries": "100",
          "disk": {
            "watermark": {
              "low": "75%",
              "high": "85%"
            }
          },
          "node_initial_primaries_recoveries": "114"
        }
      }
    },
    "indices": {
      "breaker": {
        "fielddata": {
          "limit": "1gb"
        },
        "request": {
          "limit": "1gb"
        }
      },
      "recovery": {
        "max_bytes_per_sec": "1000mb"
      }
    },
    "search": {
      "default_search_timeout": "120s",
      "default_allow_partial_results": "false",
      "max_open_scroll_context": "500000"
    },
    "ingest": {
      "geoip": {
        "downloader": {
          "enabled": "false"
        }
      }
    },
    "logger": {
      "org": {
        "elasticsearch": {
          "deprecation": "ERROR"
        }
      }
    }
  },
  "transient": {}
}
@gregolsen gregolsen added >bug needs:triage Requires assignment of a team area label labels Dec 12, 2024
@javanna
Copy link
Member

javanna commented Dec 13, 2024

After starting an upgrade from 8.6.1 to 8.16.1 we noticed that data nodes running new version are performing worse.

I don't find any details on how nodes are performing worse, but rather higher CPU usage, which is per se not necessarily a problem. Could you expand on the performance you are observing? Which API, what changes are you observing etc.

@gregolsen
Copy link
Author

@javanna sorry - my bad!
Here's an example of a node that I upgraded to 8.16.1 in place (pause shard rebalancing, stop old ES, install new version, start ES, enable allocation)
Image

After the upgrade node's search latency got worse and never settled back into its original value, while CPU was much higher:
Image

Here's how flush rate changed for that node:
Image

But even if all the other metrics stayed the same I would consider CPU degradation a problem as nodes doing same amount of work but consuming significantly more CPU effectively means performance is degraded and I would have to scale up the cluster after the upgrade.

@gregolsen
Copy link
Author

I also added ES and JVM config/settings to the issue description – none of those changed between the upgrades.

@gregolsen
Copy link
Author

gregolsen commented Dec 13, 2024

I wonder if this is related to the improvements introduced here #94607

@gregolsen
Copy link
Author

Another indirect indication of how performance degraded on the upgraded node - active threads in search thread pool went up:
Image

@gregolsen
Copy link
Author

gregolsen commented Dec 13, 2024

Another observation which may or may not be relevant is the change in query cache behavior. Looks like the query cache became more effective judging by hit rate and size, but at the same time rate of evictions increased reducing overall memory footprint. Increased evictions where mentioned here too – and getting the query cache memory overcounting issue fixed was our original intent behind the upgrade.
Image
Image

@javanna javanna added :Search Foundations/Search Catch all for Search Foundations and removed feedback_needed needs:triage Requires assignment of a team area label labels Dec 13, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Dec 13, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@javanna
Copy link
Member

javanna commented Dec 13, 2024

What do your queries look like? Did you identify specific ones that are slowing down more than others? Could you run them with profile enabled so that we look where time is spent?

@gregolsen
Copy link
Author

I wasn't able to pin-point a specific query shape. I guess I would need to complete the migration and get rid of the nodes running the old version. And then compare performance of the queries on old version vs. new one.
However, given how much higher the CPU utilization is, I will have to scale up the cluster. So it won't be a fair comparison of performance.

@gregolsen
Copy link
Author

gregolsen commented Dec 13, 2024

Upon comparing hot threads API output for 8.6.1 and 8.16.1 nodes I noticed that this stack shows a lot in 8.16.1 and is missing from 8.6.1:

   100.0% [cpu=97.6%, other=2.4%] (500ms out of 500ms) cpu usage by thread 'elasticsearch[es-data-i-022a7c6157cb44c54][search][T#84]'
     3/10 snapshots sharing following 43 elements
       java.base@23/java.lang.invoke.VarHandleGuards.guard_LJ_I(VarHandleGuards.java:1002)
       java.base@23/jdk.internal.foreign.AbstractMemorySegmentImpl.get(AbstractMemorySegmentImpl.java:765)
       app/[email protected]/org.apache.lucene.store.MemorySegmentIndexInput.readShort(MemorySegmentIndexInput.java:247)
       app/[email protected]/org.apache.lucene.util.compress.LZ4.decompress(LZ4.java:113)

Pointing towards LZ4.decompress being slower in Lucene 9.12.0 due to guard_LJ_I.
I found this dacapobench/dacapobench#264 and considering to try out

-Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false

Update

I restarted one 8.16.1 node after disabling Lucene memory segments with the setting above and its CPU went down!
Image

I repeated experimented by restarting two 8.16.1 nodes: one w/o any settings changes (control) and another one again with memory segments disabled. Again CPU on the node with setting disabled went down significanly:
Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

3 participants