Performance degradation after upgrading from 8.6.1 to 8.16.1 #118623

gregolsen · 2024-12-12T21:38:03Z

Elasticsearch Version

8.16.1

Installed Plugins

No response

Java Version

bundled

OS Version

6.1.112-124.190.amzn2023.aarch64

Problem Description

After starting an upgrade from 8.6.1 to 8.16.1 we noticed that data nodes running new version are performing worse.
Their CPU was significantly higher than on the data nodes with the old version:

We also noticed that new version nodes have much higher flush rate:

It doesn't matter whether an old node is upgraded to 8.16.1 or a brand new node spun up with 8.16.1 - they all show same symptoms of higher CPU and much higher flush rate degrading performance of the cluster as a whole.

We added more data nodes with the new 8.16.1 version and now have a cluster with 35 "old" nodes and 36 "new" nodes.
New nodes are consuming twice more CPU and have much higher flush rate:

Another notable difference is how new nodes have higher cache size:

The reason why we were upgrading to 8.16.1 is to mitigate this issue

We are running Elasticsearch on EC2 with ephemeral instance store on c7gd.16xlarge. Elasticsearch version aside no changes were made to the cluster infrastructure, JVM/Elasticsearch options, etc.

Steps to Reproduce

Upgrade a 8.6.1 data node to 8.16.1

Logs (if relevant)

No response

None of the below has changed between the upgrade

JVM options

## JVM configuration
-Xms30500m
-Xmx30500m
8-13:-XX:+UseConcMarkSweepGC
8-13:-XX:CMSInitiatingOccupancyFraction=75
8-13:-XX:+UseCMSInitiatingOccupancyOnly
8:-XX:+AlwaysPreTouch
8:-Xss1m
8:-Djava.awt.headless=true
8:-Dfile.encoding=UTF-8
8:-Djna.nosys=true
8:-XX:-OmitStackTraceInFastThrow
8:-Dio.netty.noUnsafe=true
8:-Dio.netty.noKeySetOptimization=true
8:-Dio.netty.recycler.maxCapacityPerThread=0
8:-Dlog4j.shutdownHookEnabled=false
8:-Dlog4j2.disable.jmx=true
8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:logs/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m
9-:-Djava.locale.providers=COMPAT
-Des.index.memory.max_index_buffer_size=10240m
14-:-XX:+UseG1GC
14-:-XX:G1ReservePercent=25
14-:-XX:InitiatingHeapOccupancyPercent=30
14-:-Djava.io.tmpdir=${ES_TMPDIR}
14-:-XX:+HeapDumpOnOutOfMemoryError
14-:-XX:HeapDumpPath=data
14-:-XX:ErrorFile=logs/hs_err_pid%p.log
-Dlog4j2.formatMsgNoLookups=true

Elasticsearch static node config

cloud.node.auto_attributes: true
cluster.routing.allocation.awareness.attributes: aws_availability_zone
action.auto_create_index: false
indices.memory.index_buffer_size: 25%
network.host: 0.0.0.0
http.compression: true
action.destructive_requires_name: true
bootstrap.memory_lock: true
thread_pool.write.queue_size: 2500
script.max_compilations_rate: 30/1m
xpack.graph.enabled: false
xpack.ml.enabled: false
xpack.watcher.enabled: false
discovery.seed_providers: ec2
indices.breaker.request.limit: 1gb
indices.breaker.fielddata.limit: 1gb
xpack.security.enabled: true
xpack.security.http.ssl.enabled: false
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.client_authentication: required
xpack.security.transport.ssl.keystore.path: elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: elastic-certificates.p12
transport.compress: true
transport.compression_scheme: lz4
cluster.name: CLUSTER_NAME
discovery.ec2.tag.elasticsearch_cluster_name: CLUSTER_NAME
node.name: NODE_NAME
path.data: "/media/ephemeral"
path.logs: "/var/log/elasticsearch"
reindex.remote.whitelist: 10.0.*.*:9200
node.roles:
- data

Elasticsearch cluster settings

{
  "persistent": {
    "cluster": {
      "routing": {
        "allocation": {
          "cluster_concurrent_rebalance": "100",
          "node_concurrent_recoveries": "100",
          "disk": {
            "watermark": {
              "low": "75%",
              "high": "85%"
            }
          },
          "node_initial_primaries_recoveries": "114"
        }
      }
    },
    "indices": {
      "breaker": {
        "fielddata": {
          "limit": "1gb"
        },
        "request": {
          "limit": "1gb"
        }
      },
      "recovery": {
        "max_bytes_per_sec": "1000mb"
      }
    },
    "search": {
      "default_search_timeout": "120s",
      "default_allow_partial_results": "false",
      "max_open_scroll_context": "500000"
    },
    "ingest": {
      "geoip": {
        "downloader": {
          "enabled": "false"
        }
      }
    },
    "logger": {
      "org": {
        "elasticsearch": {
          "deprecation": "ERROR"
        }
      }
    }
  },
  "transient": {}
}

The text was updated successfully, but these errors were encountered:

javanna · 2024-12-13T09:05:34Z

After starting an upgrade from 8.6.1 to 8.16.1 we noticed that data nodes running new version are performing worse.

I don't find any details on how nodes are performing worse, but rather higher CPU usage, which is per se not necessarily a problem. Could you expand on the performance you are observing? Which API, what changes are you observing etc.

gregolsen · 2024-12-13T09:24:08Z

@javanna sorry - my bad!
Here's an example of a node that I upgraded to 8.16.1 in place (pause shard rebalancing, stop old ES, install new version, start ES, enable allocation)

After the upgrade node's search latency got worse and never settled back into its original value, while CPU was much higher:

Here's how flush rate changed for that node:

But even if all the other metrics stayed the same I would consider CPU degradation a problem as nodes doing same amount of work but consuming significantly more CPU effectively means performance is degraded and I would have to scale up the cluster after the upgrade.

gregolsen · 2024-12-13T09:30:51Z

I also added ES and JVM config/settings to the issue description – none of those changed between the upgrades.

gregolsen · 2024-12-13T09:58:54Z

I wonder if this is related to the improvements introduced here #94607

gregolsen · 2024-12-13T10:05:03Z

Another indirect indication of how performance degraded on the upgraded node - active threads in search thread pool went up:

gregolsen · 2024-12-13T10:12:36Z

Another observation which may or may not be relevant is the change in query cache behavior. Looks like the query cache became more effective judging by hit rate and size, but at the same time rate of evictions increased reducing overall memory footprint. Increased evictions where mentioned here too – and getting the query cache memory overcounting issue fixed was our original intent behind the upgrade.

elasticsearchmachine · 2024-12-13T12:48:48Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

javanna · 2024-12-13T12:49:14Z

What do your queries look like? Did you identify specific ones that are slowing down more than others? Could you run them with profile enabled so that we look where time is spent?

gregolsen · 2024-12-13T15:23:55Z

I wasn't able to pin-point a specific query shape. I guess I would need to complete the migration and get rid of the nodes running the old version. And then compare performance of the queries on old version vs. new one.
However, given how much higher the CPU utilization is, I will have to scale up the cluster. So it won't be a fair comparison of performance.

gregolsen · 2024-12-13T17:19:50Z

Upon comparing hot threads API output for 8.6.1 and 8.16.1 nodes I noticed that this stack shows a lot in 8.16.1 and is missing from 8.6.1:

   100.0% [cpu=97.6%, other=2.4%] (500ms out of 500ms) cpu usage by thread 'elasticsearch[es-data-i-022a7c6157cb44c54][search][T#84]'
     3/10 snapshots sharing following 43 elements
       java.base@23/java.lang.invoke.VarHandleGuards.guard_LJ_I(VarHandleGuards.java:1002)
       java.base@23/jdk.internal.foreign.AbstractMemorySegmentImpl.get(AbstractMemorySegmentImpl.java:765)
       app/[email protected]/org.apache.lucene.store.MemorySegmentIndexInput.readShort(MemorySegmentIndexInput.java:247)
       app/[email protected]/org.apache.lucene.util.compress.LZ4.decompress(LZ4.java:113)

Pointing towards LZ4.decompress being slower in Lucene 9.12.0 due to guard_LJ_I.
I found this dacapobench/dacapobench#264 and considering to try out

-Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false

Update

I restarted one 8.16.1 node after disabling Lucene memory segments with the setting above and its CPU went down!

I repeated experimented by restarting two 8.16.1 nodes: one w/o any settings changes (control) and another one again with memory segments disabled. Again CPU on the node with setting disabled went down significanly:

gregolsen added >bug needs:triage Requires assignment of a team area label labels Dec 12, 2024

javanna added the feedback_needed label Dec 13, 2024

javanna added :Search Foundations/Search Catch all for Search Foundations and removed feedback_needed needs:triage Requires assignment of a team area label labels Dec 13, 2024

elasticsearchmachine added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degradation after upgrading from 8.6.1 to 8.16.1 #118623

Performance degradation after upgrading from 8.6.1 to 8.16.1 #118623

gregolsen commented Dec 12, 2024 •

edited

Loading

javanna commented Dec 13, 2024

gregolsen commented Dec 13, 2024

gregolsen commented Dec 13, 2024

gregolsen commented Dec 13, 2024 •

edited

Loading

gregolsen commented Dec 13, 2024

gregolsen commented Dec 13, 2024 •

edited

Loading

elasticsearchmachine commented Dec 13, 2024

javanna commented Dec 13, 2024

gregolsen commented Dec 13, 2024

gregolsen commented Dec 13, 2024 •

edited

Loading

Performance degradation after upgrading from 8.6.1 to 8.16.1 #118623

Performance degradation after upgrading from 8.6.1 to 8.16.1 #118623

Comments

gregolsen commented Dec 12, 2024 • edited Loading

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

None of the below has changed between the upgrade

JVM options

Elasticsearch static node config

Elasticsearch cluster settings

javanna commented Dec 13, 2024

gregolsen commented Dec 13, 2024

gregolsen commented Dec 13, 2024

gregolsen commented Dec 13, 2024 • edited Loading

gregolsen commented Dec 13, 2024

gregolsen commented Dec 13, 2024 • edited Loading

elasticsearchmachine commented Dec 13, 2024

javanna commented Dec 13, 2024

gregolsen commented Dec 13, 2024

gregolsen commented Dec 13, 2024 • edited Loading

Update

gregolsen commented Dec 12, 2024 •

edited

Loading

gregolsen commented Dec 13, 2024 •

edited

Loading

gregolsen commented Dec 13, 2024 •

edited

Loading

gregolsen commented Dec 13, 2024 •

edited

Loading