Concurrently reconcile CloudStackMachine resources #290

chrisdoherty4 · 2023-07-13T17:27:19Z

AWS analyzed CAPC in high node count contexts and found it takes considerable time to scale clusters. Part of the issue stems from CloudStackMachine resources being reconciled serially. This change enables concurrent reconciliation of CloudStackMachine resources improving the efficiency and preventing other parts of the system from reacting to slowness.

I have tested these changes by scaling up and down a machine deployment from 1 to 11 nodes. Scale ups took comparable times (55s) vs serial reconciliation which is expected as most of the time is consumed by VM provisioning. Scale down had an 85% improvement from 1m57s to 27s.

Related #274

k8s-ci-robot · 2023-07-13T17:27:21Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2023-07-13T17:27:26Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chrisdoherty4

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [chrisdoherty4]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2023-07-13T17:27:26Z

✅ Deploy Preview for kubernetes-sigs-cluster-api-cloudstack ready!

Name	Link
🔨 Latest commit	`9f73dae`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-cluster-api-cloudstack/deploys/64b1b1fa22dde80007b9dc8b
😎 Deploy Preview	https://deploy-preview-290--kubernetes-sigs-cluster-api-cloudstack.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

codecov-commenter · 2023-07-14T20:12:14Z

Codecov Report

Patch coverage has no change and project coverage change: -0.05 ⚠️

Comparison is base (4ccf853) 25.29% compared to head (9f73dae) 25.25%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #290      +/-   ##
==========================================
- Coverage   25.29%   25.25%   -0.05%     
==========================================
  Files          59       59              
  Lines        5585     5582       -3     
==========================================
- Hits         1413     1410       -3     
  Misses       4035     4035              
  Partials      137      137

Impacted Files	Coverage Δ
controllers/cloudstackmachine_controller.go	`54.85% <ø> (ø)`
pkg/cloud/instance.go	`82.38% <ø> (-0.16%)`	⬇️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

chrisdoherty4 · 2023-07-14T20:38:21Z

/run-e2e -c 4.18

g-gaston · 2023-07-14T21:06:53Z

/lgtm

g-gaston · 2023-07-14T21:07:24Z

/hold

chrisdoherty4 · 2023-07-14T21:07:27Z

/hold

chrisdoherty4 · 2023-07-14T21:17:49Z

The E2E don't seem to be getting kicked off?

/assign @vishesh92 @weizhouapache

k8s-ci-robot · 2023-07-14T21:17:51Z

@chrisdoherty4: GitHub didn't allow me to assign the following users: vishesh92.

Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

The E2E don't seem to be getting kicked off?

/assign @vishesh92 @weizhouapache

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rohityadavcloud · 2023-07-15T04:19:35Z

@chrisdoherty4 it's possible the backend BO script is facing Github rate limits (we found recently the script/jenkins jobs are running but couldn't post results as Github rate limit would void the API request to post comments for some reason)

/run-e2e help

blueorangutan · 2023-07-15T04:20:02Z

@rohityadavcloud
The command to run e2e test for CAPC.

Usage: /run-e2e [-k Kubernetes_Version] [-c CloudStack_Version] [-h Hypervisor] [-i Template/Image] [-f Kubernetes_Version_Upgrade_From] [-t Kubernetes_Version_Upgrade_To]

Supported Kubernetes versions are: ['1.27.2', '1.26.5', '1.25.10', '1.24.14', '1.23.3', '1.22.6']. The default value is '1.27.2'.
Supported CloudStack versions are: ['4.18', '4.17', '4.16']. If it is not set, an existing environment will be used.
Supported hypervisors are: ['kvm', 'vmware', 'xen']. The default value is 'kvm'.
Supported templates are: ['ubuntu-2004-kube', 'rockylinux-8-kube']. The default value is 'ubuntu-2004-kube'.
By default it tests Kubernetes upgrade from version '1.26.5' to '1.27.2'.

Examples:

/run-e2e
/run-e2e -k 1.27.2 -h kvm -i ubuntu-2004-kube
/run-e2e -k 1.27.2 -c 4.18 -h kvm -i ubuntu-2004-kube -f 1.26.5 -t 1.27.2

rohityadavcloud · 2023-07-15T04:20:13Z

/run-e2e -c 4.18

blueorangutan · 2023-07-15T04:21:04Z

@rohityadavcloud a jenkins job has been kicked to run test with following paramaters:

kubernetes version: 1.27.2
CloudStack version: 4.18
hypervisor: kvm
template: ubuntu-2004-kube
Kubernetes upgrade from: 1.26.5 to 1.27.2

blueorangutan · 2023-07-15T11:20:41Z

Test Results : (tid-272)
Environment: kvm Rocky8(x3), Advanced Networking with Management Server Rocky8
Kubernetes Version: v1.27.2
Kubernetes Version upgrade from: v1.26.5
Kubernetes Version upgrade to: v1.27.2
CloudStack Version: 4.18
Template: ubuntu-2004-kube
E2E Test Run Logs: https://github.com/blueorangutan/capc-prs/releases/download/capc-pr-ci-cd/capc-e2e-artifacts-pr290-sl-272.zip

[PASS] When testing node drain timeout A node should be forcefully removed if it cannot be drained in time
[PASS] with two clusters should successfully add and remove a second cluster without breaking the first cluster
[PASS] When testing subdomain Should create a cluster in a subdomain
[PASS] When testing app deployment to the workload cluster with network interruption [ToxiProxy] Should be able to create a cluster despite a network interruption during that process
[PASS] When testing affinity group Should have host affinity group when affinity is anti
[PASS] When testing machine remediation Should replace a machine when it is destroyed
[PASS] When testing Kubernetes version upgrades Should successfully upgrade kubernetes versions when there is a change in relevant fields
[PASS] When testing K8S conformance [Conformance] Should create a workload cluster and run kubetest
[PASS] When testing MachineDeployment rolling upgrades Should successfully upgrade Machines upon changes in relevant MachineDeployment fields
[PASS] When testing with custom disk offering Should successfully create a cluster with a custom disk offering
[PASS] When testing multiple CPs in a shared network with kubevip Should successfully create a cluster with multiple CPs in a shared network
[PASS] When testing with disk offering Should successfully create a cluster with disk offering
[PASS] When the specified resource does not exist Should fail due to the specified account is not found [TC4a]
[PASS] When the specified resource does not exist Should fail due to the specified domain is not found [TC4b]
[PASS] When the specified resource does not exist Should fail due to the specified control plane offering is not found [TC7]
[PASS] When the specified resource does not exist Should fail due to the specified template is not found [TC6]
[PASS] When the specified resource does not exist Should fail due to the specified zone is not found [TC3]
[PASS] When the specified resource does not exist Should fail due to the specified disk offering is not found
[PASS] When the specified resource does not exist Should fail due to the compute resources are not sufficient for the specified offering [TC8]
[PASS] When the specified resource does not exist Should fail due to the specified disk offer is not customized but the disk size is specified
[PASS] When the specified resource does not exist Should fail due to the specified disk offer is customized but the disk size is not specified
[PASS] When the specified resource does not exist Should fail due to the public IP can not be found
[PASS] When the specified resource does not exist When starting with a healthy cluster Should fail to upgrade worker machine due to insufficient compute resources
[PASS] When the specified resource does not exist When starting with a healthy cluster Should fail to upgrade control plane machine due to insufficient compute resources
[PASS] When testing horizontal scale out/in [TC17][TC18][TC20][TC21] Should successfully scale machine replicas up and down horizontally
[PASS] When testing app deployment to the workload cluster with slow network [ToxiProxy] Should be able to download an HTML from the app deployed to the workload cluster


Summarizing 3 Failures:

[Fail] When testing affinity group [It] Should have host affinity group when affinity is pro 
/jenkins/workspace/capc-e2e-new/test/e2e/common.go:331

[Fail] When testing resource cleanup [AfterEach] Should create a new network when the specified network does not exist 
/jenkins/workspace/capc-e2e-new/test/e2e/resource_cleanup.go:101

[Fail] When testing app deployment to the workload cluster [TC1][PR-Blocking] [It] Should be able to download an HTML from the app deployed to the workload cluster 
/jenkins/workspace/capc-e2e-new/test/e2e/deploy_app.go:111

Ran 28 of 29 Specs in 9307.217 seconds
FAIL! -- 25 Passed | 3 Failed | 0 Pending | 1 Skipped
--- FAIL: TestE2E (9307.23s)
FAIL

chrisdoherty4 · 2023-07-17T12:54:28Z

/run-e2e -c 4.18

blueorangutan · 2023-07-17T12:55:03Z

@chrisdoherty4 a jenkins job has been kicked to run test with following paramaters:

kubernetes version: 1.27.2
CloudStack version: 4.18
hypervisor: kvm
template: ubuntu-2004-kube
Kubernetes upgrade from: 1.26.5 to 1.27.2

chrisdoherty4 · 2023-07-17T13:33:08Z

/uncc @davidjumani

blueorangutan · 2023-07-17T15:58:50Z

Test Results : (tid-273)
Environment: kvm Rocky8(x3), Advanced Networking with Management Server Rocky8
Kubernetes Version: v1.27.2
Kubernetes Version upgrade from: v1.26.5
Kubernetes Version upgrade to: v1.27.2
CloudStack Version: 4.18
Template: ubuntu-2004-kube
E2E Test Run Logs: https://github.com/blueorangutan/capc-prs/releases/download/capc-pr-ci-cd/capc-e2e-artifacts-pr290-sl-273.zip

[PASS] When testing with disk offering Should successfully create a cluster with disk offering
[PASS] When testing app deployment to the workload cluster [TC1][PR-Blocking] Should be able to download an HTML from the app deployed to the workload cluster
[PASS] When testing with custom disk offering Should successfully create a cluster with a custom disk offering
[PASS] When testing horizontal scale out/in [TC17][TC18][TC20][TC21] Should successfully scale machine replicas up and down horizontally
[PASS] with two clusters should successfully add and remove a second cluster without breaking the first cluster
[PASS] When testing app deployment to the workload cluster with network interruption [ToxiProxy] Should be able to create a cluster despite a network interruption during that process
[PASS] When testing K8S conformance [Conformance] Should create a workload cluster and run kubetest
[PASS] When testing multiple CPs in a shared network with kubevip Should successfully create a cluster with multiple CPs in a shared network
[PASS] When testing machine remediation Should replace a machine when it is destroyed
[PASS] When testing subdomain Should create a cluster in a subdomain
[PASS] When the specified resource does not exist Should fail due to the specified account is not found [TC4a]
[PASS] When the specified resource does not exist Should fail due to the specified domain is not found [TC4b]
[PASS] When the specified resource does not exist Should fail due to the specified control plane offering is not found [TC7]
[PASS] When the specified resource does not exist Should fail due to the specified template is not found [TC6]
[PASS] When the specified resource does not exist Should fail due to the specified zone is not found [TC3]
[PASS] When the specified resource does not exist Should fail due to the specified disk offering is not found
[PASS] When the specified resource does not exist Should fail due to the compute resources are not sufficient for the specified offering [TC8]
[PASS] When the specified resource does not exist Should fail due to the specified disk offer is not customized but the disk size is specified
[PASS] When the specified resource does not exist Should fail due to the specified disk offer is customized but the disk size is not specified
[PASS] When the specified resource does not exist Should fail due to the public IP can not be found
[PASS] When the specified resource does not exist When starting with a healthy cluster Should fail to upgrade worker machine due to insufficient compute resources
[PASS] When the specified resource does not exist When starting with a healthy cluster Should fail to upgrade control plane machine due to insufficient compute resources
[PASS] When testing affinity group Should have host affinity group when affinity is anti
[PASS] When testing resource cleanup Should create a new network when the specified network does not exist
[PASS] When testing node drain timeout A node should be forcefully removed if it cannot be drained in time
[PASS] When testing Kubernetes version upgrades Should successfully upgrade kubernetes versions when there is a change in relevant fields
[PASS] When testing MachineDeployment rolling upgrades Should successfully upgrade Machines upon changes in relevant MachineDeployment fields
[PASS] When testing app deployment to the workload cluster with slow network [ToxiProxy] Should be able to download an HTML from the app deployed to the workload cluster


Summarizing 1 Failure:

[Fail] When testing affinity group [It] Should have host affinity group when affinity is pro 
/jenkins/workspace/capc-e2e-new/test/e2e/common.go:331

Ran 28 of 29 Specs in 8523.486 seconds
FAIL! -- 27 Passed | 1 Failed | 0 Pending | 1 Skipped
--- FAIL: TestE2E (8523.49s)
FAIL

chrisdoherty4 · 2023-07-17T20:32:23Z

The failing affinity E2E is also failing on main so is not an error introduced by this change.

chrisdoherty4 · 2023-07-17T20:32:31Z

/unhold

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 13, 2023

k8s-ci-robot requested review from dims and g-gaston July 13, 2023 17:27

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 13, 2023

chrisdoherty4 force-pushed the feature/concurrenc-reconciles branch 2 times, most recently from 2d5efba to 37a85e8 Compare July 14, 2023 20:03

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 14, 2023

chrisdoherty4 force-pushed the feature/concurrenc-reconciles branch from 37a85e8 to 4fb8931 Compare July 14, 2023 20:09

Concurrently reconcile CloudStackMachine resources

9f73dae

chrisdoherty4 force-pushed the feature/concurrenc-reconciles branch from 4fb8931 to 9f73dae Compare July 14, 2023 20:37

chrisdoherty4 changed the title ~~WIP: Concurrently reconcile CloudStackMachine resources~~ Concurrently reconcile CloudStackMachine resources Jul 14, 2023

chrisdoherty4 marked this pull request as ready for review July 14, 2023 20:38

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 14, 2023

k8s-ci-robot requested a review from davidjumani July 14, 2023 20:38

k8s-ci-robot assigned g-gaston Jul 14, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 14, 2023

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 14, 2023

k8s-ci-robot assigned weizhouapache Jul 14, 2023

k8s-ci-robot removed the request for review from davidjumani July 17, 2023 13:33

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 17, 2023

k8s-ci-robot merged commit 9445277 into kubernetes-sigs:main Jul 17, 2023

chrisdoherty4 deleted the feature/concurrenc-reconciles branch July 26, 2023 18:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrently reconcile CloudStackMachine resources #290

Concurrently reconcile CloudStackMachine resources #290

chrisdoherty4 commented Jul 13, 2023 •

edited

Loading

k8s-ci-robot commented Jul 13, 2023

k8s-ci-robot commented Jul 13, 2023

netlify bot commented Jul 13, 2023 •

edited

Loading

codecov-commenter commented Jul 14, 2023 •

edited

Loading

chrisdoherty4 commented Jul 14, 2023

g-gaston commented Jul 14, 2023

g-gaston commented Jul 14, 2023

chrisdoherty4 commented Jul 14, 2023

chrisdoherty4 commented Jul 14, 2023

k8s-ci-robot commented Jul 14, 2023

rohityadavcloud commented Jul 15, 2023

blueorangutan commented Jul 15, 2023

rohityadavcloud commented Jul 15, 2023

blueorangutan commented Jul 15, 2023

blueorangutan commented Jul 15, 2023

chrisdoherty4 commented Jul 17, 2023

blueorangutan commented Jul 17, 2023

chrisdoherty4 commented Jul 17, 2023

blueorangutan commented Jul 17, 2023

chrisdoherty4 commented Jul 17, 2023

chrisdoherty4 commented Jul 17, 2023

Concurrently reconcile CloudStackMachine resources #290

Concurrently reconcile CloudStackMachine resources #290

Conversation

chrisdoherty4 commented Jul 13, 2023 • edited Loading

k8s-ci-robot commented Jul 13, 2023

k8s-ci-robot commented Jul 13, 2023

netlify bot commented Jul 13, 2023 • edited Loading

✅ Deploy Preview for kubernetes-sigs-cluster-api-cloudstack ready!

codecov-commenter commented Jul 14, 2023 • edited Loading

Codecov Report

chrisdoherty4 commented Jul 14, 2023

g-gaston commented Jul 14, 2023

g-gaston commented Jul 14, 2023

chrisdoherty4 commented Jul 14, 2023

chrisdoherty4 commented Jul 14, 2023

k8s-ci-robot commented Jul 14, 2023

rohityadavcloud commented Jul 15, 2023

blueorangutan commented Jul 15, 2023

rohityadavcloud commented Jul 15, 2023

blueorangutan commented Jul 15, 2023

blueorangutan commented Jul 15, 2023

chrisdoherty4 commented Jul 17, 2023

blueorangutan commented Jul 17, 2023

chrisdoherty4 commented Jul 17, 2023

blueorangutan commented Jul 17, 2023

chrisdoherty4 commented Jul 17, 2023

chrisdoherty4 commented Jul 17, 2023

chrisdoherty4 commented Jul 13, 2023 •

edited

Loading

netlify bot commented Jul 13, 2023 •

edited

Loading

codecov-commenter commented Jul 14, 2023 •

edited

Loading