Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrently reconcile CloudStackMachine resources #290

Merged

Conversation

chrisdoherty4
Copy link
Member

@chrisdoherty4 chrisdoherty4 commented Jul 13, 2023

AWS analyzed CAPC in high node count contexts and found it takes considerable time to scale clusters. Part of the issue stems from CloudStackMachine resources being reconciled serially. This change enables concurrent reconciliation of CloudStackMachine resources improving the efficiency and preventing other parts of the system from reacting to slowness.

I have tested these changes by scaling up and down a machine deployment from 1 to 11 nodes. Scale ups took comparable times (55s) vs serial reconciliation which is expected as most of the time is consumed by VM provisioning. Scale down had an 85% improvement from 1m57s to 27s.

Related #274

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 13, 2023
@k8s-ci-robot k8s-ci-robot requested review from dims and g-gaston July 13, 2023 17:27
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chrisdoherty4

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@netlify
Copy link

netlify bot commented Jul 13, 2023

Deploy Preview for kubernetes-sigs-cluster-api-cloudstack ready!

Name Link
🔨 Latest commit 9f73dae
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-cluster-api-cloudstack/deploys/64b1b1fa22dde80007b9dc8b
😎 Deploy Preview https://deploy-preview-290--kubernetes-sigs-cluster-api-cloudstack.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 13, 2023
@chrisdoherty4 chrisdoherty4 force-pushed the feature/concurrenc-reconciles branch 2 times, most recently from 2d5efba to 37a85e8 Compare July 14, 2023 20:03
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 14, 2023
@chrisdoherty4 chrisdoherty4 force-pushed the feature/concurrenc-reconciles branch from 37a85e8 to 4fb8931 Compare July 14, 2023 20:09
@codecov-commenter
Copy link

codecov-commenter commented Jul 14, 2023

Codecov Report

Patch coverage has no change and project coverage change: -0.05 ⚠️

Comparison is base (4ccf853) 25.29% compared to head (9f73dae) 25.25%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #290      +/-   ##
==========================================
- Coverage   25.29%   25.25%   -0.05%     
==========================================
  Files          59       59              
  Lines        5585     5582       -3     
==========================================
- Hits         1413     1410       -3     
  Misses       4035     4035              
  Partials      137      137              
Impacted Files Coverage Δ
controllers/cloudstackmachine_controller.go 54.85% <ø> (ø)
pkg/cloud/instance.go 82.38% <ø> (-0.16%) ⬇️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@chrisdoherty4 chrisdoherty4 force-pushed the feature/concurrenc-reconciles branch from 4fb8931 to 9f73dae Compare July 14, 2023 20:37
@chrisdoherty4 chrisdoherty4 changed the title WIP: Concurrently reconcile CloudStackMachine resources Concurrently reconcile CloudStackMachine resources Jul 14, 2023
@chrisdoherty4 chrisdoherty4 marked this pull request as ready for review July 14, 2023 20:38
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 14, 2023
@chrisdoherty4
Copy link
Member Author

/run-e2e -c 4.18

@k8s-ci-robot k8s-ci-robot requested a review from davidjumani July 14, 2023 20:38
@g-gaston
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 14, 2023
@g-gaston
Copy link
Contributor

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 14, 2023
@chrisdoherty4
Copy link
Member Author

/hold

@chrisdoherty4
Copy link
Member Author

The E2E don't seem to be getting kicked off?

/assign @vishesh92 @weizhouapache

@k8s-ci-robot
Copy link
Contributor

@chrisdoherty4: GitHub didn't allow me to assign the following users: vishesh92.

Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

The E2E don't seem to be getting kicked off?

/assign @vishesh92 @weizhouapache

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rohityadavcloud
Copy link
Member

@chrisdoherty4 it's possible the backend BO script is facing Github rate limits (we found recently the script/jenkins jobs are running but couldn't post results as Github rate limit would void the API request to post comments for some reason)

/run-e2e help

@blueorangutan
Copy link

@rohityadavcloud
The command to run e2e test for CAPC.

Usage: /run-e2e [-k Kubernetes_Version] [-c CloudStack_Version] [-h Hypervisor] [-i Template/Image] [-f Kubernetes_Version_Upgrade_From] [-t Kubernetes_Version_Upgrade_To]

  • Supported Kubernetes versions are: ['1.27.2', '1.26.5', '1.25.10', '1.24.14', '1.23.3', '1.22.6']. The default value is '1.27.2'.
  • Supported CloudStack versions are: ['4.18', '4.17', '4.16']. If it is not set, an existing environment will be used.
  • Supported hypervisors are: ['kvm', 'vmware', 'xen']. The default value is 'kvm'.
  • Supported templates are: ['ubuntu-2004-kube', 'rockylinux-8-kube']. The default value is 'ubuntu-2004-kube'.
  • By default it tests Kubernetes upgrade from version '1.26.5' to '1.27.2'.

Examples:

  • /run-e2e
  • /run-e2e -k 1.27.2 -h kvm -i ubuntu-2004-kube
  • /run-e2e -k 1.27.2 -c 4.18 -h kvm -i ubuntu-2004-kube -f 1.26.5 -t 1.27.2

@rohityadavcloud
Copy link
Member

/run-e2e -c 4.18

@blueorangutan
Copy link

@rohityadavcloud a jenkins job has been kicked to run test with following paramaters:

  • kubernetes version: 1.27.2
  • CloudStack version: 4.18
  • hypervisor: kvm
  • template: ubuntu-2004-kube
  • Kubernetes upgrade from: 1.26.5 to 1.27.2

@blueorangutan
Copy link

Test Results : (tid-272)
Environment: kvm Rocky8(x3), Advanced Networking with Management Server Rocky8
Kubernetes Version: v1.27.2
Kubernetes Version upgrade from: v1.26.5
Kubernetes Version upgrade to: v1.27.2
CloudStack Version: 4.18
Template: ubuntu-2004-kube
E2E Test Run Logs: https://github.com/blueorangutan/capc-prs/releases/download/capc-pr-ci-cd/capc-e2e-artifacts-pr290-sl-272.zip

[PASS] When testing node drain timeout A node should be forcefully removed if it cannot be drained in time
[PASS] with two clusters should successfully add and remove a second cluster without breaking the first cluster
[PASS] When testing subdomain Should create a cluster in a subdomain
[PASS] When testing app deployment to the workload cluster with network interruption [ToxiProxy] Should be able to create a cluster despite a network interruption during that process
[PASS] When testing affinity group Should have host affinity group when affinity is anti
[PASS] When testing machine remediation Should replace a machine when it is destroyed
[PASS] When testing Kubernetes version upgrades Should successfully upgrade kubernetes versions when there is a change in relevant fields
[PASS] When testing K8S conformance [Conformance] Should create a workload cluster and run kubetest
[PASS] When testing MachineDeployment rolling upgrades Should successfully upgrade Machines upon changes in relevant MachineDeployment fields
[PASS] When testing with custom disk offering Should successfully create a cluster with a custom disk offering
[PASS] When testing multiple CPs in a shared network with kubevip Should successfully create a cluster with multiple CPs in a shared network
[PASS] When testing with disk offering Should successfully create a cluster with disk offering
[PASS] When the specified resource does not exist Should fail due to the specified account is not found [TC4a]
[PASS] When the specified resource does not exist Should fail due to the specified domain is not found [TC4b]
[PASS] When the specified resource does not exist Should fail due to the specified control plane offering is not found [TC7]
[PASS] When the specified resource does not exist Should fail due to the specified template is not found [TC6]
[PASS] When the specified resource does not exist Should fail due to the specified zone is not found [TC3]
[PASS] When the specified resource does not exist Should fail due to the specified disk offering is not found
[PASS] When the specified resource does not exist Should fail due to the compute resources are not sufficient for the specified offering [TC8]
[PASS] When the specified resource does not exist Should fail due to the specified disk offer is not customized but the disk size is specified
[PASS] When the specified resource does not exist Should fail due to the specified disk offer is customized but the disk size is not specified
[PASS] When the specified resource does not exist Should fail due to the public IP can not be found
[PASS] When the specified resource does not exist When starting with a healthy cluster Should fail to upgrade worker machine due to insufficient compute resources
[PASS] When the specified resource does not exist When starting with a healthy cluster Should fail to upgrade control plane machine due to insufficient compute resources
[PASS] When testing horizontal scale out/in [TC17][TC18][TC20][TC21] Should successfully scale machine replicas up and down horizontally
[PASS] When testing app deployment to the workload cluster with slow network [ToxiProxy] Should be able to download an HTML from the app deployed to the workload cluster


Summarizing 3 Failures:

[Fail] When testing affinity group [It] Should have host affinity group when affinity is pro 
/jenkins/workspace/capc-e2e-new/test/e2e/common.go:331

[Fail] When testing resource cleanup [AfterEach] Should create a new network when the specified network does not exist 
/jenkins/workspace/capc-e2e-new/test/e2e/resource_cleanup.go:101

[Fail] When testing app deployment to the workload cluster [TC1][PR-Blocking] [It] Should be able to download an HTML from the app deployed to the workload cluster 
/jenkins/workspace/capc-e2e-new/test/e2e/deploy_app.go:111

Ran 28 of 29 Specs in 9307.217 seconds
FAIL! -- 25 Passed | 3 Failed | 0 Pending | 1 Skipped
--- FAIL: TestE2E (9307.23s)
FAIL

@chrisdoherty4
Copy link
Member Author

/run-e2e -c 4.18

@blueorangutan
Copy link

@chrisdoherty4 a jenkins job has been kicked to run test with following paramaters:

  • kubernetes version: 1.27.2
  • CloudStack version: 4.18
  • hypervisor: kvm
  • template: ubuntu-2004-kube
  • Kubernetes upgrade from: 1.26.5 to 1.27.2

@chrisdoherty4
Copy link
Member Author

/uncc @davidjumani

@k8s-ci-robot k8s-ci-robot removed the request for review from davidjumani July 17, 2023 13:33
@blueorangutan
Copy link

Test Results : (tid-273)
Environment: kvm Rocky8(x3), Advanced Networking with Management Server Rocky8
Kubernetes Version: v1.27.2
Kubernetes Version upgrade from: v1.26.5
Kubernetes Version upgrade to: v1.27.2
CloudStack Version: 4.18
Template: ubuntu-2004-kube
E2E Test Run Logs: https://github.com/blueorangutan/capc-prs/releases/download/capc-pr-ci-cd/capc-e2e-artifacts-pr290-sl-273.zip

[PASS] When testing with disk offering Should successfully create a cluster with disk offering
[PASS] When testing app deployment to the workload cluster [TC1][PR-Blocking] Should be able to download an HTML from the app deployed to the workload cluster
[PASS] When testing with custom disk offering Should successfully create a cluster with a custom disk offering
[PASS] When testing horizontal scale out/in [TC17][TC18][TC20][TC21] Should successfully scale machine replicas up and down horizontally
[PASS] with two clusters should successfully add and remove a second cluster without breaking the first cluster
[PASS] When testing app deployment to the workload cluster with network interruption [ToxiProxy] Should be able to create a cluster despite a network interruption during that process
[PASS] When testing K8S conformance [Conformance] Should create a workload cluster and run kubetest
[PASS] When testing multiple CPs in a shared network with kubevip Should successfully create a cluster with multiple CPs in a shared network
[PASS] When testing machine remediation Should replace a machine when it is destroyed
[PASS] When testing subdomain Should create a cluster in a subdomain
[PASS] When the specified resource does not exist Should fail due to the specified account is not found [TC4a]
[PASS] When the specified resource does not exist Should fail due to the specified domain is not found [TC4b]
[PASS] When the specified resource does not exist Should fail due to the specified control plane offering is not found [TC7]
[PASS] When the specified resource does not exist Should fail due to the specified template is not found [TC6]
[PASS] When the specified resource does not exist Should fail due to the specified zone is not found [TC3]
[PASS] When the specified resource does not exist Should fail due to the specified disk offering is not found
[PASS] When the specified resource does not exist Should fail due to the compute resources are not sufficient for the specified offering [TC8]
[PASS] When the specified resource does not exist Should fail due to the specified disk offer is not customized but the disk size is specified
[PASS] When the specified resource does not exist Should fail due to the specified disk offer is customized but the disk size is not specified
[PASS] When the specified resource does not exist Should fail due to the public IP can not be found
[PASS] When the specified resource does not exist When starting with a healthy cluster Should fail to upgrade worker machine due to insufficient compute resources
[PASS] When the specified resource does not exist When starting with a healthy cluster Should fail to upgrade control plane machine due to insufficient compute resources
[PASS] When testing affinity group Should have host affinity group when affinity is anti
[PASS] When testing resource cleanup Should create a new network when the specified network does not exist
[PASS] When testing node drain timeout A node should be forcefully removed if it cannot be drained in time
[PASS] When testing Kubernetes version upgrades Should successfully upgrade kubernetes versions when there is a change in relevant fields
[PASS] When testing MachineDeployment rolling upgrades Should successfully upgrade Machines upon changes in relevant MachineDeployment fields
[PASS] When testing app deployment to the workload cluster with slow network [ToxiProxy] Should be able to download an HTML from the app deployed to the workload cluster


Summarizing 1 Failure:

[Fail] When testing affinity group [It] Should have host affinity group when affinity is pro 
/jenkins/workspace/capc-e2e-new/test/e2e/common.go:331

Ran 28 of 29 Specs in 8523.486 seconds
FAIL! -- 27 Passed | 1 Failed | 0 Pending | 1 Skipped
--- FAIL: TestE2E (8523.49s)
FAIL

@chrisdoherty4
Copy link
Member Author

The failing affinity E2E is also failing on main so is not an error introduced by this change.

@chrisdoherty4
Copy link
Member Author

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 17, 2023
@k8s-ci-robot k8s-ci-robot merged commit 9445277 into kubernetes-sigs:main Jul 17, 2023
@chrisdoherty4 chrisdoherty4 deleted the feature/concurrenc-reconciles branch July 26, 2023 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants