Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix (crc/machine) : KubeContext left in invalid state after crc stop (#1569) #4400

Merged
merged 1 commit into from
Oct 29, 2024

Conversation

rohanKanojia
Copy link
Contributor

@rohanKanojia rohanKanojia commented Oct 15, 2024

Description

Fix #1569

At the moment, we are only cleaning up crc context from kubeconfig during crc delete. This can be problematic if user tries to run any cluster related command after running crc stop as kubeconfig still points to CRC cluster that is not active.

I checked minikube's behavior and noticed it's cleaning up kube config in case of both stop and delete commands. Make crc behavior consistent with minikube and perform kubeconfig cleanup in both sub commands.

Signed-off-by: Rohan Kumar [email protected]

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • Feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change
  • Chore (non-breaking change which doesn't affect codebase;
    test, version modification, documentation, etc.)

Checklist

  • I have read the contributing guidelines
  • My code follows the style guidelines of this project
  • I Keep It Small and Simple: The smaller the PR is, the easier it is to review and have it merged
  • I use conventional commits in my commit messages
  • I have performed a self-review of my code
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I tested my code on specified platforms
    • Linux
    • Windows
    • MacOS

Fixes: Issue #1569

Relates to: Issue #1569

Solution/Idea

Clean up .kube/config file while doing crc stop in order to not leave kubeconfig in an inconsistent state.

Currently after crc stop .kube/config file is left pointing to an outdated kube-context :

  current-context: default/api-crc-testing:6443/kubeadmin

This results in timeouts on the client side when user tries to access cluster using any kube client oc/kubectl:

crc : $ time oc get pods
E1015 15:46:38.452130   72163 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:35058->127.0.0.1:6443: read: connection reset by peer
E1015 15:47:10.615173   72163 memcache.go:265] couldn't get current server API group list: client rate limiter Wait returned an error: context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:38388->127.0.0.1:6443: read: connection reset by peer
E1015 15:47:43.548507   72163 memcache.go:265] couldn't get current server API group list: client rate limiter Wait returned an error: context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:55098->127.0.0.1:6443: read: connection reset by peer
E1015 15:48:15.549643   72163 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:35854->127.0.0.1:6443: read: connection reset by peer
E1015 15:48:47.550725   72163 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:44620->127.0.0.1:6443: read: connection reset by peer
error: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:44620->127.0.0.1:6443: read: connection reset by peer

real    2m41.162s
user    0m0.150s
sys     0m0.059s

This pull request would clean up .kube/config to align crc behavior with minikube so that it fails fast now:
Trying to access cluster after crc stop

crc : $ time oc get pods
error: Missing or incomplete configuration info.  Please point to an existing, complete config file:


  1. Via the command-line flag --kubeconfig
  2. Via the KUBECONFIG environment variable
  3. In your home directory as ~/.kube/config

To view or setup config directly use the 'config' command.

real    0m0.126s
user    0m0.062s
sys     0m0.051s

Proposed changes

Add a call to cleanKubeconfig in stop.go to clean up kubeconfig while stopping cluster.

Testing

In order to test this branch you need to follow these steps:

  1. make cross to build crc binary
  2. Set up a new cluster with created crc binary
    • ./out/linux-amd64/crc setup
    • ./out/linux-amd64/crc start
    • ./out/linux-amd64/crc stop
  3. Verify whether .kube/config is cleaned up after crc stop
crc : $ ./out/linux-amdcat ~/.kube/config 
apiVersion: v1
clusters: null
contexts: null
current-context: ""
kind: Config
preferences: {}
users: null
  1. Verify whether when accessing stopped cluster with kubectl / oc it fails fast:
crc : $ ./out/linux-amd./out/linux-amd64/crc stop
INFO Stopping the instance, this may take a few minutes... 
Stopped the instance
crc : $ kubectl get opds
E1015 14:29:44.984352   64932 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused
E1015 14:29:44.984593   64932 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused
E1015 14:29:44.985937   64932 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused
E1015 14:29:44.986265   64932 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused
E1015 14:29:44.987715   64932 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused
The connection to the server localhost:8080 was refused - did you specify the right host or port?

Copy link

openshift-ci bot commented Oct 15, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gbraad for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

openshift-ci bot commented Oct 15, 2024

Hi @rohanKanojia. Thanks for your PR.

I'm waiting for a crc-org member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@gbraad
Copy link
Contributor

gbraad commented Oct 15, 2024

/ok-to-test

@rohanKanojia
Copy link
Contributor Author

rohanKanojia commented Oct 16, 2024

Could anyone please help me understand CI failures in the Windows-QE pipeline? Could it be a flaky failure? From the GitHub action logs, it seems that an action failed to generate a report. I'm not entirely sure whether these failures are related to changes made in this pull request.

@redbeam
Copy link
Contributor

redbeam commented Oct 16, 2024

@rohanKanojia I would say they are related to something else, since these two pipelines fail for me too in #4343 .

@rohanKanojia rohanKanojia marked this pull request as ready for review October 16, 2024 10:23
@openshift-ci openshift-ci bot requested review from gbraad and praveenkumar October 16, 2024 10:23
@gbraad
Copy link
Contributor

gbraad commented Oct 16, 2024

@adrianriobo and @lilyLuLiu can help you with this

@lilyLuLiu
Copy link
Contributor

lilyLuLiu commented Oct 17, 2024

CI failures in the Windows-QE pipeline failed in copy test resource to target machine, this is qe related, not because of this pr.
@adrianriobo we need to improve the failure handing for the deliverset.

@rohanKanojia
Copy link
Contributor Author

@lilyLuLiu : Is there any open issue to track this?

@lilyLuLiu
Copy link
Contributor

@rohanKanojia https://github.com/adrianriobo/deliverest/issues/50

if !errors.Is(err, os.ErrNotExist) {
logging.Warnf("Failed to remove crc contexts from kubeconfig: %v", err)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rohanKanojia this block of code should be at the end, we shouldn't clean the kubeconfig until stop the instance first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had moved this above in order to make it easier to test. I had added tests in kubeconfig_test to verify that calling cleanupKubeconfig multiple times wouldn't affect kubeconfig.

Let me try to adapt tests after moving it to the bottom.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, on second thought from the user's perspective, we should always clean up .kube/config regardless of whether the instance was stopped successfully or not (to avoid an inconsistent state, which was the problem user was facing).

Do you think it would be okay if we move this cleanup statement in a defer block ?

	defer func(input, output string) {
		err := cleanKubeconfig(input, output)
		if !errors.Is(err, os.ErrNotExist) {
			logging.Warnf("Failed to remove crc contexts from kubeconfig: %v", err)
		}
	}(getGlobalKubeConfigPath(), getGlobalKubeConfigPath())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, having it part of defer would be also good so that it is execute regardless.

}
if len(kubeConfig) == 0 {
fmt.Println("Unable to load kubeconfig file")
os.Exit(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we return error here or exit? I think if we exit then other test is not run which we don't want.

@rohanKanojia rohanKanojia force-pushed the pr/issue1569 branch 3 times, most recently from 3fffa8b to 453bfbb Compare October 22, 2024 14:55
At the moment, we are only cleaning up crc context from kubeconfig
during `crc delete`. This can be problematic if user tries to run any
cluster related command after running `crc stop` as kubeconfig still
points to CRC cluster that is not active.

I checked minikube's behavior and noticed it's cleaning up kube config
in case of both stop and delete commands. Make crc behavior consistent
with minikube and perform kubeconfig cleanup in both sub commands.

Signed-off-by: Rohan Kumar <[email protected]>
Copy link

openshift-ci bot commented Oct 22, 2024

@rohanKanojia: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-crc cdc863f link true /test e2e-crc

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@praveenkumar praveenkumar merged commit 0342835 into crc-org:main Oct 29, 2024
28 of 32 checks passed
@rohanKanojia rohanKanojia deleted the pr/issue1569 branch October 29, 2024 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[BUG] After stopping CRC the Kube context is left in inconsistent state causing timeouts
5 participants