fix (crc/machine) : KubeContext left in invalid state after crc stop (#1569) #4400

rohanKanojia · 2024-10-15T10:20:25Z

Description

At the moment, we are only cleaning up crc context from kubeconfig during crc delete. This can be problematic if user tries to run any cluster related command after running crc stop as kubeconfig still points to CRC cluster that is not active.

I checked minikube's behavior and noticed it's cleaning up kube config in case of both stop and delete commands. Make crc behavior consistent with minikube and perform kubeconfig cleanup in both sub commands.

Signed-off-by: Rohan Kumar [email protected]

Type of change

Bug fix (non-breaking change which fixes an issue)
Feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change
Chore (non-breaking change which doesn't affect codebase;
test, version modification, documentation, etc.)

Checklist

Fixes: Issue #1569

Relates to: Issue #1569

Solution/Idea

Clean up .kube/config file while doing crc stop in order to not leave kubeconfig in an inconsistent state.

Currently after crc stop .kube/config file is left pointing to an outdated kube-context :

  current-context: default/api-crc-testing:6443/kubeadmin

This results in timeouts on the client side when user tries to access cluster using any kube client oc/kubectl:

crc : $ time oc get pods
E1015 15:46:38.452130   72163 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:35058->127.0.0.1:6443: read: connection reset by peer
E1015 15:47:10.615173   72163 memcache.go:265] couldn't get current server API group list: client rate limiter Wait returned an error: context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:38388->127.0.0.1:6443: read: connection reset by peer
E1015 15:47:43.548507   72163 memcache.go:265] couldn't get current server API group list: client rate limiter Wait returned an error: context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:55098->127.0.0.1:6443: read: connection reset by peer
E1015 15:48:15.549643   72163 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:35854->127.0.0.1:6443: read: connection reset by peer
E1015 15:48:47.550725   72163 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:44620->127.0.0.1:6443: read: connection reset by peer
error: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:44620->127.0.0.1:6443: read: connection reset by peer

real    2m41.162s
user    0m0.150s
sys     0m0.059s

This pull request would clean up .kube/config to align crc behavior with minikube so that it fails fast now:
Trying to access cluster after crc stop

crc : $ time oc get pods
error: Missing or incomplete configuration info.  Please point to an existing, complete config file:


  1. Via the command-line flag --kubeconfig
  2. Via the KUBECONFIG environment variable
  3. In your home directory as ~/.kube/config

To view or setup config directly use the 'config' command.

real    0m0.126s
user    0m0.062s
sys     0m0.051s

Proposed changes

Add a call to cleanKubeconfig in stop.go to clean up kubeconfig while stopping cluster.

Testing

In order to test this branch you need to follow these steps:

make cross to build crc binary
Set up a new cluster with created crc binary
- ./out/linux-amd64/crc setup
- ./out/linux-amd64/crc start
- ./out/linux-amd64/crc stop
Verify whether .kube/config is cleaned up after crc stop

crc : $ ./out/linux-amdcat ~/.kube/config 
apiVersion: v1
clusters: null
contexts: null
current-context: ""
kind: Config
preferences: {}
users: null

Verify whether when accessing stopped cluster with kubectl / oc it fails fast:

crc : $ ./out/linux-amd./out/linux-amd64/crc stop
INFO Stopping the instance, this may take a few minutes... 
Stopped the instance
crc : $ kubectl get opds
E1015 14:29:44.984352   64932 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused
E1015 14:29:44.984593   64932 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused
E1015 14:29:44.985937   64932 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused
E1015 14:29:44.986265   64932 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused
E1015 14:29:44.987715   64932 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused
The connection to the server localhost:8080 was refused - did you specify the right host or port?

openshift-ci · 2024-10-15T10:20:33Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gbraad for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2024-10-15T10:20:37Z

Hi @rohanKanojia. Thanks for your PR.

I'm waiting for a crc-org member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

gbraad · 2024-10-15T13:00:00Z

/ok-to-test

rohanKanojia · 2024-10-16T05:13:45Z

Could anyone please help me understand CI failures in the Windows-QE pipeline? Could it be a flaky failure? From the GitHub action logs, it seems that an action failed to generate a report. I'm not entirely sure whether these failures are related to changes made in this pull request.

redbeam · 2024-10-16T09:36:38Z

@rohanKanojia I would say they are related to something else, since these two pipelines fail for me too in #4343 .

gbraad · 2024-10-16T10:28:30Z

@adrianriobo and @lilyLuLiu can help you with this

lilyLuLiu · 2024-10-17T03:50:33Z

CI failures in the Windows-QE pipeline failed in copy test resource to target machine, this is qe related, not because of this pr.
@adrianriobo we need to improve the failure handing for the deliverset.

rohanKanojia · 2024-10-17T05:11:47Z

@lilyLuLiu : Is there any open issue to track this?

lilyLuLiu · 2024-10-17T05:46:05Z

@rohanKanojia https://github.com/adrianriobo/deliverest/issues/50

praveenkumar · 2024-10-18T08:29:23Z

pkg/crc/machine/stop.go

+		if !errors.Is(err, os.ErrNotExist) {
+			logging.Warnf("Failed to remove crc contexts from kubeconfig: %v", err)
+		}
+	}


@rohanKanojia this block of code should be at the end, we shouldn't clean the kubeconfig until stop the instance first.

I had moved this above in order to make it easier to test. I had added tests in kubeconfig_test to verify that calling cleanupKubeconfig multiple times wouldn't affect kubeconfig.

Let me try to adapt tests after moving it to the bottom.

Hello, on second thought from the user's perspective, we should always clean up .kube/config regardless of whether the instance was stopped successfully or not (to avoid an inconsistent state, which was the problem user was facing).

Do you think it would be okay if we move this cleanup statement in a defer block ?

defer func(input, output string) { err := cleanKubeconfig(input, output) if !errors.Is(err, os.ErrNotExist) { logging.Warnf("Failed to remove crc contexts from kubeconfig: %v", err) } }(getGlobalKubeConfigPath(), getGlobalKubeConfigPath())

yes, having it part of defer would be also good so that it is execute regardless.

praveenkumar · 2024-10-22T11:15:57Z

test/e2e/testsuite/testsuite.go

+	}
+	if len(kubeConfig) == 0 {
+		fmt.Println("Unable to load kubeconfig file")
+		os.Exit(1)


should we return error here or exit? I think if we exit then other test is not run which we don't want.

At the moment, we are only cleaning up crc context from kubeconfig during `crc delete`. This can be problematic if user tries to run any cluster related command after running `crc stop` as kubeconfig still points to CRC cluster that is not active. I checked minikube's behavior and noticed it's cleaning up kube config in case of both stop and delete commands. Make crc behavior consistent with minikube and perform kubeconfig cleanup in both sub commands. Signed-off-by: Rohan Kumar <[email protected]>

openshift-ci · 2024-10-22T17:07:15Z

@rohanKanojia: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-crc	`cdc863f`	link	true	`/test e2e-crc`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot added the do-not-merge/work-in-progress label Oct 15, 2024

openshift-ci bot added the needs-ok-to-test label Oct 15, 2024

openshift-ci bot added ok-to-test and removed needs-ok-to-test labels Oct 15, 2024

rohanKanojia force-pushed the pr/issue1569 branch from 74d7c68 to 261e971 Compare October 15, 2024 14:05

rohanKanojia marked this pull request as ready for review October 16, 2024 10:23

openshift-ci bot removed the do-not-merge/work-in-progress label Oct 16, 2024

openshift-ci bot requested review from gbraad and praveenkumar October 16, 2024 10:23

gbraad assigned rohanKanojia Oct 16, 2024

praveenkumar reviewed Oct 18, 2024

View reviewed changes

rohanKanojia force-pushed the pr/issue1569 branch from 261e971 to 2f2def2 Compare October 21, 2024 16:15

praveenkumar reviewed Oct 22, 2024

View reviewed changes

rohanKanojia force-pushed the pr/issue1569 branch 3 times, most recently from 3fffa8b to 453bfbb Compare October 22, 2024 14:55

rohanKanojia force-pushed the pr/issue1569 branch from 453bfbb to cdc863f Compare October 22, 2024 15:09

praveenkumar merged commit 0342835 into crc-org:main Oct 29, 2024
28 of 32 checks passed

rohanKanojia deleted the pr/issue1569 branch October 29, 2024 07:02

rohanKanojia mentioned this pull request Oct 29, 2024

refactor (crc/machine) : Provide a dummy implementation for virtualMachine object for writing unit tests (#4407) #4423

Open

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix (crc/machine) : KubeContext left in invalid state after crc stop (#1569) #4400

fix (crc/machine) : KubeContext left in invalid state after crc stop (#1569) #4400

rohanKanojia commented Oct 15, 2024 •

edited

Loading

openshift-ci bot commented Oct 15, 2024

openshift-ci bot commented Oct 15, 2024

gbraad commented Oct 15, 2024

rohanKanojia commented Oct 16, 2024 •

edited

Loading

redbeam commented Oct 16, 2024

gbraad commented Oct 16, 2024

lilyLuLiu commented Oct 17, 2024 •

edited

Loading

rohanKanojia commented Oct 17, 2024

lilyLuLiu commented Oct 17, 2024

praveenkumar Oct 18, 2024

rohanKanojia Oct 18, 2024

rohanKanojia Oct 21, 2024

praveenkumar Oct 21, 2024

praveenkumar Oct 22, 2024

openshift-ci bot commented Oct 22, 2024

fix (crc/machine) : KubeContext left in invalid state after crc stop (#1569) #4400

fix (crc/machine) : KubeContext left in invalid state after crc stop (#1569) #4400

Conversation

rohanKanojia commented Oct 15, 2024 • edited Loading

Description

Type of change

Checklist

Solution/Idea

Proposed changes

Testing

openshift-ci bot commented Oct 15, 2024

openshift-ci bot commented Oct 15, 2024

gbraad commented Oct 15, 2024

rohanKanojia commented Oct 16, 2024 • edited Loading

redbeam commented Oct 16, 2024

gbraad commented Oct 16, 2024

lilyLuLiu commented Oct 17, 2024 • edited Loading

rohanKanojia commented Oct 17, 2024

lilyLuLiu commented Oct 17, 2024

praveenkumar Oct 18, 2024

Choose a reason for hiding this comment

rohanKanojia Oct 18, 2024

Choose a reason for hiding this comment

rohanKanojia Oct 21, 2024

Choose a reason for hiding this comment

praveenkumar Oct 21, 2024

Choose a reason for hiding this comment

praveenkumar Oct 22, 2024

Choose a reason for hiding this comment

openshift-ci bot commented Oct 22, 2024

rohanKanojia commented Oct 15, 2024 •

edited

Loading

rohanKanojia commented Oct 16, 2024 •

edited

Loading

lilyLuLiu commented Oct 17, 2024 •

edited

Loading