Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only provision workloads when network creation has started #29348

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

maiqueb
Copy link
Contributor

@maiqueb maiqueb commented Dec 5, 2024

No description provided.

@openshift-ci openshift-ci bot requested review from knobunc and trozet December 5, 2024 18:15
@maiqueb
Copy link
Contributor Author

maiqueb commented Dec 5, 2024

/test e2e-aws-ovn-virt-techpreview

@maiqueb maiqueb force-pushed the only-provision-workloads-when-network-creation-has-started branch from 47890fa to 7542b3d Compare December 9, 2024 15:30
udnManifest := generateUserDefinedNetworkManifest(&c)
By(fmt.Sprintf("Creating UserDefinedNetwork %s/%s", c.namespace, c.name))
Expect(applyManifest(c.namespace, udnManifest)).To(Succeed())
Expect(waitForUserDefinedNetworkReady(c.namespace, c.name, udnCrReadyTimeout)).To(Succeed())
return networkAttachmentConfig{networkAttachmentConfigParams{networkName: c.namespace + "." + c.name}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ormergi how can I have access to the name of the network ?

Right now, the only thing I think I can do is to read the corresponding NAD, unmarshal, and return that ...

Is there something simpler ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@npinaeva how about you ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fetched it from the NAD generated for the UDN.

I think we can assume the name is the same (i.e. NAD.metadata.Name == UDN.metadata.Name)

If that changes, we'll need to be able to derive it somehow. Not thinking much about that, because the change must be backwards compatible.

@npinaeva

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would fetch the NAD that created by the UDN controller and parse the network-name it got.
Doing so ensure tests wont break even if network-name will be changed, no need to go back to tests and do some adjustments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that would make sense. I have the owner reference for that, right ? @ormergi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can also fetch the NAD with the meta.name as the UDN object as well

@maiqueb
Copy link
Contributor Author

maiqueb commented Dec 9, 2024

/test e2e-aws-ovn-virt-techpreview

@maiqueb maiqueb force-pushed the only-provision-workloads-when-network-creation-has-started branch 2 times, most recently from 3961085 to f009222 Compare December 9, 2024 16:55
@maiqueb
Copy link
Contributor Author

maiqueb commented Dec 9, 2024

/test e2e-aws-ovn-virt-techpreview

Copy link

openshift-trt bot commented Dec 9, 2024

Job Failure Risk Analysis for sha: f009222

Job Name Failure Risk
pull-ci-openshift-origin-master-okd-scos-e2e-aws-ovn IncompleteTests
Tests for this run (20) are below the historical average (2051): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-microshift-serial IncompleteTests
Tests for this run (22) are below the historical average (425): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-microshift IncompleteTests
Tests for this run (22) are below the historical average (949): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-agnostic-ovn-cmd Medium
[bz-kube-apiserver][invariant] alert/KubeAPIErrorBudgetBurn should not be at or above info
This test has passed 93.44% of 61 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:azure SecurityMode:default Topology:ha Upgrade:none] in the last week.
pull-ci-openshift-origin-master-e2e-aws-ovn-virt-techpreview Low
[sig-network][OCPFeatureGate:PersistentIPsForVirtualization][Feature:Layer2LiveMigration] Kubevirt Virtual Machines when using openshift ovn-kubernetes with user defined networks and persistent ips configured created using UserDefinedNetwork [Suite:openshift/network/virtualization] should keep ip when the VM attached to a primary UDN is restarted
This test has passed 75.00% of 4 runs on release 4.19 [Architecture:amd64 FeatureSet:techpreview Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.
---
[sig-auth] all workloads in ns/openshift-cnv must set the 'openshift.io/required-scc' annotation
This test has passed 0.00% of 4 runs on release 4.19 [Architecture:amd64 FeatureSet:techpreview Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.

@maiqueb
Copy link
Contributor Author

maiqueb commented Dec 10, 2024

/test e2e-aws-ovn-virt-techpreview

@deads2k
Copy link
Contributor

deads2k commented Dec 10, 2024

/hold

fix the product, not the test.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 10, 2024
Copy link

openshift-trt bot commented Dec 10, 2024

Job Failure Risk Analysis for sha: 3e93667

Job Name Failure Risk
pull-ci-openshift-origin-master-okd-scos-e2e-aws-ovn IncompleteTests
Tests for this run (20) are below the historical average (1692): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-microshift-serial IncompleteTests
Tests for this run (22) are below the historical average (363): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-microshift IncompleteTests
Tests for this run (22) are below the historical average (807): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-virt-techpreview Low
[sig-auth] all workloads in ns/openshift-cnv must set the 'openshift.io/required-scc' annotation
This test has passed 0.00% of 5 runs on release 4.19 [Architecture:amd64 FeatureSet:techpreview Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.

Comment on lines 486 to 507
out, err = runOcWithRetry(oc.AsAdmin(), "get",
"pods",
"-o", "name",
"-n", "openshift-ovn-kubernetes",
"--field-selector", fmt.Sprintf("spec.nodeName=%s", nodeName),
"-l", "app=ovnkube-node")
if err != nil {
return false, err
}
outReader := bufio.NewScanner(strings.NewReader(out))
re := regexp.MustCompile("^pod/(.*)")
for outReader.Scan() {
match := re.FindSubmatch([]byte(outReader.Text()))
if len(match) != 2 {
continue
}
podName = string(match[1])
break
}
if podName == "" {
return false, fmt.Errorf("could not find a valid ovnkube-node pod on node '%s'", nodeName)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this code looks for ovnkube-node pod name, cant it be done using the programmatic go-client (similar to line 508)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can, but I just copied this code from somewhere else - I still need to make this coalesce.

And still, I guess staff engineering opposes this PR: #29348 (comment)

@deads2k
Copy link
Contributor

deads2k commented Dec 12, 2024

And still, I guess staff engineering opposes this PR: #29348 (comment)

How about creating a copy of this test that is showing the UDN problem and making it minimum reproducer. Then gate that test on UDN. This ensures that the failure mode is obvious and we don't lose the signal. With the signal preserved, it is then ok to adjust other tests to compensate for a product bug because the signal won't be lost and will block the promotion of UDN.

Signed-off-by: Miguel Duarte Barroso <[email protected]>
Otherwise, we risk facing a situation where the primary UDN will not be
plumbed into the worlkload.

Signed-off-by: Miguel Duarte Barroso <[email protected]>
This way, we can better debug when things go wrong.

Signed-off-by: Miguel Duarte Barroso <[email protected]>
This way, we can re-use this logic in other places without duplicating
code

Signed-off-by: Miguel Duarte Barroso <[email protected]>
… race is avoided

Signed-off-by: Miguel Duarte Barroso <[email protected]>
@maiqueb maiqueb force-pushed the only-provision-workloads-when-network-creation-has-started branch from 3e93667 to f2451e8 Compare December 13, 2024 16:18
Copy link
Contributor

openshift-ci bot commented Dec 13, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: maiqueb
Once this PR has been reviewed and has the lgtm label, please assign bparees for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Dec 13, 2024

@maiqueb: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-virt-techpreview 3e93667 link false /test e2e-aws-ovn-virt-techpreview
ci/prow/verify 5fd52b8 link true /test verify
ci/prow/e2e-aws-ovn-kube-apiserver-rollout 5fd52b8 link false /test e2e-aws-ovn-kube-apiserver-rollout
ci/prow/okd-scos-e2e-aws-ovn 5fd52b8 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-aws-ovn-single-node-upgrade 5fd52b8 link false /test e2e-aws-ovn-single-node-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link

openshift-trt bot commented Dec 13, 2024

Job Failure Risk Analysis for sha: 5fd52b8

Job Name Failure Risk
pull-ci-openshift-origin-master-okd-scos-e2e-aws-ovn IncompleteTests
Tests for this run (20) are below the historical average (937): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-virt-techpreview Low
[sig-auth] all workloads in ns/openshift-cnv must set the 'openshift.io/required-scc' annotation
This test has passed 0.00% of 6 runs on release 4.19 [Architecture:amd64 FeatureSet:techpreview Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants