Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use omitempty for optional fields in Job Pod Failure Policy #126046

Merged
merged 1 commit into from
Jul 12, 2024

Conversation

mimowo
Copy link
Contributor

@mimowo mimowo commented Jul 12, 2024

What type of PR is this?

/kind bug
/kind api-change

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #126040

Special notes for your reviewer:

My procedure for testing (repro):
0. repro by creating Jobset with Job template embedding the snippet:

        podFailurePolicy:
          rules:
          - action: FailJob
            onPodConditions:
            - type: DisruptionTarget
          - action: Ignore
            onExitCodes:
              operator: In
              values: [ 42 ]

In results in the following errors:

> k create -f jobset.yaml
The JobSet "failurepolicy-abcdef" is invalid: 
* spec.replicatedJobs[0].template.spec.podFailurePolicy.rules[0].onExitCodes: Invalid value: "null": spec.replicatedJobs[0].template.spec.podFailurePolicy.rules[0].onExitCodes in body must be of type object: "null"
* spec.replicatedJobs[0].template.spec.podFailurePolicy.rules[1].onExitCodes.containerName: Invalid value: "null": spec.replicatedJobs[0].template.spec.podFailurePolicy.rules[1].onExitCodes.containerName in body must be of type string: "null"
* spec.replicatedJobs[0].template.spec.podFailurePolicy.rules[1].onPodConditions: Invalid value: "null": spec.replicatedJobs[0].template.spec.podFailurePolicy.rules[1].onPodConditions in body must be of type array: "null"
* <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation
  1. Build JobSet using the updated k8s.io/api:
  • clone https://github.com/kubernetes/api into JobSet repo under bin/k8s_api/api, and checkout the release-1.29 branch
  • make code adjustments as in this repro (adding "omitempty")
  • adjust JobSet's go.mod with replace k8s.io/api => ./bin/k8s_api/api to use the code, and adjust its Dockerfile with new line replace k8s.io/api => ./bin/k8s_api/api, then go mod tidy
  • build JobSet image with make kind-image-build
  • load the image to kind: kind load docker-image gcr.io/k8s-staging-jobset/jobset:0b8c19c-dirty --name kind
  • adjust the JobSet image for the running deployment with kubectl edit deploy/jobset-controller-manager -njobset-system
  1. The JobSet creation succeeds, the kubectl get job -oyaml returns the correct snippet:
    podFailurePolicy:
      rules:
      - action: FailJob
        onPodConditions:
        - status: "True"
          type: DisruptionTarget
      - action: Ignore
        onExitCodes:
          containerName: null
          operator: In
          values:
          - 42

(same snippet is created when creating the Job directly, without JobSet)

I have also confirmed that it only matters which version of k8s API is used to build the JobSet. In particular, when used fixed version to build JobSet, then I was able to create JobSets on 1.26 and 1.27 k8s clusters.

I see we did similar fixes in the past like here: #121000

Does this PR introduce a user-facing change?

Use omitempty for optional Job Pod Failure Policy fields

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 12, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 12, 2024
@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 12, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Jul 12, 2024

/cc @alculquicondor @deads2k @liggitt

@k8s-ci-robot k8s-ci-robot requested a review from liggitt July 12, 2024 07:17
@mimowo mimowo force-pushed the fix-job-pod-failure-api branch from 9fcb840 to f523be6 Compare July 12, 2024 08:45
@mimowo mimowo changed the title WIP: Use omitempty for optional Job Pod Failure Policy fields Use omitempty for optional fields in Job Pod Failure Policy Jul 12, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 12, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Jul 12, 2024

@alculquicondor please add to kubernetes/enhancements#3329
Hope it does not need to block the graduation in #125442 and we can cherry-pick the fix

@k8s-triage-robot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@mimowo
Copy link
Contributor Author

mimowo commented Jul 12, 2024

/label api-review

@k8s-ci-robot k8s-ci-robot added the api-review Categorizes an issue or PR as actively needing an API review. label Jul 12, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Jul 12, 2024

I've checked that it only matters which version of k8s api is used to build the JobSet webhook. In particular, I was able to successfully create JobSets on k8s 1.26 and 1.27 clusters using JobSet built against a patched version of k8s API (for 1.29).

@deads2k
Copy link
Contributor

deads2k commented Jul 12, 2024

Thinking through our serialization issues

  1. if serialized with zero values into etcd, the kube-apiserver will start returning responses without values. ContainerName nil versus missing makes no semantic difference. OnExitCodes nil versus empty makes no semantic difference. OnPodConditions nil versus zero length array makes no semantic difference: some languages may handle length interactions differently, but had to tolerate old servers anyway.
  2. if serialized without zero values, the kube-apiserver will stop providing zero values. As above, no semantic difference for controllers and the difference will be zero length array versus null.

On balance, I think it's better to let a consumer handle the empty array to nil shift than to have weird embedding problems.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 12, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 3c37175e4d9e62b18975b5ffa9caefeb956f6c20

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 12, 2024
@k8s-ci-robot k8s-ci-robot merged commit d11e860 into kubernetes:master Jul 12, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.31 milestone Jul 12, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Jul 12, 2024

Thanks, I will prepare cherry-picks down to 1.28 https://kubernetes.io/releases/. I think it will be particularly useful to cherry-pick to 1.29 or 1.30 for JobSet. cc @danielvegamyhre

@mimowo
Copy link
Contributor Author

mimowo commented Jul 12, 2024

One more thing tested is I compared the response patches generated by the webhook generated using k8s.io/api before and after the change. To see that I enable log level 10 in the API server, and use k logs kube-apiserver-kind-control-plane -nkube-system | grep "Response Body", then decode base64 the patch by the webhook.

before:

[{"op":"add","path":"/metadata/creationTimestamp","value":null},
{"op":"add","path":"/spec/replicatedJobs/0/template/metadata","value":{"creationTimestamp":null}},
{"op":"add","path":"/spec/replicatedJobs/0/template/spec/template/metadata","value":{"creationTimestamp":null}},
{"op":"add","path":"/spec/replicatedJobs/0/template/spec/template/spec/containers/0/resources","value":{}},
{"op":"add","path":"/spec/replicatedJobs/0/template/spec/completionMode","value":"Indexed"},
{"op":"add","path":"/spec/replicatedJobs/0/template/spec/podFailurePolicy/rules/0/onExitCodes","value":null},
{"op":"add","path":"/spec/replicatedJobs/0/template/spec/podFailurePolicy/rules/0/onPodConditions/0/status","value":""},
{"op":"add","path":"/spec/replicatedJobs/0/template/spec/podFailurePolicy/rules/1/onExitCodes/containerName","value"
:null},{"op":"add","path":"/spec/replicatedJobs/0/template/spec/podFailurePolicy/rules/1/onPodConditions","value":null},
{"op":"add","path":"/spec/network","value":{"enableDNSHostnames":true}},
{"op":"add","path":"/spec/successPolicy","value":{"operator":"All"}},{"op":"add","path":"/spec/startupPolicy","value":
{"startupPolicyOrder":"AnyOrder"}},{"op":"add","path":"/status","value":{}}]

after:

[{"op":"add","path":"/metadata/creationTimestamp","value":null},
{"op":"add","path":"/spec/replicatedJobs/0/template/metadata","value":{"creationTimestamp":null}},
{"op":"add","path":"/spec/replicatedJobs/0/template/spec/completionMode","value":"Indexed"},
{"op":"add","path":"/spec/replicatedJobs/0/template/spec/podFailurePolicy/rules/0/onPodConditions/0/status","value":""
},{"op":"add","path":"/spec/replicatedJobs/0/template/spec/template/metadata","value":{"creationTimestamp":null}},
{"op":"add","path":"/spec/replicatedJobs/0/template/spec/template/spec/containers/0/resources","value":{}},
{"op":"add","path":"/spec/network","value":{"enableDNSHostnames":true,"publishNotReadyAddresses":true}},
{"op":"add","path":"/spec/successPolicy","value":{"operator":"All"}},{"op":"add","path":"/spec/startupPolicy","value":
{"startupPolicyOrder":"AnyOrder"}},{"op":"add","path":"/status","value":{}}]

So it is surprising that in the broken flow API server receives "null" for onExitCodes which should be semantically equivalent to no value (I suppose?).

@mimowo
Copy link
Contributor Author

mimowo commented Jul 12, 2024

I've opened the cherry-pick PR. PTAL

@danielvegamyhre
Copy link
Member

I've opened the cherry-pick PR. PTAL

Awesome thanks for the quick fix on this! Taking a look now.


// Represents the requirement on the pod conditions. The requirement is represented
// as a list of pod condition patterns. The requirement is satisfied if at
// least one pattern matches an actual pod condition. At most 20 elements are allowed.
// +listType=atomic
// +optional
OnPodConditions []PodFailurePolicyOnPodConditionsPattern `json:"onPodConditions" protobuf:"bytes,3,opt,name=onPodConditions"`
OnPodConditions []PodFailurePolicyOnPodConditionsPattern `json:"onPodConditions,omitempty" protobuf:"bytes,3,opt,name=onPodConditions"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where we're validating immutable jobs on update:

allErrs = append(allErrs, apivalidation.ValidateImmutableField(spec.PodFailurePolicy, oldSpec.PodFailurePolicy, fldPath.Child("podFailurePolicy"))...)

would an empty array vs nil cause a validation failure, or are we collapsing those with apiequality.Semantic.DeepEqual?

Copy link
Contributor Author

@mimowo mimowo Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added to validation_test.go the following test case and it passes:

		"update pod failure policy - nil to empty": {
			old: batch.Job{
				ObjectMeta: metav1.ObjectMeta{Name: "abc", Namespace: metav1.NamespaceDefault},
				Spec: batch.JobSpec{
					Selector: validGeneratedSelector,
					Template: validPodTemplateSpecForGeneratedRestartPolicyNever,
					PodFailurePolicy: &batch.PodFailurePolicy{
						Rules: []batch.PodFailurePolicyRule{{
							Action:          batch.PodFailurePolicyActionIgnore,
							OnPodConditions: []batch.PodFailurePolicyOnPodConditionsPattern{},
							OnExitCodes: &batch.PodFailurePolicyOnExitCodesRequirement{
								Operator: batch.PodFailurePolicyOnExitCodesOpIn,
								Values:   []int32{42},
							},
						}},
					},
				},
			},
			update: func(job *batch.Job) {
				job.Spec.PodFailurePolicy.Rules[0].OnPodConditions = nil
			},
		},

(I could post a follow up PR). Yes, we use apiequality.Semantic.DeepEqual underneath by calling

func ValidateImmutableField(newVal, oldVal interface{}, fldPath *field.Path) field.ErrorList {

@liggitt
Copy link
Member

liggitt commented Jul 17, 2024

Do you have the jobset CRD that gets produced prior to this PR?

Those validation errors when submitting the jobset CR object look to be coming from the server, not client-side validation.

I'm surprised the null values aren't getting pruned out by the server by PruneNonNullableNullsWithoutDefaults when submitting the object.

@liggitt
Copy link
Member

liggitt commented Jul 17, 2024

I mostly agree that the peer on* fields should have omitempty, and I think it is safe to do (once we double-check that won't break immutability checks on job updates).

However, I don't understand why a CRD produced from this is complaining about null values for these fields... the server explicitly strips out null values for non-nullable fields without default values in the schema, so it should not have been complaining about these fields. Seeing the JobSet CRD schema would help in understanding why this is causing an error.

@danielvegamyhre
Copy link
Member

I mostly agree that the peer on* fields should have omitempty, and I think it is safe to do (once we double-check that won't break immutability checks on job updates).

However, I don't understand why a CRD produced from this is complaining about null values for these fields... the server explicitly strips out null values for non-nullable fields without default values in the schema, so it should not have been complaining about these fields. Seeing the JobSet CRD schema would help in understanding why this is causing an error.

Here is a link to the relevant section of the base CRD in the JobSet repo. Note the CRD is quite large since it has a batchv1.JobTemplate embedded in it.

@mimowo
Copy link
Contributor Author

mimowo commented Jul 17, 2024

I'm surprised the null values aren't getting pruned out by the server by PruneNonNullableNullsWithoutDefaults when submitting the object.

A guess, maybe there is a different code path in the api server, because the webhook actually returns the base64 encoded patch rather than the full object (as in the #126046 (comment))

@liggitt
Copy link
Member

liggitt commented Jul 17, 2024

there are custom steps done after applying the patch from the webhook to decode and apply defaulting:

if newVersionedObject, _, err = jsonSerializer.Decode(patchedJS, nil, newVersionedObject); err != nil {
return false, apierrors.NewInternalError(err)
}
changed = !apiequality.Semantic.DeepEqual(attr.VersionedObject, newVersionedObject)
span.AddEvent("Patch applied")
annotator.addPatchAnnotation(patchObj, result.PatchType)
attr.Dirty = true
attr.VersionedObject = newVersionedObject
o.GetObjectDefaulter().Default(attr.VersionedObject)

it appears that not all of the special things added into CR decoding are being done there

d := schemaCoercingDecoder{delegate: decoder, validator: unstructuredSchemaCoercer{structuralSchemas: s.structuralSchemas, structuralSchemaGK: s.structuralSchemaGK, preserveUnknownFields: s.preserveUnknownFields, returnUnknownFieldPaths: returnUnknownFieldPaths}}

func (d schemaCoercingDecoder) Decode(data []byte, defaults *schema.GroupVersionKind, into runtime.Object) (runtime.Object, *schema.GroupVersionKind, error) {
var decodingStrictErrs []error
obj, gvk, err := d.delegate.Decode(data, defaults, into)
if err != nil {
decodeStrictErr, ok := runtime.AsStrictDecodingError(err)
if !ok || obj == nil {
return nil, gvk, err
}
decodingStrictErrs = decodeStrictErr.Errors()
}
var unknownFields []string
if u, ok := obj.(*unstructured.Unstructured); ok {
unknownFields, err = d.validator.apply(u)
if err != nil {
return nil, gvk, err
}
}
if d.validator.returnUnknownFieldPaths && (len(decodingStrictErrs) > 0 || len(unknownFields) > 0) {
for _, unknownField := range unknownFields {
decodingStrictErrs = append(decodingStrictErrs, fmt.Errorf(`unknown field "%s"`, unknownField))
}
return obj, gvk, runtime.NewStrictDecodingError(decodingStrictErrs)
}
return obj, gvk, nil
}

func (v *unstructuredSchemaCoercer) apply(u *unstructured.Unstructured) (unknownFieldPaths []string, err error) {
// save implicit meta fields that don't have to be specified in the validation spec
kind, foundKind, err := unstructured.NestedString(u.UnstructuredContent(), "kind")
if err != nil {
return nil, err
}
apiVersion, foundApiVersion, err := unstructured.NestedString(u.UnstructuredContent(), "apiVersion")
if err != nil {
return nil, err
}
objectMeta, foundObjectMeta, metaUnknownFields, err := schemaobjectmeta.GetObjectMetaWithOptions(u.Object, schemaobjectmeta.ObjectMetaOptions{
DropMalformedFields: v.dropInvalidMetadata,
ReturnUnknownFieldPaths: v.returnUnknownFieldPaths,
})
if err != nil {
return nil, err
}
unknownFieldPaths = append(unknownFieldPaths, metaUnknownFields...)
// compare group and kind because also other object like DeleteCollection options pass through here
gv, err := schema.ParseGroupVersion(apiVersion)
if err != nil {
return nil, err
}
if gv.Group == v.structuralSchemaGK.Group && kind == v.structuralSchemaGK.Kind {
if !v.preserveUnknownFields {
// TODO: switch over pruning and coercing at the root to schemaobjectmeta.Coerce too
pruneOpts := structuralschema.UnknownFieldPathOptions{}
if v.returnUnknownFieldPaths {
pruneOpts.TrackUnknownFieldPaths = true
}
unknownFieldPaths = append(unknownFieldPaths, structuralpruning.PruneWithOptions(u.Object, v.structuralSchemas[gv.Version], true, pruneOpts)...)
structuraldefaulting.PruneNonNullableNullsWithoutDefaults(u.Object, v.structuralSchemas[gv.Version])
}
err, paths := schemaobjectmeta.CoerceWithOptions(nil, u.Object, v.structuralSchemas[gv.Version], false, schemaobjectmeta.CoerceOptions{
DropInvalidFields: v.dropInvalidMetadata,
ReturnUnknownFieldPaths: v.returnUnknownFieldPaths,
})
if err != nil {
return nil, err
}
unknownFieldPaths = append(unknownFieldPaths, paths...)
// fixup missing generation in very old CRs
if v.repairGeneration && objectMeta.Generation == 0 {
objectMeta.Generation = 1
}
}
// restore meta fields, starting clean
if foundKind {
u.SetKind(kind)
}
if foundApiVersion {
u.SetAPIVersion(apiVersion)
}
if foundObjectMeta {
if err := schemaobjectmeta.SetObjectMeta(u.Object, objectMeta); err != nil {
return nil, err
}
}
return unknownFieldPaths, nil
}

k8s-ci-robot added a commit that referenced this pull request Jul 18, 2024
…46-upstream-release-1.29

Automated cherry pick of #126046: Use omitempty for optional fields in Job Pod Failure Policy
k8s-ci-robot added a commit that referenced this pull request Jul 18, 2024
…46-upstream-release-1.28

Automated cherry pick of #126046: Use omitempty for optional fields in Job Pod Failure Policy
k8s-ci-robot added a commit that referenced this pull request Jul 18, 2024
…46-upstream-release-1.30

Automated cherry pick of #126046: Use omitempty for optional fields in Job Pod Failure Policy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
Archived in project
Archived in project
Development

Successfully merging this pull request may close these issues.

JobTemplate validation bug: OnExitCodes should be an optional field for PodFailurePolicyRule
6 participants