Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graduate indexed job to beta #101292

Merged

Conversation

AliceZhang2016
Copy link
Contributor

@AliceZhang2016 AliceZhang2016 commented Apr 20, 2021

What type of PR is this?

/kind feature

What this PR does / why we need it:

Graduate indexed job from alpha to beta, contain three parts in total:

  1. add three metrics to job controller:
    a. job_sync_duration_seconds: tracks the latency of a Job sync.
    b. job_sync_total: tracks the number of Job syncs.
    c. job_finished_total: tracks the number of finished job.

  2. add limitations pods create/delete operations per job sync, so that:
    a. big job will not occupy controller too much time which may get other jobs stuck, due to too many pods create/delete at once.
    b. gives much better predictability of metrics job_sync_duration_seconds

  3. change the default feature gate value of "IndexedJob" from false to true, to make it enabled by default in beta.

Which issue(s) this PR fixes:

fixes #98434

Special notes for your reviewer:

This is a requirement to graduate indexed job to Beta
The PR of changing hostname and podname is #101601

Does this PR introduce a user-facing change?

Add three metrics to job controller to monitor if Job works in a healthy condition.
IndexedJob promoted to Beta

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2214-indexed-job

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 20, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @AliceZhang2016. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 20, 2021
@AliceZhang2016
Copy link
Contributor Author

cc @alculquicondor

@k8s-ci-robot k8s-ci-robot requested review from mortent and soltysh April 20, 2021 20:35
@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 20, 2021
pkg/controller/job/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/controller/job/metrics/metrics.go Outdated Show resolved Hide resolved
"sync"

"k8s.io/component-base/metrics"
"k8s.io/component-base/metrics/legacyregistry"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this package suggests it's deprecated. We can probably use prometheus as suggested here https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/instrumentation.md#quick-start

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code of package metrics seems only wraps the prometheus, it also uses prometheus inside. https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/component-base/metrics/histogram.go#L95

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that sense, there is no need for using legacyregistry, just use prometheus directly. Note that the guide was updated in February, so it's pretty up to date.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the example in that guide was updated 6 years ago.

requestCounter = prometheus.NewCounterVec(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@logicalhan what is the up-to-date recommendation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't use prometheus directly, we actually have CI checks which should block you from merging PRs that use prometheus. The legacyregistry is deprecated because we want to eventually move off of using globals, which should happen hopefully this upcoming release.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the recommendation is to use the wrappers (from component base) and currently legacyregistry unless you are exposing a new metrics endpoint. I've submitted a PR to update the docs (kubernetes/community#5737), sorry for the confusion.

@alculquicondor
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 21, 2021
@AliceZhang2016 AliceZhang2016 force-pushed the job_controller_metrics branch from 7475e81 to 2f1341f Compare April 21, 2021 15:43
@logicalhan
Copy link
Member

/triage accepted
/assign @brancz

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 21, 2021
@AliceZhang2016 AliceZhang2016 force-pushed the job_controller_metrics branch from 2f1341f to 9c87df8 Compare April 21, 2021 17:17
pkg/controller/job/job_controller.go Outdated Show resolved Hide resolved
pkg/controller/job/job_controller.go Outdated Show resolved Hide resolved
Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
(from sig instrumentation) sorry about the confusion in the docs.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 22, 2021
@AliceZhang2016 AliceZhang2016 force-pushed the job_controller_metrics branch from 9f3f610 to 8dbccdc Compare May 5, 2021 14:46
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 5, 2021
@alculquicondor
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 5, 2021
@AliceZhang2016 AliceZhang2016 force-pushed the job_controller_metrics branch from 8dbccdc to 07f0ceb Compare May 5, 2021 17:04
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 5, 2021
@AliceZhang2016 AliceZhang2016 force-pushed the job_controller_metrics branch from 07f0ceb to 0c99f29 Compare May 7, 2021 13:29
Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 7, 2021
@liggitt
Copy link
Member

liggitt commented May 7, 2021

validation change looks good to me. eric's approve is now effective

@liggitt
Copy link
Member

liggitt commented May 7, 2021

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 7, 2021
@k8s-ci-robot k8s-ci-robot merged commit 548fb43 into kubernetes:master May 7, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.22 milestone May 7, 2021
tengqm added a commit to tengqm/website that referenced this pull request Jun 27, 2021
This PR updates the feature gate status based on upstream
implementation.

The chanegs are listed below with links to upstream PRs merged:

- BalanceAttacedNodeVolumes  kubernetes/kubernetes#102443
- CSIMigrationvSphereComplete kubernetes/kubernetes#101272
- CSIServiceAccountToken kubernetes/kubernetes#103001
- DaemonSetUpdateSurge kubernetes/kubernetes#101742
- DisableCloudProviders kubernetes/kubernetes#100136
- IndexedJob kubernetes/kubernetes#101292
- LegacyNodeRoleBehavior    kubernetes/kubernetes#100776
- NamespaceDefaultLabelName kubernetes/kubernetes#101342
- NetworkPolicyEndPort   kubernetes/kubernetes#102834
- NodeDisruptionExclusion   kubernetes/kubernetes#100776
- PodAffinityNamespaceSelector  kubernetes/kubernetes#101496
- PodDeletionCost  kubernetes/kubernetes#101080
- PreferNominatedNode kubernetes/kubernetes#102201
- ServiceLoadBalancerClass  kubernetes/kubernetes#103129
- ServiceNodeExclusion  kubernetes/kubernetes#100776
- ServiceTopology kubernetes/kubernetes#102412
- SizeMemoryBackedVoluems kubernetes/kubernetes#101048
- StatefulSetMinReadySeconds kubernetes/kubernetes#100842
- SuspendJob kubernetes/kubernetes#102022
- WindowsHostProcessContainers  kubernetes/kubernetes#99576
tengqm added a commit to tengqm/website that referenced this pull request Jun 27, 2021
This PR updates the feature gate status based on upstream
implementation.

The chanegs are listed below with links to upstream PRs merged:

- BalanceAttacedNodeVolumes  kubernetes/kubernetes#102443
- CSIMigrationvSphereComplete kubernetes/kubernetes#101272
- CSIServiceAccountToken kubernetes/kubernetes#103001
- DaemonSetUpdateSurge kubernetes/kubernetes#101742
- DisableCloudProviders kubernetes/kubernetes#100136
- IndexedJob kubernetes/kubernetes#101292
- LegacyNodeRoleBehavior    kubernetes/kubernetes#100776
- NamespaceDefaultLabelName kubernetes/kubernetes#101342
- NetworkPolicyEndPort   kubernetes/kubernetes#102834
- NodeDisruptionExclusion   kubernetes/kubernetes#100776
- PodAffinityNamespaceSelector  kubernetes/kubernetes#101496
- PodDeletionCost  kubernetes/kubernetes#101080
- PreferNominatedNode kubernetes/kubernetes#102201
- ServiceLoadBalancerClass  kubernetes/kubernetes#103129
- ServiceNodeExclusion  kubernetes/kubernetes#100776
- ServiceTopology kubernetes/kubernetes#102412
- SizeMemoryBackedVoluems kubernetes/kubernetes#101048
- StatefulSetMinReadySeconds kubernetes/kubernetes#100842
- SuspendJob kubernetes/kubernetes#102022
- WindowsHostProcessContainers  kubernetes/kubernetes#99576
tengqm added a commit to tengqm/website that referenced this pull request Jun 28, 2021
This PR updates the feature gate status based on upstream
implementation.

The chanegs are listed below with links to upstream PRs merged:

- BalanceAttacedNodeVolumes  kubernetes/kubernetes#102443
- CSIMigrationvSphereComplete kubernetes/kubernetes#101272
- CSIServiceAccountToken kubernetes/kubernetes#103001
- DaemonSetUpdateSurge kubernetes/kubernetes#101742
- DisableCloudProviders kubernetes/kubernetes#100136
- IndexedJob kubernetes/kubernetes#101292
- LegacyNodeRoleBehavior    kubernetes/kubernetes#100776
- NamespaceDefaultLabelName kubernetes/kubernetes#101342
- NetworkPolicyEndPort   kubernetes/kubernetes#102834
- NodeDisruptionExclusion   kubernetes/kubernetes#100776
- PodAffinityNamespaceSelector  kubernetes/kubernetes#101496
- PodDeletionCost  kubernetes/kubernetes#101080
- PreferNominatedNode kubernetes/kubernetes#102201
- ServiceLoadBalancerClass  kubernetes/kubernetes#103129
- ServiceNodeExclusion  kubernetes/kubernetes#100776
- ServiceTopology kubernetes/kubernetes#102412
- SizeMemoryBackedVoluems kubernetes/kubernetes#101048
- StatefulSetMinReadySeconds kubernetes/kubernetes#100842
- SuspendJob kubernetes/kubernetes#102022
- WindowsHostProcessContainers  kubernetes/kubernetes#99576
tengqm added a commit to tengqm/website that referenced this pull request Jun 28, 2021
This PR updates the feature gate status based on upstream
implementation.

The chanegs are listed below with links to upstream PRs merged:

- BalanceAttacedNodeVolumes  kubernetes/kubernetes#102443
- CSIMigrationvSphereComplete kubernetes/kubernetes#101272
- CSIServiceAccountToken kubernetes/kubernetes#103001
- DaemonSetUpdateSurge kubernetes/kubernetes#101742
- DisableCloudProviders kubernetes/kubernetes#100136
- IndexedJob kubernetes/kubernetes#101292
- LegacyNodeRoleBehavior    kubernetes/kubernetes#100776
- NamespaceDefaultLabelName kubernetes/kubernetes#101342
- NetworkPolicyEndPort   kubernetes/kubernetes#102834
- NodeDisruptionExclusion   kubernetes/kubernetes#100776
- PodAffinityNamespaceSelector  kubernetes/kubernetes#101496
- PodDeletionCost  kubernetes/kubernetes#101080
- PreferNominatedNode kubernetes/kubernetes#102201
- ServiceLoadBalancerClass  kubernetes/kubernetes#103129
- ServiceNodeExclusion  kubernetes/kubernetes#100776
- ServiceTopology kubernetes/kubernetes#102412
- SizeMemoryBackedVoluems kubernetes/kubernetes#101048
- StatefulSetMinReadySeconds kubernetes/kubernetes#100842
- SuspendJob kubernetes/kubernetes#102022
- WindowsHostProcessContainers  kubernetes/kubernetes#99576
tengqm added a commit to tengqm/website that referenced this pull request Jun 28, 2021
This PR updates the feature gate status based on upstream
implementation.

The chanegs are listed below with links to upstream PRs merged:

- BalanceAttacedNodeVolumes  kubernetes/kubernetes#102443
- CSIMigrationvSphereComplete kubernetes/kubernetes#101272
- CSIServiceAccountToken kubernetes/kubernetes#103001
- DaemonSetUpdateSurge kubernetes/kubernetes#101742
- DisableCloudProviders kubernetes/kubernetes#100136
- IndexedJob kubernetes/kubernetes#101292
- LegacyNodeRoleBehavior    kubernetes/kubernetes#100776
- NamespaceDefaultLabelName kubernetes/kubernetes#101342
- NetworkPolicyEndPort   kubernetes/kubernetes#102834
- NodeDisruptionExclusion   kubernetes/kubernetes#100776
- PodAffinityNamespaceSelector  kubernetes/kubernetes#101496
- PodDeletionCost  kubernetes/kubernetes#101080
- PreferNominatedNode kubernetes/kubernetes#102201
- ServiceLoadBalancerClass  kubernetes/kubernetes#103129
- ServiceNodeExclusion  kubernetes/kubernetes#100776
- ServiceTopology kubernetes/kubernetes#102412
- SizeMemoryBackedVoluems kubernetes/kubernetes#101048
- StatefulSetMinReadySeconds kubernetes/kubernetes#100842
- SuspendJob kubernetes/kubernetes#102022
- WindowsHostProcessContainers  kubernetes/kubernetes#99576
tengqm added a commit to tengqm/website that referenced this pull request Jun 29, 2021
This PR updates the feature gate status based on upstream
implementation.

The chanegs are listed below with links to upstream PRs merged:

- BalanceAttacedNodeVolumes  kubernetes/kubernetes#102443
- CSIMigrationvSphereComplete kubernetes/kubernetes#101272
- CSIServiceAccountToken kubernetes/kubernetes#103001
- DaemonSetUpdateSurge kubernetes/kubernetes#101742
- DisableCloudProviders kubernetes/kubernetes#100136
- IndexedJob kubernetes/kubernetes#101292
- LegacyNodeRoleBehavior    kubernetes/kubernetes#100776
- NamespaceDefaultLabelName kubernetes/kubernetes#101342
- NetworkPolicyEndPort   kubernetes/kubernetes#102834
- NodeDisruptionExclusion   kubernetes/kubernetes#100776
- PodAffinityNamespaceSelector  kubernetes/kubernetes#101496
- PodDeletionCost  kubernetes/kubernetes#101080
- PreferNominatedNode kubernetes/kubernetes#102201
- ServiceLoadBalancerClass  kubernetes/kubernetes#103129
- ServiceNodeExclusion  kubernetes/kubernetes#100776
- ServiceTopology kubernetes/kubernetes#102412
- SizeMemoryBackedVoluems kubernetes/kubernetes#101048
- StatefulSetMinReadySeconds kubernetes/kubernetes#100842
- SuspendJob kubernetes/kubernetes#102022
- WindowsHostProcessContainers  kubernetes/kubernetes#99576
tengqm added a commit to tengqm/website that referenced this pull request Jun 30, 2021
This PR updates the feature gate status based on upstream
implementation.

The chanegs are listed below with links to upstream PRs merged:

- BalanceAttacedNodeVolumes  kubernetes/kubernetes#102443
- CSIMigrationvSphereComplete kubernetes/kubernetes#101272
- CSIServiceAccountToken kubernetes/kubernetes#103001
- DaemonSetUpdateSurge kubernetes/kubernetes#101742
- DisableCloudProviders kubernetes/kubernetes#100136
- IndexedJob kubernetes/kubernetes#101292
- LegacyNodeRoleBehavior    kubernetes/kubernetes#100776
- NamespaceDefaultLabelName kubernetes/kubernetes#101342
- NetworkPolicyEndPort   kubernetes/kubernetes#102834
- NodeDisruptionExclusion   kubernetes/kubernetes#100776
- PodAffinityNamespaceSelector  kubernetes/kubernetes#101496
- PodDeletionCost  kubernetes/kubernetes#101080
- PreferNominatedNode kubernetes/kubernetes#102201
- ServiceLoadBalancerClass  kubernetes/kubernetes#103129
- ServiceNodeExclusion  kubernetes/kubernetes#100776
- ServiceTopology kubernetes/kubernetes#102412
- SizeMemoryBackedVoluems kubernetes/kubernetes#101048
- StatefulSetMinReadySeconds kubernetes/kubernetes#100842
- SuspendJob kubernetes/kubernetes#102022
- WindowsHostProcessContainers  kubernetes/kubernetes#99576
tengqm added a commit to tengqm/website that referenced this pull request Jun 30, 2021
This PR updates the feature gate status based on upstream
implementation.

The chanegs are listed below with links to upstream PRs merged:

- BalanceAttacedNodeVolumes  kubernetes/kubernetes#102443
- CSIMigrationvSphereComplete kubernetes/kubernetes#101272
- CSIServiceAccountToken kubernetes/kubernetes#103001
- DaemonSetUpdateSurge kubernetes/kubernetes#101742
- DisableCloudProviders kubernetes/kubernetes#100136
- IndexedJob kubernetes/kubernetes#101292
- LegacyNodeRoleBehavior    kubernetes/kubernetes#100776
- NamespaceDefaultLabelName kubernetes/kubernetes#101342
- NetworkPolicyEndPort   kubernetes/kubernetes#102834
- NodeDisruptionExclusion   kubernetes/kubernetes#100776
- PodAffinityNamespaceSelector  kubernetes/kubernetes#101496
- PodDeletionCost  kubernetes/kubernetes#101080
- PreferNominatedNode kubernetes/kubernetes#102201
- ServiceLoadBalancerClass  kubernetes/kubernetes#103129
- ServiceNodeExclusion  kubernetes/kubernetes#100776
- ServiceTopology kubernetes/kubernetes#102412
- SizeMemoryBackedVoluems kubernetes/kubernetes#101048
- StatefulSetMinReadySeconds kubernetes/kubernetes#100842
- SuspendJob kubernetes/kubernetes#102022
- WindowsHostProcessContainers  kubernetes/kubernetes#99576
tengqm added a commit to tengqm/website that referenced this pull request Jul 2, 2021
This PR updates the feature gate status based on upstream
implementation.

The chanegs are listed below with links to upstream PRs merged:

- BalanceAttacedNodeVolumes  kubernetes/kubernetes#102443
- CSIMigrationvSphereComplete kubernetes/kubernetes#101272
- CSIServiceAccountToken kubernetes/kubernetes#103001
- DaemonSetUpdateSurge kubernetes/kubernetes#101742
- DisableCloudProviders kubernetes/kubernetes#100136
- IndexedJob kubernetes/kubernetes#101292
- LegacyNodeRoleBehavior    kubernetes/kubernetes#100776
- NamespaceDefaultLabelName kubernetes/kubernetes#101342
- NetworkPolicyEndPort   kubernetes/kubernetes#102834
- NodeDisruptionExclusion   kubernetes/kubernetes#100776
- PodAffinityNamespaceSelector  kubernetes/kubernetes#101496
- PodDeletionCost  kubernetes/kubernetes#101080
- PreferNominatedNode kubernetes/kubernetes#102201
- ServiceLoadBalancerClass  kubernetes/kubernetes#103129
- ServiceNodeExclusion  kubernetes/kubernetes#100776
- ServiceTopology kubernetes/kubernetes#102412
- SizeMemoryBackedVoluems kubernetes/kubernetes#101048
- StatefulSetMinReadySeconds kubernetes/kubernetes#100842
- SuspendJob kubernetes/kubernetes#102022
- WindowsHostProcessContainers  kubernetes/kubernetes#99576
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: API review completed, 1.22
Development

Successfully merging this pull request may close these issues.

Metrics for job-controller
10 participants