diff --git a/keps/sig-network/2594-multiple-cluster-cidrs/README.md b/keps/sig-network/2594-multiple-cluster-cidrs/README.md new file mode 100644 index 000000000000..8816d1b8400b --- /dev/null +++ b/keps/sig-network/2594-multiple-cluster-cidrs/README.md @@ -0,0 +1,776 @@ +# KEP-2594: Enhanced NodeIPAM to support Discontiguous Cluster CIDR + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Add more pod IPs to the cluster](#add-more-pod-ips-to-the-cluster) + - [Add nodes with higher or lower capabilities](#add-nodes-with-higher-or-lower-capabilities) + - [Provision discontiguous ranges](#provision-discontiguous-ranges) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [New Resource](#new-resource) + - [Expected Behavior](#expected-behavior) + - [Example Config](#example-config) + - [Controller](#controller) + - [Data Structures](#data-structures) + - [Startup](#startup) + - [Event Watching Loops](#event-watching-loops) + - [Node Added](#node-added) + - [Node Updated](#node-updated) + - [Node Deleted](#node-deleted) + - [ClusterCIDRConfig Added](#clustercidrconfig-added) + - [ClusterCIDRConfig Updated](#clustercidrconfig-updated) + - [ClusterCIDRConfig Deleted](#clustercidrconfig-deleted) + - [Test Plan](#test-plan) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [Share Resources with Service API](#share-resources-with-service-api) + - [Pros](#pros) + - [Cons](#cons) + - [Nodes Register CIDR Request](#nodes-register-cidr-request) + - [Pros](#pros-1) + - [Cons](#cons-1) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) +- [ ] (R) Graduation criteria is in place +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +Today, IP ranges for podCIDRs for nodes are allocated from a single range allocated to the cluster (cluster CIDR). Each node gets a range of a fixed size from the overall cluster CIDR. The size is specified during cluster startup time and cannot be modified later on. + +This proposal enhances how pod CIDRs are allocated for nodes by adding a new CIDR allocator that can be controlled by a new resource `ClusterCIDRRange`. This would enable users to dynamically allocate more IP ranges for pods. + +## Motivation + +Today, IP ranges for podCIDRs for nodes are allocated from a single range allocated to the cluster (cluster CIDR). Each node gets a range of a fixed size from the overall cluster CIDR. The size is specified during cluster startup time and cannot be modified later on. This has multiple disadvantages: +* There is just one cluster CIDR from which all pod CIDRs are allocated. This means that users need to provision the entire IP range up front accounting for the largest cluster that may be created. This can waste IP addresses. +* If a cluster grows beyond expectations, there isn't a simple way to add more IP addresses. +* The cluster CIDR is one large range. It may be difficult to find a contiguous block of IP addresses that satisfy the needs of the cluster. +* Each node gets a fixed IP range for clusters. This means that if nodes are of different sizes and capacity, users cannot allocate a bigger pod range to a given node with larger capacity and a smaller range to nodes with lesser capacity. This wastes a lot of IP addresses. + +### Goals + +* Support multiple discontiguous IP CIDR blocks for Cluster CIDR +* Support node affinity of CIDR blocks +* Extensible to allow different block sizes allocated to nodes +* Does not require master or controller restart to add/remove ranges for pods. + +### Non-Goals + +* Not providing a generalized IPAM API to Kubernetes. We plan to enhance the RangeAllocator’s current behavior (give each Node a /XX from the Cluster CIDR as its `PodCIDR`) +* No change to the default behavior of a Kubernetes cluster. + * This will be an optional API and can be disabled (as today’s NodeIPAM controllers may also be disabled) + +## Proposal + +This proposal enhances how pod CIDRs are allocated for nodes by adding a new CIDR allocator that can be controlled by a new resource 'CIDRRange'. This enables users to dynamically allocate more IP ranges for pods. In addition, it gives users the capability to control what ranges are allocated to specific nodes as well as the size of the pod CIDR allocated to these nodes. + +### User Stories (Optional) + +#### Add more pod IPs to the cluster +A user created a cluster with an initial clusterCIDR value of 10.1.0.0/20. Each node is assigned a /24 pod CIDR so the user could create a maximum of 16 nodes. However, the cluster needs to be expanded but the user does not have enough IPs for pods. + +With this enhancement, the user can now allocate an additional CIDR for pods; eg. 10.2.0.0/20 with the same configuration to allocate a /24 pod CIDR. This way, the cluster can now grow by an additional 16 nodes. + +#### Add nodes with higher or lower capabilities +A user created a cluster with an ample sized cluster CIDR. All the initial nodes are of uniform capacity capable of running a maximum of 256 pods and they are each assigned a /24 pod CIDR. The user is planning to add more nodes to the system which are capable of running 500 pods. However, they cannot take advantage of the additional capacity because all nodes are assigned a /24 pod CIDR. +With this enhancement the user configures a new allocation which uses the original cluster CIDR but allocates a /23 instead of a /24 to each node. They use the node selector to allocate these IPs only to the nodes with the higher capacity. + +#### Provision discontiguous ranges +A user wants to create a cluster with 32 nodes each with a capacity to run 256 pods. This means that each node needs a /24 pod CIDR range and they need a total range of /19. However, there aren't enough contiguous IPs in the user's network. They can find 4 free ranges of size /21 but no single contiguous /19 range. + +Using this enhancement, the user creates 4 different CIDR configurations each with a /21 range. The CIDR allocator allocates a /24 range from any of these /21 ranges to the nodes and the user can now create the cluster. + +### Notes/Constraints/Caveats (Optional) + +A major precondition for this feature is to prevent churn on the IPs assigned to Nodes. Specifically, we want to prevent the controller from attempting to change a Node’s assigned PodCIDRs. For example, K8s will reject attempts to delete a CIDRRange that’s currently in use by a Node. This should prevent accidental broad-ranging changes to the cluster. + +### Risks and Mitigations + +- Do racing controllers need a more sophisticated design? The current plan is to plug in to the kube-controller-manager leader election. +- Can the controller install admission controls? This is a new pattern for in-tree components. + + + +## Design Details + +### New Resource +``` +ClusterCIDRConfig { + # Nodes may only have 1 range from each family. + # TODO: maybe this is redundant and can just be inferred by the controller. + IPFamily FamilyType + + # An IP block in CIDR notation ("10.0.0.0/8", "fd12:3456:789a:1::/64") + # +required + IPCIDRBlock string + + # This defines which nodes the config is applicable to. An empty selector + # can be applied to any node. + # +optional + NodeSelector v1.LabelSelector + + # Netmask size (e.g. 25 -> "/25") to allocate to a node. + # Users would have to ensure that the kubelet doesn't try to schedule + # more pods than are supported by the node's netmask (i.e. the kubelet's + # --max-pods flag) + # +required + PerNodeMaskSize int +} + +var ( + IPV4 FamilyType = "IPV4" + IPV6 FamilyType = "IPV6" +) +``` + +#### Expected Behavior + +- Each node will be assigned up to one range from each `FamilyType`. + - **TODO:** what is the tie-break if multiple ranges match +- An empty `NodeSelector` functions as a default that applies to all nodes. This should be the fall-back and not take precedence if any other range matches. +- `IPFamily`, `IPCIDRBlock`, and `PerNodeMaskSize` are immutable after creation. +- Reject any changes that would change existing node allocations: + - Can not delete a `CIDRRangeConfig` that is in use by a node + - Can not modify a `NodeSelector` to deselect a node that is using that range. + +#### Example Config +``` +[ +{ + IPFamily: IPV4, + # For existing clusters this is the same as ClusterCIDR + IPCIDRBlock: "10.0.0.0/8", + # Default for nodes not matching any other rule + NodeSelector: nil, + # For existing API this is the same as NodeCIDRMaskSize + PerNodeMaskSize: 24, +}, +{ + IPFamily: IPV4, + IPCIDRBlock: "172.16.0.0/14", + # Another range, also allocate-able to any node + NodeSelector: nil, + PerNodeMaskSize: 24, +}, +{ + IPFamily: IPV4, + IPCIDRBlock: "10.0.0.0/8", + NodeSelector: { key: "np" op: "IN" value:["np1"] }, + PerNodeMaskSize: 26, +}, +{ + IPFamily: IPV4, + IPCIDRBlock: "192.168.0.0/16", + NodeSelector: { key: "np" op: "IN" value:["np2"] }, + PerNodeMaskSize: 26, +}, +{ + IPFamily: IPV4, + IPCIDRBlock: "5.2.0.0/16", + NodeSelector: { "np": "np3" }, + PerNodeMaskSize: 20, +}, +{ + IPFamily: IPV6, + IPCIDRBlock: "fd12:3456:789a:1::/64", + NodeSelector: { "np": "np3" }, + PerNodeMaskSize: 112, +}, +... +] +``` + +Given the above config, a valid potential configuration might be: + +``` +{“np”: “np1”} --> "10.0.0.0/26" +{“np”: “np2”} --> "192.16.0.0/26" +{“np”: “np3”} --> "5.2.0.0/20", "fd12:3456:789a:1::/112" +{“np”: “np4”} --> "172.16.0.0/24" +``` + +### Controller + +Implement a new [NodeIPAM controller](https://github.com/kubernetes/kubernetes/tree/master/pkg/controller/nodeipam) The controller will set up watchers on the `ClusterCIDRConfig` objects and the `Node` objects. + +#### Data Structures +We want to use a set of radix trees to store allocation. This is a commonly used pattern (linux uses it to store routing tables). This will track the ranges that are already allocated. We will also store the range allocated to each node. This is to handle conflicts in case some entity modifies the podCIDR value of the Node object. + +#### Startup +- Fetch the values of the cluster-cidr and node-cidr-mask-size arguments passed to kube-controller-manager. +- Fetch list of `ClusterCIDRConfig` and build internal data structure (radix trees) +- Fetch list of `Node`s. Check each node for `PodCIDRs` + - If `PodCIDR` is set, mark the allocation in the internal data structure and store this association with the node. + - If `PodCIDR` is set, but is not part of one of the tracked `ClusterCIDRConfig`s, emit a K8s event but do nothing. + - If `PodCIDR` is not set, save Node for allocation in the next step. +After processing all nodes, allocate ranges to any nodes without Pod CIDR(s) [Same logic as Node Added event] + +#### Event Watching Loops + +##### Node Added +Go through the list of `ClusterCIDRConfig`s and find ranges matching the node selector from each family. Attempt to allocate Pod CIDR(s) with the given per-node size. If that `ClusterCIDRConfig` cannot fit a node, search for another `ClusterCIDRConfig`. + +If no `ClusterCIDRConfig` matches the node, or if all matching `ClusterCIDRConfig`s are full, raise a K8s event and put the Node back on the queue for allocation (infinite retries). +Upon successfully allocating CIDR(s), update the node object with the podCIDRs + +##### Node Updated +Check that its Pod CIDR(s) match internal allocation. + +- If node.spec.PodCIDRs is already filled up, honor that allocation and mark those ranges as allocated. +- If the node.spec.PodCIDRs is filled with a CIDR not from any `ClusterCIDRConfig`, raise an error event. +- If the ranges are already marked as allocated for some other node, raise another error event (there isn’t an obvious reconciliation step the controller can take unilaterally). + +##### Node Deleted +Release said Node’s allocation from the internal data-structure. + +##### ClusterCIDRConfig Added +Update internal representation of CIDRs (add a radix tree or join an existing one) +Every failed Node Allocation is stored in a queue, that will be tried again with the new range asynchronously. + +##### ClusterCIDRConfig Updated +_`IPFamily`, `IPCIDRBlock`, and `PerNodeMaskSize` are immutable_ + +_Admission Controller: If the change to `NodeSelector` would deselect a node already using that range, reject the change._ + +Update any internal state. + +##### ClusterCIDRConfig Deleted +_Admission Controller: Reject attempts to delete a `ClusterCIDRConfig` in use by a node_ + +Delete its corresponding internal data (radix tree) + +### Test Plan + + + +### Graduation Criteria + + + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + + + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [ ] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: + - Components depending on the feature gate: +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). + +###### Does enabling the feature change any default behavior? + + + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +###### What happens if we reenable the feature if it was previously rolled back? + +###### Are there any tests for feature enablement/disablement? + + + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs? + + + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +###### Will enabling / using this feature result in introducing new API types? + + + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + +### Share Resources with Service API + +There have also been discussions about updating the service API to have multiple ranges. One proposal is to share a common `CIDRRange` resource between both APIs. + +The potential for divergence between Service CIDRs and Pod CIDRs is quite high, as discussed in the cons section below. + +``` +CIDRRange { + Type CIDRType + IPFamily string # Perhaps omit – inferable from CIDRBlock + CIDRBlock string # Example "10.0.0.0/8" or "fd12:3456:789a:1::/64" + Selector v1.LabelSelector # Specifies which Services or Nodes can be + # assigned IPs from this block. + BlockSize string # How large of an IP block to allocate. For services + # this would always be "/32". Example "/24" +} + +var ( + ServiceCIDR CIDRType = "service" + ClusterCIDR CIDRType = "cluster" +) +``` + +#### Pros + +- First-party resource to allow editing of ClusterCIDR or ServiceCIDR without cluster restart +- Single IPAM resource for K8s. Potentially extensible for more use cases down the line. + +#### Cons +- Need a strategy for supporting divergence of Service and NodeIPAM APIs in the future. + - Already BlockSize feels odd, as Service will not make use of it. +- Any differences in how Service treats an object vs how NodeIPAM treats an object are likely to cause confusion. + - Enforce API level requirements across multiple unrelated controllers + +### Nodes Register CIDR Request + +Nodes might register a request for CIDR (as a K8s resource). The NodeIPAM controllers would watch this resource and attempt to fulfill these requests. + +The major goals behind this design is to provide more flexibility in IPAM. Additionally, it ensures that nodes ask for what they need and users don’t need to ensure that the ClusterCIDRRange and the Node’s `--max-pods` value are in alignment. + +A major factor in not recommending this strategy is the increased complexity to Kubernetes’ IPAM model. One of the stated non-goals was that this proposal doesn’t seek to provide a general IPAM solution or to drastically change how Kubernetes does IPAM. + +``` +NodeCIDRRequest { + NodeName string # Name of node requesting the CIDR + IPFamily IPFamilyType + RangeSize string # Example "/24" + CIDRBlock string # Populated by some IPAM controller. Example: "10.2.0.0/24" +} + +var ( + IPv4 IPFamilyType = "IPv4" + IPv6 IPFamilyType = "IPv6" +) +``` + +#### Pros +- Because the node is registering its request, it can ensure that it is asking for enough IPs to cover its `--max-pods` value. +- Added flexibility to support different IPAM models: + - Example: Nodes can request additional Pod IPs on the fly. This can help address customer requests for centralized IP handling as opposed to assigning them as chunks. + +#### Cons +- Requires changes to the kubelet in addition to change to NodeIPAM controller + - Kubelet needs to register the requests +- Potentially more confusing API. +- _Minor: O(nodes) more objects in etcd. Could be thousands in large clusters._ + diff --git a/keps/sig-network/2594-multiple-cluster-cidrs/kep.yaml b/keps/sig-network/2594-multiple-cluster-cidrs/kep.yaml new file mode 100644 index 000000000000..e1b98a15630f --- /dev/null +++ b/keps/sig-network/2594-multiple-cluster-cidrs/kep.yaml @@ -0,0 +1,28 @@ +title: Enhanced NodeIPAM to support Discontiguous ClusterCIDR +kep-number: NNNN +authors: + - "@rahulkjoshi" + - "@sdmodi" +owning-sig: sig-network +status: provisional +creation-date: 2021-03-22 +reviewers: + - TBD +approvers: + - TBD +prr-approvers: + - TBD + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.22" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.22" + beta: "v1.23" + stable: "v1.25"