From 5927f9602927bbd447ee33f3b3357ebb458fcc3d Mon Sep 17 00:00:00 2001 From: Cezary Zukowski Date: Mon, 19 Oct 2020 18:10:59 +0200 Subject: [PATCH] The Memory Manager official documentation Signed-off-by: Cezary Zukowski --- .../concepts/policy/node-resource-managers.md | 22 + .../administer-cluster/memory-manager.md | 380 ++++++++++++++++++ .../administer-cluster/topology-manager.md | 10 +- static/images/docs/memory-manager-diagram.svg | 2 + 4 files changed, 409 insertions(+), 5 deletions(-) create mode 100644 content/en/docs/concepts/policy/node-resource-managers.md create mode 100644 content/en/docs/tasks/administer-cluster/memory-manager.md create mode 100644 static/images/docs/memory-manager-diagram.svg diff --git a/content/en/docs/concepts/policy/node-resource-managers.md b/content/en/docs/concepts/policy/node-resource-managers.md new file mode 100644 index 0000000000000..ce9e6be98ba87 --- /dev/null +++ b/content/en/docs/concepts/policy/node-resource-managers.md @@ -0,0 +1,22 @@ +--- +reviewers: +- derekwaynecarr +- klueska +title: Node Resource Managers +content_type: concept +weight: 50 +--- + + + +In order to support latency-critical and high-throughput workloads, Kubernetes offers a suite of Resource Managers. The managers aim to co-ordinate and optimise node's resources alignment for pods configured with a specific requirement for CPUs, devices, and memory (hugepages) resources. + + + +The main manager, the Topology Manager, is a Kubelet component that co-ordinates the overall resource management process through its [policy](/docs/tasks/administer-cluster/topology-manager/). + +The configuration of individual managers is elaborated in dedicated documents: + +- [CPU Manager Policies](/docs/tasks/administer-cluster/cpu-management-policies/) +- [Device Manager](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-integration-with-the-topology-manager) +- [Memory Manger Policies](/docs/tasks/administer-cluster/memory-manager/) \ No newline at end of file diff --git a/content/en/docs/tasks/administer-cluster/memory-manager.md b/content/en/docs/tasks/administer-cluster/memory-manager.md new file mode 100644 index 0000000000000..60f2ded206325 --- /dev/null +++ b/content/en/docs/tasks/administer-cluster/memory-manager.md @@ -0,0 +1,380 @@ +--- +title: Memory Manager + +reviewers: +- klueska +- derekwaynecarr + +content_type: task +min-kubernetes-server-version: v1.21 +--- + + + +{{< feature-state state="alpha" for_k8s_version="v1.21" >}} + +The Kubernetes *Memory Manager* enables the feature of guaranteed memory (and hugepages) allocation for pods in the `Guaranteed` {{< glossary_tooltip text="QoS class" term_id="qos-class" >}}. + +The Memory Manager employs hint generation protocol to yield the most suitable NUMA affinity for a pod. The Memory Manager feeds the central manager (*Topology Manager*) with these affinity hints. Based on both the hints and Topology Manager policy, the pod is rejected or admitted to the node. + +Moreover, the Memory Manager ensures that the memory which a pod requests is allocated from a minimum number of NUMA nodes. + +The Memory Manager is only pertinent to Linux based hosts. + +## {{% heading "prerequisites" %}} + +{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}} + +To align memory resources with other requested resources in a Pod Spec: +- the CPU Manager should be enabled and proper CPU Manager policy should be configured on a Node. See [control CPU Management Policies](/docs/tasks/administer-cluster/cpu-management-policies/); +- the Topology Manager should be enabled and proper Topology Manager policy should be configured on a Node. See [control Topology Management Policies](/docs/tasks/administer-cluster/topology-manager/). + +Support for the Memory Manager requires `MemoryManager` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) to be enabled. + +That is, the `kubelet` must be started with the following flag: + +`--feature-gates=MemoryManager=true` + +## How Memory Manager Operates? + +The Memory Manager currently offers the guaranteed memory (and hugepages) allocation for Pods in Guaranteed QoS class. To immediately put the Memory Manager into operation follow the guidelines in the section [Memory Manager configuration](#memory-manager-configuration), and subsequently, prepare and deploy a `Guaranteed` pod as illustrated in the section [Placing a Pod in the Guaranteed QoS class](#placing-a-pod-in-the-guaranteed-qos-class). + +The Memory Manager is a Hint Provider, and it provides topology hints for the Topology Manager which then aligns the requested resources according to these topology hints. It also enforces `cgroups` (i.e. `cpuset.mems`) for pods. The complete flow diagram concerning pod admission and deployment process is illustrated in [Memory Manager KEP: Design Overview][4] and below: + +![Memory Manager in the pod admission and deployment process](/images/docs/memory-manager-diagram.svg) + +During this process, the Memory Manager updates its internal counters stored in [Node Map and Memory Maps][2] to manage guaranteed memory allocation. + +The Memory Manager updates the Node Map during the startup and runtime as follows. + +### Startup + +This occurs once a node administrator employs `--reserved-memory` (section [Reserved memory flag](#reserved-memory-flag)). In this case, the Node Map becomes updated to reflect this reservation as illustrated in [Memory Manager KEP: Memory Maps at start-up (with examples)][5]. + +The administrator must provide `--reserved-memory` flag when `static` policy is configured. + +### Runtime + +Reference [Memory Manager KEP: Memory Maps at runtime (with examples)][6] illustrates how a successful pod deployment affects the Node Map, and it also relates to how potential Out-of-Memory (OOM) situations are handled further by Kubernetes or operating system. + +Important topic in the context of Memory Manager operation is the management of NUMA groups. Each time pod's memory request is in excess of single NUMA node capacity, the Memory Manager attempts to create a group that comprises several NUMA nodes and features extend memory capacity. The problem has been solved as elaborated in [Memory Manager KEP: How to enable the guaranteed memory allocation over many NUMA nodes?][3]. Also, reference [Memory Manager KEP: Simulation - how the Memory Manager works? (by examples)][1] illustrates how the management of groups occurs. + +## Memory Manager configuration + +Other Managers should be first pre-configured (section [Pre-configuration](#pre-configuration)). Next, the Memory Manger feature should be enabled (section [Enable the Memory Manager feature](#enable-the-memory-manager-feature)) and be run with `static` policy (section [static policy](#static-policy)). Optionally, some amount of memory can be reserved for system or kubelet processes to increase node stability (section [Reserved memory flag](#reserved-memory-flag)). + +### Policies + +Memory Manager supports two policies. You can select a policy via a `kubelet` flag `--memory-manager-policy`. + +Two policies can be selected: + +* `none` (default) +* `static` + +#### none policy {#policy-none} + +This is the default policy and does not affect the memory allocation in any way. +It acts the same as if the Memory Manager is not present at all. + +The `none` policy returns default topology hint. This special hint denotes that Hint Provider (Memory Manger in this case) has no preference for NUMA affinity with any resource. + +#### static policy {#policy-static} + +In the case of the `Guaranteed` pod, the `static` Memory Manger policy returns topology hints relating to the set of NUMA nodes where the memory can be guaranteed, and reserves the memory through updating the internal [NodeMap][2] object. + +In the case of the `BestEffort` or `Burstable` pod, the `static` Memory Manager policy sends back the default topology hint as there is no request for the guaranteed memory, and does not reserve the memory in the internal [NodeMap][2] object. + +### Reserved memory flag + +The [Node Allocatable](/docs/tasks/administer-cluster/reserve-compute-resources/) mechanism is commonly used by node administrators to reserve K8S node system resources for the kubelet or operating system processes in order to enhance the node stability. A dedicated set of flags can be used for this purpose to set the total amount of reserved memory for a node. This pre-configured value is subsequently utilized to calculate the real amount of node's "allocatable" memory available to pods. + +The Kubernetes scheduler incorporates "allocatable" to optimise pod scheduling process. The foregoing flags include `--kube-reserved`, `--system-reserved` and `--eviction-threshold`. The sum of their values will account for the total amount of reserved memory. + + +A new `--reserved-memory` flag was added to Memory Manager to allow for this total reserved memory to be split (by a node administrator) and accordingly reserved across many NUMA nodes. + +The flag specifies a comma-separated list of memory reservations per NUMA node. +This parameter is only useful in the context of the Memory Manager feature. +The Memory Manager will not use this reserved memory for the allocation of container workloads. + +For example, if you have a NUMA node "NUMA0" with `10Gi` of memory available, and the `--reserved-memory` was specified to reserve `1Gi` of memory at "NUMA0", the Memory Manager assumes that only `9Gi` is available for containers. + +You can omit this parameter, however, you should be aware that the quantity of reserved memory from all NUMA nodes should be equal to the quantity of memory specified by the [Node Allocatable feature](/docs/tasks/administer-cluster/reserve-compute-resources/). If at least one node allocatable parameter is non-zero, you will need to specify `--reserved-memory` for at least one NUMA node. In fact, `eviction-hard` threshold value is equal to `100Mi` by default, so if `static` policy is used, `--reserved-memory` is obligatory. + +Also, avoid the following configurations: +1. duplicates, i.e. the same NUMA node or memory type, but with a different value; +2. setting zero limit for any of memory types; +3. NUMA node IDs that do not exist in the machine hardware; +4. memory type names different than `memory` or `hugepages-` (hugepages of particular `` should also exist). + +Syntax: + +`--reserved-memory N:memory-type1=value1,memory-type2=value2,...` +* `N` (integer) - NUMA node index, e.g. `0` +* `memory-type` (string) - represents memory type: + * `memory` - conventional memory + * `hugepages-2Mi` or `hugepages-1Gi` - hugepages +* `value` (string) - the quantity of reserved memory, e.g. `1Gi` + +Example usage: + +`--reserved-memory 0:memory=1Gi,hugepages-1Gi=2Gi` + +or + +`--reserved-memory 0:memory=1Gi --reserved-memory 1:memory=2Gi` + +When you specify values for `--reserved-memory` flag, you must comply with the setting that you prior provided via Node Allocatable Feature flags. That is, the following rule must be obeyed for each memory type: + +`sum(reserved-memory(i)) = kube-reserved + system-reserved + eviction-threshold`, + +where `i` is an index of a NUMA node. + +If you do not follow the formula above, the Memory Manager will show an error on startup. + +In other words, the example above illustrates that for the conventional memory (`type=memory`), we reserve `3Gi` in total, i.e.: + +`sum(reserved-memory(i)) = reserved-memory(0) + reserved-memory(1) = 1Gi + 2Gi = 3Gi` + +An example of kubelet command-line arguments relevant to the node Allocatable configuration: +* `--kube-reserved=cpu=500m,memory=50Mi` +* `--system-reserved=cpu=123m,memory=333Mi` +* `--eviction-hard=memory.available<500Mi` + +{{< note >}} +The default hard eviction threshold is 100MiB, and **not** zero. Remember to increase the quantity of memory that you reserve by setting `--reserved-memory` by that hard eviction threshold. Otherwise, the kubelet will not start Memory Manager and display an error. +{{< /note >}} + +Here is an example of a correct configuration: + +```shell +--feature-gates=MemoryManager=true +--kube-reserved=cpu=4,memory=4Gi +--system-reserved=cpu=1,memory=1Gi +--memory-manager-policy=static +--reserved-memory 0:memory=3Gi --reserved-memory 1:memory=2148Mi +``` +Let us validate the configuration above: +1. `kube-reserved + system-reserved + eviction-hard(default) = reserved-memory(0) + reserved-memory(1)` +2. `4GiB + 1GiB + 100MiB = 3GiB + 2148MiB` +3. `5120MiB + 100MiB = 3072MiB + 2148MiB` +4. `5220MiB = 5220MiB` (which is correct) + +## Placing a Pod in the Guaranteed QoS class + +If the selected policy is anything other than `none`, the Memory Manager identifies pods that are in the `Guaranteed` QoS class. The Memory Manager provides specific topology hints to the Topology Manager for each `Guaranteed` pod. For pods in a QoS class other than `Guaranteed`, the Memory Manager provides default topology hints to the Topology Manager. + +The following excerpts from pod manifests assign a pod to the `Guaranteed` QoS class. + +Pod with integer CPU(s) runs in the `Guaranteed` QoS class, when `requests` are equal to `limits`: + +```yaml +spec: + containers: + - name: nginx + image: nginx + resources: + limits: + memory: "200Mi" + cpu: "2" + example.com/device: "1" + requests: + memory: "200Mi" + cpu: "2" + example.com/device: "1" +``` + +Also, a pod sharing CPU(s) runs in the `Guaranteed` QoS class, when `requests` are equal to `limits`. + +```yaml +spec: + containers: + - name: nginx + image: nginx + resources: + limits: + memory: "200Mi" + cpu: "300m" + example.com/device: "1" + requests: + memory: "200Mi" + cpu: "300m" + example.com/device: "1" +``` + +Notice that both CPU and memory requests must be specified for a Pod to lend it to Guaranteed QoS class. + +## Troubleshooting + +The following means can be used to troubleshoot the reason why a pod could not be deployed or became rejected at a node: +- pod status - indicates topology affinity errors +- system logs - include valuable information for debugging, e.g., about generated hints +- state file - the dump of internal state of the Memory Manager (includes [Node Map and Memory Maps][2]) + +### Pod status (TopologyAffinityError) {#TopologyAffinityError} + +This error typically occurs in the following situations: +* a node has not enough resources available to satisfy the pod's request +* the pod's request is rejected due to particular Topology Manager policy constraints + +The error appears in the status of a pod: +```shell +# kubectl get pods +NAME READY STATUS RESTARTS AGE +guaranteed 0/1 TopologyAffinityError 0 113s +``` + +Use `kubectl describe pod ` or `kubectl get events` to obtain detailed error message: +```shell +Warning TopologyAffinityError 10m kubelet, dell8 Resources cannot be allocated with Topology locality +``` + +### System logs + +Search system logs with respect to a particular pod. + +The set of hints that Memory Manager generated for the pod can be found in the logs. +Also, the set of hints generated by CPU Manager should be present in the logs. + +Topology Manager merges these hints to calculate a single best hint. +The best hint should be also present in the logs. + +The best hint indicates where to allocate all the resources. Topology Manager tests this hint against its current policy, and based on the verdict, it either admits the pod to the node or rejects it. + +Also, search the logs for occurrences associated with the Memory Manager, e.g. to find out information about `cgroups` and `cpuset.mems` updates. + +### Examine the memory manager state on a node + +Let us first deploy a sample `Guaranteed` pod whose specification is as follows: +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: guaranteed +spec: + containers: + - name: guaranteed + image: consumer + imagePullPolicy: Never + resources: + limits: + cpu: "2" + memory: 150Gi + requests: + cpu: "2" + memory: 150Gi + command: ["sleep","infinity"] +``` + +Next, let us log into the node where it was deployed and examine the state file in `/var/lib/kubelet/memory_manager_state`: +```json +{ + "policyName":"static", + "machineState":{ + "0":{ + "numberOfAssignments":1, + "memoryMap":{ + "hugepages-1Gi":{ + "total":0, + "systemReserved":0, + "allocatable":0, + "reserved":0, + "free":0 + }, + "memory":{ + "total":134987354112, + "systemReserved":3221225472, + "allocatable":131766128640, + "reserved":131766128640, + "free":0 + } + }, + "nodes":[ + 0, + 1 + ] + }, + "1":{ + "numberOfAssignments":1, + "memoryMap":{ + "hugepages-1Gi":{ + "total":0, + "systemReserved":0, + "allocatable":0, + "reserved":0, + "free":0 + }, + "memory":{ + "total":135286722560, + "systemReserved":2252341248, + "allocatable":133034381312, + "reserved":29295144960, + "free":103739236352 + } + }, + "nodes":[ + 0, + 1 + ] + } + }, + "entries":{ + "fa9bdd38-6df9-4cf9-aa67-8c4814da37a8":{ + "guaranteed":[ + { + "numaAffinity":[ + 0, + 1 + ], + "type":"memory", + "size":161061273600 + } + ] + } + }, + "checksum":4142013182 +} +``` + +It can be deduced from the state file that the pod was pinned to both NUMA nodes, i.e.: + +```json +"numaAffinity":[ + 0, + 1 +], +``` + +Pinned term means that pod's memory consumption is constrained (through `cgroups` configuration) to these NUMA nodes. + +This automatically implies that Memory Manager instantiated a new group that comprises these two NUMA nodes, i.e. `0` and `1` indexed NUMA nodes. + +Notice that the management of groups is handled in a relatively complex manner, and further elaboration is provided in Memory Manager KEP in [this][1] and [this][3] sections. + +In order to analyse memory resources available in a group, the corresponding entries from NUMA nodes belonging to the group must be added up. + +For example, the total amount of free "conventional" memory in the group can be computed by adding up the free memory available at every NUMA node in the group, i.e., in the `"memory"` section of NUMA node `0` (`"free":0`) and NUMA node `1` (`"free":103739236352`). So, the total amount of free "conventional" memory in this group is equal to `0 + 103739236352` bytes. + +The line `"systemReserved":3221225472` indicates that the administrator of this node reserved `3221225472` bytes (i.e. `3Gi`) to serve kubelet and system processes at NUMA node `0`, by using `--reserved-memory` flag. + +## {{% heading "whatsnext" %}} + +- [Memory Manager KEP: Design Overview][4] + +- [Memory Manager KEP: Memory Maps at start-up (with examples)][5] + +- [Memory Manager KEP: Memory Maps at runtime (with examples)][6] + +- [Memory Manager KEP: Simulation - how the Memory Manager works? (by examples)][1] + +- [Memory Manager KEP: The Concept of Node Map and Memory Maps][2] + +- [Memory Manager KEP: How to enable the guaranteed memory allocation over many NUMA nodes?][3] + +[1]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1769-memory-manager#simulation---how-the-memory-manager-works-by-examples +[2]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1769-memory-manager#the-concept-of-node-map-and-memory-maps +[3]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1769-memory-manager#how-to-enable-the-guaranteed-memory-allocation-over-many-numa-nodes +[4]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1769-memory-manager#design-overview +[5]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1769-memory-manager#memory-maps-at-start-up-with-examples +[6]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1769-memory-manager#memory-maps-at-runtime-with-examples diff --git a/content/en/docs/tasks/administer-cluster/topology-manager.md b/content/en/docs/tasks/administer-cluster/topology-manager.md index 7d3f017940279..e8d2e7c19d6c9 100644 --- a/content/en/docs/tasks/administer-cluster/topology-manager.md +++ b/content/en/docs/tasks/administer-cluster/topology-manager.md @@ -69,6 +69,10 @@ Details on the various `scopes` and `policies` available today can be found belo To align CPU resources with other requested resources in a Pod Spec, the CPU Manager should be enabled and proper CPU Manager policy should be configured on a Node. See [control CPU Management Policies](/docs/tasks/administer-cluster/cpu-management-policies/). {{< /note >}} +{{< note >}} +To align memory (and hugepages) resources with other requested resources in a Pod Spec, the Memory Manager should be enabled and proper Memory Manager policy should be configured on a Node. Examine [Memory Manager](/docs/tasks/administer-cluster/memory-manager/) documentation. +{{< /note >}} + ### Topology Manager Scopes The Topology Manager can deal with the alignment of resources in a couple of distinct scopes: @@ -263,8 +267,4 @@ Using this information the Topology Manager calculates the optimal hint for the ### Known Limitations 1. The maximum number of NUMA nodes that Topology Manager allows is 8. With more than 8 NUMA nodes there will be a state explosion when trying to enumerate the possible NUMA affinities and generating their hints. -2. The scheduler is not topology-aware, so it is possible to be scheduled on a node and then fail on the node due to the Topology Manager. - -3. The Device Manager and the CPU Manager are the only components to adopt the Topology Manager's HintProvider interface. This means that NUMA alignment can only be achieved for resources managed by the CPU Manager and the Device Manager. Memory or Hugepages are not considered by the Topology Manager for NUMA alignment. - - +2. The scheduler is not topology-aware, so it is possible to be scheduled on a node and then fail on the node due to the Topology Manager. \ No newline at end of file diff --git a/static/images/docs/memory-manager-diagram.svg b/static/images/docs/memory-manager-diagram.svg new file mode 100644 index 0000000000000..43832f140c917 --- /dev/null +++ b/static/images/docs/memory-manager-diagram.svg @@ -0,0 +1,2 @@ + +
Kubelet
[Not supported by viewer]
Topology Manager
[Not supported by viewer]
Memory Manager 
(MM)
[Not supported by viewer]
Node Map 
(a part of MM)
[Not supported by viewer]
Admit(...)
[Not supported by viewer]
AddContainer(...)
[Not supported by viewer]
GetTopologyHints(...)
[Not supported by viewer]
Allocate(...)
[Not supported by viewer]
retrieve the counters (free memory)
<font style="font-size: 12px">retrieve the counters (free memory)</font>
consider memory pre-allocation 
in Memory Maps
[Not supported by viewer]
update cgroups  (cpuset.mems
using CRI API 
[Not supported by viewer]
Hints (10, 11, etc.)
[Not supported by viewer]
Compute NUMA node affinity for a container:
  • adequate amount of memory at single NUMA node 0 => attach hint "10"
  • adequate amount of memory at multi-NUMA group => attach hint "11"
(and so forth)
[Not supported by viewer]
\ No newline at end of file