-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set cache dir to volatile dir #1066
Conversation
Signed-off-by: Henry Wang <[email protected]>
@squeed @MikeZappa87 or others? |
Interested in answers to the questions here. Running into issues where cni cache directory is corrupted with 0 bytes files |
Backstory time: we specifically set the cache directory to be non-volatile so that a CNI DEL would be consistent, even after reboots. Specifically, we use the cache directory to store the CNI ARGS / capability args passed to the plugins on ADD. We then reconstruct them for DEL, which plugins may rely on. A reboot does not absolve us of the need to supply a CNI DEL -- which the exiting runtimes (containerd, cri-o) correctly do. There may be non-volatile resources that need to be cleaned up, even after reboot. We now also use the cache directory for GC, which is convenient for cleaning up stale resources. So, we should keep the cache directory on persistent storage. That said, we should be extremely tolerant of errors in the DEL case; @alaypatel07 would you mind filing an issue? |
This issue is likely a result of OS crash, where dirty pages were lost before persisted to disk. containerd had similar issue with its runtime data and had to introduce an optional |
@henry118 so I understand the issue better, if case of OS crash, are all the files in Also, is there anyway to confirm upon seeing this issue, that it was indeed an OS crash? |
That's just my guess. I haven't validated it with a repro. But it looks like all cases so far had node reboot somewhat involved. Assuming the 0 sized files are the result of unflushed writes, then it would only affect the files being written at the time of OS crash. Other existing files in the dir should remain intact. |
Opened #1072 as alternative solution. Closing this thread. |
This was meant to be a question but since it's a pretty easy change I'll just use a PR to kick off the discussion.
The question is that should CNI cache result persist across node restart?
My understanding is that most networking setups will not persist across node reboot, and libcni will not recover those network setups either. Pods/container recovery after reboot is expected to be driven by the runtime. Today in k8s, pods are normally rescheduled once a node is restarted, therefore any cached CNI result will immediately go stale. It could be a problem if the same node continues to get new pods scheduled on it.
Should we set the cache dir to something volatile, say
/run/cni
or/var/run/cni
?Likewise I had containerd/containerd#9825 to enable the customization of this dir in runtime. Just unsure where the best place of the change should be...
xref: containerd/containerd#9825
xref: #1055