Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set cache dir to volatile dir #1066

Closed
wants to merge 1 commit into from

Conversation

henry118
Copy link
Member

@henry118 henry118 commented Feb 15, 2024

This was meant to be a question but since it's a pretty easy change I'll just use a PR to kick off the discussion.

The question is that should CNI cache result persist across node restart?

My understanding is that most networking setups will not persist across node reboot, and libcni will not recover those network setups either. Pods/container recovery after reboot is expected to be driven by the runtime. Today in k8s, pods are normally rescheduled once a node is restarted, therefore any cached CNI result will immediately go stale. It could be a problem if the same node continues to get new pods scheduled on it.

Should we set the cache dir to something volatile, say /run/cni or /var/run/cni?

Likewise I had containerd/containerd#9825 to enable the customization of this dir in runtime. Just unsure where the best place of the change should be...

xref: containerd/containerd#9825
xref: #1055

@coveralls
Copy link

Coverage Status

coverage: 70.133%. remained the same
when pulling 8526416 on henry118:cachedir
into b62753a on containernetworking:main.

@henry118
Copy link
Member Author

@squeed @MikeZappa87 or others?

@MikeZappa87
Copy link
Contributor

I am not certain of the backstory to why the cache ended up in /var/lib vs /var/run. @squeed @dcbw

@alaypatel07
Copy link

Interested in answers to the questions here. Running into issues where cni cache directory is corrupted with 0 bytes files

@squeed
Copy link
Member

squeed commented Feb 23, 2024

Backstory time: we specifically set the cache directory to be non-volatile so that a CNI DEL would be consistent, even after reboots.

Specifically, we use the cache directory to store the CNI ARGS / capability args passed to the plugins on ADD. We then reconstruct them for DEL, which plugins may rely on. A reboot does not absolve us of the need to supply a CNI DEL -- which the exiting runtimes (containerd, cri-o) correctly do. There may be non-volatile resources that need to be cleaned up, even after reboot.

We now also use the cache directory for GC, which is convenient for cleaning up stale resources.

So, we should keep the cache directory on persistent storage. That said, we should be extremely tolerant of errors in the DEL case; @alaypatel07 would you mind filing an issue?

@alaypatel07
Copy link

@squeed There is already an issue for the error we are running into #1055.

What would break if upon reboot the results directory is deleted?

@henry118
Copy link
Member Author

henry118 commented Feb 23, 2024

This issue is likely a result of OS crash, where dirty pages were lost before persisted to disk. containerd had similar issue with its runtime data and had to introduce an optional fsync as a safeguard (containerd/containerd#9401). Can we do something similar in libcni?

@alaypatel07
Copy link

alaypatel07 commented Feb 23, 2024

@henry118 so I understand the issue better, if case of OS crash, are all the files in /var/lib/cni/results expected to be of size 0 or just some of them which were dirtied?

Also, is there anyway to confirm upon seeing this issue, that it was indeed an OS crash?

@henry118
Copy link
Member Author

That's just my guess. I haven't validated it with a repro. But it looks like all cases so far had node reboot somewhat involved.

Assuming the 0 sized files are the result of unflushed writes, then it would only affect the files being written at the time of OS crash. Other existing files in the dir should remain intact.

@henry118
Copy link
Member Author

henry118 commented Mar 7, 2024

Opened #1072 as alternative solution. Closing this thread.

@henry118 henry118 closed this Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants