Set cache dir to volatile dir #1066

henry118 · 2024-02-15T18:15:51Z

This was meant to be a question but since it's a pretty easy change I'll just use a PR to kick off the discussion.

The question is that should CNI cache result persist across node restart?

My understanding is that most networking setups will not persist across node reboot, and libcni will not recover those network setups either. Pods/container recovery after reboot is expected to be driven by the runtime. Today in k8s, pods are normally rescheduled once a node is restarted, therefore any cached CNI result will immediately go stale. It could be a problem if the same node continues to get new pods scheduled on it.

Should we set the cache dir to something volatile, say /run/cni or /var/run/cni?

Likewise I had containerd/containerd#9825 to enable the customization of this dir in runtime. Just unsure where the best place of the change should be...

xref: containerd/containerd#9825
xref: #1055

Signed-off-by: Henry Wang <[email protected]>

coveralls · 2024-02-15T18:19:15Z

coverage: 70.133%. remained the same
when pulling 8526416 on henry118:cachedir
into b62753a on containernetworking:main.

henry118 · 2024-02-15T18:22:13Z

@squeed @MikeZappa87 or others?

MikeZappa87 · 2024-02-15T18:24:43Z

I am not certain of the backstory to why the cache ended up in /var/lib vs /var/run. @squeed @dcbw

alaypatel07 · 2024-02-22T20:38:24Z

Interested in answers to the questions here. Running into issues where cni cache directory is corrupted with 0 bytes files

squeed · 2024-02-23T14:13:35Z

Backstory time: we specifically set the cache directory to be non-volatile so that a CNI DEL would be consistent, even after reboots.

Specifically, we use the cache directory to store the CNI ARGS / capability args passed to the plugins on ADD. We then reconstruct them for DEL, which plugins may rely on. A reboot does not absolve us of the need to supply a CNI DEL -- which the exiting runtimes (containerd, cri-o) correctly do. There may be non-volatile resources that need to be cleaned up, even after reboot.

We now also use the cache directory for GC, which is convenient for cleaning up stale resources.

So, we should keep the cache directory on persistent storage. That said, we should be extremely tolerant of errors in the DEL case; @alaypatel07 would you mind filing an issue?

alaypatel07 · 2024-02-23T15:29:49Z

@squeed There is already an issue for the error we are running into #1055.

What would break if upon reboot the results directory is deleted?

henry118 · 2024-02-23T17:35:11Z

This issue is likely a result of OS crash, where dirty pages were lost before persisted to disk. containerd had similar issue with its runtime data and had to introduce an optional fsync as a safeguard (containerd/containerd#9401). Can we do something similar in libcni?

alaypatel07 · 2024-02-23T17:44:56Z

@henry118 so I understand the issue better, if case of OS crash, are all the files in /var/lib/cni/results expected to be of size 0 or just some of them which were dirtied?

Also, is there anyway to confirm upon seeing this issue, that it was indeed an OS crash?

henry118 · 2024-02-23T18:06:03Z

That's just my guess. I haven't validated it with a repro. But it looks like all cases so far had node reboot somewhat involved.

Assuming the 0 sized files are the result of unflushed writes, then it would only affect the files being written at the time of OS crash. Other existing files in the dir should remain intact.

henry118 · 2024-03-07T23:28:40Z

Opened #1072 as alternative solution. Closing this thread.

Set cache dir to volatile dir

8526416

Signed-off-by: Henry Wang <[email protected]>

henry118 mentioned this pull request Feb 29, 2024

Failed to destroy the network because of empty cached cni result. #1055

Closed

henry118 mentioned this pull request Mar 7, 2024

tolerate invalid cni caches for deletion #1072

Merged

henry118 closed this Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set cache dir to volatile dir #1066

Set cache dir to volatile dir #1066

henry118 commented Feb 15, 2024 •

edited

Loading

coveralls commented Feb 15, 2024

henry118 commented Feb 15, 2024

MikeZappa87 commented Feb 15, 2024

alaypatel07 commented Feb 22, 2024

squeed commented Feb 23, 2024 •

edited

Loading

alaypatel07 commented Feb 23, 2024

henry118 commented Feb 23, 2024 •

edited

Loading

alaypatel07 commented Feb 23, 2024 •

edited

Loading

henry118 commented Feb 23, 2024

henry118 commented Mar 7, 2024

Set cache dir to volatile dir #1066

Set cache dir to volatile dir #1066

Conversation

henry118 commented Feb 15, 2024 • edited Loading

coveralls commented Feb 15, 2024

henry118 commented Feb 15, 2024

MikeZappa87 commented Feb 15, 2024

alaypatel07 commented Feb 22, 2024

squeed commented Feb 23, 2024 • edited Loading

alaypatel07 commented Feb 23, 2024

henry118 commented Feb 23, 2024 • edited Loading

alaypatel07 commented Feb 23, 2024 • edited Loading

henry118 commented Feb 23, 2024

henry118 commented Mar 7, 2024

henry118 commented Feb 15, 2024 •

edited

Loading

squeed commented Feb 23, 2024 •

edited

Loading

henry118 commented Feb 23, 2024 •

edited

Loading

alaypatel07 commented Feb 23, 2024 •

edited

Loading