Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checkpoint rootfs diff is wasteful #24826

Open
hanwen-flow opened this issue Dec 12, 2024 · 9 comments
Open

checkpoint rootfs diff is wasteful #24826

hanwen-flow opened this issue Dec 12, 2024 · 9 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@hanwen-flow
Copy link

Issue Description

Container checkpointing with CRIU works for me, but its speed is disappointing. What currently happens is:

  • checkpoint process into checkpoint/
  • discover changed files
  • put changed files into rootfs-diff.tar
  • put deleted files into a deleted.json
  • tar up all of the above in the final checkpoint file.

In my case, my containers have large local FS differences. On my laptop, the tarring runs at 400mb/s, so 10G of file system differences takes 25s to create the rootfs-diff.tar archive. Then, this data goes through tar again which takes another 25s.

wouldn't it be better to insert the rootfs diff directly into the snapshot tar (perhaps under a rootfs-diff/ directory) ? Then my large file content goes through tar only once, yielding a 2x speedup.

Note that Go isn't particularly efficient with tar files either, see golang/go#70807.

Steps to reproduce the issue

Steps to reproduce the issue
1.
2.
3.

Describe the results you received

Describe the results you received

Describe the results you expected

Describe the results you expected

podman info output

If you are unable to run podman info for any reason, please provide the podman version, operating system and its version and the architecture you are running.

Podman in a container

No

Privileged Or Rootless

Privileged

Upstream Latest Release

Yes

Additional environment details

Additional environment details

Additional information

Additional information like issue happens only occasionally or issue happens with a particular architecture or on a particular setting

@hanwen-flow hanwen-flow added the kind/bug Categorizes issue or PR as related to a bug. label Dec 12, 2024
@mheon
Copy link
Member

mheon commented Dec 12, 2024

@adrianreber Any thoughts?

@adrianreber
Copy link
Collaborator

@hanwen-flow Sounds like a good idea. You should do a proof of concept and open a PR to verify it.

Another thing which would be interesting is to see if using --compress=none changes anything during checkpointing. The checkpoint archive is compressed with zstd, but you can switch to none. Maybe that also helps.

The main problem seems to be that we write the file-system changes twice. Thinking about your approach we will still need to write it twice. Once from the container to the temporary directory and the second time while creating the final tar archive from the temporary directory.

Thinking more about it. If we write the content from the container to the rootfs-diff.tar it is written uncompressed. Is writing the data to a tar archive, uncompressed, so much slower than writing to a directory? My expectation, without measuring it, would be that there should not be much difference. The --compress flag only controls the final tar archive. The internal tar archive is explicitly not compressed to avoid compressing data twice.

If you find a way to write the data only once, that would be great. If you have an idea how to improve it, please open a PR.

@hanwen-flow
Copy link
Author

Once from the container to the temporary directory and the second time while creating the final tar archive from the temporary directory.

why? CRIU has to create files in a temp directory, because it is a separate process, but podman can create a tar.Writer in golang for the final file directly, insert the CRIU snapshots (from the file system tmp dir) and read the changed files from the container directly.

if you do this without compression, the whole process could use kernel file copying (assuming the linked golang proposal goes through). This requires dropping the gratuitous io.Pipe() calls to connect tar streams to output sinks.

@hanwen-flow
Copy link
Author

Re. compression: I observed that zstd compression on the final archive incurs 10% overhead.

@adrianreber
Copy link
Collaborator

Once from the container to the temporary directory and the second time while creating the final tar archive from the temporary directory.

why?

Good question.

CRIU has to create files in a temp directory, because it is a separate process, but podman can create a tar.Writer in golang for the final file directly, insert the CRIU snapshots (from the file system tmp dir) and read the changed files from the container directly.

There is a tool called criu-image-streamer which tries to avoid it. So it would be possible to also optimize this step.

if you do this without compression, the whole process could use kernel file copying (assuming the linked golang proposal goes through). This requires dropping the gratuitous io.Pipe() calls to connect tar streams to output sinks.

Sounds great, please open a PR. From my point of view there is nothing against having this.

@Luap99
Copy link
Member

Luap99 commented Dec 13, 2024

Do we need to worry about backwards compat here? If the layout is changed podman would be unable to load a checkpoint created on a previous version or the other way around.

@hanwen-flow
Copy link
Author

There is a tool called criu-image-streamer which tries to avoid it. So it would be possible to also optimize this step.

I had a quick look. That is a small tool, so it could be reimplemented in podman or one of its dependencies, I guess?

re. compatibility: we could maintain the tar-in-tar format, and stream the embedded rootfs tar directly into the final tar file without going to disk.

@rst0git
Copy link
Contributor

rst0git commented Dec 13, 2024

That is a small tool, so it could be reimplemented in podman or one of its dependencies, I guess?

A better place to re-implement image streaming would be go-criu. We use these bindings in Podman. We can propose this as a GSoC project for next year: https://www.criu.org/Google_Summer_of_Code_Ideas

@hanwen
Copy link

hanwen commented Dec 13, 2024

and stream the embedded rootfs tar directly into the final tar file without going to disk.

now I remember again: we can't do this: we need to know the size of the file in advance, because the metadata (inclding size) precedes the file content.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

6 participants