Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeout to Ceph GET API calls #900

Open
karthik-us opened this issue Jul 10, 2023 · 5 comments
Open

Add timeout to Ceph GET API calls #900

karthik-us opened this issue Jul 10, 2023 · 5 comments

Comments

@karthik-us
Copy link

This is to add neccessary changes in go-ceph to handle the ceph-csi issue #ceph/ceph-csi#3657.

Provide a way to configure the timeout for the ceph Get API calls to avoid command stuck if there is some problem between the ceph cluster and the csi driver (cluster health, slow ops, or short network connectivity problem)

For more info please refer to the ceph-csi issue.

@phlogistonjohn
Copy link
Collaborator

Can you be more specific about what APIs you mean? When I read "Get API calls" I think RGW (HTTP) APIs, but when I look at the linked issue it doesn't seem to be RGW specific.

The APIs that wrap C calls from Ceph do not support things like Go's contexts so the typical methods for timing out in Go do not work. There are some timeout related parameters in the ceph configuration that you could apply to a rados connection. You'd probably need to experiment with them to see what works for your use-case (if any).

@karthik-us
Copy link
Author

karthik-us commented Jul 13, 2023

Hi @phlogistonjohn, the problem that we are trying to solve is csi pod hang when there is something wrong in the ceph cluster or some network problems. In such cases pod restart is the only manual fix available at the moment. So we are trying to add timeouts to such csi calls (mainly the get calls). So if it is possible to do that directly on rados that would be great. Or else we might need to write wrappers around the get calls to handle it. Some more context on this can be found here (a bit old though).

Thanks for your inputs on the timeout related parameters in ceph configs. Let me check whether those can be useful here.

@yxxhero
Copy link

yxxhero commented Jul 18, 2024

any updates?

@black-dragon74
Copy link
Member

I would not prefer implementing timeouts at go-ceph as it is supposed to be a simple wrapper around C libraries.

If one were to modify go-ceph to include support for timeouts it would lead to major refactors to the project as well as the consumers of this project.

Storage systems are expected to be transparent, i.e. if something is not in an expected state, it should be clear, we should not try to pretend otherwise. Moreover timeout is not something that would be useful in every use case there is.

Since we only need timeouts in csi driver for GET calls, we can implement a wrapper on driver side of things with something like:

// This is just a mockup
func TimedWrapper(ctx context.Context) (string, error) {
	// type this chan to the return type
	done := make(chan string, 1)

	go func() {
		defer close(done)

		// mockup the call to go-ceph API func
		chunks := 10
		for i := 0; i < chunks; i++ {
			time.Sleep(time.Second)
		}

		// Return the value
		done <- "success!"
	}()

	select {
	case a := <-done:
		return a, nil
	case <-ctx.Done():
		return "", ctx.Err()
	}
}

The drawback of this approach is we have no way to kill the command post flight. We can terminate the goroutine itself using signals.

As it is just a simple GET call, would leaving it as is be an issue?

Please share your thoughts on this.. Thank you!

cc: @nixpanic @Madhu-1

@nixpanic
Copy link
Member

The main question that I have, is if it is possible to terminate a goroutine if it is executing a librados, librbd or libcephfs call. Interrupting the C-library call may not be possible, or may not be reliable depending on the call?

Experimenting with that and sharing research results would be needed to really understand if this approach gives any benefits.

@ansiwen might have ideas about interrupting CGo calls too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants