-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"omicron_dev db-run" tool for running CockroachDB in the background #48
Conversation
I've filed this bug for the Windows CI failure: The signal-hook-tokio crate appears to depend on part of signal-hook that's not available on Windows. We're using this to handle ^C in the new
I will briefly look at the |
The ctrlc approach looks promising. I just pushed a change to use that instead of signal-hook. I did notice that ctrlc pulls in Nix, which doesn't yet work on illumos, but should after nix-rust/nix#1394. If for some reason that gets derailed, we can go back to the signal-hook implementation. |
src/backoff.rs
Outdated
pub enum CondCheckError<E: std::error::Error + 'static> { | ||
/** the condition we're waiting for is not true */ | ||
#[error("poll condition not yet ready")] | ||
NotYet, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This NotYet
case is surprising to see in a value which is returned from a Future - why not have the future itself classify this case with Poll::Pending?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going for the same style of interface here that's used in the backoff
crate (which we already use here). This is also the one I've used in Node.js. That's not to say it's ideal in Rust.
The idea here is that it's easy to write a block of code that, say, attempts to make an (async) HTTP request to a dependent service or checks some (async) condition. backoff
and the interface here seek to make it as ergonomic as possible to take an existing chunk of code like that and do it in a loop (with delays) until it succeeds, times out, or encounters some permanent error. I see what you're saying about this sounding like a Future
and should we just have the closure return Poll::Pending
, and maybe that's good? But I'm not sure how ergonomic that will be to use. It would help to see an example of what that would look like.
src/backoff.rs
Outdated
* remember Clulow's adage: | ||
* | ||
* Timeouts, timeouts: always wrong! | ||
* Some too short and some too long. | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this
src/backoff.rs
Outdated
return Err(Error::TimedOut(duration)); | ||
} | ||
|
||
let check = cond().await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kinda surprised to see this cond()
function queried multiple times - if it didn't return a Future
, that would make sense to me, but futures are already pollable objects. Why are we constructing a completely new future (and discarding the old one), instead of polling the future we already have?
From the async book:
It's common that futures aren't able to complete the first time they are polled. When this happens, the future needs to ensure that it is polled again once it is ready to make more progress. This is done with the Waker type.
Note that for the cond()
to return from .await
, callers of this function will need to implement a timeout of their own internally - if callers supplied a value of cond
that .await
s for a long time (e.g., longer than the poll interval or timeout), it won't be interrupted by this helper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and implemented this here: https://docs.rs/interval_future/0.1.1/interval_future/fn.to_future.html
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kinda surprised to see this cond() function queried multiple times - if it didn't return a Future, that would make sense to me, but futures are already pollable objects. Why are we constructing a completely new future (and discarding the old one), instead of polling the future we already have?
Maybe I'm misunderstanding your question, but I think there are two levels of async operation going on here. The user is providing us one that's small and easy to write -- think making an async HTTP request or checking whether a file exists. We're taking that and wrapping it in one that retries based on some policy. If we have them give us a Future that only resolves when the whole thing is done, it feels like that puts all the work on the consumer. But maybe I just don't grok what it would look like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example of behavior where I believe the current implementation in this PR might struggle:
poll::wait_for_condition(
|| async {
// Or substitute your own long-running operation of choice!
tokio::time::sleep(Duration::from_secs(20).await;
Ok(())
},
&Duration::from_millis(25),
&Duration::from_secs(10),
).await
With the current implementation, this example would ignore the user-supplied timeout, and would block for 20 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked about this a bunch offline. I understood one of the concerns was that if the async function itself takes longer than the overall operation timeout, there's some confusion about who's responsible for making sure the overall timeout is not violated. My expectations here are:
- that the overall timeout is really a "time after which we stop retrying", not a "time after which we promise that there are no attempts in progress"
- that it's the caller's responsibility to choose this overall timeout, and potentially to bound the time used by the async function, to ensure that the resulting behavior makes sense for their use case
Critically for that second bullet: this abstraction is only intended for the test suite and developer tools, and the only reason there even is an overall timeout is so that these things don't hang when something goes wrong (e.g., test breakage). The current use cases are indeed expected to be quick in reality (reading a tiny file off local disk and the equivalent of waitpid(WNOHANG)
).
I'm thinking maybe it makes sense to make this more explicit by creating a new "dev" module, moving this abstraction there, moving the (small) contents of the "test_util" module there, and moving all of "dev_db.rs" into there as well. That way it's clear that all of these things are intended for use by the test suite and dev tools and not, say, Nexus or the various agents. For things that we want to use in production with Nexus or the agents, we may indeed want to be more careful about exactly what promises are made with respect to timeouts and what various callers' expectations are. I don't have such a use case in mind, so I don't want to try to design an API for it.
All of this is not to say we can't improve this later, but that's my current plan. I will plan to do this soon and integrate this soon.
Thanks for taking a look @smklein!
src/dev_db.rs
Outdated
* TODO-robustness It would be nice if there were a version | ||
* of tokio::fs::read_to_string() that accepted a maximum | ||
* byte count so that this couldn't, say, use up all of | ||
* memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There basically is a version that accepts a maximum byte count: https://doc.rust-lang.org/std/fs/struct.File.html#impl-Read
Tokio's read_to_string implementation directly invokes the std
version (so it blocks the executor for this task, because it's synchronous), which itself is just a wrapper around File::read
+ str::from_utf8
until EOF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you referring more specifically to https://doc.rust-lang.org/std/io/trait.Read.html#tymethod.read?
I expect that will work and we can do that to resolve this TODO. (It's a fair bit more code than just calling that method instead: open the file, call read in a loop, append the correct subslice of bytes to a separate String, handle EOF and errors, etc. It's obviously not hard but it's more work than using read_to_string()
. I think it would be useful to have a read_to_string() that just took a maximum byte count to do this work. Unless I'm looking at the wrong function?)
Tokio's read_to_string implementation directly invokes the std version (so it blocks the executor for this task, because it's synchronous)
That is technically true, but it looks like it takes care to make sure that doesn't have the impact one would expect from blocking the executor:
https://github.com/tokio-rs/tokio/blob/edfff7551abc24e6d6ee5e0dc8e894cf1309a53d/tokio/src/fs/read_to_string.rs#L25
https://github.com/tokio-rs/tokio/blob/edfff7551abc24e6d6ee5e0dc8e894cf1309a53d/tokio/src/fs/mod.rs#L9-L14
https://github.com/tokio-rs/tokio/blob/edfff7551abc24e6d6ee5e0dc8e894cf1309a53d/tokio/src/fs/mod.rs#L106-L118
src/dev_db.rs
Outdated
* great way for us to know when this has happened, unfortunately. So | ||
* we just poll for it up to some maximum timeout. | ||
*/ | ||
let wait_result = poll::wait_for_condition( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at both of the uses of wait_for_condition
, I think it might make more sense for this function to operate directly on a future, rather than a function which returns a future.
The behavior below - checking "exited", trying to read over and over again until a file exists - smells a lot like an implementation of Future::poll.
The latest commit (49f1ca1) downloads official CockroachDB binaries and uses them in the build. Unfortunately there's a new Windows build failure:
I'm guessing these don't exist in libc for Windows. This is not unreasonable, but at this point I'm going to just drop the Windows build for the reasons mentioned earlier. |
This change is the first in my plan to switch out Nexus's in-memory datastore with a CockroachDB instance. This step adds some facilities for running a single-node CockroachDB instance using a temporary directory for its data. The data is removed upon successful cleanup. There's a command-line tool
omicron_dev db-run
intended for doing this in development. There's also a bunch of automated tests for all this, including ^C handling.The idea is to make these two use cases as simple as possible:
omicron_dev db-run
in a separate terminal, then do whatever else you were going to do. When you ^C that command (or otherwise shut it down), the database is cleaned up. This is intended to mimic the in-memory case as much as possible.Note that this change does not change Nexus, or the datastore, or any of the existing test suite. This all still uses the in-memory datastore. Think of this as phase 1.
I also plan to extend
omicron_dev
to includedb-populate
anddb-wipe
commands for creating and removing the Omicron database schema in a CockroachDB database. That's not here yet, but that's why there's a command with just one subcommand. I also wanted "dev" in the name so that someone would think twice before running this in production.@smklein I know you're irked by mixed underscores and hyphens. I feel like CLI commands with subcommands usually use hyphens in the subcommand names (e.g., "db-run" in this case). So if we move towards consistency, I imagine we want to change all the commands to use hyphens (e.g., "sled-agent"). That'll also be consistent with the conventions around SMF FMRIs. Is there any reason not to do this? Are there other contexts that use underscores? Rust crate identifiers are the only place that jumps to mind -- and fortunately we just removed those when we renamed this repo and crate. I'm considering this not-a-blocker for this change.