Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"omicron_dev db-run" tool for running CockroachDB in the background #48

Merged
merged 27 commits into from
Mar 12, 2021

Conversation

davepacheco
Copy link
Collaborator

@davepacheco davepacheco commented Mar 10, 2021

This change is the first in my plan to switch out Nexus's in-memory datastore with a CockroachDB instance. This step adds some facilities for running a single-node CockroachDB instance using a temporary directory for its data. The data is removed upon successful cleanup. There's a command-line tool omicron_dev db-run intended for doing this in development. There's also a bunch of automated tests for all this, including ^C handling.

The idea is to make these two use cases as simple as possible:

  • a developer wants to run Nexus + other control plane components to play around with. Today, this is trivial because Nexus always uses an in-memory datastore created from a clean slate. After this change, you'll run omicron_dev db-run in a separate terminal, then do whatever else you were going to do. When you ^C that command (or otherwise shut it down), the database is cleaned up. This is intended to mimic the in-memory case as much as possible.
  • the test suite wants to run Nexus + other control plane components to run a battery of tests. Again, today, this trivially works using Nexus's in-memory datastore. After this change, the test suite can use a library interface to set up a CockroachDB instance and have it cleaned up upon normal process termination.

Note that this change does not change Nexus, or the datastore, or any of the existing test suite. This all still uses the in-memory datastore. Think of this as phase 1.

I also plan to extend omicron_dev to include db-populate and db-wipe commands for creating and removing the Omicron database schema in a CockroachDB database. That's not here yet, but that's why there's a command with just one subcommand. I also wanted "dev" in the name so that someone would think twice before running this in production.

@smklein I know you're irked by mixed underscores and hyphens. I feel like CLI commands with subcommands usually use hyphens in the subcommand names (e.g., "db-run" in this case). So if we move towards consistency, I imagine we want to change all the commands to use hyphens (e.g., "sled-agent"). That'll also be consistent with the conventions around SMF FMRIs. Is there any reason not to do this? Are there other contexts that use underscores? Rust crate identifiers are the only place that jumps to mind -- and fortunately we just removed those when we renamed this repo and crate. I'm considering this not-a-blocker for this change.

@davepacheco davepacheco changed the title Omicron dev "omicron_dev db-run" tool for running CockroachDB in the background Mar 10, 2021
@davepacheco
Copy link
Collaborator Author

I've filed this bug for the Windows CI failure:
vorner/signal-hook#100

The signal-hook-tokio crate appears to depend on part of signal-hook that's not available on Windows. We're using this to handle ^C in the new omicron_dev tool. I'm guessing this breakage won't be easy to fix because that crate's Windows support is best-effort. Given that, there are a few options:

  • switch to the ctrlc crate, which claims to be cross-platform
  • make the ^C handling dependent on not being Windows. This is kind of a copout.
  • drop Windows support. I don't see this software being practically useful on Windows in the foreseeable future, but trying to keep the build clean has felt useful in terms of avoiding unnecessary platform dependence. That said, I think most of us working on this don't currently have a Windows test environment, and it would take some doing to make that happen, so this might only be worthwhile as long as it's easy. And this might be an indication that it's not easy.

I will briefly look at the ctrlc crate, and otherwise I lean towards dropping the Windows CI build.

@davepacheco
Copy link
Collaborator Author

The ctrlc approach looks promising. I just pushed a change to use that instead of signal-hook. I did notice that ctrlc pulls in Nix, which doesn't yet work on illumos, but should after nix-rust/nix#1394. If for some reason that gets derailed, we can go back to the signal-hook implementation.

src/backoff.rs Outdated
pub enum CondCheckError<E: std::error::Error + 'static> {
/** the condition we're waiting for is not true */
#[error("poll condition not yet ready")]
NotYet,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This NotYet case is surprising to see in a value which is returned from a Future - why not have the future itself classify this case with Poll::Pending?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going for the same style of interface here that's used in the backoff crate (which we already use here). This is also the one I've used in Node.js. That's not to say it's ideal in Rust.

The idea here is that it's easy to write a block of code that, say, attempts to make an (async) HTTP request to a dependent service or checks some (async) condition. backoff and the interface here seek to make it as ergonomic as possible to take an existing chunk of code like that and do it in a loop (with delays) until it succeeds, times out, or encounters some permanent error. I see what you're saying about this sounding like a Future and should we just have the closure return Poll::Pending, and maybe that's good? But I'm not sure how ergonomic that will be to use. It would help to see an example of what that would look like.

src/backoff.rs Outdated
Comment on lines 69 to 73
* remember Clulow's adage:
*
* Timeouts, timeouts: always wrong!
* Some too short and some too long.
*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this

src/backoff.rs Outdated
return Err(Error::TimedOut(duration));
}

let check = cond().await;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kinda surprised to see this cond() function queried multiple times - if it didn't return a Future, that would make sense to me, but futures are already pollable objects. Why are we constructing a completely new future (and discarding the old one), instead of polling the future we already have?

From the async book:

It's common that futures aren't able to complete the first time they are polled. When this happens, the future needs to ensure that it is polled again once it is ready to make more progress. This is done with the Waker type.

Note that for the cond() to return from .await, callers of this function will need to implement a timeout of their own internally - if callers supplied a value of cond that .awaits for a long time (e.g., longer than the poll interval or timeout), it won't be interrupted by this helper.

Copy link
Collaborator

@smklein smklein Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and implemented this here: https://docs.rs/interval_future/0.1.1/interval_future/fn.to_future.html

WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kinda surprised to see this cond() function queried multiple times - if it didn't return a Future, that would make sense to me, but futures are already pollable objects. Why are we constructing a completely new future (and discarding the old one), instead of polling the future we already have?

Maybe I'm misunderstanding your question, but I think there are two levels of async operation going on here. The user is providing us one that's small and easy to write -- think making an async HTTP request or checking whether a file exists. We're taking that and wrapping it in one that retries based on some policy. If we have them give us a Future that only resolves when the whole thing is done, it feels like that puts all the work on the consumer. But maybe I just don't grok what it would look like.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example of behavior where I believe the current implementation in this PR might struggle:

        poll::wait_for_condition(
            || async {
                // Or substitute your own long-running operation of choice!
                tokio::time::sleep(Duration::from_secs(20).await;
                Ok(())
            },
            &Duration::from_millis(25),
            &Duration::from_secs(10),
        ).await

With the current implementation, this example would ignore the user-supplied timeout, and would block for 20 seconds.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this a bunch offline. I understood one of the concerns was that if the async function itself takes longer than the overall operation timeout, there's some confusion about who's responsible for making sure the overall timeout is not violated. My expectations here are:

  • that the overall timeout is really a "time after which we stop retrying", not a "time after which we promise that there are no attempts in progress"
  • that it's the caller's responsibility to choose this overall timeout, and potentially to bound the time used by the async function, to ensure that the resulting behavior makes sense for their use case

Critically for that second bullet: this abstraction is only intended for the test suite and developer tools, and the only reason there even is an overall timeout is so that these things don't hang when something goes wrong (e.g., test breakage). The current use cases are indeed expected to be quick in reality (reading a tiny file off local disk and the equivalent of waitpid(WNOHANG)).

I'm thinking maybe it makes sense to make this more explicit by creating a new "dev" module, moving this abstraction there, moving the (small) contents of the "test_util" module there, and moving all of "dev_db.rs" into there as well. That way it's clear that all of these things are intended for use by the test suite and dev tools and not, say, Nexus or the various agents. For things that we want to use in production with Nexus or the agents, we may indeed want to be more careful about exactly what promises are made with respect to timeouts and what various callers' expectations are. I don't have such a use case in mind, so I don't want to try to design an API for it.

All of this is not to say we can't improve this later, but that's my current plan. I will plan to do this soon and integrate this soon.

Thanks for taking a look @smklein!

src/dev_db.rs Outdated
Comment on lines 274 to 277
* TODO-robustness It would be nice if there were a version
* of tokio::fs::read_to_string() that accepted a maximum
* byte count so that this couldn't, say, use up all of
* memory.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There basically is a version that accepts a maximum byte count: https://doc.rust-lang.org/std/fs/struct.File.html#impl-Read

Tokio's read_to_string implementation directly invokes the std version (so it blocks the executor for this task, because it's synchronous), which itself is just a wrapper around File::read + str::from_utf8 until EOF.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring more specifically to https://doc.rust-lang.org/std/io/trait.Read.html#tymethod.read?

I expect that will work and we can do that to resolve this TODO. (It's a fair bit more code than just calling that method instead: open the file, call read in a loop, append the correct subslice of bytes to a separate String, handle EOF and errors, etc. It's obviously not hard but it's more work than using read_to_string(). I think it would be useful to have a read_to_string() that just took a maximum byte count to do this work. Unless I'm looking at the wrong function?)

Tokio's read_to_string implementation directly invokes the std version (so it blocks the executor for this task, because it's synchronous)

That is technically true, but it looks like it takes care to make sure that doesn't have the impact one would expect from blocking the executor:
https://github.com/tokio-rs/tokio/blob/edfff7551abc24e6d6ee5e0dc8e894cf1309a53d/tokio/src/fs/read_to_string.rs#L25
https://github.com/tokio-rs/tokio/blob/edfff7551abc24e6d6ee5e0dc8e894cf1309a53d/tokio/src/fs/mod.rs#L9-L14
https://github.com/tokio-rs/tokio/blob/edfff7551abc24e6d6ee5e0dc8e894cf1309a53d/tokio/src/fs/mod.rs#L106-L118

src/dev_db.rs Outdated
* great way for us to know when this has happened, unfortunately. So
* we just poll for it up to some maximum timeout.
*/
let wait_result = poll::wait_for_condition(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at both of the uses of wait_for_condition, I think it might make more sense for this function to operate directly on a future, rather than a function which returns a future.

The behavior below - checking "exited", trying to read over and over again until a file exists - smells a lot like an implementation of Future::poll.

@davepacheco
Copy link
Collaborator Author

The latest commit (49f1ca1) downloads official CockroachDB binaries and uses them in the build.

Unfortunately there's a new Windows build failure:

error[E0425]: cannot find function `kill` in crate `libc`
  --> src\test_util.rs:37:26
   |
37 |     0 == (unsafe { libc::kill(pid as libc::pid_t, 0) })
   |                          ^^^^ not found in `libc`

error[E0412]: cannot find type `pid_t` in crate `libc`
  --> src\test_util.rs:37:44
   |
37 |     0 == (unsafe { libc::kill(pid as libc::pid_t, 0) })
   |                                            ^^^^^ not found in `libc`

I'm guessing these don't exist in libc for Windows. This is not unreasonable, but at this point I'm going to just drop the Windows build for the reasons mentioned earlier.

@davepacheco davepacheco marked this pull request as ready for review March 12, 2021 19:17
@davepacheco davepacheco merged commit 7607003 into master Mar 12, 2021
@davepacheco davepacheco deleted the omicron_dev branch March 12, 2021 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants