"omicron_dev db-run" tool for running CockroachDB in the background #48

davepacheco · 2021-03-10T22:15:51Z

This change is the first in my plan to switch out Nexus's in-memory datastore with a CockroachDB instance. This step adds some facilities for running a single-node CockroachDB instance using a temporary directory for its data. The data is removed upon successful cleanup. There's a command-line tool omicron_dev db-run intended for doing this in development. There's also a bunch of automated tests for all this, including ^C handling.

The idea is to make these two use cases as simple as possible:

a developer wants to run Nexus + other control plane components to play around with. Today, this is trivial because Nexus always uses an in-memory datastore created from a clean slate. After this change, you'll run omicron_dev db-run in a separate terminal, then do whatever else you were going to do. When you ^C that command (or otherwise shut it down), the database is cleaned up. This is intended to mimic the in-memory case as much as possible.
the test suite wants to run Nexus + other control plane components to run a battery of tests. Again, today, this trivially works using Nexus's in-memory datastore. After this change, the test suite can use a library interface to set up a CockroachDB instance and have it cleaned up upon normal process termination.

Note that this change does not change Nexus, or the datastore, or any of the existing test suite. This all still uses the in-memory datastore. Think of this as phase 1.

I also plan to extend omicron_dev to include db-populate and db-wipe commands for creating and removing the Omicron database schema in a CockroachDB database. That's not here yet, but that's why there's a command with just one subcommand. I also wanted "dev" in the name so that someone would think twice before running this in production.

@smklein I know you're irked by mixed underscores and hyphens. I feel like CLI commands with subcommands usually use hyphens in the subcommand names (e.g., "db-run" in this case). So if we move towards consistency, I imagine we want to change all the commands to use hyphens (e.g., "sled-agent"). That'll also be consistent with the conventions around SMF FMRIs. Is there any reason not to do this? Are there other contexts that use underscores? Rust crate identifiers are the only place that jumps to mind -- and fortunately we just removed those when we renamed this repo and crate. I'm considering this not-a-blocker for this change.

davepacheco · 2021-03-10T22:55:01Z

I've filed this bug for the Windows CI failure:
vorner/signal-hook#100

The signal-hook-tokio crate appears to depend on part of signal-hook that's not available on Windows. We're using this to handle ^C in the new omicron_dev tool. I'm guessing this breakage won't be easy to fix because that crate's Windows support is best-effort. Given that, there are a few options:

switch to the ctrlc crate, which claims to be cross-platform
make the ^C handling dependent on not being Windows. This is kind of a copout.
drop Windows support. I don't see this software being practically useful on Windows in the foreseeable future, but trying to keep the build clean has felt useful in terms of avoiding unnecessary platform dependence. That said, I think most of us working on this don't currently have a Windows test environment, and it would take some doing to make that happen, so this might only be worthwhile as long as it's easy. And this might be an indication that it's not easy.

I will briefly look at the ctrlc crate, and otherwise I lean towards dropping the Windows CI build.

davepacheco · 2021-03-10T23:14:26Z

The ctrlc approach looks promising. I just pushed a change to use that instead of signal-hook. I did notice that ctrlc pulls in Nix, which doesn't yet work on illumos, but should after nix-rust/nix#1394. If for some reason that gets derailed, we can go back to the signal-hook implementation.

smklein · 2021-03-11T13:47:59Z

src/backoff.rs

+    pub enum CondCheckError<E: std::error::Error + 'static> {
+        /** the condition we're waiting for is not true */
+        #[error("poll condition not yet ready")]
+        NotYet,


This NotYet case is surprising to see in a value which is returned from a Future - why not have the future itself classify this case with Poll::Pending?

I was going for the same style of interface here that's used in the backoff crate (which we already use here). This is also the one I've used in Node.js. That's not to say it's ideal in Rust.

The idea here is that it's easy to write a block of code that, say, attempts to make an (async) HTTP request to a dependent service or checks some (async) condition. backoff and the interface here seek to make it as ergonomic as possible to take an existing chunk of code like that and do it in a loop (with delays) until it succeeds, times out, or encounters some permanent error. I see what you're saying about this sounding like a Future and should we just have the closure return Poll::Pending, and maybe that's good? But I'm not sure how ergonomic that will be to use. It would help to see an example of what that would look like.

smklein · 2021-03-11T13:48:54Z

src/backoff.rs

+     * remember Clulow's adage:
+     *
+     *    Timeouts, timeouts: always wrong!
+     *    Some too short and some too long.
+     *


smklein · 2021-03-11T14:07:10Z

src/backoff.rs

+                return Err(Error::TimedOut(duration));
+            }
+
+            let check = cond().await;


I'm kinda surprised to see this cond() function queried multiple times - if it didn't return a Future, that would make sense to me, but futures are already pollable objects. Why are we constructing a completely new future (and discarding the old one), instead of polling the future we already have?

From the async book:

It's common that futures aren't able to complete the first time they are polled. When this happens, the future needs to ensure that it is polled again once it is ready to make more progress. This is done with the Waker type.

Note that for the cond() to return from .await, callers of this function will need to implement a timeout of their own internally - if callers supplied a value of cond that .awaits for a long time (e.g., longer than the poll interval or timeout), it won't be interrupted by this helper.

I went ahead and implemented this here: https://docs.rs/interval_future/0.1.1/interval_future/fn.to_future.html

WDYT?

I'm kinda surprised to see this cond() function queried multiple times - if it didn't return a Future, that would make sense to me, but futures are already pollable objects. Why are we constructing a completely new future (and discarding the old one), instead of polling the future we already have?

Maybe I'm misunderstanding your question, but I think there are two levels of async operation going on here. The user is providing us one that's small and easy to write -- think making an async HTTP request or checking whether a file exists. We're taking that and wrapping it in one that retries based on some policy. If we have them give us a Future that only resolves when the whole thing is done, it feels like that puts all the work on the consumer. But maybe I just don't grok what it would look like.

This is an example of behavior where I believe the current implementation in this PR might struggle:

poll::wait_for_condition( || async { // Or substitute your own long-running operation of choice! tokio::time::sleep(Duration::from_secs(20).await; Ok(()) }, &Duration::from_millis(25), &Duration::from_secs(10), ).await

With the current implementation, this example would ignore the user-supplied timeout, and would block for 20 seconds.

We talked about this a bunch offline. I understood one of the concerns was that if the async function itself takes longer than the overall operation timeout, there's some confusion about who's responsible for making sure the overall timeout is not violated. My expectations here are:

that the overall timeout is really a "time after which we stop retrying", not a "time after which we promise that there are no attempts in progress"

that it's the caller's responsibility to choose this overall timeout, and potentially to bound the time used by the async function, to ensure that the resulting behavior makes sense for their use case

Critically for that second bullet: this abstraction is only intended for the test suite and developer tools, and the only reason there even is an overall timeout is so that these things don't hang when something goes wrong (e.g., test breakage). The current use cases are indeed expected to be quick in reality (reading a tiny file off local disk and the equivalent of waitpid(WNOHANG)).

I'm thinking maybe it makes sense to make this more explicit by creating a new "dev" module, moving this abstraction there, moving the (small) contents of the "test_util" module there, and moving all of "dev_db.rs" into there as well. That way it's clear that all of these things are intended for use by the test suite and dev tools and not, say, Nexus or the various agents. For things that we want to use in production with Nexus or the agents, we may indeed want to be more careful about exactly what promises are made with respect to timeouts and what various callers' expectations are. I don't have such a use case in mind, so I don't want to try to design an API for it.

All of this is not to say we can't improve this later, but that's my current plan. I will plan to do this soon and integrate this soon.

Thanks for taking a look @smklein!

smklein · 2021-03-11T14:13:20Z

src/dev_db.rs

+                     * TODO-robustness It would be nice if there were a version
+                     * of tokio::fs::read_to_string() that accepted a maximum
+                     * byte count so that this couldn't, say, use up all of
+                     * memory.


There basically is a version that accepts a maximum byte count: https://doc.rust-lang.org/std/fs/struct.File.html#impl-Read

Tokio's read_to_string implementation directly invokes the std version (so it blocks the executor for this task, because it's synchronous), which itself is just a wrapper around File::read + str::from_utf8 until EOF.

Are you referring more specifically to https://doc.rust-lang.org/std/io/trait.Read.html#tymethod.read?

I expect that will work and we can do that to resolve this TODO. (It's a fair bit more code than just calling that method instead: open the file, call read in a loop, append the correct subslice of bytes to a separate String, handle EOF and errors, etc. It's obviously not hard but it's more work than using read_to_string(). I think it would be useful to have a read_to_string() that just took a maximum byte count to do this work. Unless I'm looking at the wrong function?)

Tokio's read_to_string implementation directly invokes the std version (so it blocks the executor for this task, because it's synchronous)

That is technically true, but it looks like it takes care to make sure that doesn't have the impact one would expect from blocking the executor:
https://github.com/tokio-rs/tokio/blob/edfff7551abc24e6d6ee5e0dc8e894cf1309a53d/tokio/src/fs/read_to_string.rs#L25
https://github.com/tokio-rs/tokio/blob/edfff7551abc24e6d6ee5e0dc8e894cf1309a53d/tokio/src/fs/mod.rs#L9-L14
https://github.com/tokio-rs/tokio/blob/edfff7551abc24e6d6ee5e0dc8e894cf1309a53d/tokio/src/fs/mod.rs#L106-L118

smklein · 2021-03-11T14:27:37Z

src/dev_db.rs

+         * great way for us to know when this has happened, unfortunately.  So
+         * we just poll for it up to some maximum timeout.
+         */
+        let wait_result = poll::wait_for_condition(


Looking at both of the uses of wait_for_condition, I think it might make more sense for this function to operate directly on a future, rather than a function which returns a future.

The behavior below - checking "exited", trying to read over and over again until a file exists - smells a lot like an implementation of Future::poll.

davepacheco · 2021-03-11T23:07:38Z

The latest commit (49f1ca1) downloads official CockroachDB binaries and uses them in the build.

Unfortunately there's a new Windows build failure:

error[E0425]: cannot find function `kill` in crate `libc`
  --> src\test_util.rs:37:26
   |
37 |     0 == (unsafe { libc::kill(pid as libc::pid_t, 0) })
   |                          ^^^^ not found in `libc`

error[E0412]: cannot find type `pid_t` in crate `libc`
  --> src\test_util.rs:37:44
   |
37 |     0 == (unsafe { libc::kill(pid as libc::pid_t, 0) })
   |                                            ^^^^^ not found in `libc`

I'm guessing these don't exist in libc for Windows. This is not unreasonable, but at this point I'm going to just drop the Windows build for the reasons mentioned earlier.

… process management

davepacheco added 16 commits March 3, 2021 19:44

initial 'omicron_dev db-run' tool (needs tests, more work)

36b172a

add a happy-path test

1e46d4c

add another test

08952af

more tests

5a975e2

add a test for hang (still needs work)

eb1692f

commonize polling code

1885194

fix regression

b7c1289

add test with overridden store directory

689360f

helper function for connect()

34411af

reorder test cases

0554e6a

add concurrency test

d16bd7a

add happy-path test for "omicron_dev run-db"

2c5c414

add test for db killed; hide stdout and stderr in test suite

b9be6ae

add smoke tests for bad omicron_dev args

ef3169f

Merge commit '1cf80740540a37b689cda4b48b1be505e42668ed' into omicron_dev

835d3f6

merge fixes

903e8fd

davepacheco changed the title ~~Omicron dev~~ "omicron_dev db-run" tool for running CockroachDB in the background Mar 10, 2021

remove dbinit.sql for now

c15eb72

replace signal-hook with ctrlc in hopes of better Windows support

8e2e931

smklein reviewed Mar 11, 2021

View reviewed changes

download and use cockroach in CI

49f1ca1

davepacheco added 6 commits March 11, 2021 15:10

drop Windows from CI -- latest changes require too much Unix-specific…

2a5a9f2

… process management

github workflow: do not fail fast

57523d1

add GitHub caching

18543d2

fix workflow

51061ee

fix workflow (2)

b121b5a

move test suite and dev tool code into src/dev

95d2d78

davepacheco added 2 commits March 12, 2021 09:59

document PATH in GitHub action

74e6889

clear GitHub action cache

fa6b6dc

davepacheco marked this pull request as ready for review March 12, 2021 19:17

davepacheco merged commit 7607003 into master Mar 12, 2021

davepacheco deleted the omicron_dev branch March 12, 2021 19:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"omicron_dev db-run" tool for running CockroachDB in the background #48

"omicron_dev db-run" tool for running CockroachDB in the background #48

davepacheco commented Mar 10, 2021 •

edited

Loading

davepacheco commented Mar 10, 2021

davepacheco commented Mar 10, 2021

smklein Mar 11, 2021

davepacheco Mar 11, 2021

smklein Mar 11, 2021

smklein Mar 11, 2021

smklein Mar 11, 2021 •

edited

Loading

davepacheco Mar 11, 2021

smklein Mar 11, 2021

davepacheco Mar 11, 2021

smklein Mar 11, 2021

davepacheco Mar 11, 2021

smklein Mar 11, 2021

davepacheco commented Mar 11, 2021

"omicron_dev db-run" tool for running CockroachDB in the background #48

"omicron_dev db-run" tool for running CockroachDB in the background #48

Conversation

davepacheco commented Mar 10, 2021 • edited Loading

davepacheco commented Mar 10, 2021

davepacheco commented Mar 10, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein Mar 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davepacheco commented Mar 11, 2021

davepacheco commented Mar 10, 2021 •

edited

Loading

smklein Mar 11, 2021 •

edited

Loading