Skip to content
This repository has been archived by the owner on Mar 4, 2024. It is now read-only.

membership rollback issue #250

Closed
as58545652 opened this issue Nov 24, 2021 · 6 comments · Fixed by #354
Closed

membership rollback issue #250

as58545652 opened this issue Nov 24, 2021 · 6 comments · Fixed by #354
Labels
Bug Confirmed to be a bug

Comments

@as58545652
Copy link

as58545652 commented Nov 24, 2021

assert(entry != NULL);

The assertion may fail if the configuration entry has already been removed. In addition, an uncommitted configuration may be stored in a snapshot, making it impossible to roll back.

@as58545652
Copy link
Author

One possible solution would be to take a snapshot only if there is no uncommitted configuration, and to restore the configuration from the snapshot if there is no configuration entry in the log.

@MathieuBordere
Copy link
Contributor

Thanks for this, will investigate.

@MathieuBordere MathieuBordere added the Bug Confirmed to be a bug label Nov 24, 2021
@as58545652
Copy link
Author

raft/src/start.c

Lines 16 to 87 in cbb9cd7

static int restoreMostRecentConfiguration(struct raft *r,
struct raft_entry *entry,
raft_index index)
{
struct raft_configuration configuration;
int rv;
raft_configuration_init(&configuration);
rv = configurationDecode(&entry->buf, &configuration);
if (rv != 0) {
raft_configuration_close(&configuration);
return rv;
}
configurationTrace(r, &configuration, "restore most recent configuration");
raft_configuration_close(&r->configuration);
r->configuration = configuration;
r->configuration_index = index;
return 0;
}
/* Restore the entries that were loaded from persistent storage. The most recent
* configuration entry will be restored as well, if any.
*
* Note that we don't care whether the most recent configuration entry was
* actually committed or not. We don't allow more than one pending uncommitted
* configuration change at a time, plus
*
* when adding or removing just a single server, it is safe to switch directly
* to the new configuration.
*
* and
*
* The new configuration takes effect on each server as soon as it is added to
* that server's log: the C_new entry is replicated to the C_new servers, and
* a majority of the new configuration is used to determine the C_new entry's
* commitment. This means that servers do notwait for configuration entries to
* be committed, and each server always uses the latest configuration found in
* its log.
*
* as explained in section 4.1.
*
* TODO: we should probably set configuration_uncommitted_index as well, since we
* can't be sure a configuration change has been committed and we need to be
* ready to roll back to the last committed configuration.
*/
static int restoreEntries(struct raft *r,
raft_index snapshot_index,
raft_term snapshot_term,
raft_index start_index,
struct raft_entry *entries,
size_t n)
{
struct raft_entry *conf = NULL;
raft_index conf_index = 0;
size_t i;
int rv;
logStart(&r->log, snapshot_index, snapshot_term, start_index);
r->last_stored = start_index - 1;
for (i = 0; i < n; i++) {
struct raft_entry *entry = &entries[i];
rv = logAppend(&r->log, entry->term, entry->type, &entry->buf,
entry->batch);
if (rv != 0) {
goto err;
}
r->last_stored++;
if (entry->type == RAFT_CHANGE) {
conf = entry;
conf_index = r->last_stored;
}
}
if (conf != NULL) {
rv = restoreMostRecentConfiguration(r, conf, conf_index);

The most recent configuration might not be a committed one, which would probably lead to another membership rollback issue.

@MathieuBordere
Copy link
Contributor

One possible solution would be to take a snapshot only if there is no uncommitted configuration, and to restore the configuration from the snapshot if there is no configuration entry in the log.

There are degenerate cases where not taking a snapshot while there's an uncommitted configuration could lead to never taking a snapshot anymore.

There's this fragment in the raft dissertation in Chapter 5

Raft also retains the latest configuration from the discarded log prefix in order to support cluster
membership changes.

In think the straightforward solution is to save the last known configuration when taking a snapshot so that it can be restored.

@as58545652
Copy link
Author

One possible solution would be to take a snapshot only if there is no uncommitted configuration, and to restore the configuration from the snapshot if there is no configuration entry in the log.

There are degenerate cases where not taking a snapshot while there's an uncommitted configuration could lead to never taking a snapshot anymore.

There's this fragment in the raft dissertation in Chapter 5

Raft also retains the latest configuration from the discarded log prefix in order to support cluster
membership changes.

In think the straightforward solution is to save the last known configuration when taking a snapshot so that it can be restored.

Yeah, I think you might be right. We just need to save the previous configuration if there is an uncommitted one when taking a snapshot. But I have to say that an uncommitted configuration could still stand in the way because a snapshot can never be taken beyond any uncommitted index.

@cole-miller
Copy link
Contributor

cole-miller commented Nov 28, 2022

It seems pretty clear to me that the description of Raft in the dissertation envisions that

  1. only committed + applied entries are incorporated into a snapshot
  2. the InstallSnapshot RPC includes the "latest configuration as of lastIndex", which because of (1) can only be a committed configuration

So if we're storing uncommitted configurations in snapshots at all, that seems like a deviation from the algorithm as presented.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Bug Confirmed to be a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants