Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid writer poisoned errors #4533

Merged
merged 5 commits into from
Apr 25, 2023
Merged

Avoid writer poisoned errors #4533

merged 5 commits into from
Apr 25, 2023

Conversation

lutter
Copy link
Collaborator

@lutter lutter commented Apr 11, 2023

If a subgraph encountered an error, it would get restarted with some backoff. If that error was caused by some store interaction, the WritableStore would remain in poisoned state and refuse to process any more writes.

This PR changes that so that the WritableStore is restarted when a subgraph is restarted, clearing the error. That requires that we clear any internal state that is based on assumptions of what has been written so far. I am not entirely sure that the changes in core/src/subgraph/runner.rs is enough to do that, and would appreciate a thorough review of what other state might have to be cleared.

@leoyvens
Copy link
Collaborator

For correctness, you should use SubgraphRunner::revert_state, which is suppose to clear any dirty state written at or beyond a given block number.

The current approach also has a performance issue, which is that it clears the entity cache on every Action::Restart. But that is also used when a dynamic data source is created, where clearing the cache would be unnecessarily detrimental to performance. So we want to revert the state in the error case, but not in the created data source case. One way to do this would be to add an Action::RestartFromError(BlockPtr), which is used when restarting mid-block, with potentially dirty state.

@lutter
Copy link
Collaborator Author

lutter commented Apr 14, 2023

I added a commit that hopefully addresses that

let logger = self.store.logger.clone();
if let Err(e) = self.stop().await {
warn!(logger, "Writable had error when stopping, it is safe to ignore this error"; "error" => e.to_string());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it sufficient to send a stop request here, shouldn't this also join the task handle of the writer, to ensure is finished?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually wondering if this is too defensive - if the writer is poisoned, it's because we had an error in the Queue, and start_writer will return, i.e. shut down, on any error. Maybe I should just remove that code and just leave a comment here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this would always fail - when a writer has been poisoned, it doesn't allow adding anything more to the queue. I changed this code to just log a warning if the join handle indicates that the background writer is still running. Joining here is a bit tricky since it risks blocking indexing of that subgraph if we really do have a problem with the writer stopping.

lutter added 5 commits April 25, 2023 11:28
When the subgraph runner encounters an error, it needs to restart the store
to clear any errors that might have happened.
We only need to reset the state of the runner when the `WritableStore`
actually had to be restarted because of an error; if it had an error, we
have to reset the state.

Use `SubgraphRunner.revert_state` to properly reset the runner state.
@lutter
Copy link
Collaborator Author

lutter commented Apr 25, 2023

Rebased to latest master.

@lutter lutter merged commit 2ce7566 into master Apr 25, 2023
@lutter lutter deleted the lutter/restart branch April 25, 2023 19:10
@lutter
Copy link
Collaborator Author

lutter commented Apr 25, 2023

Woops .. I screwed up, I thought this had been approved already, and merged it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants