Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry mechanics to pallet-scheduler #3060

Merged
merged 44 commits into from
Feb 16, 2024

Conversation

georgepisaltu
Copy link
Contributor

Fixes #3014

This PR adds retry mechanics to pallet-scheduler, as described in the issue above.

Users can now set a retry configuration for a task so that, in case its scheduled run fails, it will be retried after a number of blocks, for a specified number of times or until it succeeds.

If a retried task runs successfully before running out of retries, its remaining retry counter will be reset to the initial value. If a retried task runs out of retries, it will be removed from the schedule.

Tasks which need to be scheduled for a retry are still subject to weight metering and agenda space, same as a regular task. Periodic tasks will have their periodic schedule put on hold while the task is retrying.

@georgepisaltu georgepisaltu requested a review from muharem January 25, 2024 11:19
@georgepisaltu georgepisaltu self-assigned this Jan 25, 2024
@georgepisaltu georgepisaltu requested review from a team January 25, 2024 11:19
Signed-off-by: georgepisaltu <[email protected]>
substrate/frame/scheduler/src/lib.rs Outdated Show resolved Hide resolved
substrate/frame/scheduler/src/lib.rs Show resolved Hide resolved
substrate/frame/scheduler/src/lib.rs Show resolved Hide resolved
@mordamax
Copy link
Contributor

bot bench-all pallet -v PIPELINE_SCRIPTS_REF=mak-bench-all-pallet --pallet=pallet_scheduler

@command-bot
Copy link

command-bot bot commented Jan 25, 2024

@mordamax https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/5029575 was started for your command "$PIPELINE_SCRIPTS_DIR/commands/bench-all/bench-all.sh" --pallet=pallet_scheduler. Check out https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/pipelines?page=1&scope=all&username=group_605_bot to know what else is being executed currently.

Comment bot cancel 1-44309cf4-2563-41eb-9bf6-8c037436c2c7 to cancel this command or bot cancel to cancel all commands in this pull request.

@paritytech-review-bot paritytech-review-bot bot requested a review from a team January 25, 2024 15:47
@command-bot
Copy link

command-bot bot commented Jan 25, 2024

@mordamax Command "$PIPELINE_SCRIPTS_DIR/commands/bench-all/bench-all.sh" --pallet=pallet_scheduler has finished. Result: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/5029575 has finished. If any artifacts were generated, you can download them from https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/5029575/artifacts/download.

@mordamax
Copy link
Contributor

bot help

@command-bot
Copy link

command-bot bot commented Jan 26, 2024

Here's a link to docs

@@ -481,8 +499,15 @@ pub mod pallet {
/// will be removed from the schedule.
///
/// Tasks which need to be scheduled for a retry are still subject to weight metering and
/// agenda space, same as a regular task. Periodic tasks will have their periodic schedule
/// put on hold while the task is retrying.
/// agenda space, same as a regular task. If a periodic task fails, it will be scheduled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be worth mentioning that we talking about possible recoverable fail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My view on this is that we shouldn't mention in the pallet docs what type of calls this applies to because the pallet itself cannot make the distinction between a recoverable or unrecoverable fail. Only a user can do that and they should update their retry config accordingly.

substrate/frame/scheduler/src/lib.rs Outdated Show resolved Hide resolved
substrate/frame/scheduler/src/lib.rs Outdated Show resolved Hide resolved
substrate/frame/scheduler/src/lib.rs Outdated Show resolved Hide resolved
substrate/frame/scheduler/src/lib.rs Outdated Show resolved Hide resolved
@@ -1124,11 +1258,17 @@ impl<T: Config> Pallet<T> {
},
Err(()) => Err((Overweight, Some(task))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this and the Err arm above you abandon the retry entry if there is any.
I would move the retry handling one level above, for this you just need to add for Ok varian an info if it failed or not.
or at least take it before execute_dispatch.
please write tests for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this and the Err arm above you abandon the retry entry if there is any

The Err arms above deal with completely different problems:

  • Err(Unavailable) is a task which may never be able to run and it may never even be attempted again
  • Err(Overweight) is a task which couldn't run because the scheduler ran out of weight, but it will be run the next time the scheduler services agendas; the scheduler runs agendas from previous blocks if they didn't execute completely, in order of block number from earliest up until present

This means that an overweight task is not abandoned and will be run again soon, but also that there is absolutely nothing that we can do about it. There is no fallback for the current block if we run out of weight. This is the scheduler's behavior before this PR and I believe the retry mechanism has nothing to do with this.

I would move the retry handling one level above, for this you just need to add for Ok varian an info if it failed or not.

Most of the logic is common for failed or successful tasks, there would be a lot of copied code. But if you just want to refactor this, I could move the common logic out of the match.

or at least take it before execute_dispatch

Tasks are scheduled for retries if they actually run but fail in their execution. By definition, there is no way to handle this before running execute_dispatch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that the task is removed if Unavailable, but now I see that we keep it in the storage. This is why I proposed to remove the retry as well. And also I see now that Overweight does not change it's address.

Comment on lines 549 to 550
/// derived from the original task's configuration, but will have a lower value for
/// `remaining` than the original `total_retries`. Tasks scheduled as a result of a retry of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why you think it should be this way?
the way I think about it, is I want some call to be executed 5 times, and when any of it fail, I wanna retry it 3 times. I wanna be sure that every call will be given chance to be retried 3 times. If first exhausts all retries for some unexpected reason, I do not wanna leave the rest calls without retries. It is hard for me to model such situation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way you described it is the way it works.

A periodic task that fails at any point during its periodic execution will have a clone of itself scheduled as a retry. That clone will have a retry configuration attached with remaining being total_retries - 1 at first. The original task still has its own unchanged retry configuration. The retry task will keep being retries until it runs out of retries. The periodic task keeps running periodically, potentially spawning other retry clones if it fails again.

For non-periodic tasks, there is no clone, they become the retry task because there no point in keeping clones around for something which will never run again.

In short, the point of the doc comment is to say that a retry configuration with remaining == total_retries means it's the original task. A retry configuration with remaining < total_retries means it's a retry clone.

I will try to make it clearer in the docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@muharem
Copy link
Contributor

muharem commented Feb 13, 2024

I do not know your motivation for a RP from you fork, but it makes it harder for me to pool you branch and review from my local machine.

substrate/frame/scheduler/src/lib.rs Outdated Show resolved Hide resolved
@@ -1124,11 +1258,17 @@ impl<T: Config> Pallet<T> {
},
Err(()) => Err((Overweight, Some(task))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that the task is removed if Unavailable, but now I see that we keep it in the storage. This is why I proposed to remove the retry as well. And also I see now that Overweight does not change it's address.

substrate/frame/scheduler/src/lib.rs Show resolved Hide resolved
Comment on lines +1388 to +1397
if weight
.try_consume(T::WeightInfo::schedule_retry(T::MaxScheduledPerBlock::get()))
.is_err()
{
Self::deposit_event(Event::RetryFailed {
task: (when, agenda_index),
id: task.maybe_id,
});
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant, that it would be better to have this in service_task context. Now the schedule_retry is failable/no-op, which is not clear from service_task context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside is that you can't know in service_task ahead of time if the task will fail and that it has a retry config. You could check that you can consume the weight anyway, but you might be preventing users from running their tasks when it wasn't necessary.

substrate/frame/scheduler/src/lib.rs Show resolved Hide resolved
substrate/frame/scheduler/src/lib.rs Show resolved Hide resolved
substrate/frame/scheduler/src/lib.rs Outdated Show resolved Hide resolved
substrate/frame/scheduler/src/lib.rs Outdated Show resolved Hide resolved
substrate/frame/scheduler/src/lib.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@NachoPal NachoPal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just wondering if you considered extending Scheduled struct with RetryConfig fields instead of creating a new Retries storage item. In that way it would natively support retries when schedule() without the need of doing it in two steps (schedule() + set_retry())

@georgepisaltu
Copy link
Contributor Author

LGTM. Just wondering if you considered extending Scheduled struct with RetryConfig fields instead of creating a new Retries storage item. In that way it would natively support retries when schedule() without the need of doing it in two steps (schedule() + set_retry())

I considered it but I decided against it. This is a niche mechanic which we don't expect to be widely used right now. Doing it in Scheduled would bloat existing Scheduled entries. It's not backwards compatible in terms of storage (would require a migration) or API. It makes more sense for the few users of this mechanic to pay extra than for everyone to pay more for something that is not a core feature. If, in the future, we find that a lot of people are doing retries, we can reconsider in order to optimize costs.

@georgepisaltu georgepisaltu added this pull request to the merge queue Feb 16, 2024
Merged via the queue into paritytech:master with commit 9346019 Feb 16, 2024
129 of 130 checks passed
@georgepisaltu georgepisaltu deleted the retry-schedule branch February 16, 2024 12:07
bgallois pushed a commit to duniter/duniter-polkadot-sdk that referenced this pull request Mar 25, 2024
Fixes paritytech#3014 

This PR adds retry mechanics to `pallet-scheduler`, as described in the
issue above.

Users can now set a retry configuration for a task so that, in case its
scheduled run fails, it will be retried after a number of blocks, for a
specified number of times or until it succeeds.

If a retried task runs successfully before running out of retries, its
remaining retry counter will be reset to the initial value. If a retried
task runs out of retries, it will be removed from the schedule.

Tasks which need to be scheduled for a retry are still subject to weight
metering and agenda space, same as a regular task. Periodic tasks will
have their periodic schedule put on hold while the task is retrying.

---------

Signed-off-by: georgepisaltu <[email protected]>
Co-authored-by: command-bot <>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A0-needs_burnin Pull request needs to be tested on a live validator node before merge. DevOps is notified via matrix I5-enhancement An additional feature request. T2-pallets This PR/Issue is related to a particular pallet.
Projects
Status: Backlog
Development

Successfully merging this pull request may close these issues.

Scheduler: Retry a Task on a Call Failure
5 participants