Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some fsm issues #3350

Merged
merged 27 commits into from
Aug 29, 2020
Merged

Fix some fsm issues #3350

merged 27 commits into from
Aug 29, 2020

Conversation

magik6k
Copy link
Contributor

@magik6k magik6k commented Aug 27, 2020

This PR grew a bit bigger than I've originally expected, but it's really just a bunch of smallish fixes, so reviewing commit-by-commit should be pretty easy.

Summary of changes:

  • lotus-miner sectors remove [sid] now works for sectors in all states
  • We now print precommit/commit message CIDs in lotus-miner sectors status
  • Separate state for sending the Commit message
  • Auto-retry sending precommit/commit messages when they hit out-of-gas
  • Recover from reorged dealIDs
  • Mostly deprecate the PackingFailed state
  • Show tasks assigned to worker windows, but not running yet (1)
  • More priority to FinalizeSector and other tasks moving data around
  • Don't start scheduling tasks for 3s after restart (lets the FSM tasks restart)
  • Auto-retry failed sector remove tasks
  • Get more pending tasks before calling trySched (should noticeably mitigate its slowness in bigger deployments)
  • Compact assigned worker windows
    • In the current scheduling logic, each worker has 2 resource based windows of resources
    • When a window is empty it gets sent to the main scheduler goroutine, and filled with some tasks, then sent back to the worker for processing (the scheduler only holds empty / "open" windows, as soon as at least 1 task is put in a window it's sent to the worker)
    • When the worker gets windows back from the scheduler, it starts executing them sequentially
    • In some scenarios this can lead to wasting a fair bit of resources, as we may be refusing to execute smaller tasks which we could be running, but aren't because it's not in the current window we're executing
    • This change implements logic which will looks at 'newer' windows and if it sees tasks which can fit in the currently executing window, moves them there, to hopefully execute sooner
  • (ties in with the above, I'm not sure why this was done this way) Don't require tasks in the currently running window to be executed in order

(1)

ID  Sector  Worker  Hostname  Task  State        Time
2   11      2       avocado   PC1   running      45m1.2s
3   27      0       pancetta  GET   running      45m1s
0   62      3       banana    AP    running      3m2.4s
0   31      0       pancetta  AP    assigned(0)  45m1s
0   25      0       pancetta  PC1   assigned(0)  39m5s
0   63      0       pancetta  AP    assigned(0)  3m1.9s
0   64      0       pancetta  AP    assigned(0)  3m1.5s

@magik6k magik6k force-pushed the fix/some-fsm-issues branch from 6364f50 to 7806a98 Compare August 27, 2020 15:51
@magik6k magik6k force-pushed the fix/some-fsm-issues branch from aa9fb45 to 489d523 Compare August 27, 2020 19:11
Copy link
Contributor

@arajasek arajasek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deal recovery stuff and UX/logging all LGTM. I'm less sure about some of the scheduling changes, but I think it all makes sense (or makes no less sense than the earlier code).

On the whole seems fiiine, I guess it all works.

}

if height >= proposal.StartEpoch {
// TODO: check if we are in an early enough state (before precommit), try to remove the offending pieces
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds awful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants