Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: update log progress on network migrations #12732

Merged
merged 13 commits into from
Dec 12, 2024

Conversation

virajbhartiya
Copy link
Member

Closes #12259

Code changes:

  • chain/consensus/filcns/upgrades.go:
    • Updated the ProgressLogPeriod from 5 minutes to 2 seconds in the PreUpgradeActorsV16 function.
    • Updated the ProgressLogPeriod from 10 seconds to 2 seconds in the UpgradeActorsV16 function.
    • Enhanced the Log method in the migrationLogger to provide more detailed logging for migration jobs, including the percentage of jobs completed and the rate of completion.

Documentation changes:

  • documentation/misc/Building_a_network_skeleton.md:
    • Updated the ProgressLogPeriod from 5 minutes to 2 seconds in the migration configuration example.
    • Updated the ProgressLogPeriod from 10 seconds to 2 seconds in another migration configuration example.

@virajbhartiya
Copy link
Member Author

@rvagg is there a way I can run this locally to test it out?

@rvagg
Copy link
Member

rvagg commented Nov 28, 2024

Yes, you can run this locally! make lotus-shed and then you can run ./lotus-shed migrate-state 25 CID where CID is a block from a recent tipset, it doesn't matter which block, just grab one of the CIDs from a tipset that's going to be in your blockstore (so within the last couple of days for a splitstore node). Easiest way is to watch the log output, look for New heaviest tipset an pick one of the CIDs in the tipset CID list it prints. This will run the nv25 migration which is the new one we have a skeleton for and will invoke the functions in upgrades.go. Note that you'll need to shut your node down while you run this, it needs exclusive access to the repo.

@virajbhartiya
Copy link
Member Author

Raised a PR in go-state-types @ filecoin-project/go-state-types#329

@BigLep
Copy link
Member

BigLep commented Dec 3, 2024

@virajbhartiya ; what are the next steps? Is this blocked until filecoin-project/go-state-types#329 lands?

metrics/metrics.go Outdated Show resolved Hide resolved
@rvagg
Copy link
Member

rvagg commented Dec 3, 2024

Screenshot 2024-12-03 at 7 51 02 pm

Screenshot 2024-12-03 at 7 51 34 pm

@virajbhartiya this is looking good so far but a couple of tweaks now that I see it:

  • Let's remove the Worker %d done log line, it's not helping anyone
  • Let's round the multiple time prints to 100ms (.Round(100*time.Millisecond) for the 4 of them, that I can see)

Here's what that looks like after hacking it up locally:

Screenshot 2024-12-03 at 7 58 02 pm

@rjan90 what do you think of this output? Printing every 2 seconds by default, but this can be configured per-migration if we have a particularly long one that might get spammy (hopefully we never have long ones!). Any feedback?

@rvagg
Copy link
Member

rvagg commented Dec 3, 2024

Also this needs a changelog entry. It might be nice to make SPs prepared because there's a lot of log grepping as upgrades happen.

btw the 4 redundant-seeming "done" prints may not be redundant in some upgrades - almost all of the >3M jobs will be done together but we may have one or two special jobs that keep on going because they are doing something complicated (like the nv22 upgrade with all its state mangling), and there may be deferred migrations to be run too.

@rjan90
Copy link
Contributor

rjan90 commented Dec 3, 2024

@rjan90 what do you think of this output? Printing every 2 seconds by default, but this can be configured per-migration if we have a particularly long one that might get spammy (hopefully we never have long ones!). Any feedback?

Two seconds sounds fine for smaller state nodes. But can we have an environment variable that makes it tweakable as well? Historical archive operators might want to tweak this a lot higher, since they will face hour long migrations, even if it's a non heavy network migration.

@rvagg
Copy link
Member

rvagg commented Dec 3, 2024

@virajbhartiya git grep Getenv..LOTUS_ -- *.go to see how we do env vars currently. How about we make LOTUS_MIGRATE_PROGRESS_LOG_SECONDS and we just parse an integer, and replace the config.ProgressLogPeriod field with it if it's set. If it's 0 then they turn it off. Alternative suggestions welcome of course @virajbhartiya @rjan90.

@rvagg
Copy link
Member

rvagg commented Dec 3, 2024

you can add in go-state-types@master to this PR now, we'll just have to wait for a tagged release but that should come fairly quickly

@virajbhartiya
Copy link
Member Author

I'll add the env variable for log period and also go-state-types@master

go.mod Outdated Show resolved Hide resolved
@rvagg
Copy link
Member

rvagg commented Dec 11, 2024

in my local checkout of this I rebased on master and got rid of the go-state-types submodule then added those env var parsing changes I mentioned and it seems to work nicely, setting the env var to 1:

2024-12-11T15:29:47.013+1100    INFO    bundle  bundle/bundle.go:60     manifest cid: bafy2bzacedb5sfhrzpd6joitui6wfyjhnfmmojyvlhlmpkhlrhkrh3tmwwb5e
2024-12-11T15:29:47.014+1100    INFO    fil-consensus   filcns/upgrades.go:3028 Started 29 workers
2024-12-11T15:29:47.014+1100    INFO    fil-consensus   filcns/upgrades.go:3028 Creating migration jobs
2024-12-11T15:29:47.014+1100    INFO    fil-consensus   filcns/upgrades.go:3028 Result writer started
2024-12-11T15:29:48.014+1100    INFO    fil-consensus   filcns/upgrades.go:3028 Performing migration: 316281 of 316292 jobs processed (316237/s) [1s elapsed]
2024-12-11T15:29:49.014+1100    INFO    fil-consensus   filcns/upgrades.go:3028 Performing migration: 643408 of 643408 jobs processed (321670/s) [2s elapsed]
2024-12-11T15:29:50.014+1100    INFO    fil-consensus   filcns/upgrades.go:3028 Performing migration: 952228 of 952247 jobs processed (317382/s) [3s elapsed]
2024-12-11T15:29:51.014+1100    INFO    fil-consensus   filcns/upgrades.go:3028 Performing migration: 1317248 of 1317248 jobs processed (329282/s) [4s elapsed]
2024-12-11T15:29:52.014+1100    INFO    fil-consensus   filcns/upgrades.go:3028 Performing migration: 1654826 of 1654845 jobs processed (330934/s) [5s elapsed]

Default prints out every 2s as expected. Setting it to 10 I don't get any because it runs fast enough to not get to print one. So I think we're good to go with those changes.

go.mod Outdated Show resolved Hide resolved
@rvagg
Copy link
Member

rvagg commented Dec 11, 2024

👍 looking good, couple of notes left

… to period across functions and documentation
CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Member

@rvagg rvagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 we got there! thanks for coordinating across repos for this

@rvagg rvagg enabled auto-merge (squash) December 12, 2024 10:54
@rvagg rvagg merged commit 4f29e2e into filecoin-project:master Dec 12, 2024
81 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ☑️ Done (Archive)
Development

Successfully merging this pull request may close these issues.

Log progress on network migrations
4 participants