Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running install.sh with -cdk-cmd update in rapid succession can damage the cluster #221

Closed
gwolski opened this issue Apr 12, 2024 · 1 comment · Fixed by #244
Closed
Assignees

Comments

@gwolski
Copy link

gwolski commented Apr 12, 2024

I ran a --cdk-cmd update to update Instance selections. Then I realized I wanted an additional change, so I modified my config file, and ran the update again. Unfortunately, this corrupted my cluster as the two commands were run too close in succession. The second command tried to do a rollback and that failed.

Can we put in some sort of check to ensure the CloudFormation is not "IN PROGRESS" before allowing install.sh to update?

To reproduce just change some instances in your config and then do it again in rapid order.

@cartalla cartalla self-assigned this Apr 24, 2024
@cartalla
Copy link
Contributor

Add a check to make sure that the cluster stack isn't already being updated or is in a bad state and abort the install.

cartalla added a commit that referenced this issue Jul 12, 2024
Add support for ParallelCluster 3.10.0.

Add alinux2023 support.

Add support for external slurmdbd instance.

Update documentation.

Change the UID of the slurm user to 401 to match what ParallelCluster uses.
Otherwise munge flags security errors because the UID of the submitter doesn't match the head node.

Change the UpdateHeadNode lambda to only do the update via ssm if the cluster ins't already being updated.

Resolves #242

Change the installer so that it checks to make sure that the cluster stack
isn't already being changed or in a bad state.

Resolves #221
cartalla added a commit that referenced this issue Jul 12, 2024
Add support for ParallelCluster 3.10.0.

Add alinux2023 support.

Add support for external slurmdbd instance.

Update documentation.

Change the UID of the slurm user to 401 to match what ParallelCluster uses.
Otherwise munge flags security errors because the UID of the submitter doesn't match the head node.

Change the UpdateHeadNode lambda to only do the update via ssm if the cluster ins't already being updated.

Resolves #242

Change the installer so that it checks to make sure that the cluster stack
isn't already being changed or in a bad state.

Resolves #221

Add support for ParallelCluster 3.10.1.

Resolves #243
cartalla added a commit that referenced this issue Jul 12, 2024
Add support for ParallelCluster 3.10.0.

Add alinux2023 support.

Add support for external slurmdbd instance.

Update documentation.

Change the UID of the slurm user to 401 to match what ParallelCluster uses.
Otherwise munge flags security errors because the UID of the submitter doesn't match the head node.

Change the UpdateHeadNode lambda to only do the update via ssm if the cluster ins't already being updated.

Resolves #242

Change the installer so that it checks to make sure that the cluster stack
isn't already being changed or in a bad state.

Resolves #221

Add support for ParallelCluster 3.10.1.

Resolves #243
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants