-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Can only configure 3 clusters on a submitter host #204
Comments
cartalla
added a commit
that referenced
this issue
May 10, 2024
Add support for rhel9 and rocky9. Resolves #229 Set SubmitterInstanceTags based on RESEnvironmentName. Remove SubmitterSecurityGroupIds parameter. This option added rules to existing security groups and if they were used by multiple clusters then the number of security group rules would exceed the maximum allowed. With the addition of adding security groups to the head and compute nodes the customer should supply their own security groups that meet the slurm cluster requirements, attach them to their login nodes and configure them as additional security groups for the head and compute nodes. Resolves #204
cartalla
added a commit
that referenced
this issue
May 10, 2024
Add support for rhel9 and rocky9. Had to update some of the ansible playbooks to mimic rhel8 changes. Resolves #229 Set SubmitterInstanceTags based on RESEnvironmentName. Remove SubmitterSecurityGroupIds parameter. This option added rules to existing security groups and if they were used by multiple clusters then the number of security group rules would exceed the maximum allowed. With the addition of adding security groups to the head and compute nodes the customer should supply their own security groups that meet the slurm cluster requirements, attach them to their login nodes and configure them as additional security groups for the head and compute nodes. Resolves #204
cartalla
added a commit
that referenced
this issue
May 13, 2024
Add support for rhel9 and rocky9. Had to update some of the ansible playbooks to mimic rhel8 changes. Resolves #229 Set SubmitterInstanceTags based on RESEnvironmentName. Remove SubmitterSecurityGroupIds parameter. This option added rules to existing security groups and if they were used by multiple clusters then the number of security group rules would exceed the maximum allowed. With the addition of adding security groups to the head and compute nodes the customer should supply their own security groups that meet the slurm cluster requirements, attach them to their login nodes and configure them as additional security groups for the head and compute nodes. Resolves #204 Update CallSlurmRestApiLambda from Python 3.8 to 3.9. Resolves #230 Update CDK version to 2.111.0. This is the latest version supported by nodejs 16. Really need to move to nodejs 20, but it isn't supported on Amazon Linux 2 or RHEL 7 family. Would require either running in a Docker container or on a newer OS version. I think that I'm going to change the prerequisites for the OS distribution so that I can stay on the latest tools. For example, I can't update to Python 3.12 until I do this.
cartalla
added a commit
that referenced
this issue
May 13, 2024
Add support for rhel9 and rocky9. Had to update some of the ansible playbooks to mimic rhel8 changes. Resolves #229 Set SubmitterInstanceTags based on RESEnvironmentName. Remove SubmitterSecurityGroupIds parameter. This option added rules to existing security groups and if they were used by multiple clusters then the number of security group rules would exceed the maximum allowed. With the addition of adding security groups to the head and compute nodes the customer should supply their own security groups that meet the slurm cluster requirements, attach them to their login nodes and configure them as additional security groups for the head and compute nodes. Resolves #204 Update CallSlurmRestApiLambda from Python 3.8 to 3.9. Resolves #230 Update CDK version to 2.111.0. This is the latest version supported by nodejs 16. Really need to move to nodejs 20, but it isn't supported on Amazon Linux 2 or RHEL 7 family. Would require either running in a Docker container or on a newer OS version. I think that I'm going to change the prerequisites for the OS distribution so that I can stay on the latest tools. For example, I can't update to Python 3.12 until I do this. Update DeconfigureRESUsersGroupsJson to pass if last statement fails.
cartalla
added a commit
that referenced
this issue
May 13, 2024
Add support for rhel9 and rocky9. Had to update some of the ansible playbooks to mimic rhel8 changes. Resolves #229 Set SubmitterInstanceTags based on RESEnvironmentName. Remove SubmitterSecurityGroupIds parameter. This option added rules to existing security groups and if they were used by multiple clusters then the number of security group rules would exceed the maximum allowed. With the addition of adding security groups to the head and compute nodes the customer should supply their own security groups that meet the slurm cluster requirements, attach them to their login nodes and configure them as additional security groups for the head and compute nodes. Resolves #204 Update CallSlurmRestApiLambda from Python 3.8 to 3.9. Resolves #230 Update CDK version to 2.111.0. This is the latest version supported by nodejs 16. Really need to move to nodejs 20, but it isn't supported on Amazon Linux 2 or RHEL 7 family. Would require either running in a Docker container or on a newer OS version. I think that I'm going to change the prerequisites for the OS distribution so that I can stay on the latest tools. For example, I can't update to Python 3.12 until I do this. Update DeconfigureRESUsersGroupsJson to pass if last statement fails. Fix bug in create_slurm_accounts.py Resolves #231
cartalla
added a commit
that referenced
this issue
May 13, 2024
Add support for rhel9 and rocky9. Had to update some of the ansible playbooks to mimic rhel8 changes. Resolves #229 Set SubmitterInstanceTags based on RESEnvironmentName. Remove SubmitterSecurityGroupIds parameter. This option added rules to existing security groups and if they were used by multiple clusters then the number of security group rules would exceed the maximum allowed. With the addition of adding security groups to the head and compute nodes the customer should supply their own security groups that meet the slurm cluster requirements, attach them to their login nodes and configure them as additional security groups for the head and compute nodes. Resolves #204 Update CallSlurmRestApiLambda from Python 3.8 to 3.9. Resolves #230 Update CDK version to 2.111.0. This is the latest version supported by nodejs 16. Really need to move to nodejs 20, but it isn't supported on Amazon Linux 2 or RHEL 7 family. Would require either running in a Docker container or on a newer OS version. I think that I'm going to change the prerequisites for the OS distribution so that I can stay on the latest tools. For example, I can't update to Python 3.12 until I do this. Update DeconfigureRESUsersGroupsJson to pass if last statement fails. Fix bug in create_slurm_accounts.py Resolves #231
cartalla
added a commit
that referenced
this issue
May 13, 2024
Add support for rhel9 and rocky9. Had to update some of the ansible playbooks to mimic rhel8 changes. Resolves #229 Set SubmitterInstanceTags based on RESEnvironmentName. Remove SubmitterSecurityGroupIds parameter. This option added rules to existing security groups and if they were used by multiple clusters then the number of security group rules would exceed the maximum allowed. With the addition of adding security groups to the head and compute nodes the customer should supply their own security groups that meet the slurm cluster requirements, attach them to their login nodes and configure them as additional security groups for the head and compute nodes. Resolves #204 Update CallSlurmRestApiLambda from Python 3.8 to 3.9. Resolves #230 Update CDK version to 2.111.0. This is the latest version supported by nodejs 16. Really need to move to nodejs 20, but it isn't supported on Amazon Linux 2 or RHEL 7 family. Would require either running in a Docker container or on a newer OS version. I think that I'm going to change the prerequisites for the OS distribution so that I can stay on the latest tools. For example, I can't update to Python 3.12 until I do this. Update DeconfigureRESUsersGroupsJson to pass if last statement fails. Fix bug in create_slurm_accounts.py Resolves #231
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
If you try to configure a submitter host as a login node for a 4th cluster, the stack fails with the following error:
The stack is adding rules to the security groups configured in slurm/SubmitterSecurityGroupIds to allow connections to the head nodes NFS exports and Slurm controller ports.
The head node security group has ingress rules allowing connections from submitters.
I suspect that usually there would only be 1 security group for user remote desktops (DCV, VNC, etc.).
So, I don't anticipate hitting the limit on the head node's ingress rules.
However, multiple security group rules are added to the submitter's security group for each cluster.
Right now that limit is exceeded with the 4th configured cluster.
Each cluster also creates a submitter security group that could be attached to the submitter instances, but I think that there is a limit of 5 security groups that can be attached to an instance, so using those security groups would still hit a limit after a small number of clusters.
I think that the best way to handle this may be to create the required security group rules without hitting limits is to configure a single submitter security group that is allowed access to the cluster. The same security group could be used for multiple clusters. Each cluster would only be accessible to submitter hosts that belong to the configured security group. But, that security group should have a destination security group in the cluster that would also have to be passed to the cluster.
Need to think about this a bit to figure out a solution that is relatively easy to implement and use.
The text was updated successfully, but these errors were encountered: