Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Can only configure 3 clusters on a submitter host #204

Closed
cartalla opened this issue Feb 12, 2024 · 0 comments · Fixed by #232
Closed

[BUG] Can only configure 3 clusters on a submitter host #204

cartalla opened this issue Feb 12, 2024 · 0 comments · Fixed by #232

Comments

@cartalla
Copy link
Contributor

Describe the bug

If you try to configure a submitter host as a login node for a 4th cluster, the stack fails with the following error:

Resource handler returned message: "The maximum number of rules per security group has been reached. (Service: Ec2, Status Code: 400, Request ID: 3ab7f4c4-cdea-4ada-97fe-f650b116f7f1)" (RequestToken: ff9bc132-d76c-50cd-9f8c-41db329f01d1, HandlerErrorCode: ServiceLimitExceeded)

The stack is adding rules to the security groups configured in slurm/SubmitterSecurityGroupIds to allow connections to the head nodes NFS exports and Slurm controller ports.
The head node security group has ingress rules allowing connections from submitters.
I suspect that usually there would only be 1 security group for user remote desktops (DCV, VNC, etc.).
So, I don't anticipate hitting the limit on the head node's ingress rules.

However, multiple security group rules are added to the submitter's security group for each cluster.
Right now that limit is exceeded with the 4th configured cluster.

Each cluster also creates a submitter security group that could be attached to the submitter instances, but I think that there is a limit of 5 security groups that can be attached to an instance, so using those security groups would still hit a limit after a small number of clusters.

I think that the best way to handle this may be to create the required security group rules without hitting limits is to configure a single submitter security group that is allowed access to the cluster. The same security group could be used for multiple clusters. Each cluster would only be accessible to submitter hosts that belong to the configured security group. But, that security group should have a destination security group in the cluster that would also have to be passed to the cluster.

Need to think about this a bit to figure out a solution that is relatively easy to implement and use.

cartalla added a commit that referenced this issue May 10, 2024
Add support for rhel9 and rocky9.

Resolves #229

Set SubmitterInstanceTags based on RESEnvironmentName.

Remove SubmitterSecurityGroupIds parameter.
This option added rules to existing security groups and if they were used by multiple clusters then the number of security group rules would exceed the maximum allowed.
With the addition of adding security groups to the head and compute nodes the
customer should supply their own security groups that meet the slurm cluster requirements, attach them to their login nodes and configure them as additional security groups for the head and compute nodes.

Resolves #204
cartalla added a commit that referenced this issue May 10, 2024
Add support for rhel9 and rocky9.
Had to update some of the ansible playbooks to mimic rhel8 changes.

Resolves #229

Set SubmitterInstanceTags based on RESEnvironmentName.

Remove SubmitterSecurityGroupIds parameter.
This option added rules to existing security groups and if they were used by multiple clusters then the number of security group rules would exceed the maximum allowed.
With the addition of adding security groups to the head and compute nodes the
customer should supply their own security groups that meet the slurm cluster requirements, attach them to their login nodes and configure them as additional security groups for the head and compute nodes.

Resolves #204
cartalla added a commit that referenced this issue May 13, 2024
Add support for rhel9 and rocky9.
Had to update some of the ansible playbooks to mimic rhel8 changes.

Resolves #229

Set SubmitterInstanceTags based on RESEnvironmentName.

Remove SubmitterSecurityGroupIds parameter.
This option added rules to existing security groups and if they were used by multiple clusters then the number of security group rules would exceed the maximum allowed.
With the addition of adding security groups to the head and compute nodes the
customer should supply their own security groups that meet the slurm cluster requirements, attach them to their login nodes and configure them as additional security groups for the head and compute nodes.

Resolves #204

Update CallSlurmRestApiLambda from Python 3.8 to 3.9.

Resolves #230

Update CDK version to 2.111.0.
This is the latest version supported by nodejs 16.
Really need to move to nodejs 20, but it isn't supported on Amazon Linux 2 or
RHEL 7 family.
Would require either running in a Docker container or on a newer OS version.
I think that I'm going to change the prerequisites for the OS distribution
so that I can stay on the latest tools.
For example, I can't update to Python 3.12 until I do this.
cartalla added a commit that referenced this issue May 13, 2024
Add support for rhel9 and rocky9.
Had to update some of the ansible playbooks to mimic rhel8 changes.

Resolves #229

Set SubmitterInstanceTags based on RESEnvironmentName.

Remove SubmitterSecurityGroupIds parameter.
This option added rules to existing security groups and if they were used by multiple clusters then the number of security group rules would exceed the maximum allowed.
With the addition of adding security groups to the head and compute nodes the
customer should supply their own security groups that meet the slurm cluster requirements, attach them to their login nodes and configure them as additional security groups for the head and compute nodes.

Resolves #204

Update CallSlurmRestApiLambda from Python 3.8 to 3.9.

Resolves #230

Update CDK version to 2.111.0.
This is the latest version supported by nodejs 16.
Really need to move to nodejs 20, but it isn't supported on Amazon Linux 2 or
RHEL 7 family.
Would require either running in a Docker container or on a newer OS version.
I think that I'm going to change the prerequisites for the OS distribution
so that I can stay on the latest tools.
For example, I can't update to Python 3.12 until I do this.

Update DeconfigureRESUsersGroupsJson to pass if last statement fails.
cartalla added a commit that referenced this issue May 13, 2024
Add support for rhel9 and rocky9.
Had to update some of the ansible playbooks to mimic rhel8 changes.

Resolves #229

Set SubmitterInstanceTags based on RESEnvironmentName.

Remove SubmitterSecurityGroupIds parameter.
This option added rules to existing security groups and if they were used by multiple clusters then the number of security group rules would exceed the maximum allowed.
With the addition of adding security groups to the head and compute nodes the
customer should supply their own security groups that meet the slurm cluster requirements, attach them to their login nodes and configure them as additional security groups for the head and compute nodes.

Resolves #204

Update CallSlurmRestApiLambda from Python 3.8 to 3.9.

Resolves #230

Update CDK version to 2.111.0.
This is the latest version supported by nodejs 16.
Really need to move to nodejs 20, but it isn't supported on Amazon Linux 2 or
RHEL 7 family.
Would require either running in a Docker container or on a newer OS version.
I think that I'm going to change the prerequisites for the OS distribution
so that I can stay on the latest tools.
For example, I can't update to Python 3.12 until I do this.

Update DeconfigureRESUsersGroupsJson to pass if last statement fails.

Fix bug in create_slurm_accounts.py

Resolves #231
cartalla added a commit that referenced this issue May 13, 2024
Add support for rhel9 and rocky9.
Had to update some of the ansible playbooks to mimic rhel8 changes.

Resolves #229

Set SubmitterInstanceTags based on RESEnvironmentName.

Remove SubmitterSecurityGroupIds parameter.
This option added rules to existing security groups and if they were used by multiple clusters then the number of security group rules would exceed the maximum allowed.
With the addition of adding security groups to the head and compute nodes the
customer should supply their own security groups that meet the slurm cluster requirements, attach them to their login nodes and configure them as additional security groups for the head and compute nodes.

Resolves #204

Update CallSlurmRestApiLambda from Python 3.8 to 3.9.

Resolves #230

Update CDK version to 2.111.0.
This is the latest version supported by nodejs 16.
Really need to move to nodejs 20, but it isn't supported on Amazon Linux 2 or
RHEL 7 family.
Would require either running in a Docker container or on a newer OS version.
I think that I'm going to change the prerequisites for the OS distribution
so that I can stay on the latest tools.
For example, I can't update to Python 3.12 until I do this.

Update DeconfigureRESUsersGroupsJson to pass if last statement fails.

Fix bug in create_slurm_accounts.py

Resolves #231
cartalla added a commit that referenced this issue May 13, 2024
Add support for rhel9 and rocky9.
Had to update some of the ansible playbooks to mimic rhel8 changes.

Resolves #229

Set SubmitterInstanceTags based on RESEnvironmentName.

Remove SubmitterSecurityGroupIds parameter.
This option added rules to existing security groups and if they were used by multiple clusters then the number of security group rules would exceed the maximum allowed.
With the addition of adding security groups to the head and compute nodes the
customer should supply their own security groups that meet the slurm cluster requirements, attach them to their login nodes and configure them as additional security groups for the head and compute nodes.

Resolves #204

Update CallSlurmRestApiLambda from Python 3.8 to 3.9.

Resolves #230

Update CDK version to 2.111.0.
This is the latest version supported by nodejs 16.
Really need to move to nodejs 20, but it isn't supported on Amazon Linux 2 or
RHEL 7 family.
Would require either running in a Docker container or on a newer OS version.
I think that I'm going to change the prerequisites for the OS distribution
so that I can stay on the latest tools.
For example, I can't update to Python 3.12 until I do this.

Update DeconfigureRESUsersGroupsJson to pass if last statement fails.

Fix bug in create_slurm_accounts.py

Resolves #231
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant