Skip to content

Commit

Permalink
Add support for ParallelCluster versions 3.9.0 and 3.9.1 (#232)
Browse files Browse the repository at this point in the history
Add support for rhel9 and rocky9.
Had to update some of the ansible playbooks to mimic rhel8 changes.

Resolves #229

Set SubmitterInstanceTags based on RESEnvironmentName.

Remove SubmitterSecurityGroupIds parameter.
This option added rules to existing security groups and if they were used by multiple clusters then the number of security group rules would exceed the maximum allowed.
With the addition of adding security groups to the head and compute nodes the
customer should supply their own security groups that meet the slurm cluster requirements, attach them to their login nodes and configure them as additional security groups for the head and compute nodes.

Resolves #204

Update CallSlurmRestApiLambda from Python 3.8 to 3.9.

Resolves #230

Update CDK version to 2.111.0.
This is the latest version supported by nodejs 16.
Really need to move to nodejs 20, but it isn't supported on Amazon Linux 2 or
RHEL 7 family.
Would require either running in a Docker container or on a newer OS version.
I think that I'm going to change the prerequisites for the OS distribution
so that I can stay on the latest tools.
For example, I can't update to Python 3.12 until I do this.

Update DeconfigureRESUsersGroupsJson to pass if last statement fails.

Fix bug in create_slurm_accounts.py

Resolves #231
  • Loading branch information
cartalla authored May 13, 2024
1 parent ded618c commit 8dff7cd
Show file tree
Hide file tree
Showing 15 changed files with 169 additions and 118 deletions.
10 changes: 1 addition & 9 deletions docs/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,8 +76,6 @@ This project creates a ParallelCluster configuration file that is documented in
- str
<a href="https://docs.aws.amazon.com/parallelcluster/latest/ug/HeadNode-v3.html#HeadNode-v3-Imds">Imds</a>:
<a href="https://docs.aws.amazon.com/parallelcluster/latest/ug/HeadNode-v3.html#yaml-HeadNode-Imds-Secured">Secured</a>: bool
<a href="#submittersecuritygroupids">SubmitterSecurityGroupIds</a>:
SecurityGroupName: SecurityGroupId
<a href="#submitterinstancetags">SubmitterInstanceTags</a>: str
TagName:
- TagValues
Expand Down Expand Up @@ -249,7 +247,7 @@ See the [ParallelCluster docs](https://docs.aws.amazon.com/parallelcluster/lates

See the [ParallelCluster docs](https://docs.aws.amazon.com/parallelcluster/latest/ug/Image-v3.html#yaml-Image-CustomAmi) for the custom AMI documentation.

**NOTE**: A CustomAmi must be provided for Rocky8.
**NOTE**: A CustomAmi must be provided for Rocky8 or Rocky9.
All other distributions have a default AMI that is provided by ParallelCluster.

#### Architecture
Expand Down Expand Up @@ -491,12 +489,6 @@ Additional security groups that will be added to the head node instance.

List of Amazon Resource Names (ARNs) of IAM policies for Amazon EC2 that will be added to the head node instance.

### SubmitterSecurityGroupIds

External security groups that should be able to use the cluster.

Rules will be added to allow it to interact with Slurm.

### SubmitterInstanceTags

Tags of instances that can be configured to submit to the cluster.
Expand Down
74 changes: 73 additions & 1 deletion docs/deployment-prerequisites.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,76 @@ The version that has been tested is in the CDK_VERSION variable in the install s

The install script will try to install the prerequisites if they aren't already installed.

## Security Groups for Login Nodes

If you want to allow instances like remote desktops to use the cluster directly, you must define
three security groups that allow connections between the instance, the Slurm head node, and the Slurm compute nodes.
We call the instance that is connecting to the Slurm cluster a login node or a submitter instance.

I'll call the three security groups the following names, but they can be whatever you want.

* SlurmSubmitterSG
* SlurmHeadNodeSG
* SlurmComputeNodeSG

### Slurm Submitter Security Group

The SlurmSubmitterSG will be attached to your login nodes, such as your virtual desktops.

It needs at least the following inbound rules:

| Type | Port range | Source | Description
|------|------------|--------|------------
| TCP | 1024-65535 | SlurmHeadNodeSG | SlurmHeadNode ephemeral
| TCP | 1024-65535 | SlurmComputeNodeSG | SlurmComputeNode ephemeral
| TCP | 6000-7024 | SlurmComputeNodeSG | SlurmComputeNode X11

It needs the following outbound rules.

| Type | Port range | Destination | Description
|------|------------|-------------|------------
| TCP | 2049 | SlurmHeadNodeSG | SlurmHeadNode NFS
| TCP | 6818 | SlurmComputeNodeSG | SlurmComputeNode slurmd
| TCP | 6819 | SlurmHeadNodeSG | SlurmHeadNode slurmdbd
| TCP | 6820-6829 | SlurmHeadNodeSG | SlurmHeadNode slurmctld
| TCP | 6830 | SlurmHeadNodeSG | SlurmHeadNode slurmrestd

### Slurm Head Node Security Group

The SlurmHeadNodeSG will be specified in your configuration file for the slurm/SlurmCtl/AdditionalSecurityGroups parameter.

It needs at least the following inbound rules:

| Type | Port range | Source | Description
|------|------------|--------|------------
| TCP | 2049 | SlurmSubmitterSG | SlurmSubmitter NFS
| TCP | 6819 | SlurmSubmitterSG | SlurmSubmitter slurmdbd
| TCP | 6820-6829 | SlurmSubmitterSG | SlurmSubmitter slurmctld
| TCP | 6830 | SlurmSubmitterSG | SlurmSubmitter slurmrestd

It needs the following outbound rules.

| Type | Port range | Destination | Description
|------|------------|-------------|------------
| TCP | 1024-65535 | SlurmSubmitterSG | SlurmSubmitter ephemeral

### Slurm Compute Node Security Group

The SlurmComputeNodeSG will be specified in your configuration file for the slurm/InstanceConfig/AdditionalSecurityGroups parameter.

It needs at least the following inbound rules:

| Type | Port range | Source | Description
|------|------------|--------|------------
| TCP | 6818 | SlurmSubmitterSG | SlurmSubmitter slurmd

It needs the following outbound rules.

| Type | Port range | Destination | Description
|------|------------|-------------|------------
| TCP | 1024-65535 | SlurmSubmitterSG | SlurmSubmitter ephemeral
| TCP | 6000-7024 | SlurmSubmitterSG | SlurmSubmitter X11

## Create Configuration File

Before you deploy a cluster you need to create a configuration file.
Expand All @@ -108,6 +178,7 @@ Ideally you should version control this file so you can keep track of changes.

The schema for the config file along with its default values can be found in [source/cdk/config_schema.py](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L230-L445).
The schema is defined in python, but the actual config file should be in yaml format.
See [Configuration File Format](config.md) for documentation on all of the parameters.

The following are key parameters that you will need to update.
If you do not have the required parameters in your config file then the installer script will fail unless you specify the `--prompt` option.
Expand All @@ -120,7 +191,6 @@ You should save your selections in the config file.
| [Region](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L368-L369) | Region where VPC is located | | `$AWS_DEFAULT_REGION`
| [VpcId](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L372-L373) | The vpc where the cluster will be deployed. | vpc-* | None
| [SshKeyPair](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L370-L371) | EC2 Keypair to use for instances | | None
| [slurm/SubmitterSecurityGroupIds](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L480-L485) | Existing security groups that can submit to the cluster. For SOCA this is the ComputeNodeSG* resource. | sg-* | None
| [ErrorSnsTopicArn](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L379-L380) | ARN of an SNS topic that will be notified of errors | `arn:aws:sns:{{region}}:{AccountId}:{TopicName}` | None
| [slurm/InstanceConfig](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L491-L543) | Configure instance types that the cluster can use and number of nodes. | | See [default_config.yml](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/resources/config/default_config.yml)

Expand All @@ -137,7 +207,9 @@ all nodes must have the same architecture and Base OS.
| CentOS 7 | x86_64
| RedHat 7 | x86_64
| RedHat 8 | x86_64, arm64
| RedHat 9 | x86_64, arm64
| Rocky 8 | x86_64, arm64
| Rocky 9 | x86_64, arm64

You can exclude instances types by family or specific instance type.
By default the InstanceConfig excludes older generation instance families.
Expand Down
9 changes: 7 additions & 2 deletions docs/res_integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,12 @@ The intention is to completely automate the deployment of ParallelCluster and se
|-----------|-------------|------
| VpcId | VPC id for the RES cluster | vpc-xxxxxx
| SubnetId | Subnet in the RES VPC. | subnet-xxxxx
| SubmitterSecurityGroupIds | The security group names and ids used by RES VDIs. The name will be something like *EnvironmentName*-vdc-dcv-host-security-group | *EnvironmentName*-*VDISG*: sg-xxxxxxxx
| SubmitterInstanceTags | The tag of VDI instances. | 'res:EnvironmentName': *EnvironmentName*'
| ExtraMounts | The mount parameters for the /home directory. This is required for access to the home directory. |
| ExtraMountSecurityGroups | Security groups that give access to the ExtraMounts. These will be added to compute nodes so they can access the file systems.

You must also create security groups as described in [Security Groups for Login Nodes](deployment-prerequisites.md#security-groups-for-login-nodes) and specify the SlurmHeadNodeSG in the `slurm/SlurmCtl/AdditionalSecurityGroups` parameter and the SlurmComputeNodeSG in the `slurm/InstanceConfig/AdditionalSecurityGroups` parameter.

When you specify **RESEnvironmentName**, a lambda function will run SSM commands to create a cron job on a RES domain joined instance to update the users_groups.json file every hour. Another lambda function will also automatically configure all running VDI hosts to use the cluster.

The following example shows the configuration parameters for a RES with the EnvironmentName=res-eda.
Expand Down Expand Up @@ -51,11 +52,15 @@ slurm:
Database:
DatabaseStackName: pcluster-slurm-db-res
SlurmCtl: {}
SlurmCtl:
AdditionalSecurityGroups:
- sg-12345678 # SlurmHeadNodeSG
# Configure typical EDA instance types
# A partition will be created for each combination of Base OS, Architecture, and Spot
InstanceConfig:
AdditionalSecurityGroups:
- sg-23456789 # SlurmComputeNodeSG
UseSpot: true
NodeCounts:
DefaultMaxCount: 10
Expand Down
3 changes: 2 additions & 1 deletion docs/soca_integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ Set the following parameters in your config file.
| Parameter | Description | Value
|-----------|-------------|------
| VpcId | VPC id for the SOCA cluster | vpc-xxxxxx
| SubmitterSecurityGroupIds | The ComputeNode security group name and id | *cluster-id*-*ComputeNodeSG*: sg-xxxxxxxx
| slurm/SlurmCtl/AdditionalSecurityGroups | Security group ids that give desktop instances access to the head node and that give the head node access to VPC resources such as file systems.
| slurm/InstanceConfig/AdditionalSecurityGroups | Security group ids that give desktop instances access to the compute nodes and that give compute nodes access to VPC resources such as file systems.
| ExtraMounts | Add the mount parameters for the /apps and /data directories. This is required for access to the home directory. |

Deploy your slurm cluster.
Expand Down
11 changes: 10 additions & 1 deletion setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,16 @@ fi
echo "Using python $python_version"

# Check nodejs version
# https://nodejs.org/en/about/previous-releases
required_nodejs_version=16.20.2
# required_nodejs_version=18.20.2
# On Amazon Linux 2 and nodejs 18.20.2 I get the following errors:
# node: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by node)
# node: /lib64/libc.so.6: version `GLIBC_2.28' not found (required by node)
# required_nodejs_version=20.13.1
# On Amazon Linux 2 and nodejs 20.13.1 I get the following errors:
# node: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by node)
# node: /lib64/libc.so.6: version `GLIBC_2.28' not found (required by node)
export JSII_SILENCE_WARNING_DEPRECATED_NODE_VERSION=1
if ! which node &> /dev/null; then
echo -e "\nnode not found in your path."
Expand Down Expand Up @@ -88,7 +97,7 @@ fi
echo "Using nodejs version $nodejs_version"

# Create a local installation of cdk
CDK_VERSION=2.91.0 # If you change the CDK version here, make sure to also change it in source/requirements.txt
CDK_VERSION=2.111.0 # When you change the CDK version here, make sure to also change it in source/requirements.txt
if ! cdk --version &> /dev/null; then
echo "CDK not installed. Installing global version of cdk@$CDK_VERSION."
if ! npm install -g aws-cdk@$CDK_VERSION; then
Expand Down
48 changes: 5 additions & 43 deletions source/cdk/cdk_slurm_stack.py
Original file line number Diff line number Diff line change
Expand Up @@ -231,17 +231,6 @@ def override_config_with_context(self):
logger.error(f"Must set --{command_line_switch} from the command line or {config_key} in the config files")
exit(1)

config_key = 'SubmitterSecurityGroupIds'
context_key = config_key
submitterSecurityGroupIds_b64_string = self.node.try_get_context(context_key)
if submitterSecurityGroupIds_b64_string:
submitterSecurityGroupIds = json.loads(base64.b64decode(submitterSecurityGroupIds_b64_string).decode('utf-8'))
if config_key not in self.config['slurm']:
logger.info(f"slurm/{config_key:20} set from command line: {submitterSecurityGroupIds}")
else:
logger.info(f"slurm/{config_key:20} in config file overridden on command line from {self.config['slurm'][config_key]} to {submitterSecurityGroupIds}")
self.config['slurm'][config_key] = submitterSecurityGroupIds

def check_config(self):
'''
Check config, set defaults, and sanity check the configuration.
Expand Down Expand Up @@ -425,6 +414,9 @@ def update_config_for_res(self):
'''
res_environment_name = self.config['RESEnvironmentName']
logger.info(f"Updating configuration for RES environment: {res_environment_name}")

self.config['slurm']['SubmitterInstanceTags'] = {'res:EnvironmentName': [res_environment_name]}

cloudformation_client = boto3.client('cloudformation', region_name=self.config['Region'])
res_stack_name = None
stack_statuses = {}
Expand Down Expand Up @@ -481,13 +473,6 @@ def update_config_for_res(self):
self.config['SubnetId'] = subnet_ids[0]
logger.info(f" SubnetId: {self.config['SubnetId']}")

submitter_security_group_ids = []
if 'SubmitterSecurityGroupIds' not in self.config['slurm']:
self.config['slurm']['SubmitterSecurityGroupIds'] = {}
else:
for security_group_name, security_group_ids in self.config['slurm']['SubmitterSecurityGroupIds'].items():
submitter_security_group_ids.append(security_group_ids)

# Get RES VDI Security Group
res_vdc_stack_name = f"{res_stack_name}-vdc"
if res_vdc_stack_name not in stack_statuses:
Expand All @@ -508,11 +493,6 @@ def update_config_for_res(self):
if not res_dcv_security_group_id:
logger.error(f"RES VDI security group not found.")
exit(1)
if res_dcv_security_group_id not in submitter_security_group_ids:
res_dcv_security_group_name = f"{res_environment_name}-dcv-sg"
logger.info(f" SubmitterSecurityGroupIds['{res_dcv_security_group_name}'] = '{res_dcv_security_group_id}'")
self.config['slurm']['SubmitterSecurityGroupIds'][res_dcv_security_group_name] = res_dcv_security_group_id
submitter_security_group_ids.append(res_dcv_security_group_id)

# Get cluster manager Security Group
logger.debug(f"Searching for cluster manager security group id")
Expand All @@ -535,11 +515,6 @@ def update_config_for_res(self):
if not res_cluster_manager_security_group_id:
logger.error(f"RES cluster manager security group not found.")
exit(1)
if res_cluster_manager_security_group_id not in submitter_security_group_ids:
res_cluster_manager_security_group_name = f"{res_environment_name}-cluster-manager-sg"
logger.info(f" SubmitterSecurityGroupIds['{res_cluster_manager_security_group_name}'] = '{res_cluster_manager_security_group_id}'")
self.config['slurm']['SubmitterSecurityGroupIds'][res_cluster_manager_security_group_name] = res_cluster_manager_security_group_id
submitter_security_group_ids.append(res_cluster_manager_security_group_id)

# Get vdc controller Security Group
logger.debug(f"Searching for VDC controller security group id")
Expand All @@ -564,11 +539,6 @@ def update_config_for_res(self):
if not res_vdc_controller_security_group_id:
logger.error(f"RES VDC controller security group not found.")
exit(1)
if res_vdc_controller_security_group_id not in submitter_security_group_ids:
res_vdc_controller_security_group_name = f"{res_environment_name}-vdc-controller-sg"
logger.info(f" SubmitterSecurityGroupIds['{res_vdc_controller_security_group_name}'] = '{res_vdc_controller_security_group_id}'")
self.config['slurm']['SubmitterSecurityGroupIds'][res_vdc_controller_security_group_name] = res_vdc_controller_security_group_id
submitter_security_group_ids.append(res_vdc_controller_security_group_id)

# Configure the /home mount from RES if /home not already configured
home_mount_found = False
Expand Down Expand Up @@ -1025,7 +995,7 @@ def create_parallel_cluster_lambdas(self):
],
compatible_runtimes = [
aws_lambda.Runtime.PYTHON_3_9,
aws_lambda.Runtime.PYTHON_3_10,
# aws_lambda.Runtime.PYTHON_3_10, # Doesn't work: No module named 'rpds.rpds'
# aws_lambda.Runtime.PYTHON_3_11, # Doesn't work: No module named 'rpds.rpds'
],
)
Expand Down Expand Up @@ -1694,7 +1664,7 @@ def create_callSlurmRestApiLambda(self):
function_name=f"{self.stack_name}-CallSlurmRestApiLambda",
description="Example showing how to call Slurm REST API",
memory_size=128,
runtime=aws_lambda.Runtime.PYTHON_3_8,
runtime=aws_lambda.Runtime.PYTHON_3_9,
architecture=aws_lambda.Architecture.ARM_64,
timeout=Duration.minutes(1),
log_retention=logs.RetentionDays.INFINITE,
Expand Down Expand Up @@ -1842,14 +1812,6 @@ def create_security_groups(self):
Tags.of(self.slurm_submitter_sg).add("Name", self.slurm_submitter_sg_name)
self.suppress_cfn_nag(self.slurm_submitter_sg, 'W29', 'Egress port range used to block all egress')
self.submitter_security_groups[self.slurm_submitter_sg_name] = self.slurm_submitter_sg
for slurm_submitter_sg_name, slurm_submitter_sg_id in self.config['slurm']['SubmitterSecurityGroupIds'].items():
(allow_all_outbound, allow_all_ipv6_outbound) = self.allow_all_outbound(slurm_submitter_sg_id)
self.submitter_security_groups[slurm_submitter_sg_name] = ec2.SecurityGroup.from_security_group_id(
self, f"{slurm_submitter_sg_name}",
security_group_id = slurm_submitter_sg_id,
allow_all_outbound = allow_all_outbound,
allow_all_ipv6_outbound = allow_all_ipv6_outbound
)

self.slurm_rest_api_lambda_sg = ec2.SecurityGroup(self, "SlurmRestLambdaSG", vpc=self.vpc, allow_all_outbound=False, description="SlurmRestApiLambda to SlurmCtl Security Group")
self.slurm_rest_api_lambda_sg_name = f"{self.stack_name}-SlurmRestApiLambdaSG"
Expand Down
Loading

0 comments on commit 8dff7cd

Please sign in to comment.