Skip to content

Commit

Permalink
Clean up security groups and permissions for extra mounts (#246)
Browse files Browse the repository at this point in the history
Create a CDK script to automate the creation of security groups for external
login nodes and for external FSx file systems.
Add a parameter, AdditionalSecurityGroupsStackName to get the security
group ids from the created stack and configure the head and compute node
additional security groups.

Update docs.
Update deployment-prerequisites.md.
Add security-groups.md.

Replace RESEnvironmentName parameter with RESStackName.
Get the RESEnvironment from the parameters of the RES stack.

Delete SubmitterInstanceTags parameter because not used anywhere.
Will add a new parameter to use configure/deconfigure external login nodes.

Don't add extramount security groups to parallelcluster.

Don't add extra mount security groups to create cluster lambda

Update permissions to lambda that creates ParallelCluster.
Add ec2:DeleteTags permission
Add missing fsx permissions.

Use cluster-manager instead of vdc-controller to create users/groups json.

Add errors to SNS notification in CreateBuildFiles lambda.

Handle special case where the same cluster name exists in multiple VPCs.
This causes Route53 hosted zones with the same names and the a record
for the head node gets created in the wrong hosted zone.

Make sure to send SNS notification if parallelCluster create or update fails.
  • Loading branch information
cartalla authored Aug 19, 2024
1 parent 2ae1b13 commit 2d84608
Show file tree
Hide file tree
Showing 25 changed files with 1,045 additions and 585 deletions.
23 changes: 14 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,30 +14,31 @@ Key features are:
* Automatic scaling of AWS EC2 instances based on demand
* Use any AWS EC2 instance type including Graviton2
* Use of spot instances
* Memory-aware scheduling
* License-aware scheduling (Manages tool licenses as a consumable resource)
* User and group fair share scheduling
* Handling of spot terminations
* Handling of insufficient capacity exceptions
* Batch and interactive partitions (queues)
* Manages tool licenses as a consumable resource
* User and group fair share scheduling
* Slurm accounting database
* CloudWatch dashboard
* Job preemption
* Manage on-premises compute nodes
* Configure partitions (queues) and nodes that are always on to support reserved instances (RIs) and savings plans (SPs).
* Integration with [Research and Engineering Studio on AWS (RES)](https://aws.amazon.com/hpc/res/)

Features in the legacy version and not in the ParallelCluster version:

* Heterogenous clusters with mixed OSes and CPU architectures on compute nodes.
* Multi-AZ support. Supported by ParallelCluster, but not currently implemented.
* Multi-region support
* AWS Fault Injection Simulator (FIS) templates to test spot terminations
* Support for MungeKeySsmParameter
* Multi-cluster federation

ParallelCluster Limitations

* Number of "Compute Resources" (CRs) is limited to 50 which limits the number of instance types allowed in a cluster.
ParallelCluster can have multiple instance types in a CR, but with memory based scheduling enabled, they must all have the same number of cores and amount of memory.
ParallelCluster can have multiple instance types in a compute resource (CR), but with memory based scheduling enabled, they must all have the same number of cores and amount of memory.
* All Slurm instances must have the same OS and CPU architecture.
* Stand-alone Slurm database daemon instance. Prevents federation.
* Multi-region support. This is unlikely to change because multi-region services run against our archiectural philosophy.
Expand All @@ -57,11 +58,12 @@ ParallelCluster:

* Amazon Linux 2
* CentOS 7
* RedHat 7 and 8
* RedHat 7, 8 and 9
* Rocky Linux 8 and 9

This Slurm cluster supports both Intel/AMD (x86_64) based instances and ARM Graviton2 (arm64/aarch64) based instances.
This Slurm cluster supports both Intel/AMD (x86_64) based instances and Graviton (arm64/aarch64) based instances.

[Graviton instances require](https://github.com/aws/aws-graviton-getting-started/blob/main/os.md) Amazon Linux 2 or RedHat 8 operating systems.
[Graviton instances require](https://github.com/aws/aws-graviton-getting-started/blob/main/os.md) Amazon Linux 2 or RedHat/Rocky >=8 operating systems.
RedHat 7 and CentOS 7 do not support Graviton 2.

This provides the following different combinations of OS and processor architecture.
Expand All @@ -72,10 +74,13 @@ ParallelCluster:
* Amazon Linux 2 and x86_64
* CentOS 7 and x86_64
* RedHat 7 and x86_64
* RedHat 8 and arm64
* RedHat 8 and x86_64
* RedHat 8/9 and arm64
* RedHat 8/9 and x86_64
* Rocky 8/9 and arm64
* Rocky 8/9 and x86_64

Note that in ParallelCluster, all compute nodes must have the same OS and architecture.
However, you can create as many clusters as you require.

## Documentation

Expand Down
9 changes: 9 additions & 0 deletions create-slurm-security-groups.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/bin/bash -xe

cd create-slurm-security-groups

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
pwd
./create-slurm-security-groups.py "$@"
24 changes: 24 additions & 0 deletions create-slurm-security-groups/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
*.swp
package-lock.json
.pytest_cache
*.egg-info

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# CDK Context & Staging files
.cdk.staging/
cdk.out/

cdk.context.json
65 changes: 65 additions & 0 deletions create-slurm-security-groups/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@

# Welcome to your CDK Python project!

You should explore the contents of this project. It demonstrates a CDK app with an instance of a stack (`create_security_groups_stack`)
which contains an Amazon SQS queue that is subscribed to an Amazon SNS topic.

The `cdk.json` file tells the CDK Toolkit how to execute your app.

This project is set up like a standard Python project. The initialization process also creates
a virtualenv within this project, stored under the .venv directory. To create the virtualenv
it assumes that there is a `python3` executable in your path with access to the `venv` package.
If for any reason the automatic creation of the virtualenv fails, you can create the virtualenv
manually once the init process completes.

To manually create a virtualenv on MacOS and Linux:

```
$ python3 -m venv .venv
```

After the init process completes and the virtualenv is created, you can use the following
step to activate your virtualenv.

```
$ source .venv/bin/activate
```

If you are a Windows platform, you would activate the virtualenv like this:

```
% .venv\Scripts\activate.bat
```

Once the virtualenv is activated, you can install the required dependencies.

```
$ pip install -r requirements.txt
```

At this point you can now synthesize the CloudFormation template for this code.

```
$ cdk synth
```

You can now begin exploring the source code, contained in the hello directory.
There is also a very trivial test included that can be run like this:

```
$ pytest
```

To add additional dependencies, for example other CDK libraries, just add to
your requirements.txt file and rerun the `pip install -r requirements.txt`
command.

## Useful commands

* `cdk ls` list all stacks in the app
* `cdk synth` emits the synthesized CloudFormation template
* `cdk deploy` deploy this stack to your default AWS account/region
* `cdk diff` compare deployed stack with current state
* `cdk docs` open CDK documentation

Enjoy!
17 changes: 17 additions & 0 deletions create-slurm-security-groups/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#!/usr/bin/env python3

import aws_cdk as cdk
from aws_cdk import App, Environment
from create_slurm_security_groups.create_slurm_security_groups_stack import CreateSlurmSecurityGroupsStack

app = cdk.App()

cdk_env = Environment(
account = app.node.try_get_context('account_id'),
region = app.node.try_get_context('region')
)
stack_name = app.node.try_get_context('stack_name')

CreateSlurmSecurityGroupsStack(app, stack_name, env=cdk_env, termination_protection = True,)

app.synth()
62 changes: 62 additions & 0 deletions create-slurm-security-groups/cdk.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
{
"app": "python3 app.py",
"watch": {
"include": [
"**"
],
"exclude": [
"README.md",
"cdk*.json",
"requirements*.txt",
"source.bat",
"**/__init__.py",
"python/__pycache__",
"tests"
]
},
"context": {
"@aws-cdk/aws-lambda:recognizeLayerVersion": true,
"@aws-cdk/core:checkSecretUsage": true,
"@aws-cdk/core:target-partitions": [
"aws",
"aws-cn"
],
"@aws-cdk-containers/ecs-service-extensions:enableDefaultLogDriver": true,
"@aws-cdk/aws-ec2:uniqueImdsv2TemplateName": true,
"@aws-cdk/aws-ecs:arnFormatIncludesClusterName": true,
"@aws-cdk/aws-iam:minimizePolicies": true,
"@aws-cdk/core:validateSnapshotRemovalPolicy": true,
"@aws-cdk/aws-codepipeline:crossAccountKeyAliasStackSafeResourceName": true,
"@aws-cdk/aws-s3:createDefaultLoggingPolicy": true,
"@aws-cdk/aws-sns-subscriptions:restrictSqsDescryption": true,
"@aws-cdk/aws-apigateway:disableCloudWatchRole": true,
"@aws-cdk/core:enablePartitionLiterals": true,
"@aws-cdk/aws-events:eventsTargetQueueSameAccount": true,
"@aws-cdk/aws-iam:standardizedServicePrincipals": true,
"@aws-cdk/aws-ecs:disableExplicitDeploymentControllerForCircuitBreaker": true,
"@aws-cdk/aws-iam:importedRoleStackSafeDefaultPolicyName": true,
"@aws-cdk/aws-s3:serverAccessLogsUseBucketPolicy": true,
"@aws-cdk/aws-route53-patters:useCertificate": true,
"@aws-cdk/customresources:installLatestAwsSdkDefault": false,
"@aws-cdk/aws-rds:databaseProxyUniqueResourceName": true,
"@aws-cdk/aws-codedeploy:removeAlarmsFromDeploymentGroup": true,
"@aws-cdk/aws-apigateway:authorizerChangeDeploymentLogicalId": true,
"@aws-cdk/aws-ec2:launchTemplateDefaultUserData": true,
"@aws-cdk/aws-secretsmanager:useAttachedSecretResourcePolicyForSecretTargetAttachments": true,
"@aws-cdk/aws-redshift:columnId": true,
"@aws-cdk/aws-stepfunctions-tasks:enableEmrServicePolicyV2": true,
"@aws-cdk/aws-ec2:restrictDefaultSecurityGroup": true,
"@aws-cdk/aws-apigateway:requestValidatorUniqueId": true,
"@aws-cdk/aws-kms:aliasNameRef": true,
"@aws-cdk/aws-autoscaling:generateLaunchTemplateInsteadOfLaunchConfig": true,
"@aws-cdk/core:includePrefixInUniqueNameGeneration": true,
"@aws-cdk/aws-efs:denyAnonymousAccess": true,
"@aws-cdk/aws-opensearchservice:enableOpensearchMultiAzWithStandby": true,
"@aws-cdk/aws-lambda-nodejs:useLatestRuntimeVersion": true,
"@aws-cdk/aws-efs:mountTargetOrderInsensitiveLogicalId": true,
"@aws-cdk/aws-rds:auroraClusterChangeScopeOfInstanceParameterGroupWithEachParameters": true,
"@aws-cdk/aws-appsync:useArnForSourceApiAssociationIdentifier": true,
"@aws-cdk/aws-rds:preventRenderingDeprecatedCredentials": true,
"@aws-cdk/aws-codepipeline-actions:useNewDefaultBranchForCodeCommitSource": true
}
}
Loading

0 comments on commit 2d84608

Please sign in to comment.