reducing number of compute resources to aggressively. #220

gwolski · 2024-04-11T22:58:24Z

I'm building a cluster with just nine instance types and certain instances are being culled to "reduce number of CRs" - this is unnecessary as I do not have many compute resources.

Config file has:

InstanceConfig:
UseSpot: false
NodeCounts:
# @todo: Update the max number of each instance type to configure
DefaultMaxCount: 10
Include:
InstanceTypes:
- m7a.large
- m7a.xlarge
- m7a.2xlarge
- m7a.4xlarge
- r7a.large
- r7a.xlarge
- r7a.2xlarge
- r7a.4xlarge
- r7a.8xlarge

It then buckets appropriately:
INFO: Instance type by memory and core:
INFO: 6 unique memory size:
INFO: 8 GB
INFO: 1 instance type with 2 core(s): ['m7a.large']
INFO: 16 GB
INFO: 1 instance type with 2 core(s): ['r7a.large']
INFO: 1 instance type with 4 core(s): ['m7a.xlarge']
INFO: 32 GB
INFO: 1 instance type with 4 core(s): ['r7a.xlarge']
INFO: 1 instance type with 8 core(s): ['m7a.2xlarge']
INFO: 64 GB
INFO: 1 instance type with 8 core(s): ['r7a.2xlarge']
INFO: 1 instance type with 16 core(s): ['m7a.4xlarge']
INFO: 128 GB
INFO: 1 instance type with 16 core(s): ['r7a.4xlarge']
INFO: 256 GB
INFO: 1 instance type with 32 core(s): ['r7a.8xlarge']

But then it starts culling unnecessarily as parallecluster/slurm can handle 9 compute resources...

INFO: Configuring od-8-gb queue:
INFO: Adding od-8gb-2-cores compute resource: ['m7a.large']
INFO: Configuring od-16-gb queue:
INFO: Adding od-16gb-2-cores compute resource: ['r7a.large']
INFO: Skipping od-16gb-4-cores compute resource: ['m7a.xlarge'] to reduce number of CRs.
INFO: Configuring od-32-gb queue:
INFO: Adding od-32gb-4-cores compute resource: ['r7a.xlarge']
INFO: Skipping od-32gb-8-cores compute resource: ['m7a.2xlarge'] to reduce number of CRs.
INFO: Configuring od-64-gb queue:
INFO: Adding od-64gb-8-cores compute resource: ['r7a.2xlarge']
INFO: Skipping od-64gb-16-cores compute resource: ['m7a.4xlarge'] to reduce number of CRs.
INFO: Configuring od-128-gb queue:
INFO: Adding od-128gb-16-cores compute resource: ['r7a.4xlarge']
INFO: Configuring od-256-gb queue:
INFO: Adding od-256gb-32-cores compute resource: ['r7a.8xlarge']
INFO: Created 6 queues with 6 compute resources

I would like to have a 16 core 64G machine, a 32G 8 core machine, etc.. How to disable/modify this "culling". I would argue we should only start culling when we exceed what parallelcluster can handle.

We can now have 50 slurm queues per cluster, and 50 compute resources per queue and 50 compute resources per cluster! See:
https://docs.aws.amazon.com/parallelcluster/latest/ug/configuration-of-multiple-queues-v3.html

gwolski · 2024-04-13T02:04:01Z

I've found by comment out the following three lines in source/cdk/cdk_slurm_stack.py I could turn off the reduction code (at line 2770 in the code I have):

                    if len(parallel_cluster_queue['ComputeResources']):
                        logger.info(f"    Skipping {compute_resource_name:18} compute resource: {instance_types} to reduce number of CRs.")
                        continue

The next line checks if I've exceed the MAX_NUMBER_OF_COMPUTE_RESOURCES, so there is a nice check in case my configuration were to be too much.

I want to be able to have machines with the same cores and less memory - no need to pay for more than I need.

I was previously only allowing 1 memory size/core count combination to keep the number of compute resources down and also was combining multiple instance types in one compute resource if possible. This led to people no being able to configure the exact instance types they wanted. So, I've reverted to my original strategy of 1 instance type per compute resource and 1 CR per queue. The compute resources can be combined into any queues that the user wants using custom slurm settings. I had to exclude instance types in the default configuration in order to keep from exceeding the PC limits. Resolves #220

cartalla · 2024-05-17T18:28:46Z

I was trying to configure as many instance types as allowed by ParallelCluster's limits, but in retrospect, should really leave this up to the user to configure.

I've changed the code to just create 1 instance type per CR and 1 CR per queue/partition.
This should allow you to pick and choose which instances.
It is now an error if you configure too many instance types and you must either remove included instances or exclude instances until you get under the ParallelCluster limit of 50.

I was previously only allowing 1 memory size/core count combination to keep the number of compute resources down and also was combining multiple instance types in one compute resource if possible. This was to try to maximize the number of instance types that were configured. This led to people not being able to configure the exact instance types they wanted. The preference is to notify the user and let them choose which instances types to exclude or to reduce the number of included types. So, I've reverted to my original strategy of 1 instance type per compute resource and 1 CR per queue. The compute resources can be combined into any queues that the user wants using custom slurm settings. I had to exclude instance types in the default configuration in order to keep from exceeding the PC limits. Resolves #220

I was previously only allowing 1 memory size/core count combination to keep the number of compute resources down and also was combining multiple instance types in one compute resource if possible. This was to try to maximize the number of instance types that were configured. This led to people not being able to configure the exact instance types they wanted. The preference is to notify the user and let them choose which instances types to exclude or to reduce the number of included types. So, I've reverted to my original strategy of 1 instance type per compute resource and 1 CR per queue. The compute resources can be combined into any queues that the user wants using custom slurm settings. I had to exclude instance types in the default configuration in order to keep from exceeding the PC limits. Resolves #220 Update ParallelCluster version in config files and docs.

I was previously only allowing 1 memory size/core count combination to keep the number of compute resources down and also was combining multiple instance types in one compute resource if possible. This was to try to maximize the number of instance types that were configured. This led to people not being able to configure the exact instance types they wanted. The preference is to notify the user and let them choose which instances types to exclude or to reduce the number of included types. So, I've reverted to my original strategy of 1 instance type per compute resource and 1 CR per queue. The compute resources can be combined into any queues that the user wants using custom slurm settings. I had to exclude instance types in the default configuration in order to keep from exceeding the PC limits. Resolves #220 Update ParallelCluster version in config files and docs. Clean up security scan.

cartalla self-assigned this Apr 24, 2024

cartalla linked a pull request May 17, 2024 that will close this issue

Do not auto-prune instance types if there are too many #235

Merged

cartalla mentioned this issue May 17, 2024

Do not auto-prune instance types if there are too many #235

Merged

cartalla closed this as completed in #235 May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reducing number of compute resources to aggressively. #220

reducing number of compute resources to aggressively. #220

gwolski commented Apr 11, 2024

gwolski commented Apr 13, 2024

cartalla commented May 17, 2024

reducing number of compute resources to aggressively. #220

reducing number of compute resources to aggressively. #220

Comments

gwolski commented Apr 11, 2024

gwolski commented Apr 13, 2024

cartalla commented May 17, 2024