-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reducing number of compute resources to aggressively. #220
Comments
I've found by comment out the following three lines in source/cdk/cdk_slurm_stack.py I could turn off the reduction code (at line 2770 in the code I have):
The next line checks if I've exceed the MAX_NUMBER_OF_COMPUTE_RESOURCES, so there is a nice check in case my configuration were to be too much. I want to be able to have machines with the same cores and less memory - no need to pay for more than I need. |
I was previously only allowing 1 memory size/core count combination to keep the number of compute resources down and also was combining multiple instance types in one compute resource if possible. This led to people no being able to configure the exact instance types they wanted. So, I've reverted to my original strategy of 1 instance type per compute resource and 1 CR per queue. The compute resources can be combined into any queues that the user wants using custom slurm settings. I had to exclude instance types in the default configuration in order to keep from exceeding the PC limits. Resolves #220
I was previously only allowing 1 memory size/core count combination to keep the number of compute resources down and also was combining multiple instance types in one compute resource if possible. This led to people no being able to configure the exact instance types they wanted. So, I've reverted to my original strategy of 1 instance type per compute resource and 1 CR per queue. The compute resources can be combined into any queues that the user wants using custom slurm settings. I had to exclude instance types in the default configuration in order to keep from exceeding the PC limits. Resolves #220
I was previously only allowing 1 memory size/core count combination to keep the number of compute resources down and also was combining multiple instance types in one compute resource if possible. This led to people no being able to configure the exact instance types they wanted. So, I've reverted to my original strategy of 1 instance type per compute resource and 1 CR per queue. The compute resources can be combined into any queues that the user wants using custom slurm settings. I had to exclude instance types in the default configuration in order to keep from exceeding the PC limits. Resolves #220
I was trying to configure as many instance types as allowed by ParallelCluster's limits, but in retrospect, should really leave this up to the user to configure. I've changed the code to just create 1 instance type per CR and 1 CR per queue/partition. |
I was previously only allowing 1 memory size/core count combination to keep the number of compute resources down and also was combining multiple instance types in one compute resource if possible. This was to try to maximize the number of instance types that were configured. This led to people not being able to configure the exact instance types they wanted. The preference is to notify the user and let them choose which instances types to exclude or to reduce the number of included types. So, I've reverted to my original strategy of 1 instance type per compute resource and 1 CR per queue. The compute resources can be combined into any queues that the user wants using custom slurm settings. I had to exclude instance types in the default configuration in order to keep from exceeding the PC limits. Resolves #220
I was previously only allowing 1 memory size/core count combination to keep the number of compute resources down and also was combining multiple instance types in one compute resource if possible. This was to try to maximize the number of instance types that were configured. This led to people not being able to configure the exact instance types they wanted. The preference is to notify the user and let them choose which instances types to exclude or to reduce the number of included types. So, I've reverted to my original strategy of 1 instance type per compute resource and 1 CR per queue. The compute resources can be combined into any queues that the user wants using custom slurm settings. I had to exclude instance types in the default configuration in order to keep from exceeding the PC limits. Resolves #220 Update ParallelCluster version in config files and docs.
I was previously only allowing 1 memory size/core count combination to keep the number of compute resources down and also was combining multiple instance types in one compute resource if possible. This was to try to maximize the number of instance types that were configured. This led to people not being able to configure the exact instance types they wanted. The preference is to notify the user and let them choose which instances types to exclude or to reduce the number of included types. So, I've reverted to my original strategy of 1 instance type per compute resource and 1 CR per queue. The compute resources can be combined into any queues that the user wants using custom slurm settings. I had to exclude instance types in the default configuration in order to keep from exceeding the PC limits. Resolves #220 Update ParallelCluster version in config files and docs. Clean up security scan.
I was previously only allowing 1 memory size/core count combination to keep the number of compute resources down and also was combining multiple instance types in one compute resource if possible. This was to try to maximize the number of instance types that were configured. This led to people not being able to configure the exact instance types they wanted. The preference is to notify the user and let them choose which instances types to exclude or to reduce the number of included types. So, I've reverted to my original strategy of 1 instance type per compute resource and 1 CR per queue. The compute resources can be combined into any queues that the user wants using custom slurm settings. I had to exclude instance types in the default configuration in order to keep from exceeding the PC limits. Resolves #220 Update ParallelCluster version in config files and docs. Clean up security scan.
I'm building a cluster with just nine instance types and certain instances are being culled to "reduce number of CRs" - this is unnecessary as I do not have many compute resources.
Config file has:
InstanceConfig:
UseSpot: false
NodeCounts:
# @todo: Update the max number of each instance type to configure
DefaultMaxCount: 10
Include:
InstanceTypes:
- m7a.large
- m7a.xlarge
- m7a.2xlarge
- m7a.4xlarge
- r7a.large
- r7a.xlarge
- r7a.2xlarge
- r7a.4xlarge
- r7a.8xlarge
It then buckets appropriately:
INFO: Instance type by memory and core:
INFO: 6 unique memory size:
INFO: 8 GB
INFO: 1 instance type with 2 core(s): ['m7a.large']
INFO: 16 GB
INFO: 1 instance type with 2 core(s): ['r7a.large']
INFO: 1 instance type with 4 core(s): ['m7a.xlarge']
INFO: 32 GB
INFO: 1 instance type with 4 core(s): ['r7a.xlarge']
INFO: 1 instance type with 8 core(s): ['m7a.2xlarge']
INFO: 64 GB
INFO: 1 instance type with 8 core(s): ['r7a.2xlarge']
INFO: 1 instance type with 16 core(s): ['m7a.4xlarge']
INFO: 128 GB
INFO: 1 instance type with 16 core(s): ['r7a.4xlarge']
INFO: 256 GB
INFO: 1 instance type with 32 core(s): ['r7a.8xlarge']
But then it starts culling unnecessarily as parallecluster/slurm can handle 9 compute resources...
INFO: Configuring od-8-gb queue:
INFO: Adding od-8gb-2-cores compute resource: ['m7a.large']
INFO: Configuring od-16-gb queue:
INFO: Adding od-16gb-2-cores compute resource: ['r7a.large']
INFO: Skipping od-16gb-4-cores compute resource: ['m7a.xlarge'] to reduce number of CRs.
INFO: Configuring od-32-gb queue:
INFO: Adding od-32gb-4-cores compute resource: ['r7a.xlarge']
INFO: Skipping od-32gb-8-cores compute resource: ['m7a.2xlarge'] to reduce number of CRs.
INFO: Configuring od-64-gb queue:
INFO: Adding od-64gb-8-cores compute resource: ['r7a.2xlarge']
INFO: Skipping od-64gb-16-cores compute resource: ['m7a.4xlarge'] to reduce number of CRs.
INFO: Configuring od-128-gb queue:
INFO: Adding od-128gb-16-cores compute resource: ['r7a.4xlarge']
INFO: Configuring od-256-gb queue:
INFO: Adding od-256gb-32-cores compute resource: ['r7a.8xlarge']
INFO: Created 6 queues with 6 compute resources
I would like to have a 16 core 64G machine, a 32G 8 core machine, etc.. How to disable/modify this "culling". I would argue we should only start culling when we exceed what parallelcluster can handle.
We can now have 50 slurm queues per cluster, and 50 compute resources per queue and 50 compute resources per cluster! See:
https://docs.aws.amazon.com/parallelcluster/latest/ug/configuration-of-multiple-queues-v3.html
The text was updated successfully, but these errors were encountered: