Scaling ECS down is not a straightforward task;Based on one metric solely, an instance could be taken down causing 2 effects:
- Forcefully removing a host when a container is running will cut off active connections causing service downtime
- Removing an instance based on utilization / capacity metric may cause an endless loop of scale
Once identified, the target is moved to "draining" state, where a new instance of the same task is raised on an available host. Once the new containers are ready, the draining instsnce will start draining connection from active tasks. Once the draining process is complete, the instance will be terminated.
- Throw
ecscale.py
code to AWS Lambda providing relevant role to handle ECS and autoscaling (Instrcutions ahead) - Set repeated run (recommended every 60 minutes using a cloudwatch events trigger for Lambda)
- That's it... Your ECS hosts are being gracefully removed if needed. No metrics/alarms needed
- SCALE_IN_CPU_TH = 30
# Below this EC2 average metric scaling would take action
- SCALE_IN_MEM_TH = 60
# Below this cluster average metric scaling would take action
- FUTURE_MEM_TH = 70
# Below this future metric scaling would take action
- DRAIN_ALL_EMPTY_INSTANCES = True
# Set to False to prevent scaling in more than one instance at a time
- ASG_PREFIX = ''
# Use this when your ASG naming convention requires a prefix (e.g. 'ecs-')
- ASG_SUFFIX = ''
# Use this when your ASG naming convention requires a suffix (e.g. '-live')
- ECS_AVOID_STR = 'awseb'
# Use this to avoid clusters containing a specific string (i.e ElasticBeanstalk clusters)
- When creating the Lambda function, you'll be asked to select a role or create a new one, choose a new role
- Provide the json from
policy.json
to the role policy - All set to allow ecscale to do its work
- Iterate over existing ECS clusters
- Check a cluster's ability to scale-in based on predicted future memory reservation capacity
- Look for empty hosts the can be scaled
- Look for least utilized host
- Choose a candidate and put in draining state
- Terminate a draining host that has no running tasks and decrease the desired number of instances