For running large workflows, you'll typically need to scale the controller to match.
You cannot horizontally scale the controller.
v3.0
As of v3.0, the controller supports having a hot-standby for High Availability.
You can scale the controller vertically in these ways:
If you observe the Controller using its total CPU or memory requests, you should increase those.
If you have sufficient CPU cores, you can take advantage of them with more goroutines:
- If you have many Workflows and you notice they're not being reconciled fast enough, increase
--workflow-workers
. - If you're using
TTLStrategy
in your Workflows and you notice they're not being deleted fast enough, increase--workflow-ttl-workers
. - If you're using
PodGC
in your Workflows and you notice the Pods aren't being deleted fast enough, increase--pod-cleanup-workers
.
v3.5 and after
- If you're using a lot of
CronWorkflows
and they don't seem to be firing on time, increase--cron-workflow-workers
.
The K8S client library rate limits the messages that can go out.
If you frequently see messages similar to this in the Controller log (issued by the library):
Waited for 7.090296384s due to client-side throttling, not priority and fairness, request: GET:https://10.100.0.1:443/apis/argoproj.io/v1alpha1/namespaces/argo/workflowtemplates/s2t
Or, in >= v3.5, if you see warnings similar to this (could be any CR, not just WorkflowTemplate
):
Waited for 7.090296384s, request:GET:https://10.100.0.1:443/apis/argoproj.io/v1alpha1/namespaces/argo/workflowtemplates/s2t
Then, if your K8S API Server can handle more requests:
- Increase both
--qps
and--burst
arguments for the Controller. Theqps
value indicates the average number of queries per second allowed by the K8S Client. Theburst
value is the number of queries/sec the Client receives before it starts enforcingqps
, so typicallyburst
>qps
. If not set, the default values areqps=20
andburst=30
(as of v3.5 (refer tocmd/workflow-controller/main.go
in case the values change)).
Rather than running a single installation in your cluster, run one per namespace using the --namespaced
flag.
Within a cluster can use instance ID to run N Argo instances within a cluster.
Create one namespace for each Argo, e.g. argo-i1
, argo-i2
:.
Edit workflow-controller-configmap.yaml
for each namespace to set an instance ID.
apiVersion: v1
kind: ConfigMap
metadata:
name: workflow-controller-configmap
data:
instanceID: i1
v2.9 and after
You may need to pass the instance ID to the CLI:
argo --instanceid i1 submit my-wf.yaml
You do not need to have one instance ID per namespace, you could have many or few.
In order to protect users against infinite recursion, the controller has a default maximum recursion depth of 100 calls to templates.
This protection can be disabled with the environment variable DISABLE_MAX_RECURSION=true
See also Running At Massive Scale.