-
-
Notifications
You must be signed in to change notification settings - Fork 745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
st2api being bogged down by MongoEngine #4030
Comments
Maybe this has something to do with it? https://stackoverflow.com/questions/35257305/mongoengine-is-very-slow-on-large-documents-comapred-to-native-pymongo-usage I'm sure the parent workflow is very large with all of the results all in one big workflow object. |
And here I was wondering what I was going to do in the morning. ;) |
I also am hitting this issue with 100s of items in a list. Let me know if there is anything I can provide to help troubleshoot. @nmaludy was kind enough to share his strategy for dealing with this, but would love to get a bump on getting a fix into st2. |
Did you encounter this issue when hitting that API endpoint using WebUI (or some other method)? #4300 should help with WebUI performance a lot once that PR is merged and one WebUI starts utilizing it (should be in the next release). Besides that, I hope to have a look at this exact issue when I get a chance (likely late next week). |
@Kami i hit this issue by running a Mistral workflow with a lot of |
This isn't so much a "UI rendering slow" issue and more of a "My workflow slows down as the number of executions increases" issue. We use with-items to handle deployment, so with 5-10 items running subworkflows, we have great performance, with 50 items, we have bad performance :-) |
Here is a good quantification of the issue. Notice in a 50 item, with-items execution, how each iteration takes longer and longer. If you go to 100, 200, etc... As @nmaludy said, the st2api calls can be seen increasing in latency in the logs. The larger the task size (output specifically, the longer it takes) in the workflow
|
I'll try to reproduce this issue today. The weird thing is that POST operation takes a long time - "[6397] [INFO] [POST] URI /v1/actionexecutions Total time: 100.877137899". POST aka schedule action execution operation is quite simple so there is no reason it should take that long (it inserts a couple of objects into the database and publishes a message on the messages bus). It's likely that some other API operation (e.g. GET) is hogging it down, or that there is some weird contention going on on the message bus or similar. |
After some digging in - I wonder if the issue lies somewhere else and not in the st2api. If I run the workflow above which uses Mistral to run a lot of actions in a short time-frame, I noticed large CPU utilization by mistral-server and partially, st2rulesengine. st2rulesengine utilization is expected, because each execution results in at least one trigger instance. So could this issue simple be due to resource exhaustion caused by mistral-server hogging most of the available CPU? |
I tried to replicate this issue on a single server setup. After some more instrumentation and digging in, I noticed the same behavior as reported above. When scheduling a lot of executions in a short time-frame through Mistral workflow, mistral-server starts hogging a lot of CPU and as such, everything else (including st2api) slows down due to the lack of resources. 10 - 20 tasks work relatively fine, but as soon as I move above 30 or so, things start to get very slow (and I noticed very high server CPU utilization aka server basically being bogged down). Things gets even worse if I change with-item action from core.local to some Mistral workflow action because it causes additional load on Mistral. If I concurrency schedule a lot of runs of In short - I don't think there is any quick or easy code change we can do at the moment - the root cause seems to be related to inefficiencies in Mistral and there is nothing we can do, besides advising users to dedicate more resources to Mistral and other StackStorm components (ideally running StackStorm services across multiple correctly sized servers). One thing which my also help (but I didn't have time to verify it) is splitting "one large Mistral workflow with a lot of child tasks / actions" to multiple Mistral workflows with less child tasks. Going forward, we will ensure with-items with a lot of items (e.g. scheduling a lot of actions through Orquesta workflow) has good performance in Orquesta, our new workflow engine. @nmaludy Are you running all of the StackStorm components on a single server? @m4dcoder See above. One of our "benchmarks" for Orquesta should also be orquesta workflow with a lot of |
I don't think this is with-items only. This seems like a known issue with the scheduler in mistral. If you are executing a lot of concurrent workflows thru mistral, mistral will try to schedule all the workflows and tasks at once. As tasks are completed, it will schedule the next tasks immediately. Soon, there're a lot of tasks waiting to be executed and each workflow will take longer and longer to complete because there's not enough st2 action runners to execute them. At this point in the lifecycle for mistral, since it's going to be deprecated soon, we don't intend to put a lot of time into fixing the scheduler. However, @Kami we have a line item on our roadmap to address this in orquesta, our new workflow engine. The workaround on mistral here is to define a concurrency policy in st2 @ https://docs.stackstorm.com/reference/policies.html#concurrency for the workflow being executed by the with-items task. Also, please note that if you average the completion time of the workflow execution in the with-items, the average time to completion is more or less the same. The same amount of work (or tasks) get completed over time as a function of the number of system resources and st2 action runners. |
@m4dcoder I have tested this with a concurrency policy, and the same issues exists. I'm OK with saying this is a won't fix for Mistral, but the core issue of st2api being slow seems like it will exist with Orquestra too, right? @Kami with regards to the performance in st2api, the POST calls to st2api should not be taking 3-5 seconds correct?
|
EDITED... WORKFLOW REPLACED WITH #4030 (comment) |
@jbrownsc we don't support concurrency policies on inline subworkflow (i.e. subworkflow in a mistral workbook). Try to make |
Here is the new workflow. with-items-test.yaml
with-items-test-subworkflow.yaml
workflows/with-items-test.yaml
workflows/with-items-test-subworkflow.yaml
|
From tcpdump
|
I turned Mongodb profiling level to 2 (https://docs.mongodb.com/manual/reference/method/db.setProfilingLevel/) and I see a lot of
|
Also looking at the MongoDB query log, I see about 399 queries in 1 second.
|
I still think @nmaludy pointed out the DictField serialization glory. MongoEngine/mongoengine#1230 This also references MongoEngine/mongoengine#1230 (comment) which encapsulates the index issue. |
How does the server and service load look like (cpu and io usage) when you experience those issues? I was able to reproduce slow API response the other day, but that was due to mistral-server hogging most of the CPU aka server being overloaded and nothing in particular to do with st2api. Again, would need to have more information (specifically server and service CPU usage) so we can get a clearer picture on what is going on. |
When i've seen this happen, |
All all of those services running on the same box? Did you try pinning st2api and mistral-server process to different cores / CPUs ? Both service are single threaded. On a related note - how does MongoDB resource utilization look it? As mentioned before, "schedule execution" API operation is relatively simple, it does the following:
None of those operations are particularity complex. Only 1) and 2) could get somewhat slower if you are doing something crazy with the config / action parameters. So it's unlikely that the operation itself is the culprit. It looks like I will need to try to reproduce this again on a server with more resource. Last time I reproduced it, mistral-server was the bottleneck and cause for slow st2api responses and large CPU usage. On a related note - I'm also working on some improvements to our metrics framework (already made a lot of those targeted for v2.9.0 release). It will allow StackStorm services to easily send CPU and memory usage to the metrics backend (statsd at the moment). This should make troubleshooting issues like that much easier in the future. |
st2api pegs 1 CPU. Mongo, Postgres, Redis, Rabbitmq all seem like they're doing something, but nothing major. Disk access is around 60/100 write iops on Mongo. ST2 Process Usage
Mongo IO
Mongodb CPU
RabbitMQ
Postgresql
|
We have provided a pretty usable test case in #4030 (comment)
|
On a related note - which version of StackStorm are you using? I tested it with the latest version (Mistral callbacks are used and not the query service). In any case, I / we need to approach this issue more systematically. There are a lot of potential causes mentioned this issue, but this could be unrelated to this issue. First step should be trying to replicate st2api execution schedule issue by taking Mistral our of the loop (using apachebench directly on POST /v1/executions - I will look into that). It's unlikely that mongoengine issue mentioned above (large dict issue) would affect this, because as mentioned above POST to /v1/executions doesn't include any large data and a small and simple object is inserted into the database. |
I was able to reproduce There is a lot of going on in that controller, so it will be slow process to find the root cause (could be a lot of small short blocking operations such as render live params, etc. blocking the main eventlet loop). From what I've seen, "get" queries in that API endpoint doesn't seem to be an issue. There are quite many of them, but all of them return in a sub-millisecond and use an index. None of them also return a lot of data. |
@Kami Thanks for looking further. Perhaps the payload of the POST is not the issue, but the concurrency as you mentioned. I don't know how st2api > mongoengine > mongo will handle the createIndexes operations, but I can only assume it's causing some form of blocking. Is there any way we can run your apache bench test with |
Version
|
Some more updates - I added a lot of instrumentation and most of the operations performed inside that API controller (including mongo stuff) finish relatively fast. Problem is that this controller does a lot of things (some of it duplicated) and things simply add up cumulatively and result in poor performance. I still don't think it's specifically related to mongoengine issue mentioned above, because that endpoint works with small documents and even if I comment out most "heavy" MongoDB operations, performance is still bad. So having said that, only option we might have to make performance of that API endpoint somewhat decent (compared to scheduling actions directly via rules engine) might be re-creating that API endpoint from the ground up, trying to keep list of steps it performs to minimum. In the mean time, a not so great, but still a work-around would be to run more st2api processes and load-balance between them. |
I'm 100% ok with
Going to test now. |
Scaling the workers to 10, using a concurrency policy of 5, seems to have given a pretty sizable increase in performance. None of the requests are over 2.5 seconds, which previously they would be up to 12 seconds as time went on.
Previous runtime with st2api worker process (1 worker + 1 thread)
Current runtime with st2api worker process (10 worker + 1 thread)
|
Yeah this should help. Single process performance is not great, but it scales horizontally so running more workers (until saturating all the available CPU cycles) should definitely help. If you can easily do it, it would also be great if you can test changes from here - #4331. Wonder how much it increases performance in your specific scenario. I can let you know once PR is merged into master and new v2.9dev packages are built. |
Today I tested changes from #4331 using latest v2.9dev packages and workflows from #4030 (comment) on AWS t2.medium instance. I initially tested with range 250, but then changed it to 50 so things complete faster (on v2.8.1 and small AWS instance, range 250 simply takes way too long). v2.8.1 stableubuntu@ip-172-31-24-123:/opt/stackstorm/packs/default$ st2 --version
st2 2.8.1, on Python 2.7.12 1) 1 st2api gunicorn worker process (default setup) - 250 range| + 5b96348bbe2a4332e2a2dca1 | default.with-items-test | st2admin | succeeded | Mon, 10 Sep 2018 09:08:27 | Mon, 10 Sep 2018 10:34:20 UTC |
| | | | (5153s elapsed) | UTC | |
+----------------------------+-------------------------+--------------+--------------------------+-----------------------------+-------------------------------+ 5153 seconds 2) 1 st2api process gunicorn worker (default setup) - range 50| | | | (5153s elapsed) | UTC | |
| + 5b964a19be2a43669a8498d6 | default.with-items-test | st2admin | succeeded (619s elapsed) | Mon, 10 Sep 2018 10:40:25 | Mon, 10 Sep 2018 10:50:44 UTC |
| | | | | UTC | |
+----------------------------+-------------------------+--------------+--------------------------+-----------------------------+-------------------------------+ 619 seconds v2.9devubuntu@ip-172-31-18-251:~$ st2 --version
st2 2.9dev (c4db536), on Python 2.7.12 1) 1 st2api gunicorn worker process (default setup) - range 250| + 5b96364c88b07d3318ad9041 | default.with-items-test | st2admin | succeeded | Mon, 10 Sep 2018 09:15:56 | Mon, 10 Sep 2018 09:51:44 UTC |
| | | | (2148s elapsed) | UTC | |
+----------------------------+-------------------------+--------------+--------------------------+-----------------------------+-------------------------------+ 2148 seconds 2) 1 st2api process gunicorn worker (default setup) - range 50| + 5b96414d88b07d3318ad9bfc | default.with-items-test | st2admin | succeeded (157s elapsed) | Mon, 10 Sep 2018 10:02:53 | Mon, 10 Sep 2018 10:05:30 UTC |
| | | | | UTC | | 157 seconds From those numbers it looks like the speed up is more than 50% for this particular use case (scheduling actions via Mistral / StackStorm API) and workflow. A couple of things:
In my setup scaling st2api worker processes beyond 1 didn't make sense because I used a small AWS instance and CPU was already fully utilized by all the StackStorm services with a single st2api worker process (in fact, as expected, the performance was worse, because it just saturated already over-saturated CPU even more). |
Is there any update on this, we still see rather slow performance with mongodb for large datasets. |
#4846 should help substantially with this issue. |
Working recently with a bunch of large workflows and seeing the
st2api
service using 100% CPU.I turned on a wsgi profiler (https://medium.com/@maxmaxmaxmax/measuring-performance-of-python-based-apps-using-gunicorn-and-cprofile-ee4027b2be41) and am seeing profile stacktraces like so:
And general times reaching 100+ seconds as i let it go for longer and longer
This is reproduced using the following action workflow:
The text was updated successfully, but these errors were encountered: