Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move message processing on workers into separate thread #103

Closed
DasSkelett opened this issue Aug 5, 2023 · 0 comments · Fixed by #106
Closed

Move message processing on workers into separate thread #103

DasSkelett opened this issue Aug 5, 2023 · 0 comments · Fixed by #106
Assignees
Labels
enhancement New feature or request

Comments

@DasSkelett
Copy link
Member

Problem

Our wgkex workers reconnect to the MQTT broker very frequently, something like every 40 seconds on average.
This appears to be because whenever there's a burst of messages coming in, the worker is busy handling all these and the heavy netlink processing - on the main MQTT loop thread - that it might not be able to send out the MQTT ping request when it's due (every 5 seconds as of our current configuration).
If the pings are not sent (and answered) right away, the MQTT client deems the connection faulty, closes it (TCP RST) and reconnects to the broker with a new one.

I believe eclipse-paho/paho.mqtt.python#328 also causes this problem to show like this, because as it looks like Paho MQTT only waits one loop iteration after the keepalive (=ping interval) timer expired, giving it a bit more time would make sense. (The MQTT broker waits 1.5 * keepalive after the last ping before it cuts the connection, i.e. 0.5 * keepalive after it should've received one).
That said, due to the burstiness of our traffic and amount of processing required for handling each message, these 0.5 * keepalive would not help us much, especially not with a keepalive of 5.

To investigate I also played around with the keepalive time, bumping it should reduce the chance that it happens right after a burst and increase the average time it has for working through each burst before the ping is due.
10 seconds didn't make any difference, 20 seconds helped a tad bit, maybe reducing it to 1/3rd or 1/5th.

 

All MQTT packets in black, RSTs in red. From a packet capture on docker04 where Mosquitto is running. Notice how the resets (4, we have 4 workers/gateways) always come after the bursts.
wgkex-reconnect

Packet capture on gw04 only, keepalive at 10:
image

See also http://www.steves-internet-guide.com/mqtt-keep-alive-by-example/

Suggested Solution

Move the message processing, especially the netlink stuff, into a separate thread (or coroutine or whatever) to unblock the main loop as soon as possible. Maybe using an internal (FIFO) queue where the on_message callback pushes the messages. The alternative of spawning a new thread/corouting per message is most likely too expensive for the amount of messages we have to deal with.

@DasSkelett DasSkelett added the enhancement New feature or request label Aug 5, 2023
awlx added a commit that referenced this issue Sep 18, 2023
@awlx awlx self-assigned this Sep 18, 2023
@awlx awlx closed this as completed in #106 Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants