You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our wgkex workers reconnect to the MQTT broker very frequently, something like every 40 seconds on average.
This appears to be because whenever there's a burst of messages coming in, the worker is busy handling all these and the heavy netlink processing - on the main MQTT loop thread - that it might not be able to send out the MQTT ping request when it's due (every 5 seconds as of our current configuration).
If the pings are not sent (and answered) right away, the MQTT client deems the connection faulty, closes it (TCP RST) and reconnects to the broker with a new one.
I believe eclipse-paho/paho.mqtt.python#328 also causes this problem to show like this, because as it looks like Paho MQTT only waits one loop iteration after the keepalive (=ping interval) timer expired, giving it a bit more time would make sense. (The MQTT broker waits 1.5 * keepalive after the last ping before it cuts the connection, i.e. 0.5 * keepalive after it should've received one).
That said, due to the burstiness of our traffic and amount of processing required for handling each message, these 0.5 * keepalive would not help us much, especially not with a keepalive of 5.
To investigate I also played around with the keepalive time, bumping it should reduce the chance that it happens right after a burst and increase the average time it has for working through each burst before the ping is due.
10 seconds didn't make any difference, 20 seconds helped a tad bit, maybe reducing it to 1/3rd or 1/5th.
All MQTT packets in black, RSTs in red. From a packet capture on docker04 where Mosquitto is running. Notice how the resets (4, we have 4 workers/gateways) always come after the bursts.
Move the message processing, especially the netlink stuff, into a separate thread (or coroutine or whatever) to unblock the main loop as soon as possible. Maybe using an internal (FIFO) queue where the on_message callback pushes the messages. The alternative of spawning a new thread/corouting per message is most likely too expensive for the amount of messages we have to deal with.
The text was updated successfully, but these errors were encountered:
Problem
Our wgkex workers reconnect to the MQTT broker very frequently, something like every 40 seconds on average.
This appears to be because whenever there's a burst of messages coming in, the worker is busy handling all these and the heavy netlink processing - on the main MQTT loop thread - that it might not be able to send out the MQTT ping request when it's due (every 5 seconds as of our current configuration).
If the pings are not sent (and answered) right away, the MQTT client deems the connection faulty, closes it (TCP RST) and reconnects to the broker with a new one.
I believe eclipse-paho/paho.mqtt.python#328 also causes this problem to show like this, because as it looks like Paho MQTT only waits one loop iteration after the keepalive (=ping interval) timer expired, giving it a bit more time would make sense. (The MQTT broker waits
1.5 * keepalive
after the last ping before it cuts the connection, i.e.0.5 * keepalive
after it should've received one).That said, due to the burstiness of our traffic and amount of processing required for handling each message, these
0.5 * keepalive
would not help us much, especially not with a keepalive of 5.To investigate I also played around with the keepalive time, bumping it should reduce the chance that it happens right after a burst and increase the average time it has for working through each burst before the ping is due.
10 seconds didn't make any difference, 20 seconds helped a tad bit, maybe reducing it to 1/3rd or 1/5th.
All MQTT packets in black, RSTs in red. From a packet capture on docker04 where Mosquitto is running. Notice how the resets (4, we have 4 workers/gateways) always come after the bursts.
Packet capture on gw04 only, keepalive at 10:
See also http://www.steves-internet-guide.com/mqtt-keep-alive-by-example/
Suggested Solution
Move the message processing, especially the netlink stuff, into a separate thread (or coroutine or whatever) to unblock the main loop as soon as possible. Maybe using an internal (FIFO) queue where the
on_message
callback pushes the messages. The alternative of spawning a new thread/corouting per message is most likely too expensive for the amount of messages we have to deal with.The text was updated successfully, but these errors were encountered: