You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The bug is caused by SWSS due to busy state while processing large number of NextHopGroup/Route objects.
Due to the fact that swss is a single threaded application, the link notifications by syncd might not be processed on time by swss, which eventually caused a link delay in 30 sec. The issue can be observed during CMIS interfaces bulk toggle stress test.
This is CMIS timing issue caused by existing SONiC architecture.
Port startup flow:
system log:
2024 Sep 6 13:42:14.975387 sonic NOTICE PORT-ACTION: => Ethernet144 START
2024 Sep 6 13:42:18.062154 sonic NOTICE swss#orchagent: :- doPortTask: Set port Ethernet144 admin status to up
2024 Sep 6 13:42:18.116118 sonic NOTICE swss#orchagent: :- setHostTxReady: Setting host_tx_ready status = false, alias = Ethernet144, port_id = 0x100000000001f
2024 Sep 6 13:42:18.117390 sonic WARNING pmon#xcvrd: $$$ Ethernet144 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'state': 'ok', 'netdev_oper_status': 'down', 'admin_status': 'up', 'mtu': '9100', 'host_tx_ready': 'false', 'supported_speeds': '400000,200000,100000,50000,40000,25000,10000,1000', 'supported_fecs': 'none,rs,fc,auto', 'speed': '400000', 'fec': 'rs'}
2024 Sep 6 13:42:18.218562 sonic NOTICE pmon#xcvrd: CMIS: Ethernet144 Forcing Tx laser OFF
2024 Aug 31 03:04:24.282117 sonic INFO start-LogAnalyzer-test_lag_member_flap[CRC-SRC_IP-ipv6-None-None].2024-08-31-00:04:23
2024 Aug 31 03:09:32.207163 sonic INFO end-LogAnalyzer-test_lag_member_flap[CRC-SRC_IP-ipv6-None-None].2024-08-31-00:04:23
hash/test_generic_hash.py:493:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
duthost = <MultiAsicSonicHost sonic>
interfaces = ['Ethernet112', 'Ethernet120', 'Ethernet144', 'Ethernet160']
portchannels = dict_keys(['PortChannel102', 'PortChannel105', 'PortChannel108', 'PortChannel1011'])
times = 3
def flap_interfaces(duthost, interfaces, portchannels=[], times=3):
"""
Flap the specified interfaces. Assert when any of the interfaces is not up after the flapping.
Args:
duthost (AnsibleHost): Device Under Test (DUT)
interfaces: a list of interfaces to be flapped
portchannels: a list of portchannels which need to check the status after the flapping
times: flap times, every interface will be shutdown/startup for the value number times
"""
logger.info(f"Flap the interfaces {interfaces} for {times} times.")
# Flap the interface
for _ in range(times):
for interface in interfaces:
shutdown_interface(duthost, interface)
startup_interface(duthost, interface)
# Check the interfaces status are up
for interface in interfaces:
> pytest_assert(wait_until(30, 2, 0, duthost.is_interface_status_up, interface),
f"The interface {interface} is not up after the flapping.")
E Failed: The interface Ethernet144 is not up after the flapping.
_ = 2
duthost = <MultiAsicSonicHost sonic>
interface = 'Ethernet144'
interfaces = ['Ethernet112', 'Ethernet120', 'Ethernet144', 'Ethernet160']
portchannels = dict_keys(['PortChannel102', 'PortChannel105', 'PortChannel108', 'PortChannel1011'])
times = 3
hash/generic_hash_helper.py:334: Failed
Describe the results you expected:
No errors are expected
Output of show version:
N/A
Output of show techsupport:
N/A
Additional information you deem important (e.g. issue happens only occasionally):
Platform:
Platform: x86_64-mlnx_msn4700-r0
HwSKU: Mellanox-SN4700-O32
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2119X03331
Model Number: MSN4700-WS2F_QP1
Hardware Revision: A2
The text was updated successfully, but these errors were encountered:
Description
The bug is caused by SWSS due to busy state while processing large number of
NextHopGroup
/Route
objects.Due to the fact that
swss
is a single threaded application, the link notifications bysyncd
might not be processed on time byswss
, which eventually caused a link delay in 30 sec. The issue can be observed during CMIS interfaces bulk toggle stress test.This is CMIS timing issue caused by existing SONiC architecture.
Port startup flow:
system log:
sairedis log:
system log:
sairedis log:
Time diff: 2024 Sep 6 13:43:20.050878 - 2024 Sep 6 13:42:18.062154 ~ 62 sec
Next Hop Group & Route configuration:
system log:
sairedis log:
Time diff: 2024-09-06.10:42:50.065928 - 2024-09-06.10:42:20.140774 ~ 30 sec
Port standalone flows:
Shutdown:
Startup:
Time diff: 2024 Sep 6 12:53:42.404368 - 2024 Sep 6 12:53:11.862933 ~ 31 sec
Steps to reproduce the issue:
Automation:
Manual:
t1-32-lag
topoDescribe the results you received:
Describe the results you expected:
No errors are expected
Output of
show version
:Output of
show techsupport
:Additional information you deem important (e.g. issue happens only occasionally):
Platform:
The text was updated successfully, but these errors were encountered: