Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DUT sent lldp frame with incorrect port ID to EOS neighbor #11255

Closed
ZhaohuiS opened this issue Jun 25, 2022 · 7 comments
Closed

DUT sent lldp frame with incorrect port ID to EOS neighbor #11255

ZhaohuiS opened this issue Jun 25, 2022 · 7 comments
Labels
MSFT Triaged this issue has been triaged

Comments

@ZhaohuiS
Copy link
Contributor

ZhaohuiS commented Jun 25, 2022

Description

The Neighbor Port ID is not correct in the output of 'show lldp neighbors' on EOS neighbor.
It should be the DUT's interface alias name, not mac address.

Steps to reproduce the issue:

  1. Load internal image on testbed
  2. Check show lldp neighbors on EOS neighbor, neighbor port ID is mac address, not interface alias.
  3. Four interfaces of DUT have this issue, after debugging, it's because their netdev_oper_status is down.
Jun 24 06:04:13.164578 str2-msn2700-spy-2 INFO lldp#lldpmgrd[36]: port Ethernet0 is not up, continue
Jun 24 06:04:13.164979 str2-msn2700-spy-2 INFO lldp#lldpmgrd[36]: port Ethernet28 is not up, continue
Jun 24 06:04:13.165524 str2-msn2700-spy-2 INFO lldp#lldpmgrd[36]: port Ethernet48 is not up, continue
Jun 24 06:04:13.166013 str2-msn2700-spy-2 INFO lldp#lldpmgrd[36]: port Ethernet36 is not up, continue

Describe the results you received:

Neighbor EOS show lldp neighbor display Neighbor Port ID which is mac address, not interface name:

ARISTA01T2#show lldp neighbors 
Last table change time   : 1:09:57 ago
Number of table inserts  : 3
Number of table deletes  : 1
Number of table drops    : 0
Number of table age-outs : 0

Port          Neighbor Device ID       Neighbor Port ID    TTL 
---------- ------------------------ ---------------------- --- 
Et1           str2-msn2700-spy-2       506b.4b93.97bc      120 
Et2           str2-msn2700-spy-2       etp2                120 

tcpdump to capture lldp frame from dut shows the Port ID carries Subtype MAC address not Subtype Local
is not correct:

03:18:33.241300 LLDP, length 237
        Chassis ID TLV (1), length 7
          Subtype MAC address (4): 50:6b:4b:f5:02:58
        Port ID TLV (2), length 7
          Subtype MAC address (3): 50:6b:4b:93:97:bc
        Time to Live TLV (3), length 2: TTL 120s
        System Name TLV (5), length 18: str2-msn2700-spy-2
        System Description TLV (6), length 138
          SONiC Software Version: SONiC.internal.56021768-de2887bccb - HwSku: Mellanox-SN2700 - Distribution: Debian 11.3 - Kernel: 5.10.0-8-2-amd64
        System Capabilities TLV (7), length 4
          System  Capabilities [Bridge, WLAN AP, Router, Station Only] (0x009c)
          Enabled Capabilities [Bridge, Router] (0x0014)
        Management Address TLV (8), length 12
          Management Address length 5, AFI IPv4 (1): 10.3.146.88
          Unknown Interface Numbering (1): 0
        Port Description TLV (4), length 9: Ethernet0
        Organization specific TLV (127), length 9: OUI IEEE 802.3 Private (0x00120f)
          Link aggregation Subtype (3)
            aggregation status [supported, enabled], aggregation port ID 8
        Organization specific TLV (127), length 9: OUI IEEE 802.3 Private (0x00120f)
          MAC/PHY configuration/status Subtype (1)
            autonegotiation [none] (0x00)
            PMD autoneg capability [unknown] (0x0000)
            MAU type Unknown (0x0000)
        End TLV (0), length 0

The correct lldp frame should look like this, the value of Port ID should Subtype Local which is interface alias name of DUT.

03:21:06.030356 LLDP, length 246
        Chassis ID TLV (1), length 7
          Subtype MAC address (4): 50:6b:4b:f5:02:58
        Port ID TLV (2), length 5
          Subtype Local (7): etp2
        Time to Live TLV (3), length 2: TTL 120s
        System Name TLV (5), length 18: str2-msn2700-spy-2
        System Description TLV (6), length 138
          SONiC Software Version: SONiC.internal.56021768-de2887bccb - HwSku: Mellanox-SN2700 - Distribution: Debian 11.3 - Kernel: 5.10.0-8-2-amd64
        System Capabilities TLV (7), length 4
          System  Capabilities [Bridge, WLAN AP, Router, Station Only] (0x009c)
          Enabled Capabilities [Bridge, Router] (0x0014)
        Management Address TLV (8), length 12
          Management Address length 5, AFI IPv4 (1): 10.3.146.88
          Unknown Interface Numbering (1): 0
        Port Description TLV (4), length 20: ARISTA01T2:Ethernet2
        Organization specific TLV (127), length 9: OUI IEEE 802.3 Private (0x00120f)
          Link aggregation Subtype (3)
            aggregation status [supported, enabled], aggregation port ID 8
        Organization specific TLV (127), length 9: OUI IEEE 802.3 Private (0x00120f)
          MAC/PHY configuration/status Subtype (1)
            autonegotiation [none] (0x00)
            PMD autoneg capability [unknown] (0x0000)
            MAU type Unknown (0x0000)
        End TLV (0), length 0

lldpmgrd logs show, Ethernet0 is not up, so the correct lldpcli is not sent to lldp:
Jun 25 03:10:35.851413 str2-msn2700-spy-2 INFO lldp#lldpmgrd[39]: port Ethernet0 is not up, continue

Otherwise it should send command like this:
Jun 25 03:10:35.817256 str2-msn2700-spy-2 INFO lldp#lldpmgrd[39]: Running command: 'lldpcli configure ports Ethernet8 lldp portidsubtype local etp3 description 'ARISTA03T2:Ethernet1''

That's because its netdev_oper_status is down in PORT_TABLE.

admin@str2-msn2700-spy-2:~$ redis-cli -n 6 hgetall "PORT_TABLE|Ethernet0"
 1) "state"
 2) "ok"
 3) "netdev_oper_status"
 4) "down"
 5) "admin_status"
 6) "up"
 7) "mtu"
 8) "9100"
 9) "supported_speeds"
10) "100000,25000,50000,56000,40000,10000,1000"
11) "speed"
12) "100000"

But interface status shows it's up.

admin@str2-msn2700-spy-2:~$ show interface status
      Interface            Lanes    Speed    MTU    FEC    Alias             Vlan    Oper    Admin             Type    Asym PFC
---------------  ---------------  -------  -----  -----  -------  ---------------  ------  -------  ---------------  ----------
      Ethernet0          0,1,2,3     100G   9100     rs     etp1   PortChannel102      up       up  QSFP28 or later         off
      Ethernet4          4,5,6,7     100G   9100     rs     etp2   PortChannel102      up       up  QSFP28 or later         off

Not sure why netdev_oper_status of many interfaces is down, which should be up.

From beginning, it has set Ethernet 0 down and it set netdev_oper_status down in state_db.

Jun 24 06:03:58.943622 str2-msn2700-spy-2 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet0 admin:0 oper:0 addr:50:6b:4b:93:97:80 ifindex:49 master:0 type:sx_netdev
Jun 24 06:03:58.945701 str2-msn2700-spy-2 INFO kernel: [   71.146996] 8021q: adding VLAN 0 to HW filter on device Ethernet0
Jun 24 06:03:58.945715 str2-msn2700-spy-2 INFO kernel: [   71.147583] PortChannel102: Port device Ethernet0 added
Jun 24 06:03:58.946176 str2-msn2700-spy-2 NOTICE swss#portsyncd: :- onMsg: Publish Ethernet0(ok:down) to state db
Jun 24 06:03:58.946420 str2-msn2700-spy-2 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet0 admin:1 oper:0 addr:50:6b:4b:93:97:80 ifindex:49 master:0 type:sx_netdev
Jun 24 06:03:58.946491 str2-msn2700-spy-2 NOTICE swss#portsyncd: :- onMsg: Publish Ethernet0(ok:down) to state db
Jun 24 06:03:58.946528 str2-msn2700-spy-2 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet0 admin:1 oper:0 addr:50:6b:4b:93:97:80 ifindex:49 master:8 type:sx_netdev
Jun 24 06:03:58.946592 str2-msn2700-spy-2 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet0 admin:1 oper:0 addr:50:6b:4b:93:97:80 ifindex:49 master:8 type:sx_netdev

Then after seconds, it tried to set oper to up, but it didn’t set netdev_oper_status up in state_db eventually.
It doesn’t have this kind of log( swss#portsyncd: :- onMsg: Publish Ethernet0(ok:up) to state db) to show it’s up, no other error log either.

Jun 24 06:04:10.421875 str2-msn2700-spy-2 INFO kernel: [   82.620460] IPv6: ADDRCONF(NETDEV_CHANGE): Ethernet0: link becomes ready
Jun 24 06:04:10.424554 str2-msn2700-spy-2 NOTICE syncd#SDK: [SAI_SWITCH.NOTICE] mlnx_sai_switch.c[5499]- event_thread_func: Port 1003d changed state to up
Jun 24 06:04:10.425591 str2-msn2700-spy-2 NOTICE syncd#SDK: [SAI_UTILS.NOTICE] mlnx_sai_utils.c[1833]- set_dispatch_attrib_handler: Set SCHEDULER_PROFILE_ID, key:scheduler group id 10025:1:4, val:SCHEDULER,(0:0),1,0000,0
Jun 24 06:04:10.431924 str2-msn2700-spy-2 NOTICE syncd#SDK: [SAI_UTILS.NOTICE] mlnx_sai_utils.c[1833]- set_dispatch_attrib_handler: Set WRED_PROFILE_ID, key:queue 10025:4, val:WRED,(0:0),0,0000,0
Jun 24 06:04:10.442224 str2-msn2700-spy-2 NOTICE syncd#SDK: [SAI_UTILS.NOTICE] mlnx_sai_utils.c[1833]- set_dispatch_attrib_handler: Set SCHEDULER_PROFILE_ID, key:scheduler group id 10025:1:5, val:SCHEDULER,(0:0),0,0000,0
Jun 24 06:04:10.452049 str2-msn2700-spy-2 NOTICE syncd#SDK: [SAI_UTILS.NOTICE] mlnx_sai_utils.c[1833]- set_dispatch_attrib_handler: Set SCHEDULER_PROFILE_ID, key:scheduler group id 10025:1:6, val:SCHEDULER,(0:0),0,0000,0
Jun 24 06:04:10.461858 str2-msn2700-spy-2 WARNING kernel: [   82.661747] sx_netdev_handle_pude_event: Called for logical port - 10017 status UP
Jun 24 06:04:10.461889 str2-msn2700-spy-2 INFO kernel: [   82.661781] IPv6: ADDRCONF(NETDEV_CHANGE): Ethernet108: link becomes ready
Jun 24 06:04:10.462219 str2-msn2700-spy-2 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet0 admin:1 oper:1 addr:50:6b:4b:93:97:80 ifindex:49 master:8 type:sx_netdev
Jun 24 06:04:10.462219 str2-msn2700-spy-2 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet108 admin:1 oper:1 addr:50:6b:4b:93:97:80 ifindex:76 master:0 type:sx_netdev
Jun 24 06:04:10.462893 str2-msn2700-spy-2 NOTICE syncd#SDK: [SAI_SWITCH.NOTICE] mlnx_sai_switch.c[5499]- event_thread_func: Port 10017 changed state to up
Jun 24 06:04:10.463108 str2-msn2700-spy-2 NOTICE swss#portsyncd: :- onMsg: Publish Ethernet108(ok:up) to state db

Describe the results you expected:

Neighbor EOS show lldp neighbor displays correct neighbor Port ID which should be DUT's interface alias name:

ARISTA01T2#show lldp neighbors 
Last table change time   : 1:09:57 ago
Number of table inserts  : 3
Number of table deletes  : 1
Number of table drops    : 0
Number of table age-outs : 0

Port          Neighbor Device ID       Neighbor Port ID    TTL 
---------- ------------------------ ---------------------- --- 
Et1           str2-msn2700-spy-2       etp1                120 
Et2           str2-msn2700-spy-2       etp2                120 

Output of show version:

admin@str2-msn2700-spy-2:~$ show version

SONiC Software Version: SONiC.internal.56021768-de2887bccb
Distribution: Debian 11.3
Kernel: 5.10.0-8-2-amd64
Build commit: de2887bccb
Build date: Wed Jun  1 13:56:13 UTC 2022
Built by: cloudtest@4f11fe25c000001

Platform: x86_64-mlnx_msn2700-r0
HwSKU: Mellanox-SN2700
ASIC: mellanox
ASIC Count: 1
Serial Number: MT1826X04469
Model Number: MSN2700-CS2ROS
Hardware Revision: A1
Uptime: 03:26:31 up 21:23,  4 users,  load average: 1.91, 1.84, 1.83
Date: Sat 25 Jun 2022 03:26:31

Docker images:
REPOSITORY                 TAG                            IMAGE ID       SIZE
docker-macsec              internal.56021768-de2887bccb   d25d9a5cd59e   453MB
docker-macsec              latest                         d25d9a5cd59e   453MB
docker-dhcp-relay          latest                         b47cf966748a   445MB
docker-router-advertiser   internal.56021768-de2887bccb   4d041345c9e2   435MB
docker-router-advertiser   latest                         4d041345c9e2   435MB
docker-sonic-telemetry     internal.56021768-de2887bccb   758c3bb658d8   516MB
docker-sonic-telemetry     latest                         758c3bb658d8   516MB
docker-mux                 internal.56021768-de2887bccb   3709be0c6753   484MB
docker-mux                 latest                         3709be0c6753   484MB
docker-database            internal.56021768-de2887bccb   5008a62e0a5c   435MB
docker-database            latest                         5008a62e0a5c   435MB
docker-orchagent           internal.56021768-de2887bccb   b1a61bc97d7d   439MB
docker-orchagent           latest                         b1a61bc97d7d   439MB
docker-fpm-frr             internal.56021768-de2887bccb   a123c50c50a3   439MB
docker-fpm-frr             latest                         a123c50c50a3   439MB
docker-teamd               internal.56021768-de2887bccb   fe02e859aa46   420MB
docker-teamd               latest                         fe02e859aa46   420MB
docker-syncd-mlnx          internal.56021768-de2887bccb   d33abfaa1b0a   899MB
docker-syncd-mlnx          latest                         d33abfaa1b0a   899MB
docker-acms                internal.56021768-de2887bccb   1168e7d1fe37   615MB
docker-acms                latest                         1168e7d1fe37   615MB
docker-snmp                internal.56021768-de2887bccb   c708ac0e95eb   449MB
docker-snmp                latest                         c708ac0e95eb   449MB
docker-platform-monitor    internal.56021768-de2887bccb   768387e09a71   639MB
docker-platform-monitor    latest                         768387e09a71   639MB
docker-lldp                internal.56021768-de2887bccb   fdfce9c585ba   446MB
docker-lldp                latest                         fdfce9c585ba   446MB

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@ZhaohuiS ZhaohuiS changed the title The Neighbor Port ID is not correct in the output of 'show lldp neighbors' on EOS neighbor DUT sent lldp frame with incorrect port ID to EOS neighbor Jun 27, 2022
@zhangyanzhao zhangyanzhao added Triaged this issue has been triaged MSFT labels Jul 6, 2022
@zhangyanzhao
Copy link
Collaborator

Fix is already available per Prince S. This issue can be closed after adding the fix PR.

@zhangyanzhao
Copy link
Collaborator

@prsunny

@lguohan
Copy link
Collaborator

lguohan commented Jul 6, 2022

is this a regression? which pr caused the regression? are we missing any test to catch this one?

@prsunny
Copy link
Contributor

prsunny commented Jul 6, 2022

Fixed PR - sonic-net/sonic-swss#2233

@prsunny
Copy link
Contributor

prsunny commented Jul 6, 2022

Seems to be introduced by - #9386

@ZhaohuiS
Copy link
Contributor Author

ZhaohuiS commented Jul 7, 2022

As it's fixed in sonic-net/sonic-swss#2233, close this issue.

@ZhaohuiS ZhaohuiS closed this as completed Jul 7, 2022
@ZhaohuiS
Copy link
Contributor Author

ZhaohuiS commented Jul 7, 2022

is this a regression? which pr caused the regression? are we missing any test to catch this one?

@lguohan it's a regression, but it's already fixed in sonic-net/sonic-swss#2233, actually, test_lldp.py::test_lldp_neighbor helped me to catch this issue. It's already covered by our test.

ZhaohuiS added a commit to sonic-net/sonic-mgmt that referenced this issue Jul 18, 2022
…s is in sync with state_db (#5952)

What is the motivation for this PR?
PR test didn't catch the issue sonic-net/sonic-buildimage#11255.
test_lldp.py run to success because netdev_oper_status of interfaces is up in state_db.
But in physical testbed, netdev_oper_status of some interfaces is down and it impacts the lldp behavior. So test_lldp.py caught it.
Root cause is the netdev_oper_status in state_db is not synced with the real status of interface.
If the image has this issue, no matter you shutdown or no shutdown interface, the netdev_oper_status can't be updated.

How did you do it?
Add two cases in test_lag_2.py to check if the status of lag interfaces is in sync with state_db.

test_lag_db_status is to check if status of lag's member interfaces is synced with port's netdev_oper_status in state_db. It can run on any topology.
1). Check if status of interface is in sync with state_db after bootup.
2). Check if status of interface is in sync with state_db after shutdown/no shutdown.

test_lag_db_status_with_po_update is to check if status of lag's member interfaces is synced with port's netdev_oper_status in state_db with add/delete operation for lag member. It only supports t1-lag topology.

Remove port member from portchannel
Shutdown this port to check if status is down
Add this port back into portchannel and check if status is synced
Retrieve lag_facts after shutdown interface and check if status is synced
No shutdown this port to check if status is up
Retrieve lag_facts after no shutdown interface to check if status is synced
Since there are so many interfaces in t1-lag, we can define --completeness_level confident to pick up some interface to run it.

How did you verify/test it?
run test_lag_2.py::test_lag_db_status and test_lag_2.py::test_lag_db_status_with_po_update

Signed-off-by: Zhaohui Sun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MSFT Triaged this issue has been triaged
Projects
None yet
Development

No branches or pull requests

4 participants