Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[T2] [Chassis] - Zebra process crashed during line card config-reload #20942

Open
sanjair-git opened this issue Nov 27, 2024 · 6 comments
Open
Labels
Chassis 🤖 Modular chassis support P0 Priority of the issue

Comments

@sanjair-git
Copy link

Description

On a T2 chassis line card, we see 'zebra' process got crashed and generated the following core during OC test run when the test did a 'config reload'

2024 Nov 24 06:26:29.856703 ixre-egl-board27 NOTICE swss1#orchagent: :- addRoutePost: Update Nexthop Group 10.0.0.165@Ethernet144,10.0.0.167@Ethernet152,10.0.0.169@Ethernet160,10.0.0.171@Ethernet168,10.0.0.173@Ethernet176,10.0.0.175@Ethernet184,10.0.0.177@Ethernet192,10.0.0.179@Ethernet200,10.0.0.181@Ethernet208,10.0.0.183@Ethernet216,10.0.0.185@Ethernet224,10.0.0.187@Ethernet232,10.0.0.189@Ethernet240
2024 Nov 24 06:26:29.864921 ixre-egl-board27 NOTICE bgp0#fpmsyncd: message repeated 8 times: [ :- onRouteMsg: RouteTable del msg for route with only one nh on eth0/docker0: fe80::/64 :: eth0 ]
2024 Nov 24 06:26:29.864955 ixre-egl-board27 ERR bgp0#fpmsyncd: :- poll_descriptors: readData error: Connection reset by peer
2024 Nov 24 06:26:29.865190 ixre-egl-board27 INFO bgp0#supervisord: fpmsyncd Connection lost, reconnecting...
2024 Nov 24 06:26:29.865559 ixre-egl-board27 INFO bgp0#supervisord: fpmsyncd Waiting for fpm-client connection...
2024 Nov 24 06:26:29.865639 ixre-egl-board27 INFO bgp0#supervisord 2024-11-24 06:26:29,865 WARN exited: zebra (terminated by SIGABRT (core dumped); not expected)
2024 Nov 24 06:26:29.868045 ixre-egl-board27 WARNING bgp0#bgpmon: *WARNING* JSONDecodeError: Expecting value: line 1 column 1 (char 0) when execute: ['vtysh', '-c', 'show bgp summary json'] Retry attempt: 1
2024 Nov 24 06:26:29.900148 ixre-egl-board27 NOTICE swss1#orchagent: :- addRoutePost: Update Nexthop Group 

Attaching core, syslog and zebra logs for reference,

zebra.1732429588.53.0.core.gz
zebra-logs.zip
zebra-issue-board27.gz

Steps to reproduce the issue:

  1. On any T2 chassis line card, run a test which does 'config-reload'. The issue might happen.

Describe the results you received:

  • zebra process crashed and didn't come up for next 30 minutes.
  • core generated
2024 Nov 24 06:27:29.924944 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (1.0 minutes).
2024 Nov 24 06:28:29.979504 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (2.0 minutes).
2024 Nov 24 06:29:30.034640 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (3.0 minutes).
2024 Nov 24 06:30:30.090872 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (4.0 minutes).
2024 Nov 24 06:31:30.141823 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (5.0 minutes).
2024 Nov 24 06:32:30.196762 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (6.0 minutes).
2024 Nov 24 06:33:30.253843 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (7.0 minutes).
2024 Nov 24 06:34:30.307730 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (8.0 minutes).
2024 Nov 24 06:35:30.362860 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (9.0 minutes).
2024 Nov 24 06:36:30.418050 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (10.0 minutes).
2024 Nov 24 06:37:30.475897 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (11.0 minutes).
2024 Nov 24 06:38:30.534972 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (12.0 minutes).
2024 Nov 24 06:39:30.593887 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not unning in namespace 'asic0' (13.0 minutes).
2024 Nov 24 06:40:30.649885 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (14.0 minutes).
2024 Nov 24 06:41:30.707705 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (15.0 minutes).
2024 Nov 24 06:42:30.765343 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (16.0 minutes).
2024 Nov 24 06:43:30.822839 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (17.0 minutes).
2024 Nov 24 06:44:30.882951 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (18.0 minutes).
2024 Nov 24 06:45:30.936833 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (19.0 minutes).
2024 Nov 24 06:46:30.992880 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (20.0 minutes).
2024 Nov 24 06:47:31.051059 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (21.0 minutes).
2024 Nov 24 06:48:31.106957 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (22.0 minutes).
2024 Nov 24 06:49:31.166279 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (23.0 minutes).
2024 Nov 24 06:50:31.224898 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (24.0 minutes).
2024 Nov 24 06:51:31.280794 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (25.0 minutes).
2024 Nov 24 06:52:31.337925 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (26.0 minutes).
2024 Nov 24 06:53:31.395933 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (27.0 minutes).
2024 Nov 24 06:54:31.451244 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (28.0 minutes).
2024 Nov 24 06:55:31.509811 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (29.0 minutes).
2024 Nov 24 06:56:31.566762 ixre-egl-board27 ERR bgp0#supervisor-proc-exit-listener: Process 'zebra' is not running in namespace 'asic0' (30.0 minutes).

Describe the results you expected:

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@rlhui rlhui added the P0 Priority of the issue label Nov 28, 2024
@prabhataravind prabhataravind added the Chassis 🤖 Modular chassis support label Dec 4, 2024
@prabhataravind
Copy link
Contributor

@sanjair-git could you please check if the issue you observe is fixed by this PR?

#20990

@sanjair-git
Copy link
Author

Just FYI, this issue was seen in 202405 sonic-buildimage

@prabhataravind
Copy link
Contributor

Just FYI, this issue was seen in 202405 sonic-buildimage

OK, Can you please paste the full backtrace for the zebra crash?

@sanjair-git
Copy link
Author

@sanjair-git could you please check if the issue you observe is fixed by this PR?

#20990

Hi @prabhataravind, we took the above PR and we did full OC test run. The issue is not seen.

And also, the issue is not reproducible when we tried to reproduce manually even with multiple attempts.

@sanjair-git
Copy link
Author

Just FYI, this issue was seen in 202405 sonic-buildimage

OK, Can you please paste the full backtrace for the zebra crash?

docker run -it --entrypoint=/bin/bash -v ~/debug:/debug 8b27b6c4df1a
root@b078c945639c:/# ls /usr/lib/frr/zebra 
/usr/lib/frr/zebra
root@b078c945639c:/# ls -l /usr/lib/frr/zebra 
-rwxr-xr-x 1 root root 2017816 Dec  6 09:12 /usr/lib/frr/zebra
root@b078c945639c:/# cd /debug                        
root@b078c945639c:/debug# ls
zebra.1732429588.53.0.core
root@b078c945639c:/debug# gdb /usr/lib/frr/zebra zebra.1732429588.53.0.core 
GNU gdb (Debian 13.1-3) 13.1
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <[http://gnu.org/licenses/gpl.html>](http://gnu.org/licenses/gpl.html%3E)
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<[https://www.gnu.org/software/gdb/bugs/>.](https://www.gnu.org/software/gdb/bugs/%3E.)
Find the GDB manual and other documentation resources online at:
<[http://www.gnu.org/software/gdb/documentation/>.](http://www.gnu.org/software/gdb/documentation/%3E.)
 
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/lib/frr/zebra...
Reading symbols from /usr/lib/debug/.build-id/08/c6c2d27aa426cda9ddf445b02122676d607250.debug...
 
warning: Can't open file /var/tmp/frr/zebra.53/logbuf.79 during file-backed mapping note processing
 
warning: Can't open file /var/tmp/frr/zebra.53/logbuf.78 during file-backed mapping note processing
 
warning: Can't open file /var/tmp/frr/zebra.53/logbuf.70 during file-backed mapping note processing
 
warning: Can't open file /var/tmp/frr/zebra.53/logbuf.63 during file-backed mapping note processing
 
warning: Can't open file /var/tmp/frr/zebra.53/logbuf.62 during file-backed mapping note processing
 
warning: Can't open file /var/tmp/frr/zebra.53/logbuf.61 during file-backed mapping note processing
 
warning: Can't open file /var/tmp/frr/zebra.53/logbuf.53 during file-backed mapping note processing
[New LWP 53]
[New LWP 61]
[New LWP 63]
[New LWP 60]
[New LWP 70]
[New LWP 78]
[New LWP 79]
[New LWP 62]
Core was generated by `/usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M dplane_fpm_nl -M snmp'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f4455041ebc in ?? ()
[Current thread is 1 (LWP 53)]
(gdb) bt
#0  0x00007f4455041ebc in ?? ()
#1  0x00007fffa1818eb0 in ?? ()
#2  0xc6e332abefc6e100 in ?? ()
#3  0x0000000000000006 in ?? ()
#4  0x00007f4454c9f7c0 in ?? ()
#5  0x000055df8bb06fc0 in ?? ()
#6  0x00007f44400276a0 in ?? ()
#7  0x0000000000000000 in ?? ()
(gdb) info thread
  Id   Target Id         Frame 
* 1    LWP 53            0x00007f4455041ebc in ?? ()
  2    LWP 61            0x00007f44550b8923 in ?? ()
  3    LWP 63            0x00007f44550b32d6 in ?? ()
  4    LWP 60            0x00007f44550b8799 in ?? ()
  5    LWP 70            0x00007f44550b32d6 in ?? ()
  6    LWP 78            0x00007f44550b32d6 in ?? ()
  7    LWP 79            0x00007f44550b32d6 in ?? ()
  8    LWP 62            0x00007f44550c1bbd in ?? ()
(gdb)

Backtrace didn't help either.

@rlhui
Copy link
Contributor

rlhui commented Dec 11, 2024

@sanjair-git - please build a debug image and retest? thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Chassis 🤖 Modular chassis support P0 Priority of the issue
Projects
Status: No status
Development

No branches or pull requests

3 participants