Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Z9264/201911-170/Orchagent and python crash with warm-reboot. Kernal crash seen often #5258

Open
mini-nair-dell opened this issue Aug 27, 2020 · 8 comments

Comments

@mini-nair-dell
Copy link

Description:
=============

While performing warm-reboot observe python core and orchagent core generated. This is consistently seen in latest NOv image-HEAD
Often we see a kernal crash as well. The logs are attached below.

Platform :
===========

DellEMC-Z9264f-C64 with T1 config loaded

Build :
======

root@sonic:~# show ver

SONiC Software Version: SONiC.HEAD.170-c243b8a9
Distribution: Debian 9.13
Kernel: 4.9.0-11-2-amd64
Build commit: c243b8a
Build date: Tue Aug 25 18:59:41 UTC 2020
Built by: johnar@jenkins-worker-7

Platform: x86_64-dellemc_z9264f_c3538-r0
HwSKU: DellEMC-Z9264f-C64
ASIC: broadcom
Serial Number: CN030K8MDND0093E0001
Uptime: 07:09:59 up 4 min, 1 user, load average: 0.96, 1.15, 0.55

Docker images:
REPOSITORY TAG IMAGE ID SIZE
docker-syncd-brcm HEAD.170-c243b8a9 8b26f1bf8730 436MB

Steps to repro:
===============

With T1 config loaded in the box, perform a warm-reboot

root@sonic:~# warm-reboot -vvv
Thu Aug 27 04:22:40 UTC 2020 Pausing orchagent ...
Thu Aug 27 04:22:41 UTC 2020 Stopping nat ...
Error response from daemon: Cannot kill container: nat: No such container: nat
Thu Aug 27 04:22:41 UTC 2020 Stopped nat ...
Thu Aug 27 04:22:41 UTC 2020 Stopping radv ...
Thu Aug 27 04:22:41 UTC 2020 Stopping bgp ...
Thu Aug 27 04:22:42 UTC 2020 Stopped bgp ...
Thu Aug 27 04:22:46 UTC 2020 Initialize pre-shutdown ...
Thu Aug 27 04:22:47 UTC 2020 Requesting pre-shutdown ...
Thu Aug 27 04:22:48 UTC 2020 Waiting for pre-shutdown ...
Thu Aug 27 04:22:54 UTC 2020 Pre-shutdown succeeded ...
Thu Aug 27 04:22:54 UTC 2020 Backing up database ...
Thu Aug 27 04:22:56 UTC 2020 Stopping teamd ...
Thu Aug 27 04:22:56 UTC 2020 Stopped teamd ...
Thu Aug 27 04:22:56 UTC 2020 Stopping syncd ...
Thu Aug 27 04:23:09 UTC 2020 Stopped syncd ...
Thu Aug 27 04:23:09 UTC 2020 Stopping all remaining containers ...
Warning: Stopping telemetry.service, but it can still be activated by:
telemetry.timer
Thu Aug 27 04:23:11 UTC 2020 Stopped all remaining containers ...
Thu Aug 27 04:23:14 UTC 2020 Enabling Watchdog before warm-reboot
Traceback (most recent call last):
File "/usr/bin/watchdogutil", line 8, in
from watchdogutil.main import watchdogutil
File "/usr/lib/python2.7/dist-packages/watchdogutil/main.py", line 15, in
raise ImportError("%s - required module not found" % str(e[ 414.310317] kexec_core: Starting new kernel
))
ImportError:[ 0.075766] DMAR-IR: Failed to copy IR table for dmar0 from previous kernel
[ 4.880792] DMAR: DRHD: handling fault status reg 2
[ 4.939137] DMAR: [INTR-REMAP] Request device [04:00.0] fault index 49 [fault reason 34] Present field in the IRTE entry is clear
[ 4.987163] rc.local[449]: + sonic-cfggen -y /etc/sonic/sonic_version.yml -v build_version
[ 5.406697] kdump-tools[484]: Starting kdump-tools: no crashkernel= parameter in the kernel cmdline ... failed!
[ 6.379721] rc.local[449]: + SONIC_VERSION=HEAD.170-c243b8a9
[ 6.456690] rc.local[449]: + FIRST_BOOT_FILE=/host/image-HEAD.170-c243b8a9/platform/firsttime
[ 6.574341] rc.local[449]: + logger SONiC version HEAD.170-c243b8a9 starting up...
[ 6.676308] rc.local[449]: + [ ! -e /host/machine.conf ]
[ 6.748245] rc.local[449]: + . /host/machine.conf
[ 6.808361] rc.local[449]: + onie_version=3.42.1.B-6
[ 6.868366] rc.local[449]: + onie_vendor_id=674
[ 6.936113] rc.local[449]: + onie_platform=x86_64-dellemc_z9264f_c3538-r0
[ 7.028385] rc.local[449]: + onie_machine=dellemc_z9264f_c3538
[ 7.104376] rc.local[449]: + onie_machine_rev=0
[ 7.164447] rc.local[449]: + onie_arch=x86_64
[ 7.224271] rc.local[449]: + onie_config_version=1
[ 7.490195] c3xxx 0000:01:00.0: Failed to send admin msg to accelerator
[ 7.569294] c3xxx 0000:01:00.0: Failed to send init message
[ 7.376463] rc.local[449]: + onie_build_date=2019-10-16T18:56-07:00
[ 7.516407] rc.local[449]: + onie_partition_type=gpt
[ 7.576399] rc.local[449]: + onie_kernel_version=4.9.30
[ 7.648386] rc.local[449]: + onie_firmware=auto
[ 7.708397] rc.local[449]: + onie_switch_asic=bcm
[ 7.768401] rc.local[449]: + onie_skip_ethmgmt_macs=no
[ 7.840400] rc.local[449]: + program_console_speed
[ 7.962119] rc.local[449]: + cat /proc/cmdline
[ 8.041394] rc.local[449]: + grep -Eo console=ttyS[0-9]+,[0-9]+
[ 8.120595] rc.local[449]: + cut -d , -f2
[ 8.176915] rc.local[449]: + speed=9600
[ 8.236417] rc.local[449]: + [ -z 9600 ]
[ 8.288459] rc.local[449]: + CONSOLE_SPEED=9600
[ 8.352416] rc.local[449]: + sed -i s|--keep-baud .* %I| 9600 %I|g /lib/systemd/system/[email protected]
[ 8.476404] rc.local[449]: + systemctl daemon-reload
[ 8.536383] rc.local[449]: + [ -f /host/image-HEAD.170-c243b8a9/platform/firsttime ]
[ 8.640423] rc.local[449]: + [ -f /var/log/fsck.log.gz ]
[ 8.714123] rc.local[449]: + gunzip -d -c /var/log/fsck.log.gz
[ 8.788944] rc.local[449]: + logger -t FSCK
[ 8.848795] rc.local[449]: + rm -f /var/log/fsck.log.gz
[ 8.912388] rc.local[449]: + exit 0

Debian GNU/Linux 9 sonic ttyS0

sonic login: [ OK ] Started Update hostname based on configdb.
Stopping System Logging Service...
[ OK ] Stopped System Logging Service.
Starting System Logging Service...
[ OK ] Started System Logging Service.
[ OK ] Started Update rsyslog configuration.
[ OK ] Started BGP container.
[ OK ] Started Platform monitor container.
Stopping ifupdown2 networking initialization...
[ OK ] Stopped ifupdown2 networking initialization.
Starting ifupdown2 networking initialization...
[ OK ] Started ifupdown2 networking initialization.
[ OK ] Started Update interfaces configuration.
Starting switch state service...
[ OK ] Started switch state service.
Starting syncd service...
[ OK ] Started watchdog control service.
Starting TEAMD container...
[ OK ] Started TEAMD container.
[ OK ] Started syncd service.
Starting LLDP container...
Starting DHCP relay container...
Starting Management Framework container...
Starting Router advertiser container...
[ OK ] Started Router advertiser container.
[ OK ] Started Management Framework container.
[ OK ] Started DHCP relay container.
[ OK ] Started LLDP container.
Starting Update NTP configuration...
Stopping LSB: Start NTP daemon...
[ OK ] Stopped LSB: Start NTP daemon.
Starting LSB: Start NTP daemon...
[ OK ] Started LSB: Start NTP daemon.
[ OK ] Started Update NTP configuration.

Debian GNU/Linux 9 sonic ttyS0

sonic login: admin
Password:

Login incorrect
sonic login: admin
Password:
Last login: Thu Aug 27 04:20:20 UTC 2020 on ttyS0
Linux sonic 4.9.0-11-2-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64
You are on


/ | / _ | \ | ()/ |
_
| | | | | | | |
) | || | |\ | | |

|
/ ___/|| _||____|

-- Software for Open Networking in the Cloud --

Unauthorized access and/or use are prohibited.
All access and/or use are subject to monitoring.

Help: http://azure.github.io/SONiC/

admin@sonic:$
admin@sonic:
$
admin@sonic:$
admin@sonic:
$ sudo -i

root@sonic:#
root@sonic:
#
root@sonic:#
root@sonic:
#
root@sonic:#
root@sonic:
# ls -l /var/core
total 7908
-rw-rw-rw- 1 root root 8094020 Aug 27 04:23 python.1598502191.1255.core.gz

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
28bc2fb500f3 docker-sonic-telemetry:latest "/usr/bin/supervisord" 16 minutes ago Up 8 seconds telemetry
d63e7415c56a docker-lldp-sv2:latest "/usr/bin/docker-lld…" 20 minutes ago Up 3 minutes lldp
96a4a65bfc34 docker-router-advertiser:latest "/usr/bin/docker-ini…" 20 minutes ago Up 3 minutes radv
0f07e9f6aadc docker-sonic-mgmt-framework:latest "/usr/bin/supervisord" 20 minutes ago Up About a minute mgmt-framework
955fdf680ddf docker-dhcp-relay:latest "/usr/bin/docker_ini…" 20 minutes ago Up 3 minutes dhcp_relay
797ba772beec docker-syncd-brcm:latest "/usr/bin/supervisord" 20 minutes ago Up 3 minutes syncd
b0387b9a1d2b docker-teamd:latest "/usr/bin/supervisord" 20 minutes ago Up 3 minutes teamd
c9076d2c96da docker-fpm-frr:latest "/usr/bin/supervisord" 20 minutes ago Up 3 minutes bgp
28cb8500068a docker-platform-monitor:latest "/usr/bin/docker_ini…" 20 minutes ago Up 3 minutes pmon
3195be5b90ce docker-database:latest "/usr/local/bin/dock…" 20 minutes ago Up 3 minutes database
[ OK ] Stopped Delays snmp container until SONiC has started.

root@sonic:/var/core# ls -l
total 43816
-rw-rw-rw- 1 root root 277011 Aug 27 04:25 orchagent.1598502300.47.core.gz
-rw-rw-rw- 1 root root 276711 Aug 27 04:26 orchagent.1598502410.50.core.gz
-rw-rw-rw- 1 root root 277774 Aug 27 04:28 orchagent.1598502521.50.core.gz
-rw-rw-rw- 1 root root 276775 Aug 27 04:30 orchagent.1598502631.47.core.gz
-rw-rw-rw- 1 root root 43753472 Aug 27 04:23 python.1598502191.1255.core
root@sonic:/var/core# gdb python python.1598502191.1255.core
GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 1255]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `python /usr/bin/caclmgrd'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007fd2a6f2afff in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007fd2a6f2afff in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007fd2a6f2c42a in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007fd2a31740ad in __gnu_cxx::__verbose_terminate_handler() ()
from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007fd2a3172066 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007fd2a31720b1 in std::terminate() ()
from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007fd2a31722c9 in __cxa_throw ()
from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007fd2a452e4ea in swss::SubscriberStateTable::readData() ()
from /usr/lib/x86_64-linux-gnu/libswsscommon.so.0
#7 0x00007fd2a451e387 in swss::Select::poll_descriptors(swss::Selectable**, unsigned int) () from /usr/lib/x86_64-linux-gnu/libswsscommon.so.0
#8 0x00007fd2a451e54b in swss::Select::select(swss::Selectable**, int) ()
from /usr/lib/x86_64-linux-gnu/libswsscommon.so.0
#9 0x00007fd2a4766c6b in _wrap_Select_select__SWIG_0 (args=)
at swsscommon_wrap.cpp:13075
#10 _wrap_Select_select (self=, args=)
at swsscommon_wrap.cpp:13162
#11 0x000055e19cf59a6f in PyEval_EvalFrameEx ()
#12 0x000055e19cf57c35 in PyEval_EvalCodeEx ()
#13 0x000055e19cf5f3e9 in PyEval_EvalFrameEx ()
#14 0x000055e19cf5f0df in PyEval_EvalFrameEx ()
---Type to continue, or q to quit---
#15 0x000055e19cf5f0df in PyEval_EvalFrameEx ()
#16 0x000055e19cf57c35 in PyEval_EvalCodeEx ()
#17 0x000055e19cf579d9 in PyEval_EvalCode ()
#18 0x000055e19cf8844f in ?? ()
#19 0x000055e19cf83262 in PyRun_FileExFlags ()
#20 0x000055e19cf82dae in PyRun_SimpleFileExFlags ()
#21 0x000055e19cf335f1 in Py_Main ()
#22 0x00007fd2a6f182e1 in __libc_start_main ()
from /lib/x86_64-linux-gnu/libc.so.6
#23 0x000055e19cf32e1a in _start ()
(gdb)

Kernal crash logs
====================

root@sonic:~# warm-reboot -vvv
Thu Aug 27 04:13:00 UTC 2020 Pausing orchagent ...
Thu Aug 27 04:13:00 UTC 2020 Stopping nat ...
Error response from daemon: Cannot kill container: nat: No such container: nat
Thu Aug 27 04:13:01 UTC 2020 Stopped nat ...
Thu Aug 27 04:13:01 UTC 2020 Stopping radv ...
Thu Aug 27 04:13:01 UTC 2020 Stopping bgp ...
Thu Aug 27 04:13:02 UTC 2020 Stopped bgp ...
Thu Aug 27 04:13:06 UTC 2020 Initialize pre-shutdown ...
Thu Aug 27 04:13:07 UTC 2020 Requesting pre-shutdown ...
Thu Aug 27 04:13:08 UTC 2020 Waiting for pre-shutdown ...
Thu Aug 27 04:13:08 UTC 2020 Pre-shutdown succeeded ...
Thu Aug 27 04:13:08 UTC 2020 Backing up database ...
Thu Aug 27 04:13:10 UTC 2020 Stopping teamd ...
Thu Aug 27 04:13:10 UTC 2020 Stopped teamd ...
Thu Aug 27 04:13:10 UTC 2020 Stopping syncd ...
[ 436.866859] BUG: unable to handle kernel paging request at ffff8e240a077ffe
[ 436.950307] IP: [] free_block+0x107/0x1c0
[ 437.017073] PGD 3eb334067 [ 437.047337] PUD 0
[ 437.071360]
[ 437.089145] Oops: 0002 [#1] SMP
[ 437.126688] Modules linked in: dummy xt_hl nf_conntrack_ipv6 nf_defrag_ipv6 xt_TCPMSS xt_tcpudp iptable_mangle optoe at24 nvmem_core dell_z9264f_fpga_ocores(O) i2c_ocores ipmi_devintf i2c_mux_pca954x i2c_mux i2c_dev nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo qat_c3xxx intel_qat intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel bonding linux_knet_cb(O) linux_bcm_knet(O) dh_generic psample(O) linux_user_bde(O) linux_kernel_bde(O) evdev dcdbas kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel efi_pstore sg efivars button rsa_generic asn1_decoder mpi pcspkr ipmi_si iTCO_wdt ipmi_msghandler iTCO_vendor_support authenc shpchp ip6table_filter ip6_tables xt_conntrack ebt_vlan iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 ebtable_broute bridge xt_addrtype stp llc iptable_nat nf_conntrack_ipv4 ebtable_nat nf_defrag_ipv4 nf_nat_ipv4 nf_nat ebtable_filter nf_conntrack ebtables efivarfs ip_tables x_tables autofs4 loop ext4 crc16 jbd2 crc32c_generic fscrypto ecb mbcache nls_utf8 nls_cp437 nls_ascii vfat fat overlay squashfs sd_mod mmc_block crc32c_intel i2c_ismt ahci libahci aesni_intel sdhci_pci xhci_pci sdhci xhci_hcd aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd usbcore usb_common i2c_i801 i2c_smbus mmc_core libata scsi_mod igb i2c_algo_bit dca ptp pps_core thermal
[ 438.528288] CPU: 2 PID: 61 Comm: kworker/2:1 Tainted: G W O 4.9.0-11-2-amd64 #1 Debian 4.9.189-3+deb9u2
[ 438.652153] Hardware name: Dell EMC Z9264F-ON/Z9264F-ON-CPU, BIOS 3.42.0.B-4b 04/16/2020
[ 438.748977] Workqueue: events cache_reap
[ 438.795978] task: ffff8e236d267000 task.stack: ffffaf6c81d2c000
[ 438.866805] RIP: 0010:[] [] free_block+0x107/0x1c0
[ 438.962692] RSP: 0018:ffffaf6c81d2fd18 EFLAGS: 00010086
[ 439.026236] RAX: 00000000ffffffff RBX: ffff8e237fd1bca0 RCX: 00000000fffffffe
[ 439.111619] RDX: 000000000000281e RSI: 0000000000000000 RDI: ffff8e230a078000
[ 439.197003] RBP: ffff8e236f0009c0 R08: 0000000000000000 R09: 0000000000018644
[ 439.282385] R10: 0000000000000002 R11: 0000000000000001 R12: ffff8e230a078000
[ 439.367769] R13: fffffb0f90281e20 R14: fffffb0f90281e00 R15: ffff8e236f001180
[ 439.453153] FS: 0000000000000000(0000) GS:ffff8e237fd00000(0000) knlGS:0000000000000000
[ 439.549977] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 439.618718] CR2: ffff8e240a077ffe CR3: 00000003eb008000 CR4: 0000000000340670
[ 439.704099] Stack:
[ 439.728123] ffffaf6c81d2fd98 ffff8e237fd1bca8 ffff8e236f001188 ffff8e236f0011a8
[ 439.817043] ffff8e237fd1bc90 0000000000000001 0000000000000001 ffff8e237fd1bca0
[ 439.905962] 0000000000000000 ffff8e237fd16d88 ffffffffa99e72af ffff8e237fd1bc90
[ 439.994880] Call Trace:
[ 440.024106] [] ? drain_array_locked+0x5f/0x90
[ 440.095971] [] ? drain_array+0x85/0xb0
[ 440.160553] [] ? cache_reap+0x7c/0x210
[ 440.225137] [] ? process_one_work+0x18a/0x430
[ 440.297000] [] ? worker_thread+0x4d/0x490
[ 440.364705] [] ? process_one_work+0x430/0x430
[ 440.436568] [] ? kthread+0xd9/0xf0
[ 440.496991] [] ? __switch_to_asm+0x41/0x70
[ 440.565734] [] ? kthread_park+0x60/0x60
[ 440.631358] [] ? ret_from_fork+0x57/0x70
[ 440.698019] Code: 0f af f9 0f b6 4d 1c 48 c1 ef 20 29 fa d3 ea 0f b6 4d 1d 01 fa 49 8b 7e 10 d3 ea 8d 48 ff 48 85 ff 41 89 4e 18 0f 84 94 00 00 00 <88> 14 0f 41 8b 46 18 85 c0 0f 84 46 ff ff ff 49 8b 77 10 48 8b
[ 440.930560] RIP [] free_block+0x107/0x1c0
[ 440.998369] RSP
[ 441.040071] CR2: ffff8e240a077ffe
[ 441.079698] ---[ end trace ef4c826a61da8907 ]---
[ 441.139688] general protection fault: 0000 [#2] SMP
[ 441.198048] Modules linked in: dummy xt_hl nf_conntrack_ipv6 nf_defrag_ipv6 xt_TCPMSS xt_tcpudp iptable_mangle optoe at24 nvmem_core dell_z9264f_fpga_ocores(O) i2c_ocores ipmi_devintf i2c_mux_pca954x i2c_mux i2c_dev nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo qat_c3xxx intel_qat intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel bonding linux_knet_cb(O) linux_bcm_knet(O) dh_generic psample(O) linux_user_bde(O) linux_kernel_bde(O) evdev dcdbas kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel efi_pstore sg efivars button rsa_generic asn1_decoder mpi pcspkr ipmi_si iTCO_wdt ipmi_msghandler iTCO_vendor_support authenc shpchp ip6table_filter ip6_tables xt_conntrack ebt_vlan iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 ebtable_broute bridge xt_addrtype stp llc iptable_nat nf_conntrack_ipv4 ebtable_nat nf_defrag_ipv4 nf_nat_ipv4 nf_nat ebtable_filter nf_conntrack ebtables efivarfs ip_tables x_tables autofs4 loop ext4 crc16 jbd2 crc32c_generic fscrypto ecb mbcache nls_utf8 nls_cp437 nls_ascii vfat fat overlay squashfs sd_mod mmc_block crc32c_intel i2c_ismt ahci libahci aesni_intel sdhci_pci xhci_pci sdhci xhci_hcd aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd usbcore usb_common i2c_i801 i2c_smbus mmc_core libata scsi_mod igb i2c_algo_bit dca ptp pps_core thermal
[ 442.599642] CPU: 2 PID: 61 Comm: kworker/2:1 Tainted: G D W O 4.9.0-11-2-amd64 #1 Debian 4.9.189-3+deb9u2
[ 442.723509] Hardware name: Dell EMC Z9264F-ON/Z9264F-ON-CPU, BIOS 3.42.0.B-4b 04/16/2020
[ 442.820343] task: ffff8e236d267000 task.stack: ffffaf6c81d2c000
[ 442.891151] RIP: 0010:[] [] __wake_up_common+0x28/0x90
[ 442.991203] RSP: 0018:ffffaf6c81d2fe70 EFLAGS: 00010002
[ 443.054744] RAX: ad0ff1a9a62fa7a5 RBX: ffffaf6c81d2ff10 RCX: 0000000000000000
[ 443.140126] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffffaf6c81d2ff10
[ 443.225511] RBP: ffffaf6c81d2ff18 R08: 0000000000000000 R09: 000000007deec019
[ 443.310892] R10: 0000000000000018 R11: 0000000000000000 R12: 0000000000000282
[ 443.396275] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000046
[ 443.481660] FS: 0000000000000000(0000) GS:ffff8e237fd00000(0000) knlGS:0000000000000000
[ 443.578486] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 443.647226] CR2: 0000000000000028 CR3: 00000003eb008000 CR4: 0000000000340670
[ 443.732609] Stack:
[ 443.756633] 0000000100000000 ffffaf6c81d2ff10 ffffaf6c81d2ff08 0000000000000282
[ 443.845552] 0000000000000000 0000000000000001 0000000000000046 ffffffffa98be3f1
[ 443.934470] ffff8e236d267720 ffff8e236d267000 0000000000000000 ffffffffa98780c0
[ 444.023390] Call Trace:
[ 444.052617] [] ? complete+0x31/0x40
[ 444.114081] [] ? mm_release+0xb0/0x130
[ 444.178663] [] ? do_exit+0x14f/0xb60
[ 444.241167] [] ? rewind_stack_do_exit+0x17/0x20
[ 444.315107] Code: 00 00 00 0f 1f 44 00 00 41 57 41 56 41 55 41 54 41 89 cd 55 53 48 89 fd 48 83 c5 08 48 83 ec 08 48 8b 47 08 89 54 24 04 48 39 c5 <48> 8b 08 74 46 48 8d 78 e8 4c 8d 79 e8 41 89 f6 4d 89 c4 8b 1f
[ 444.547648] RIP [] __wake_up_common+0x28/0x90
[ 444.619617] RSP
[ 444.661320] ---[ end trace ef4c826a61da8908 ]---
[ 444.721299] Fixing recursive fault but reboot is needed!
[ 444.784777] BUG: unable to handle kernel paging request at ffffffffffffffd8
[ 444.868278] IP: [] kthread_data+0xc/0x20
[ 444.934007] PGD 3eb00d067 [ 444.964268] PUD 3eb00f067
PMD 0 [ 445.002953]
[ 445.020742] Oops: 0000 [#3] SMP
[ 445.058283] Modules linked in: dummy xt_hl nf_conntrack_ipv6 nf_defrag_ipv6 xt_TCPMSS xt_tcpudp iptable_mangle optoe at24 nvmem_core dell_z9264f_fpga_ocores(O) i2c_ocores ipmi_devintf i2c_mux_pca954x i2c_mux i2c_dev nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo qat_c3xxx intel_qat intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel bonding linux_knet_cb(O) linux_bcm_knet(O) dh_generic psample(O) linux_user_bde(O) linux_kernel_bde(O) evdev dcdbas kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel efi_pstore sg efivars button rsa_generic asn1_decoder mpi pcspkr ipmi_si iTCO_wdt ipmi_msghandler iTCO_vendor_support authenc shpchp ip6table_filter ip6_tables xt_conntrack ebt_vlan iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 ebtable_broute bridge xt_addrtype stp llc iptable_nat nf_conntrack_ipv4 ebtable_nat nf_defrag_ipv4 nf_nat_ipv4 nf_nat ebtable_filter nf_conntrack ebtables efivarfs ip_tables x_tables autofs4 loop ext4 crc16 jbd2 crc32c_generic fscrypto ecb mbcache nls_utf8 nls_cp437 nls_ascii vfat fat overlay squashfs sd_mod mmc_block crc32c_intel i2c_ismt ahci libahci aesni_intel sdhci_pci xhci_pci sdhci xhci_hcd aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd usbcore usb_common i2c_i801 i2c_smbus mmc_core libata scsi_mod igb i2c_algo_bit dca ptp pps_core thermal
[ 446.459882] CPU: 2 PID: 61 Comm: kworker/2:1 Tainted: G D W O 4.9.0-11-2-amd64 #1 Debian 4.9.189-3+deb9u2
[ 446.583748] Hardware name: Dell EMC Z9264F-ON/Z9264F-ON-CPU, BIOS 3.42.0.B-4b 04/16/2020
[ 446.680581] task: ffff8e236d267000 task.stack: ffffaf6c81d2c000
[ 446.751389] RIP: 0010:[] [] kthread_data+0xc/0x20
[ 446.846239] RSP: 0018:ffffaf6c81d2fe70 EFLAGS: 00010002
[ 446.909780] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 446.995165] RDX: 00000299830798e6 RSI: ffff8e236d267080 RDI: ffff8e236d267000
[ 447.080547] RBP: ffffaf6c81d2fec8 R08: ffff8e236d2670a8 R09: 00000000000a0c00
[ 447.165932] R10: 0000000000000000 R11: ffff8e236d267080 R12: ffff8e237fd18980
[ 447.251316] R13: ffff8e236d267000 R14: ffff8e236d2675c0 R15: 0000000000000046
[ 447.336699] FS: 0000000000000000(0000) GS:ffff8e237fd00000(0000) knlGS:0000000000000000
[ 447.433524] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 447.502264] CR2: 0000000000000028 CR3: 00000003eb008000 CR4: 0000000000340670
[ 447.587646] Stack:
[ 447.611669] ffffffffa9895f7a ffffffffa9e1892c ffffaf6c00000008 00ffaf6c81d2fee8
[ 447.700589] ffff8e237fd18980 ad0ff1a9a62fa7a5 ffff8e236d267000 ffffaf6c81d2fdc8
[ 447.789508] 0000000000000000 0000000000000001 0000000000000046 000000000000000b
[ 447.878425] Call Trace:
[ 447.907652] [] ? wq_worker_sleeping+0xa/0x80
[ 447.978477] [] ? __schedule+0x4ac/0x6f0
[ 448.044101] [] ? schedule+0x32/0x80
[ 448.105563] [] ? do_exit+0x8b6/0xb60
[ 448.168069] [] ? rewind_stack_do_exit+0x17/0x20
[ 448.242009] Code: 0b ab 04 00 84 c0 0f 85 50 ff ff ff eb ab e8 bc ac 04 00 e9 9e fe ff ff 0f 1f 80 00 00 00 00 0f 1f 44 00 00 48 8b 87 60 05 00 00 <48> 8b 40 d8 c3 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f
[ 448.474550] RIP [] kthread_data+0xc/0x20
[ 448.541319] RSP
[ 448.583022] CR2: ffffffffffffffd8
[ 448.622644] ---[ end trace ef4c826a61da8909 ]---
[ 448.682618] Fixing recursive fault but reboot is needed!

Attachemnt
=========

Pls find the syslog attached.

Thanks
Mini

@mini-nair-dell
Copy link
Author

@anshuv-mfst
Copy link

Issue in 201911.

@anshuv-mfst
Copy link

Need more information:

  • what is platform? TH
  • Is this DELL platform specific issue?
  • Warmboot from 2019 to 2019 release?

@anshuv-mfst
Copy link

Seems like HW watchdog issue and Dell specific cc: @lguohan

@lguohan
Copy link
Collaborator

lguohan commented Sep 2, 2020

there is a hardware watchdog issue, that must be dell platform specific issue in the going down path.

File "/usr/bin/watchdogutil", line 8, in
from watchdogutil.main import watchdogutil
File "/usr/lib/python2.7/dist-packages/watchdogutil/main.py", line 15, in
raise ImportError("%s - required module not found" % str(e[ 414.310317] kexec_core: Starting new kernel
))

@rck-innovium
Copy link
Contributor

Description:

While performing warm-reboot and fast-reboot PTF testcases in T0 topology, python core is generated. This is consistently seen in post-August NOS image.

The python corefile is generated in both T0 and T1 topologies when a simple ‘warm-reboot -vvv’ command is issued on the CLI.

Platform :

x86_64-delta_evs-a-32q56-r0 with T0 topology loaded.

Build :

root@sonic-k2svk26-dut:~# show ver

SONiC Software Version: SONiC.SONiC201911_Innovium.0-dirty-20200820.233443

Distribution: Debian 9.13

Kernel: 4.9.0-11-2-amd64

Build commit: 427382b

Build date: Fri Aug 21 09:40:12 UTC 2020

Built by: admin@serv706

Platform: x86_64-delta_evs-a-32q56-r0

HwSKU: Delta-evs-a-32q56_32x100

ASIC: innovium

Serial Number: EVS32Q56DG2032006

Uptime: 05:05:50 up 1 min, 2 users, load average: 0.82, 0.29, 0.10

Docker images:

REPOSITORY TAG IMAGE ID SIZE

docker-syncd-invm-rpc latest a17c6050eace 1.24GB

Steps to repro:

With T1 or T0 config loaded in the box, perform a warm-reboot

root@sonic:~# warm-reboot -vvv

Tue Sep 15 05:08:48 PDT 2020 Pausing orchagent ...

Tue Sep 15 05:08:48 PDT 2020 Stopping nat ...

Dumping conntrack entries failed

Error response from daemon: Cannot kill container: nat: No such container: nat

Tue Sep 15 05:08:48 PDT 2020 Stopped nat ...

Tue Sep 15 05:08:48 PDT 2020 Stopping radv ...

Tue Sep 15 05:08:48 PDT 2020 Stopping bgp ...

Tue Sep 15 05:08:49 PDT 2020 Stopped bgp ...

Tue Sep 15 05:08:50 PDT 2020 Initialize pre-shutdown ...

Tue Sep 15 05:08:51 PDT 2020 Requesting pre-shutdown ...

Tue Sep 15 05:08:51 PDT 2020 Waiting for pre-shutdown ...

Tue Sep 15 05:08:51 PDT 2020 Pre-shutdown succeeded ...

Tue Sep 15 05:08:51 PDT 2020 Backing up database ...

Tue Sep 15 05:08:52 PDT 2020 Stopping teamd ...

Tue Sep 15 05:08:52 PDT 2020 Stopped teamd ...

Tue Sep 15 05:08:52 PDT 2020 Stopping syncd ...

Tue Sep 15 05:09:09 PDT 2020 Stopped syncd ...

Tue Sep 15 05:09:09 PDT 2020 Stopping all remaining containers ...

Warning: Stopping telemetry.service, but it can still be activated by:

telemetry.timer

Tue Sep 15 05:09:10 PDT 2020 Stopped all remaining containers ...

Tue Sep 15 05:09:13 PDT 2020 Rebooting with /sbin/kexec -e to SONiC-OS-SONiC201911_Innovium.0-dirty-20200820.233443 ...

root@sonic-k2svk26-dut:# [ OK ] Stopped Router advertiser container.
root@sonic-k2svk26-dut:
# [ OK ] Stopped LLDP container.
Stopping SNMP container...
Stopping Management Framework container...
[ OK ] Stopped switch state service.
[ OK ] Stopped Management Framework container.
[ OK ] Stopped SNMP container.
Stopping syncd service...
[ OK ] Started Delays snmp container until SONiC has started.
[ OK ] Stopped Delays snmp container until SONiC has started.

root@sonic-k2svk26-dut:~# [ OK ] Stopped syncd service.
[ OK ] Stopped Telemetry container.
[ OK ] Stopped DHCP relay container.
[ OK ] Stopped TEAMD container.
[ OK ] Stopped Platform monitor container.
Stopping Process and docker CPU/mem… utilization data export daemon...
[ OK ] Stopped BGP container.
[ OK ] Stopped switch state service.
Stopping Control Plane ACL configuration daemon...
[ OK ] Stopped Monitor warm recovery and disable warmboot when done.
[ OK ] Stopped Process and docker CPU/memory utilization data export daemon.
[ OK ] Stopped Database container.
[ OK ] Stopped Control Plane ACL configuration daemon.
[ OK ] Stopped Update minigraph and set configuration based on minigraph.
[ OK ] Stopped Config initialization and migration service.
[ OK ] Stopped Database container.
Stopping Docker Application Container Engine...
[ OK ] Stopped Docker Application Container Engine.
[ OK ] Reached target Multi-User System.
[ OK ] Reached target Graphical Interface.
Starting Update UTMP about System Runlevel Changes...
[ OK ] Closed Docker Socket for the API.
[ OK ] Started Update UTMP about System Runlevel Changes.
[ 274.485592] kexec_core: Starting new kernel
[ 0.188352] DMAR-IR: Failed to copy IR table for dmar0 from previous kernel
[ 4.237840] Spectre V2 : System may be vulnerable to spectre v2
[ 6.061138] rc.local[620]: + sonic-cfggen -y /etc/sonic/sonic_version.yml -v build_version [40/1577]
[ 6.219067] kdump-tools[599]: Starting kdump-tools: no crashkernel= parameter in the kernel cmdline ... failed!
[ 6.567276] at24_probe chip.byte_len = 0x4000
[ 6.572149] at24_probe chip.flags = 0x80
[ 6.576532] at24_probe chip.magic = 0x300e
[ 6.581108] at24_probe use_smbus = 2
[ 6.585099] at24_probe num_addresses = 1
[ 6.589481] at24_probe 24c128-delta done
[ 7.554814] rc.local[620]: + SONIC_VERSION=SONiC201911_Innovium.0-dirty-20200820.233443
[ 7.572710] rc.local[620]: + FIRST_BOOT_FILE=/host/image-SONiC201911_Innovium.0-dirty-20200820.233443/platform/firsttime
[ 7.593618] rc.local[620]: + logger SONiC version SONiC201911_Innovium.0-dirty-20200820.233443 starting up...
[ 7.620422] rc.local[620]: + [ ! -e /host/machine.conf ]
[ 7.636380] rc.local[620]: + . /host/machine.conf
[ 7.652441] rc.local[620]: + onie_arch=x86_64
[ 7.672433] rc.local[620]: + onie_bin=
[ 7.684669] rc.local[620]: + onie_boot_reason=rescue
[ 7.700476] rc.local[620]: + onie_build_date=2019-12-20T14:21+08:00
[ 7.720420] rc.local[620]: + onie_build_machine=delta_evs-a-32q56
[ 7.736688] rc.local[620]: + onie_build_platform=x86_64-delta_evs-a-32q56-r0
[ 7.760361] rc.local[620]: + onie_cli_static_parms=
[ 7.776397] rc.local[620]: + onie_cli_static_url=sonic-innovium.bin_08-Aug-20.224014
[ 7.796385] rc.local[620]: + onie_config_version=1
[ 7.812336] rc.local[620]: + onie_dev=/dev/sda2
[ 7.828222] rc.local[620]: + onie_exec_url=sonic-innovium.bin_08-Aug-20.224014
[ 7.844241] rc.local[620]: + onie_firmware=auto
[ 7.860230] rc.local[620]: + onie_grub_image_name=grubx64.efi
[ 7.876185] rc.local[620]: + onie_initrd_tmp=/
[ 7.892207] rc.local[620]: + onie_installer=/var/tmp/installer
[ 7.908610] rc.local[620]: + onie_kernel_version=4.9.95
[ 7.924247] rc.local[620]: + onie_machine=delta_evs-a-32q56
[ 7.940194] rc.local[620]: + onie_machine_rev=0
[ 7.956184] rc.local[620]: + onie_partition_type=gpt
[ 7.972254] rc.local[620]: + onie_platform=x86_64-delta_evs-a-32q56-r0
[ 7.992222] rc.local[620]: + onie_root_dir=/mnt/onie-boot/onie
[ 8.008199] rc.local[620]: + onie_skip_ethmgmt_macs=no
[ 8.028188] rc.local[620]: + onie_switch_asic=bft
[ 8.044357] rc.local[620]: + onie_uefi_arch=x64
[ 8.060310] rc.local[620]: + onie_uefi_boot_loader=grubx64.efi
[ 8.076326] rc.local[620]: + onie_vendor_id=2254
[ 8.092320] rc.local[620]: + onie_version=master-201912201421delta_evs-a-32q56_v0.1.0-dirty
[ 8.112382] rc.local[620]: + program_console_speed
[ 8.129489] rc.local[620]: + cat /proc/cmdline
[ 8.166424] rc.local[620]: + grep -Eo console=ttyS[0-9]+,[0-9]+
[ 8.191507] rc.local[620]: + cut -d , -f2
[ 8.214718] rc.local[620]: + speed=115200
[ 8.228419] rc.local[620]: + [ -z 115200 ]
[ 8.240363] rc.local[620]: + CONSOLE_SPEED=115200
[ 8.256332] rc.local[620]: + sed -i s|--keep-baud .* %I| 115200 %I|g /lib/systemd/system/[email protected]
[ 8.276264] rc.local[620]: + systemctl daemon-reload
[ 8.292231] rc.local[620]: + [ -f /host/image-SONiC201911_Innovium.0-dirty-20200820.233443/platform/firsttime ]
[ 8.312445] rc.local[620]: + [ -f /var/log/fsck.log.gz ]
[ 8.335876] rc.local[620]: + gunzip -d -c /var/log/fsck.log.gz
[ 8.354240] rc.local[620]: + logger -t FSCK
[ 8.368655] rc.local[620]: + rm -f /var/log/fsck.log.gz
[ 8.388173] rc.local[620]: + exit 0

root@sonic-k2svk26-dut:~# ls -lt /var/core/
total 31636
-rw-rw-rw- 1 root root 8092919 Sep 15 05:09 python.1600171750.1106.core.gz
-rw-rw-rw- 1 root root 8096675 Sep 15 05:04 python.1600171473.1107.core.gz
-rw-rw-rw- 1 root root 8109865 Sep 15 02:14 python.1600161252.1115.core.gz
-rw-rw-rw- 1 root root 8093097 Sep 14 16:05 python.1600124732.1112.core.gz

BACKTRACE

root@sonic-k2svk26-dut:~# gdb /usr/bin/python
GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/python...(no debugging symbols found)...done.
(gdb) core-file /var/core/python.1600171750.1106.core
warning: core file may not match specified executable file.
[New LWP 1106]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `python /usr/bin/caclmgrd'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007feb3cdf7fff in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007feb3cdf7fff in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007feb3cdf942a in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007feb390400ad in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007feb3903e066 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007feb3903e0b1 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007feb3903e2c9 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007feb3a3fa4ea in swss::SubscriberStateTable::readData() () from /usr/lib/x86_64-linux-gnu/libswsscommon.so.0
#7 0x00007feb3a3ea387 in swss::Select::poll_descriptors(swss::Selectable**, unsigned int) () from /usr/lib/x86_64-linux-gnu/libswsscommon.so.0
#8 0x00007feb3a3ea54b in swss::Select::select(swss::Selectable**, int) () from /usr/lib/x86_64-linux-gnu/libswsscommon.so.0
#9 0x00007feb3a632c6b in _wrap_Select_select__SWIG_0 (args=) at swsscommon_wrap.cpp:13075
#10 _wrap_Select_select (self=, args=) at swsscommon_wrap.cpp:13162
#11 0x000055d44408a84a in PyEval_EvalFrameEx ()
#12 0x000055d4440889f5 in PyEval_EvalCodeEx ()
#13 0x000055d444090448 in PyEval_EvalFrameEx ()
#14 0x000055d44409014f in PyEval_EvalFrameEx ()
#15 0x000055d44409014f in PyEval_EvalFrameEx ()
#16 0x000055d4440889f5 in PyEval_EvalCodeEx ()
#17 0x000055d4440887b9 in PyEval_EvalCode ()
#18 0x000055d4440b8bff in ?? ()
#19 0x000055d4440b3b52 in PyRun_FileExFlags ()
#20 0x000055d4440b369e in PyRun_SimpleFileExFlags ()
#21 0x000055d444064771 in Py_Main ()
#22 0x00007feb3cde52e1 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#23 0x000055d444063f9a in _start ()
(gdb)

@rck-innovium
Copy link
Contributor

there is a hardware watchdog issue, that must be dell platform specific issue in the going down path.

File "/usr/bin/watchdogutil", line 8, in
from watchdogutil.main import watchdogutil
File "/usr/lib/python2.7/dist-packages/watchdogutil/main.py", line 15, in
raise ImportError("%s - required module not found" % str(e[ 414.310317] kexec_core: Starting new kernel
))

The calcmgrd core during warm-reboot is happening on our platform as well.
On our platform, we do NOT have the hardware watchdog issue or the kernel crash.
We have updated the details here.
#5258 (comment)

@aravindmani-1
Copy link
Contributor

aravindmani-1 commented Oct 29, 2020

2.0 API was invoked by watchdogutil but 2.0 API was not implemented in 201911 branch. This has caused the crash. #5637 will fix this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants