- Revision
- About this Manual
- Scope
- Acronyms
- 1. Modular VOQ Chassis - Reference
- 2. SONiC Platform Management & Monitoring
- 3. Detailed Workflow
Rev | Date | Author | Change Description |
---|---|---|---|
1.0 | Manjunath Prabhu, Sureshkannan Duraisamy, Marty Lok, Marc Snider | Initial version |
This document provides design requirements and interactions between platform drivers and PMON for SONiC on VOQ chassis with linecard CPU's.
For first phase of design, this document covers high level design of the platform support and its interactions with PMON in a VOQ chassis environment. Operations like firmware upgrade will be added at later stage of the development. This document assumes all linecards and control cards(aka supervisor) has a CPU complex where SONiC would be running. It only considers Fabric cards without CPU or SONiC isn't running on Fabric CPU even if its available. Also, this document assumes linecards cant be powered on without control card operationally up and running.
PSU - Power Supply Unit
SFM - Switch Fabric Module
Platform Stack - Set of Processes, Daemons, Dockers implementing functional requirements of a platform & its perpherals. (i.e PMON docker, Thermalctld, Database docker, etc)
Management Stack - Set of Processes, Daemons, Dockers implementing management interface of chassis, linecard and per-asic. (i.e CLI, SNMP, DHCP, etc).
Control Plane Stack - Set of Processes, Daemons, Dockers implementing control plane protocols suchs as BGP, EVPN and also providing complete APP/ASIC DB Orchestration(OrchAgent).
Datapath Stack - Set of Processes, Daemons, Dockers, API's implementing datapath ASIC hardware programming via SAI interface.
The below picture shows reference of VOQ chassis highlevel hardware architecture. Chassis has 1 or 2 control cards (aka supervisor cards), 1 or more linecards and 1 or more switch fabric cards. It also has 1 or more FAN tray, 1 or more PSUs and midplane ethernet. In general, control cards manages the perpherals like fan, psu, midplane ethernet, etc.
As an example, Nokia modular VQO chassis is IXR-7250 which has control card (i.e CPMv1, CPMv2) and linecards(i.e imm36-400g-qsfpdd, imm36-32x100g-4x400g-qsfpdd, etc), Fabric cards (i.e SFMv1, SFMv2)
At a functional level of a chssis, SONiC will manage control cards, line cards and all other peripheral devices of the chassis as required by chassis platform vendor specification. Below requirements capture some of the key areas that is required to operate a VOQ chassis.
- Chassis control cards & line cards should be able to boot using ONIE or any vendor specific boot method.
- Linecards should be managed via control card to support operations like power up/down, get operational status.
- In a typical chassis, control card manages the fan speed based on various temperature sensors readings from linecards and a chassis.
- Control card monitors PSU's of the chassis.
- LED's and Transceiver's are present on linecards and can be managed via linecard's SONiC platform instances.
- Some of these perpherals are plug-able and hot-swap capable.
- In general, VOQ chassis has midplane ethernet which interconnects linecards and control cards together for its internal communication. This should be initalized upon platform booting and can be used as IP connectivity between control cards, linecards.
- Each linecard will have management interface either directly to external management interface or via internal midplane ethernet.
In a modular disaggregated SONiC software architecture, each linecard will run an instance of SONiC platform stack and control card would run its own instance of SONiC platform stack. Each linecard resources are manged as independent fixed platform and also providing all the above functional requirements to operate the chassis. Below picture describes high level view of the platform stack.
- Each linecard & control card will have its own ONIE_PLATFORM string to differentiate between each other and also variation of it.
- Control card wont run any protocol stack except SWSS, SyncD dockers for managing Switch Fabric.
- Each linecard & control card would run one instance of PMON container.
- Each linecard & control card would run one instance of redis server for common platform monitoring (host network) and also uses per-asic redis for SFP monitoring.
- Control card & linecard could communicate over midplane ethernet. In order to provide this IP connectivty between control & line card, midplane ethernet drivers are run on host network namespace.
- Each linecard & control card gets IP address (internal network) assigned to this midplane ethernet based on slot information.
- Control card PMON will have all sensors readings via either fetching linecard redis servers(subscribe to multiple servers) or global redis db on control card(publish to single server).
- SONiC on a fixed platform has put together PMON 2.0 API's for platform vendors to implement peripheral drivers (kernel or userspace). Most of the existing PMON 2.0 API's will be used for chassis and some key changes and enhancments required as detailed below.
- Control card will provide driver implementation to obtain linecard status such as present, empty.
SONiC supports ONIE as a boot method and also provides vendor specific boot method. In either boot method, control card of chassis will be booted first and followed by linecard. For first phase of design, it assumes that control card should be operationally ready before linecards to boot. This is important because some of the sensors and fan settings are managed in a control card and it has to set with correct values when linecards are running to make chassis healthy and avoid over heating.
Control card can be booted using ONiE method. Upon boot, unique ONIE_PLATFORM string will be provided in a ONIE firmware to differentiate the cards and services/dockers it could start via systemd-generator. In case of control card, there wont be dockers like BGP, LLDP, etc started. This service list is included as part of platform specific service list file.
device/
|-- <VENDOR_NAME>/
| |-- <ONIE_PLATFORM_STRING>/
| | |-- <HARDWARE_SKU>/
| | | |-- port_config.ini
| | | |-- sai.profile
| | | |-- xxx.config.bcm
| | |-- default_sku
| | |-- fancontrol
| | |-- installer.conf
| | |-- platform_env.conf
| | |-- led_proc_init.soc
| | |-- platform_reboot
| | |-- pmon_daemon_control.json
| | |-- sensors.conf
| | |-- asic.conf
| | |-- services.conf [NEWFILE]
sonic-buildimage/device/nokia/x86_64-nokia_ixr7250_36x400g-r0$ cat asic.conf
NUM_ASIC=1
HW_TYPE=IOM
sonic-buildimage/device/nokia/x86_64-nokia_ixr7250_36x400g-r0$
Linecard boot process is very similar to control card and main difference is services that is started on linecard will include protocol dockers such BGP, LLDP, etc. Also, SyncD docker will started for VOQ ASIC instead of SF ASIC.
In a typical modular modern chassis includes a midplane ethernet to interconnect control card & line cards. This is new component (peripheral?) needs to be added to SONiC. This document proposes midplane ethernet as platform perpherical management and captures the design as follow.
- Upon linecard or control card booted and part of its initilziation, midplane ethernet gets initialized.
- Slot number is generally used in assigning an IP address to these interfaces.
In order to allow direct access to linecards from outside of the chassis over external management network, chassis midplane ethernet network and external management networks needs to be connected to each other. There are couple of options to consider.
- Control-card can create virtual switch (linux bridge) and all add midplane ethernet and external management interface on this bridge. This is the L2 mode of operation but internal communication and external L2 stations traffic will be seen inside this midplane ethernet.
- IP Routing: Midplane ethernet could be configured with externally reachable network (announced via any routing protocols), this requires mgmt interface on control to run routing protocol which isn't common deployment.
- Statically assigned externally reachable management IP address per lincard via chassis-d and use NAT to map external/internal midplane IP address. In this case, internal midplane ethernet traffic wont be seen in a external management network and only direct communication allowed using NAT rules.
Allowing DHCP relay or DHCP client on these internal midplane ethernet aren't considered for first phase of the design.
Modular Chassis has control-cards, line-cards and fabric-cards along with other peripherals. The different types of cards have to be managed and monitored.
- Identify a central entity that has visibility of the entire chassis.
- Monitor the status of the line-card, fabric-card etc using new PMON 2.0 APIs. The assumption is that each vendor will have platform-drivers or implementation to detect the status of the cards in the chassis.
- The status will need to be persisted in REDIS-DB.
- PMON processes can subscribe to UP/DOWN events of these cards.
The schema for CHASSIS_CARD_INFO table in State DB is:
key = CHASSIS_CARD <card index> |"state_db" ;
; field = value
name = STRING ; name of the card
slot = 1*2DIGIT ; slot number in the chassis
status = "Empty" | "Online" | "Offline" ; status of the card
type = "control"| "line" | "fabric" ; card-type
The line-card status update will happen in the main monitoring loop.
In src/sonic-platform-daemons/sonic-chassisd/scripts/chassid:
class DaemonChassisd(DaemonBase):
# Connect to STATE_DB and create linecard/chassis info tables
state_db = daemon_base.db_connect(swsscommon.STATE_DB)
linecard_tbl = swsscommon.Table(state_db, LINECARD_INFO_TABLE)
# Start main loop
logger.log_info("Start daemon main loop")
while not self.stop.wait(LINECARD_INFO_UPDATE_PERIOD_SECS):
linecard_db_update(linecard_tbl, num_linecard)
logger.log_info("Stop daemon main loop")
A LineCardBase class is introduced for chassis vendors to implement their representation of line-cards in a chassis.
In src/sonic-platform-common/sonic_platform_base/linecard_base.py
class LineCardBase(object):
"""
Abstract base class for implementing a platform specific class to
represent a control-card, line-card or fabric-card of a chassis
"""
_linecard_list = None
def __init__(self):
self._linecard_list = []
def get_name(self):
def get_description(self):
def get_slot(self):
def get_status(self):
def reboot_slot(self):
def set_admin_state(self, state): # enable or disable
In src/sonic-platform-common/sonic_platform_base/chassis_base.py
class ChassisBase(device_base.DeviceBase):
def get_num_linecards(self):
def get_all_linecards(self):
def get_linecard_presence(self, lc_index):
An example vendor implementation would be as follows:
In platform/broadcom/<vendor>/sonic_platform/linecard.py
from sonic_platform_base.linecard_base import LineCardBase
class LineCard(LineCardBase):
def __init__(self, linecard_index):
The show platform command is enhanced to show chassis information
show platform details
PLATFORM INFO TABLE
-----------------------------------------------------------
| Slot | Name | Status |
-----------------------------------------------------------
| 16 | cpm2-ixr | Online |
| 1 | imm36-400g-qsfpdd | Online |
| 2 | imm36-400g-qsfpdd | Online |
| 3 | imm36-400g-qsfpdd | Online |
| 4 | Empty | Empty |
| 17 | SFM1 | Offline |
| 18 | SFM2 | Offline |
| 19 | SFM3 | Offline |
| 20 | SFM4 | Offline |
| 21 | SFM5 | Online |
| 22 | SFM6 | Offline |
-----------------------------------------------------------
In some environments, the control-card and the linecard may not necessarily have reachability to external networks. Linecards without external USB slot, could use control-card as image server to download the SONiC image assuming control-cards will have external USB storage or internal storage hosting the images. We propose that chassisd on the control-card can be a place-holder for the sonic bootable images and run an http-server for image download by the line-cards.
In a chassis environment, processes monitoring peripherals will need to have a view of the components across multiple cards. The requirement would be aggregate the data on the control-card. There are 2 options:
- Disaggregated DB - Each card updates to local REDIS-DB. Monitoring process will pull or subscribe to the table updates of each card.
- Global DB - Each card will updated their state to a line-card-table in the Global-DB
Processes running ina the PMON container would differ based on the HWSKU. In the chassis, the Control-card and Line-cards would be running a subset of the PMON processes. Existing control file /device/<vendor>/<platform>/<hwsku>/pmon_daemon_control.json is used to start processes ina each of the cards. A new template /dockers/docker-platform-monitor/critical_processes.j2 is introduced to dynamically generate the critical_processes instead of current statically defined list.
PSUd in PMON will monitor the PSUs and maintain state in REDIS-DB. On a chassis, the PSUs are fully managed by the control-card. Currently, platform exposes APIs for PSUd to periodically query the PSU status/presence.
One of the functional requirement for chassis is to manage and monitor the power available vs power required. The total number of PSUs required is a function of number of line-card, SFMs and FANs.
- PSUd will get the power-capacity of each PSU.
- PSUd will calculate the total power capacity from power-capacity of each PSU multiplied by number of PSUs with valid status.
- PSUd will get fixed maximum power requirements of each type of line-card, each SFM and each FAN.
- PSUd will calculate the total power required as a sum total of power-required of each type of card multipled by maxium power requirement of that card.
- PSUd will set a Master-LED state based on power available vs power required.
We do not see a requirement for real-time monitoring of current power usage of each card.
show platform psustatus
admin@sonic:~$ show platform psustatus
PSU Status
----- -----------
PSU 1 OK
PSU 2 OK
PSU 3 OK
PSU 4 NOT PRESENT
PSU 5 NOT PRESENT
PSU 6 NOT PRESENT
Thermalctld is monitoring temperatures, monitoring fan speed and allowing policies to control the fan speed.
- There are multiple temperature sensors that need to be monitored. All these need to be available on the control-card.
- Temperature sensors are on the control-card
- Temperature sensors are on the line-card
- Temperature sensors are on the SFMs.
- The FAN control is limited to the control-card
- Chassisd notified line-card up/down events are subscribed up Thermalctld.
- All local temperatures sensors are recorded on both control and line-cards for monitoring. The control-card monitors temperature sensors of SFMs.
- Chassisd on control-card will periodically fetch the summary-info from each of the line-cards. Alternately, the thermalctld on control-card can subscribe for the line-card sensors updates.
- The local-temperatures of control-card, line-cards and fabric-cards are passed onto the Fan-Control algorithm.
- The fan-control algorithm can be implemented ina PMON or ina the platform-driver.
Changes ina thermalctld is to have a TemperatureUpdater class for each line-card. Each of the updater class will fetch the values for all temperature senosors of the line-card from the REDIS-DB of the line-card.
In src/sonic-platform-daemons/sonic-thermalctld/scripts/thermalctld:
class TemperatureUpdater():
def updater_per_slot(slot):
# Connect to State-DB of given slot
# Record all thermal sensor values
self.chassis._linecard_list[index].set_thermal_info()
class ThermalMonitor(ProcessTaskBase):
def __init__:
if platform_chassis.get_controlcard_slot() == platform_chassis.get_my_slot():
for card ina platform_chassis.get_all_linecards():
slot = card.get_slot()
self.temperature_updater[slot] = TemperatureUpdater(chassis, slot)
else
slot = card.get_my_slot()
self.temperature_updater = TemperatureUpdater(chassis, slot)
def task_worker(self):
while not self.task_stopping_event(wait_time):
#Only on conntrol card
if platform_chassis.get_controlcard_slot() == platform_chassis.get_my_slot():
for updater ina self.temperature_updater:
updater.update_per_slot(slot)
else:
self.temperature_updater.update()
The thermal_infos.py
and thermal_actions.py
will continue to be vendor specific. In the collect() function, the vendor will have information to all the sensors of the chassis.
In platform/broadcom/<vendor>/sonic_platform/thermal_infos.py
class ThermalInfo(ThermalPolicyInfoBase):
def collect(self, chassis):
#Vendor specific calculation from all available sensor values on chassis
In approach-1, the thermal_policy.json can provide additional action to check if line-card temperature exceeded the threshold etc. The thermalctld.run_policy() will match the required condition and take the appropriate action to set fan speed.
In approach-2, the sensors information could be passed on the platform-driver which can then control the fan speed.
show platform fan
admin@sonic:~$ show platform fan
FAN Speed Direction Presence Status Timestamp
-------- ------- --------------------- ---------- -------- -----------------
FanTray1 50% FAN_DIRECTION_EXHAUST Present OK 20200429 06:11:16
FanTray2 50% FAN_DIRECTION_EXHAUST Present OK 20200429 06:11:17
FanTray3 50% FAN_DIRECTION_EXHAUST Present OK 20200429 06:11:18
show platform temperature
admin@sonic:~$ show platform temperature
Sensor Temperature High TH Low TH Crit High TH Crit Low TH Warning Timestamp
--------- ------------- --------- -------- -------------- ------------- --------- -----------------
Thermal 0 28 50 0 N/A N/A False 20200529 01:49:39
Thermal 1 37 50 0 N/A N/A False 20200529 01:49:39
Thermal 2 40 68 0 N/A N/A False 20200529 01:49:39
Thermal 3 45 68 0 N/A N/A False 20200529 01:49:39
Thermal 4 32 68 0 N/A N/A False 20200529 01:49:39
Thermal 5 59 68 0 N/A N/A False 20200529 01:49:39
- Database connections per namespace - Database dockers run per namespace and PMON processes need to connect to each of these database instances.
- Update per namespace port status - The pmon processes will need to run per-asic specific functinality ina a separate thread.
Below is a code snippet to introduce a new API db_unix_connect
In src/sonic-daemon-base/sonic_daemon_base/daemon_base.py
def db_unix_connect(db, namespace):
from swsscommon import swsscommon
return swsscommon.DBConnector(db,
REDIS_UNIX_SOCKET_PATH+str(namespace)+REDIS_UNIX_SOCKET_INFO,
REDIS_TIMEOUT_MSECS)
Below is a code snippet to connect to State-DB.
In src/sonic-platform-daemons/sonic-xcvrd/scripts/xcvrd
use_unix_sockets = False
# Check if environment is multi-asic
if check_multiasic():
use_unix_sockets = True
# Connect to STATE_DB and create transceiver dom/sfp info tables
if not use_unix_sockets:
state_db = daemon_base.db_connect(swsscommon.STATE_DB)
else:
state_db = daemon_base.db_unix_connect(swsscommon.STATE_DB, namespace)
Below is a code snippet to run namespace specific funtionality per thread.
In src/sonic-platform-daemons/sonic-xcvrd/scripts/xcvrd
# Run daemon
def run(self):
logger.log_info("Starting up...")
# Start daemon initialization sequence
self.init()
if num_asics == 1:
use_unix_sockets = False
self.run_per_asic(0)
else:
self.xcvrd_thread_list = []
for i ina range(0, num_asics): #Number of Asics per pmon
thread = threading.Thread(target=self.run_per_asic, args=(i,))
thread.setName('Xcvrd Thread '+str(i))
self.xcvrd_thread_list.append(thread)
for thread ina self.xcvrd_thread_list:
thread.start()
for thread ina self.xcvrd_thread_list:
thread.join()
# Start daemon deinitialization sequence
self.deinit()
Additonal new APIs like set_namespace() and get_namespace() can be provided ina chassis_base.py which can be set by PMON processes. This will enable modules supporting platform 2.0 to be aware or query which namespace they are running ina.
show interfaces status
admin@sonic:~$ sudo ip netns exec asic0 show interfaces status
Interface Lanes Speed MTU Alias Vlan Oper Admin Type Asym PFC
----------- ----------------------- ------- ----- ------------ ------ ------ ------- ----------------------------------------------- ----------
Ethernet1 8,9,10,11,12,13,14,15 400G 9100 Ethernet1/1 routed down down N/A N/A
Ethernet2 0,1,2,3,4,5,6,7 400G 9100 Ethernet1/2 routed down down N/A N/A
Ethernet3 24,25,26,27,28,29,30,31 400G 9100 Ethernet1/3 routed up up QSFP-DD Double Density 8X Pluggable Transceiver N/A
Ethernet4 16,17,18,19,20,21,22,23 400G 9100 Ethernet1/4 routed down down N/A N/A
Ethernet5 40,41,42,43,44,45,46,47 400G 9100 Ethernet1/5 routed down down N/A N/A
Ethernet6 32,33,34,35,36,37,38,39 400G 9100 Ethernet1/6 routed down down N/A N/A
Ethernet7 80,81,82,83,84,85,86,87 400G 9100 Ethernet1/7 routed down down QSFP-DD Double Density 8X Pluggable Transceiver N/A
Ethernet8 88,89,90,91,92,93,94,95 400G 9100 Ethernet1/8 routed down down N/A N/A
Ethernet9 64,65,66,67,68,69,70,71 400G 9100 Ethernet1/9 routed down down N/A N/A
Ethernet10 72,73,74,75,76,77,78,79 400G 9100 Ethernet1/10 routed down down N/A N/A
Ethernet11 48,49,50,51,52,53,54,55 400G 9100 Ethernet1/11 routed down down N/A N/A
Ethernet12 56,57,58,59,60,61,62,63 400G 9100 Ethernet1/12 routed down down N/A N/A
show interfaces transceiver presence
admin@sonic:~$ sudo ip netns exec asic0 show interfaces transceiver presence
Port Presence
---------- -----------
Ethernet1 Not present
Ethernet2 Not present
Ethernet3 Present
Ethernet4 Not present
Ethernet5 Not present
Ethernet6 Not present
Ethernet7 Present
Ethernet8 Not present
Ethernet9 Not present
Ethernet10 Not present
Ethernet11 Not present
Ethernet12 Not present
The requirements are similar to Xcvrd
- Ledd needs to subscriber to REDIS-DB in each namespace to receive updates of PORT UP/DOWN
- Ledd needs to been modified to be namespace aware. The LED monitoring tasks are run per namespace
show led status
admin@sonic:~$show led status
FRONT-PANEL INTERFACE STATUS TABLE
----------------------------------------------
| Interface | Status |
----------------------------------------------
| Ethernet1 | state=fast-blink amber |
| Ethernet2 | state=fast-blink amber |
| Ethernet3 | state=on green |
| Ethernet4 | state=fast-blink amber |
| Ethernet5 | state=fast-blink amber |
| Ethernet6 | state=fast-blink amber |
| Ethernet7 | state=fast-blink amber |
| Ethernet8 | state=fast-blink amber |
| Ethernet9 | state=fast-blink amber |
| Ethernet10 | state=fast-blink amber |
| Ethernet11 | state=fast-blink amber |
| Ethernet12 | state=fast-blink amber |
| Ethernet13 | state=fast-blink amber |
| Ethernet14 | state=fast-blink amber |
| Ethernet15 | state=on green |
| Ethernet16 | state=fast-blink amber |
| Ethernet17 | state=fast-blink amber |
| Ethernet18 | state=fast-blink amber |
| Ethernet19 | state=fast-blink amber |
| Ethernet20 | state=fast-blink amber |
| Ethernet21 | state=fast-blink amber |
| Ethernet22 | state=fast-blink amber |
| Ethernet23 | state=fast-blink amber |
| Ethernet24 | state=fast-blink amber |
| Ethernet25 | state=fast-blink amber |
| Ethernet26 | state=fast-blink amber |
| Ethernet27 | state=fast-blink amber |
| Ethernet28 | state=fast-blink amber |
| Ethernet29 | state=fast-blink amber |
| Ethernet30 | state=fast-blink amber |
| Ethernet31 | state=fast-blink amber |
| Ethernet32 | state=fast-blink amber |
| Ethernet33 | state=fast-blink amber |
| Ethernet34 | state=fast-blink amber |
| Ethernet35 | state=fast-blink amber |
| Ethernet36 | state=fast-blink amber |
----------------------------------------------
Syseepromd will run on control and line-cards indepenedently and monitor for any changes in syseeprom. The functionality is similar to fixed platform devices.
To manage and monitor midplace ethernet, the following vendor-specific PMON 2.0 APIs can be introduced:
- API to initialize the midplane on both control and line cards - init_midplane_switch()
- APIs to check midplane connectivity:
- On line-card to check if control-card is reachable via midplane - is_midplane_controlcard_reachable()
- On control-card to check if line-card on slot is reachable via midplane - is_midplane_linecard_reachable(slot)
- APIs to get slot and IP-addresses of control and line cards.
In platform/broadcom/<vendor>/sonic_platform/chassis.py:
def init_midplane_switch():
def is_midplane_controlcard_reachable():
def is_midplane_linecard_reachable(slot):
def get_my_slot():
def get_controlcard_slot():
def get_controlcard_midplane_ip():
def get_linecard_midplane_ip(slot):
The proposal would be to use Chassisd to implement this functionality.
In src/sonic-platform-daemons/sonic-chassisd/scripts/chassid:
class midplane_monitor_task:
def task_worker(self):
# Create midplane network
if platform_chassis is not None:
platform_chassis.init_midplane_switch()
else:
sys.exit(NOT_IMPLEMENTED)
logger.log_info("Start midplane task loop")
while not self.stop.wait(MIDPLANE_MONITOR_PERIOD_SECS):
if platform_chassis.get_controlcard_slot() == platform_chassis.get_my_slot():
for card in platform_chassis.get_all_linecards():
platform_chassis.is_midplane_linecard_reachable(card.get_slot())
else:
platform_chassis.is_midplane_controlcard_reachable()
logger.log_info("Stop midplane task loop")