Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What I did
This is the change to arp refresh , details are provided in below sections .
Why I did it
SONiC depends upon the Linux kernel to manage the ARP/ND tables. SONiC then listens to ARP/ND events from the kernel and synchronizes the hardware as required. However, there are a number of problems with this: -
The kernel does not "see" the routed (in HW) through-traffic, and so cannot update its "hit bits" accordingly. Therefore the kernel may age out an entry that is still in use.
The kernel also does not "see" the HW MAC aging process, and so does not know that a MAC address associated with an ARP/ND entry has been aged out, and so does not refresh it. This can result in traffic black holes for a "quiet" neighbor (i.e. one that does not transmit much).
There is a further problem in MCLAG/ICCP setups whereby the response to an ARP/ND initiated by the kernel on one peer can go to the other peer. This eventually makes its way back across the ICCP control plane, but by then the kernel may have already aged out the entry.
The current ARP Refresh process is implemented as a bash script, and cannot run fast enough to be effective at scale, requiring the network operator to set much higher aging timers than would otherwise be used. It's also a very inefficient use of system resources. So, the proposal here is to design and implement a much faster and more efficient instance of the ARP Refresh process.
How I verified it
a. For arp (3 updates for 12.12.12.2 are shown in logs below , other arps/ more logs are not updated here)
admin@sonic:~$ show arp
Address MacAddress Iface Vlan
10.59.128.1 00:00:0c:9f:f4:68 eth0 -
12.12.12.2 00:10:94:00:00:05 Ethernet0 -
12.12.12.3 00:10:94:00:00:06 Ethernet0 -
12.12.12.4 00:10:94:00:00:07 Ethernet0 -
12.12.12.5 00:10:94:00:00:08 Ethernet0 -
Total number of entries 5
admin@sonic:~$ sudo tcpdump -ei Ethernet0
19:27:40.364309 3c:2c:99:2d:84:35 (oui Unknown) > 00:10:94:00:00:05 (oui Unknown), ethertype ARP (0x0806), length 42: Request who-has 12.12.12.2 tell 12.12.12.1, length 28
19:27:40.364666 00:10:94:00:00:05 (oui Unknown) > 3c:2c:99:2d:84:35 (oui Unknown), ethertype ARP (0x0806), length 60: Reply 12.12.12.2 is-at 00:10:94:00:00:05 (oui Unknown), length 46
19:32:40.397044 3c:2c:99:2d:84:35 (oui Unknown) > 00:10:94:00:00:05 (oui Unknown), ethertype ARP (0x0806), length 42: Request who-has 12.12.12.2 tell 12.12.12.1, length 28
19:32:40.397380 00:10:94:00:00:05 (oui Unknown) > 3c:2c:99:2d:84:35 (oui Unknown), ethertype ARP (0x0806), length 60: Reply 12.12.12.2 is-at 00:10:94:00:00:05 (oui Unknown), length 46
19:37:40.428211 3c:2c:99:2d:84:35 (oui Unknown) > 00:10:94:00:00:05 (oui Unknown), ethertype ARP (0x0806), length 42: Request who-has 12.12.12.2 tell 12.12.12.1, length 28
19:37:40.428622 00:10:94:00:00:05 (oui Unknown) > 3c:2c:99:2d:84:35 (oui Unknown), ethertype ARP (0x0806), length 60: Reply 12.12.12.2 is-at 00:10:94:00:00:05 (oui Unknown), length 46
admin@sonic:~$ sudo tcpdump -ei Ethernet0
b. For ndp
(3 updates for 2100::2 are shown in logs below , other ndps/ more logs are not updated here)
admin@sonic:~$ show ndp | head
Address MacAddress Iface Vlan Status
2100::2 00:10:94:00:00:09 Ethernet0 - REACHABLE
2100::3 00:10:94:00:00:0a Ethernet0 - REACHABLE
2100::4 00:10:94:00:00:0b Ethernet0 - REACHABLE
2100::5 00:10:94:00:00:0c Ethernet0 - REACHABLE
fe80::1a5a:58ff:fe17:c2e0 18:5a:58:17:c2:e0 eth0 - STALE
fe80::1a5a:58ff:fe18:f720 18:5a:58:18:f7:20 eth0 - STALE
fe80::1a5a:58ff:fe19:620 18:5a:58:19:06:20 eth0 - STALE
fe80::3e2c:99ff:fe2d:8735 3c:2c:99:2d:87:35 eth0 - STALE
11:55:46.283420 3c:2c:99:2d:84:35 (oui Unknown) > 33:33:ff:00:00:02 (oui Unknown), ethertype IPv6 (0x86dd), length 86: fe80::3e2c:99ff:fe2d:8435 > ff02::1:ff00:2: ICMP6, neighbor solicitation, who has 2100::2, length 32
11:55:46.283763 00:10:94:00:00:09 (oui Unknown) > 3c:2c:99:2d:84:35 (oui Unknown), ethertype IPv6 (0x86dd), length 86: 2100::2 > fe80::3e2c:99ff:fe2d:8435: ICMP6, neighbor advertisement, tgt is 2100::2, length 32
12:00:46.314416 3c:2c:99:2d:84:35 (oui Unknown) > 33:33:ff:00:00:02 (oui Unknown), ethertype IPv6 (0x86dd), length 86: fe80::3e2c:99ff:fe2d:8435 > ff02::1:ff00:2: ICMP6, neighbor solicitation, who has 2100::2, length 32
12:00:46.314820 00:10:94:00:00:09 (oui Unknown) > 3c:2c:99:2d:84:35 (oui Unknown), ethertype IPv6 (0x86dd), length 86: 2100::2 > fe80::3e2c:99ff:fe2d:8435: ICMP6, neighbor advertisement, tgt is 2100::2, length 32
12:06:46.350847 3c:2c:99:2d:84:35 (oui Unknown) > 33:33:ff:00:00:02 (oui Unknown), ethertype IPv6 (0x86dd), length 86: fe80::3e2c:99ff:fe2d:8435 > ff02::1:ff00:2: ICMP6, neighbor solicitation, who has 2100::2, length 32
12:06:46.351333 00:10:94:00:00:09 (oui Unknown) > 3c:2c:99:2d:84:35 (oui Unknown), ethertype IPv6 (0x86dd), length 86: 2100::2 > fe80::3e2c:99ff:fe2d:8435: ICMP6, neighbor advertisement, tgt is 2100::2, length 32
Details if related
ARP Refresh Thread:
ARP refresh functionality is added to neighsyncd process.
Neighsyncd is responsible for syncing the kernel ARP table to the hardware via the APP_DB and OrchAgents. Neighsyncd listens on netlink events (RTM_NEWNEIGH, RTM_DELNEIGH) and creates/deletes NEIGH_TABLE entries in APP_DB.
Existing functionality of neighsyncd is retained as it is. In addition to managing NEIGH_TABLE entries in APP_DB, neighsyncd will also add the details of the neighbor into a queue towards the new ARP Refresh thread described below.
A new ARP refresh thread is created in neighsyncd: -
to dequeue the neighbor events and populate a neighbor cache.
to periodically refresh ARP/ND by sending ARP request pkt / NS pkt
to subscribe to redis-db to gather the data required to send the ARP refresh packets.
Following are the different modules in the ARP refresh thread.
Neighbor Cache Management
Add neighbor entries to cache when the entry is learned from the kernel
All Dynamically learned neighbor entries [ARP, ND (Global, LinkLocal)]
All Static neighbor entries (MAC can be dynamic)
Below entries will not be added to the neighbor cache
Neighbors learned from “eth0” interface
Neighbors learned from BGP/EVPN MAC/IP type-2 route
MYIPaddress entries /// FF:FF:FF:FF:FF:FF Permanent entries
Remove entries from cache when the entry is deleted from Kernel
v4/v6 Neighbors Cache [map] contents are: -
Key = IP Address + InterfaceName [Phy/PortChannel/Vlan/Sag]
Value
MAC Address
State (Reachable/Failed)
Timestamp (Entry creation/last refresh)
Interface Cache Management
Required for framing the ARP packets we send
Interface Cache [Map]
Key = Interface name
Value = IP, MAC, Ifname to Index
Subscribe to redis-db tables
IP address
- CONFIG_DB: INTERFACE, VLAN_INTERFACE, SAG_INTERFACE
MAC
- CONFIG_DB: DEVICE_METADATA ==> System MAC
- CONFIG_DB: SAG_GLOBAL
Ifname to Index (required for socket send)
Packet Builder
Based on Neighbor Cache
Build ARP packet
Build NS packet
For Resolved ARP Dst MAC, the ARP request is unicast
For Unresolved ARP, Dst MAC the ARP request is broadcast
IPv6 NS uses multicast
Send Refresh
Send ARP/NS packets using raw socket
Separate sockets for ARP and ICMPv6 NS
Send Unicast packet
VLAN tagging & FDB lookup happens in kernel based on outgoing interface
Refresh Timer
Traverse the neighbor Cache entries periodically (every 30 secs)
Check refresh timeout has elapsed for every neighbors
If elapsed then send ARP/NS packet
Refresh timeout Calculation:
To avoid sending all ARP/NS packets simultaneously, each neighbor entry will be configured with different refresh timeout value. This refresh timeout value will be based on MAC/ARP/NS aging time.
ARP Reference Timeout (ARP_RT) = Lesser of [MAC age, ARP age]
ND Reference Timeout (ND_RT) = Lesser of [MAC age, ND age]
Refresh Timeout = 30% to 70% of [ARP/ND Reference Timeout]
For example:
MAC Age is less than ARP age
ARP ageout = 60 mins
MAC ageout = 30 mins
Reference Timeout = 30 mins.
Refresh timeout will be between 30% to 70% of reference timeout (9 to 21) mins.
ARP Age is less than MAC age
ARP ageout = 60 mins
MAC ageout = 90 mins
Reference Timeout = 60 mins
Refresh timeout will be between 30% to 70% of reference timeout (18 to 42) mins.
Refresh timeout will be set whenever the neighbor entry is added/updated in cache, it will also be recomputed after sending the ARP/NS refresh packet.
Recommended Configurations: