Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Voq chassis orchagent crash with 34K routes #3329

Merged

Conversation

saksarav-nokia
Copy link
Contributor

What I did
Don't add the remote system neighbor if the same neighbor exists.
Why I did it
The IMM has two asics and has 2 port channels in each asic and 2 port members in each port channel.
The ip address is configured on each port channel and bgp is enabled. The neighbor and routes are learned on these port channel.
In sonic-mgmt pc suite, the test case po-update removes the port members from one of the port channel, removes the ip address configured on that port channel, creates new port channel, adds the same port members to the new port channel, adds the same ip address to the new port channel.
In the remote asic, before all the routes learned on the old port channel are removed by routeOrch, orchagent trries to remove the neighbor and nexthop for the old portchannel. But since the routes are pending, the old nexthop and neighbor are not removed. Then the neighbor and nexthop for the new port channel are being added. If the neighbor is learned on remote system port in remote asic, the nexthop is added with alias as inband port's alias, so the key (ip,alias) is same for both old nexthop and new nexthop. When the new nexthop is added , it calls hasNextHop function to check if the nexthop with (ip-address, alias) as key and since the old nexthop is not removed yet, the hasNextHop returns true, however the assert(!hasNextHop) does n't trigger the crash. So addNextHop function replace the old nexthop with old rif-id with new nexthop with new old rif-id in the nexthop map. Then after all the routes learned on old port channel is removed, the old neighbor and old nexthop are being removed. Sine the old nexthop was replaced with new nexthop, when orchagent tries to delete the old nexthop, it actually deletes the new nexthop from SAI. Then when it tries to remove the old neighbor, SAI returns error since orchagent removed the new nexthop from SAI instead of old nexthop and old neighbor is still referenced by the old nexthop in SAI. So orchagent crashes when SAI returns error.
How I verified it
Ran pc and voq suite and verified it passes.
Details if related

@saksarav-nokia
Copy link
Contributor Author

This PR fixes the issue sonic-net/sonic-buildimage#20507 which is introduced by #3269

@saksarav-nokia saksarav-nokia force-pushed the saksara-nokia-orchagent-crash-fix branch from 1f88894 to a5d3baa Compare November 15, 2024 17:35
@saksarav-nokia
Copy link
Contributor Author

@arlakshm for your viz

remote_neigh = nkey
break

assert remote_neigh != "", "Remote neigh not found in ASIC_DB"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the remote neigh entry will be present. Should we check if the remote neighbor entry is still old neighbor? also should we add a check to see if the nexthop is not updated because of addneighbor?

remote_neigh = nkey
break

assert remote_neigh != "", "Remote neigh not found in ASIC_DB"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a check to see we have new nexthop for new neighbor?

@saksarav-nokia saksarav-nokia force-pushed the saksara-nokia-orchagent-crash-fix branch 2 times, most recently from f12e81d to b4e808f Compare November 18, 2024 15:12
@saksarav-nokia
Copy link
Contributor Author

@arlakshm , addressed your review comments. Please check

@rlhui rlhui added the P0 label Nov 28, 2024
@rlhui
Copy link
Contributor

rlhui commented Nov 28, 2024

@saksarav-nokia - please rebase? thanks.

NextHopKey nexthop = { ip_address, ibif.m_alias};
if (hasNextHop(nexthop))
{
it++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indentation is not right

arlakshm
arlakshm previously approved these changes Nov 28, 2024
Copy link
Contributor

@arlakshm arlakshm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@prsunny prsunny merged commit d006374 into sonic-net:master Dec 2, 2024
17 checks passed
@mlok-nokia
Copy link
Contributor

@prsunny This PR is required by 202405 also. Please cherry-pick it to 202405 branch

bradh352 pushed a commit to bradh352/sonic-swss that referenced this pull request Dec 4, 2024
What I did
Don't add the remote system neighbor if the same neighbor exists.
Why I did it
The IMM has two asics and has 2 port channels in each asic and 2 port members in each port channel.
The ip address is configured on each port channel and bgp is enabled. The neighbor and routes are learned on these port channel.
bradh352 pushed a commit to bradh352/sonic-swss that referenced this pull request Dec 4, 2024
What I did
Don't add the remote system neighbor if the same neighbor exists.
Why I did it
The IMM has two asics and has 2 port channels in each asic and 2 port members in each port channel.
The ip address is configured on each port channel and bgp is enabled. The neighbor and routes are learned on these port channel.
mssonicbld pushed a commit to mssonicbld/sonic-swss that referenced this pull request Dec 4, 2024
What I did
Don't add the remote system neighbor if the same neighbor exists.
Why I did it
The IMM has two asics and has 2 port channels in each asic and 2 port members in each port channel.
The ip address is configured on each port channel and bgp is enabled. The neighbor and routes are learned on these port channel.
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202405: #3410

mssonicbld pushed a commit that referenced this pull request Dec 4, 2024
What I did
Don't add the remote system neighbor if the same neighbor exists.
Why I did it
The IMM has two asics and has 2 port channels in each asic and 2 port members in each port channel.
The ip address is configured on each port channel and bgp is enabled. The neighbor and routes are learned on these port channel.
divyachandralekha pushed a commit to divyachandralekha/sonic-swss that referenced this pull request Dec 12, 2024
What I did
Don't add the remote system neighbor if the same neighbor exists.
Why I did it
The IMM has two asics and has 2 port channels in each asic and 2 port members in each port channel.
The ip address is configured on each port channel and bgp is enabled. The neighbor and routes are learned on these port channel.
divyachandralekha pushed a commit to divyachandralekha/sonic-swss that referenced this pull request Dec 12, 2024
What I did
Don't add the remote system neighbor if the same neighbor exists.
Why I did it
The IMM has two asics and has 2 port channels in each asic and 2 port members in each port channel.
The ip address is configured on each port channel and bgp is enabled. The neighbor and routes are learned on these port channel.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

7 participants