Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed FDB cleanup race issue where the mac flush may flush newly learnt MACs #2679

Merged
merged 3 commits into from
Dec 18, 2020
Merged

Conversation

gechiang
Copy link
Contributor

Description of PR

test_fdb.py results may be flaky from time to time and for some platform it may always fail due to the nature of race condition introduced by the fixture to clean up the FDB.

As part of the test case the fixture "fdb_cleanup" is ran at init time where it issues "sonic-clear mac" to the DUT. But this cmd can take time to execute within the DUT. If the test case proceeds to start sending packets to populate the MAC table before this clear MAC is fully executed by the DUT, those intended MACs can end up accidentally cleared out due to race condition and causing rest of the tests to fail since there are no MACs or missing MACs in the MAC table. The expected traffic will not be able to be forwarded without those MACs.

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

How did you do it?

To eliminate this race condition and the uncertainty that it causes, I have converted the fixture "fdb_cleanup" as a standalone method to be called at setup time and at clean up time. I also changed the algorithm to always check if the MAC table is already empty which no need to issue the "sonic-clear MAC" cmd. In case clear MAC is required, instead of sending the "sonic-clear mac" and thinking it is done, I have changed it to wait until it sees there are no more MACs before it allows the next test to proceed. This way we are sure we will not accidentally clear out the MACS that is needed for the test to run.

How did you verify/test it?

Ran the changed testcase on the platform that was always failing as well as run it on MLNX platform to ensur ethe new changes did not break any functionality.

@bingwang-ms
Copy link
Collaborator

I didn't find the test plan for test_fdb. In my opinion, the purpose for this case is to verify whether FDB entry is populated by ethernet/arp reply/arp request. If the FDB entry is not cleared between each run, how can we verify that?

@gechiang
Copy link
Contributor Author

I didn't find the test plan for test_fdb. In my opinion, the purpose for this case is to verify whether FDB entry is populated by ethernet/arp reply/arp request. If the FDB entry is not cleared between each run, how can we verify that?

Thanks for the comment! I have changed the code to ensure that each time we always clear the MAC table and also at the end of the entire test run. Also moved some duplicate code to a new method to reuse same code...

Copy link
Contributor

@daall daall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some feedback.

Comment on lines 177 to 182
while not done:
total_dyn_mac_count = get_fdb_dynamic_mac_count(duthost)
if total_dyn_mac_count != 0:
time.sleep(FDB_CLEAN_UP_SLEEP_TIMEOUT)
else:
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section could be refactored to use wait_until. This does require us to set an upper-bound on how long we'll poll, but I think that's probably a good thing to have in the (hopefully unlikely 😄) case we hit some bug and the dynamic MAC count never hits 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daall Agreed. I have made the changes to use wait_until() and pytest_assert().

tests/fdb/test_fdb.py Show resolved Hide resolved
@gechiang gechiang merged commit 7b36652 into sonic-net:master Dec 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants