-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed FDB cleanup race issue where the mac flush may flush newly learnt MACs #2679
Conversation
…wrong time when new mac learnt
I didn't find the test plan for |
… of the entire test
Thanks for the comment! I have changed the code to ensure that each time we always clear the MAC table and also at the end of the entire test run. Also moved some duplicate code to a new method to reuse same code... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some feedback.
tests/fdb/test_fdb.py
Outdated
while not done: | ||
total_dyn_mac_count = get_fdb_dynamic_mac_count(duthost) | ||
if total_dyn_mac_count != 0: | ||
time.sleep(FDB_CLEAN_UP_SLEEP_TIMEOUT) | ||
else: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this section could be refactored to use wait_until
. This does require us to set an upper-bound on how long we'll poll, but I think that's probably a good thing to have in the (hopefully unlikely 😄) case we hit some bug and the dynamic MAC count never hits 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daall Agreed. I have made the changes to use wait_until() and pytest_assert().
…) for more code reuse
Description of PR
test_fdb.py results may be flaky from time to time and for some platform it may always fail due to the nature of race condition introduced by the fixture to clean up the FDB.
As part of the test case the fixture "fdb_cleanup" is ran at init time where it issues "sonic-clear mac" to the DUT. But this cmd can take time to execute within the DUT. If the test case proceeds to start sending packets to populate the MAC table before this clear MAC is fully executed by the DUT, those intended MACs can end up accidentally cleared out due to race condition and causing rest of the tests to fail since there are no MACs or missing MACs in the MAC table. The expected traffic will not be able to be forwarded without those MACs.
Type of change
How did you do it?
To eliminate this race condition and the uncertainty that it causes, I have converted the fixture "fdb_cleanup" as a standalone method to be called at setup time and at clean up time. I also changed the algorithm to always check if the MAC table is already empty which no need to issue the "sonic-clear MAC" cmd. In case clear MAC is required, instead of sending the "sonic-clear mac" and thinking it is done, I have changed it to wait until it sees there are no more MACs before it allows the next test to proceed. This way we are sure we will not accidentally clear out the MACS that is needed for the test to run.
How did you verify/test it?
Ran the changed testcase on the platform that was always failing as well as run it on MLNX platform to ensur ethe new changes did not break any functionality.