Fixed FDB cleanup race issue where the mac flush may flush newly learnt MACs #2679

gechiang · 2020-12-17T05:36:30Z

Description of PR

test_fdb.py results may be flaky from time to time and for some platform it may always fail due to the nature of race condition introduced by the fixture to clean up the FDB.

As part of the test case the fixture "fdb_cleanup" is ran at init time where it issues "sonic-clear mac" to the DUT. But this cmd can take time to execute within the DUT. If the test case proceeds to start sending packets to populate the MAC table before this clear MAC is fully executed by the DUT, those intended MACs can end up accidentally cleared out due to race condition and causing rest of the tests to fail since there are no MACs or missing MACs in the MAC table. The expected traffic will not be able to be forwarded without those MACs.

Type of change

Bug fix
Testbed and Framework(new/improvement)
Test case(new/improvement)

How did you do it?

To eliminate this race condition and the uncertainty that it causes, I have converted the fixture "fdb_cleanup" as a standalone method to be called at setup time and at clean up time. I also changed the algorithm to always check if the MAC table is already empty which no need to issue the "sonic-clear MAC" cmd. In case clear MAC is required, instead of sending the "sonic-clear mac" and thinking it is done, I have changed it to wait until it sees there are no more MACs before it allows the next test to proceed. This way we are sure we will not accidentally clear out the MACS that is needed for the test to run.

How did you verify/test it?

Ran the changed testcase on the platform that was always failing as well as run it on MLNX platform to ensur ethe new changes did not break any functionality.

…wrong time when new mac learnt

bingwang-ms · 2020-12-17T07:11:42Z

I didn't find the test plan for test_fdb. In my opinion, the purpose for this case is to verify whether FDB entry is populated by ethernet/arp reply/arp request. If the FDB entry is not cleared between each run, how can we verify that?

… of the entire test

gechiang · 2020-12-17T18:31:08Z

I didn't find the test plan for test_fdb. In my opinion, the purpose for this case is to verify whether FDB entry is populated by ethernet/arp reply/arp request. If the FDB entry is not cleared between each run, how can we verify that?

Thanks for the comment! I have changed the code to ensure that each time we always clear the MAC table and also at the end of the entire test run. Also moved some duplicate code to a new method to reuse same code...

daall

Added some feedback.

daall · 2020-12-17T18:38:38Z

tests/fdb/test_fdb.py

+        while not done:
+            total_dyn_mac_count = get_fdb_dynamic_mac_count(duthost)
+            if total_dyn_mac_count != 0:
+                time.sleep(FDB_CLEAN_UP_SLEEP_TIMEOUT)
+            else:
+                return


I think this section could be refactored to use wait_until. This does require us to set an upper-bound on how long we'll poll, but I think that's probably a good thing to have in the (hopefully unlikely 😄) case we hit some bug and the dynamic MAC count never hits 0.

@daall Agreed. I have made the changes to use wait_until() and pytest_assert().

tests/fdb/test_fdb.py

…) for more code reuse

Fixed FDB cleanup race issue where teh mac flush may have occured at …

4c0d1f8

…wrong time when new mac learnt

gechiang requested review from vaibhavhd, bingwang-ms, lguohan, rlhui, daall and abdosi December 17, 2020 05:36

Make sure fdb_cleanup is done at begining of each test and at the end…

3592040

… of the entire test

daall reviewed Dec 17, 2020

View reviewed changes

replaced the wait for mac clear logic with pytest_assert(wait_until()…

33397c9

…) for more code reuse

daall approved these changes Dec 17, 2020

View reviewed changes

bingwang-ms approved these changes Dec 18, 2020

View reviewed changes

gechiang merged commit 7b36652 into sonic-net:master Dec 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed FDB cleanup race issue where the mac flush may flush newly learnt MACs #2679

Fixed FDB cleanup race issue where the mac flush may flush newly learnt MACs #2679

gechiang commented Dec 17, 2020

bingwang-ms commented Dec 17, 2020

gechiang commented Dec 17, 2020

daall left a comment

daall Dec 17, 2020

gechiang Dec 17, 2020

Fixed FDB cleanup race issue where the mac flush may flush newly learnt MACs #2679

Fixed FDB cleanup race issue where the mac flush may flush newly learnt MACs #2679

Conversation

gechiang commented Dec 17, 2020

Description of PR

Type of change

How did you do it?

How did you verify/test it?

bingwang-ms commented Dec 17, 2020

gechiang commented Dec 17, 2020

daall left a comment

Choose a reason for hiding this comment

daall Dec 17, 2020

Choose a reason for hiding this comment

gechiang Dec 17, 2020

Choose a reason for hiding this comment