Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[copp - NoPolicyTest] RX performance issue of ptf_nn_agent.py #308

Closed
okanchou9 opened this issue Oct 20, 2017 · 11 comments
Closed

[copp - NoPolicyTest] RX performance issue of ptf_nn_agent.py #308

okanchou9 opened this issue Oct 20, 2017 · 11 comments

Comments

@okanchou9
Copy link
Contributor

okanchou9 commented Oct 20, 2017

Hi,

I'm running the copp test on my box and found out there is large gap of the value of RX counter between CPU and ptf_nn_agent:

I'm using /proc/bcm/knet/stats to get the RX counter of CPU like following after test(Ex, DHCPTest):

root@switch2:/home/admin# cat /proc/bcm/knet/stats | grep "Rx0 packets"
Rx0 packets 100127 --> About 100K packets received by CPU

But the RX counter of ptf_nn_agent is only around 83K

2017-10-20 01:27:45 : DHCPTest
2017-10-20 01:28:20 :
2017-10-20 01:28:20 : Counters before the test:
2017-10-20 01:28:20 : If counter (0, n): (87, 0)
2017-10-20 01:28:20 : NN counter (0, n): (66637, 500002)
2017-10-20 01:28:20 : If counter (1, n): (2, 0)
2017-10-20 01:28:20 : NN counter (1, n): (2, 0)
2017-10-20 01:28:20 :
2017-10-20 01:28:20 : Counters after the test:
2017-10-20 01:28:20 : If counter (0, n): (87, 100000)
2017-10-20 01:28:20 : NN counter (0, n): (66637, 600002)
2017-10-20 01:28:20 : If counter (1, n): (83074, 0)
2017-10-20 01:28:20 : NN counter (1, n): (83074, 0)
2017-10-20 01:28:20 :
2017-10-20 01:28:20 : Sent through NN to local ptf_nn_agent: 100000
2017-10-20 01:28:20 : Sent through If to remote ptf_nn_agent: 100000
2017-10-20 01:28:20 : Recv from If on remote ptf_nn_agent: 83072
2017-10-20 01:28:20 : Recv from NN on from remote ptf_nn_agent: 83072
2017-10-20 01:28:20 :
2017-10-20 01:28:20 : test stats
2017-10-20 01:28:20 : Packet sent = 100000
2017-10-20 01:28:20 : Packet rcvd = 83072
2017-10-20 01:28:20 : Test time = 0:00:23.654488
2017-10-20 01:28:20 : TX PPS = 4227
2017-10-20 01:28:20 : RX PPS = 3511
2017-10-20 01:28:20 :
2017-10-20 01:28:20 : Checking constraints (NoPolicy):
2017-10-20 01:28:20 : rx_pps (3511) > NO_POLICER_LIMIT (840): True
2017-10-20 01:28:20 : total_rcv_pkt_cnt (83072) > pkt_rx_limit (90000): False

After doing some research, I added 1µs delay between each send packet:

for i in xrange(count):
testutils.send_packet(self, send_intf, packet)
time.sleep(1.0 / 1000000.0)

Also I got pass result when I rerun test test when above script modification:

2017-10-20 01:45:19 : DHCPTest
2017-10-20 01:46:07 :
2017-10-20 01:46:07 : Counters before the test:
2017-10-20 01:46:07 : If counter (0, n): (11, 0)
2017-10-20 01:46:07 : NN counter (0, n): (66685, 1100002)
2017-10-20 01:46:07 : If counter (1, n): (23, 0)
2017-10-20 01:46:07 : NN counter (1, n): (567847, 0)
2017-10-20 01:46:07 :
2017-10-20 01:46:07 : Counters after the test:
2017-10-20 01:46:07 : If counter (0, n): (15, 100000)
2017-10-20 01:46:07 : NN counter (0, n): (66689, 1200002)
2017-10-20 01:46:07 : If counter (1, n): (98760, 0)
2017-10-20 01:46:07 : NN counter (1, n): (666584, 0)
2017-10-20 01:46:07 :
2017-10-20 01:46:07 : Sent through NN to local ptf_nn_agent: 100000
2017-10-20 01:46:07 : Sent through If to remote ptf_nn_agent: 100000
2017-10-20 01:46:07 : Recv from If on remote ptf_nn_agent: 98737
2017-10-20 01:46:07 : Recv from NN on from remote ptf_nn_agent: 98737
2017-10-20 01:46:07 :
2017-10-20 01:46:07 : test stats
2017-10-20 01:46:07 : Packet sent = 100000
2017-10-20 01:46:07 : Packet rcvd = 98733
2017-10-20 01:46:07 : Test time = 0:00:34.352164
2017-10-20 01:46:07 : TX PPS = 2911
2017-10-20 01:46:07 : RX PPS = 2874
2017-10-20 01:46:07 :
2017-10-20 01:46:07 : Checking constraints (NoPolicy):
2017-10-20 01:46:07 : rx_pps (2874) > NO_POLICER_LIMIT (840): True
2017-10-20 01:46:07 : total_rcv_pkt_cnt (98733) > pkt_rx_limit (90000): True

Please refer the following as my testbed topology:

[testbed_server]---[fan-out switch]---[DUT]
PTF_host_node remote_node
172.20.200.202 172.20.192.94

Using CLI "python ptf_nn_agent.py --device-socket 0@tcp://172.20.192.94:10900 -i 0-3@Ethernet12&" to bring up the remote node by ptf_nn_agent.py on DUT.

And run the copp test with CLI 'ansible-playbook test_sonic.yml -i inventory --limit DUT --become --tags copp --extra-vars "ptf_host=172.20.200.202"' on testbed server.

Not sure there is anyone hit the same or similar situation as mine. Also please advice me if any, thanks.


Regards,
Kenie Liu

@cytsai0409
Copy link

ptf_nn_agent will send packet back from DUT to test server to count matched packets and insufficient socket buffer will cause packet drop on DUT

Fix this issue by adding write socket buffer and nanomsg socket buffer on DUT.

To add write socket buffer, add the following line in the file /etc/sysctl.conf on DUT

  • net.core.wmem_max = 2097152

To add nanomsg socket buffer, add the options below in the ptf_nn_agent command on DUT

  • python ptf_nn_agent.py --device-socket 1@tcp://[DUT_MGMT_IP]:10900 -i 1-3@Ethernet12 --set-nn-rcv-buffer=10000000 --set-iface-rcv-buffer=10000000 --set-nn-snd-buffer=10000000 --set-iface-snd-buffer=10000000

@okanchou9
Copy link
Contributor Author

@cytsai0409

Thanks for the info!
Tried this fix and passed in my test.

@cytsai0409
Copy link

cytsai0409 commented Nov 8, 2017

Hi, @maggiemsft

Sorry to bother you. Need your help here.

Do you have CoPP test experience and encounter the packet loss issue as we did?
If yes, did you increase the write buffer in kernel and python as we did to pass CoPP tests?
And what is the switch ASIC in your DUT? Broadcom Tamahawk?

Thanks.

@okanchou9 okanchou9 reopened this Nov 14, 2017
@pavel-shirshov
Copy link
Contributor

Hi cytsai0409,

I have some experience with CoPP test.
I've encountered packet loss issue with the test and solved it.
I solved it by increasing SO_RCVBUF for both AF_PACKET socket and nn socket.
See the parameters in p4lang/ptf@64b6b36#diff-8ecfc196f315309bc019fd91bdbeb02a patch. You need to try customize parameters "--set-iface-rcv-buffer" and "--set-nn-rcv-buffer"
Also, Please make sure you have sysctl "net.core.rmem_max" set to bigger value, because it would restrict your changes for SO_RCVBUF
https://github.com/Azure/sonic-buildimage/blob/f4e37a66f92099cc5d3bbde33823cd38b30a48a8/build_debian.sh#L284

I didn't write with a write buffer at all. Because I don't see any reason to change it, kernel is much faster in reading packets from the python test, so it's not a problem from my understanding

I tested the CoPP test with TD2 and Mellanox Spectrum ASICs.

Please let me know if you have more questions.
Also Please share the test output with me, usually I can deduce what's going on from the log.

Thanks

@cytsai0409
Copy link

Hi, Pavel:

Thanks for your suggestions.
We will try it on our testbed and feedback the result later.

@cytsai0409
Copy link

cytsai0409 commented Nov 20, 2017

Hi, Pavel:

It works for setting read buffers to "--set-nn-rcv-buffer=109430400 --set-iface-rcv-buffer=109430400" and set net.core.rmem_max to 109430400 on our testbed.
We used to set the read buffers to 10000000. Apparently it is not enough for CoPP Test.

I have another question.
Is it acceptable to set net.core.rmem_max more than 2097152?
It seems that we failed the CoPP test when setting the net.core.rmem_max to 2097152 (--set-nn-rcv-buffer and --set-iface-rcv-buffer are set to 109430400).

We are using broadcom Tomahawk ASIC.

Thanks for your help.

=====================================================
Here is the log when setting net.core.rmem_max to 2097152 (--set-nn-rcv-buffer and --set-iface-rcv-buffer are set to 109430400):

CoPP DHCP fail.log.txt

@pavel-shirshov
Copy link
Contributor

Hi,

Sure it's acceptable to set it to any value which helps your system to pass the test. We need this parameter because of the slowness of python.
Feel free to use any rcvbuf value you want.

@cytsai0409
Copy link

Hi, @pavel-shirshov

About the read buffer "net.core.rmem_max", do we need to commit the code to increase its value to 109430400 on github? Or we just change its value in sysctl.conf when we are doing the CoPP test?

Thanks.

@pavel-shirshov
Copy link
Contributor

Hi Jason,

It's up to you. If you can patch that value in your baseimage locally, without the repo change, it is the best way. If you can't do that you may create a PR to change the repo.

Thanks

@cytsai0409
Copy link

Hi, Pavel:

Ok, we will try to patch it locally.
Thanks.

@okanchou9
Copy link
Contributor Author

Close this issue since solution found.

abdosi added a commit to abdosi/sonic-mgmt that referenced this issue Jul 7, 2020
100K+ packets to CPU
Ptf_nn_agent is not able to account for all the packets. Based on this
thread sonic-net#308
I have increased both send/receive socket buffer both on Kernel and
socket side. Issue is Seen on Broadom based Dell-6000 platform.
abdosi added a commit that referenced this issue Jul 8, 2020
…1857)

100K+ packets to CPU
Ptf_nn_agent is not able to account for all the packets. Based on this
thread #308
I have increased both send/receive socket buffer both on Kernel and
socket side. Issue is Seen on Broadom based Dell-6000 platform.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants