Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Recent DMA change ed4814f6239097dde5eecf2b4fbd58661db84dda stm32h7/serial: Do not wait on TXDMA semaphore breaks a lot of things. #22399

Closed
davids5 opened this issue Nov 16, 2023 · 24 comments

Comments

@davids5
Copy link
Member

davids5 commented Nov 16, 2023

Describe the bug

With stm32h7/serial: Do not wait on TXDMA semaphore in, the console will not complete transmitting the complete output until a character is input on console.

gps status

sh> gps status
INFO  [gps] Main GPS
INFO  [gps] protocol: UBX
INFO  [gps] status: OK, port: /dev/ttyS0, baudrate: 115200
INFO  [gps] sat info: disabled
INFO  [gps] rate reading:                  817 B/s
INFO  [gps] rate position:                5.00 Hz
INFO  [gps] rate velocity:                5.00 Hz
INFO  [gps] rate publication:             5.00 Hz
INFO  [gps] rate RTCM injection:          0.00 Hz
 sensor_gps
    timestamp: 402227138 (0.141411 seconds ago)

< hit a key >

    timestamp_sample: 0
    latitude_deg: 27.589204
    longitude_deg: -80.435646
    altitude_msl_m: 21.876000
    altitude_ellipsoid_m: -6.741000
    time_utc_usec: 1700158702599596
    device_id: 11010053 (Type: 0xA8, SERIAL:0 (0x00))
    s_variance_m_s: 0.91500
    c_variance_rad: 0.77875
    eph: 4.81900
    epv: 6.55500
    hdop: 1.67000
    vdop: 2.80000
    noise_per_ms: 98
    jamming_indicator: 47
    vel_m_s: 0.04700
    vel_n_m_s: 0.02800
    vel_e_m_s: -0.03800
    vel_d_m_s: 0.00200
    cog_rad: 1.72410
    timestamp_time_relative: 0
    heading: nan
    heading_offset: 0.00000
    heading_accuracy: 0.00000
    rtcm_injection_rate: 0.00000
    automatic_gain_control: 780
    fix_type: 3
    jamming_state: 0
    spoofing_state: 1
    vel_ned_valid: True
    satellites_used: 8
    selected_rtcm_instance: 0
    rtcm_crc_failed: False
    rtcm_msg_used: 0

To Reproduce

see above

Expected behavior

not extra input needed

Screenshot / Media

No response

Flight Log

NA

Software Version

main with nuttx that has ed4814f6239097dde5eecf2b4fbd58661db84dda

Flight controller

all H7 with TX DMA

Vehicle type

None

How are the different components wired up (including port information)

No response

Additional context

No response

@davids5 davids5 changed the title [Bug] Recent DMA change ed4814f6239097dde5eecf2b4fbd58661db84dda stm32h7/serial: Do not wait on TXDMA semaphore break alot. [Bug] Recent DMA change ed4814f6239097dde5eecf2b4fbd58661db84dda stm32h7/serial: Do not wait on TXDMA semaphore breaks a lot of things. Nov 16, 2023
@davids5
Copy link
Member Author

davids5 commented Nov 16, 2023

I have have also seen where gps driver on ports that have TXDMA will not work. (v6xrt)

@AlexKlimaj
Copy link
Member

I am seeing the same thing on the ARKV6X.

@PetervdPerk-NXP
Copy link
Member

I think the TXDMA semaphore change is wrong and should be reverted the way it was.
The write() function by posix definition can block indefinitely.
https://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html

The issue #22302 tried to fix was not a problem in the NuttX kernel.
But a problem in the mavlink application itself since it opens the uart in blocking mode.

_uart_fd = ::open(uart_name, O_RDWR | O_NOCTTY);

My proposed solution is that mavlink open the uart using O_NONBLOCK instead. And should handles the EAGAIN return code.

@julianoes
Copy link
Contributor

julianoes commented Nov 16, 2023

Can someone point me to a commit where things are still sane?

Edit: looks like main is ok again?

@niklaut
Copy link
Contributor

niklaut commented Nov 17, 2023

But then should the implementation not be:

  1. UART configured as blocking -> nxsem_wait.
  2. UART configured as non-blocking -> nxsem_trywait.

@PetervdPerk-NXP
Copy link
Member

PetervdPerk-NXP commented Nov 17, 2023

up_dma_txavailable should get a return value and the NuttX serial upper driver should be extended as well to feedback EAGAIN to userspace.
https://github.com/PX4/NuttX/blob/e7da5ac0e660238cb353948b2a9118579727267a/drivers/serial/serial.c#L293
https://github.com/PX4/NuttX/blob/e7da5ac0e660238cb353948b2a9118579727267a/drivers/serial/serial.c#L494
https://github.com/PX4/NuttX/blob/e7da5ac0e660238cb353948b2a9118579727267a/drivers/serial/serial.c#L1261

Edit: Maybe we don't even need EAGAIN but I'm afraid we lose data if we don't. Need some more investigation.

@davids5
Copy link
Member Author

davids5 commented Nov 17, 2023

I need to review the sequence for the Transmit.

@davids5
Copy link
Member Author

davids5 commented Nov 21, 2023

I have a nuttx branch with

The stm32h7 fix: PX4/NuttX@c696d5e
The imx: fix: PX4/NuttX@2de0013 PX4/NuttX@52c5524 PX4/NuttX@8cae36b

The console requiring a second key press is fixed. serial_test runs at 3Mbps and with PX4/PX4-GPSDrivers#143 the imxrt can use TX DMA again.

@niklaut can you pull in the stm32h7 and test the original issue with mavlink?

@beniaminopozzan
Copy link
Member

As @dakejahl mentioned, this issue is also heavly affecting the uxrce_dds_client.
After #22302 the RTT of the xrce-dds communication with a companion computer increased from 1.5ms to 20ms (testing on a Raspberry Pi 4, serial hardware on the raspberry (serial5), CUAV pixhawk 6x, TELEM1 used on the CUAV at 921600b/s).
20ms of RTT makes the time synchronization impossible.

@davids5 , I tested your nuttx branch and it definitively improves but it does not remove entirely the issue as the RTT is now around 6ms.

@davids5
Copy link
Member Author

davids5 commented Nov 22, 2023

As @dakejahl mentioned, this issue is also heavly affecting the uxrce_dds_client. After #22302 the RTT of the xrce-dds communication with a companion computer increased from 1.5ms to 20ms (testing on a Raspberry Pi 4, serial hardware on the raspberry (serial5), CUAV pixhawk 6x, TELEM1 used on the CUAV at 921600b/s). 20ms of RTT makes the time synchronization impossible.

@davids5 , I tested your nuttx branch and it definitively improves but it does not remove entirely the issue as the RTT is now around 6ms.

@beniaminopozzan am I following that if you just revert the H7 commit that created the problem you get some timing that 1.5ms. With that commit it goes to to 20 and then with px4_firmware_nuttx-10.3.0%2B-serial-fixes-testing it is 6ms?

@beniaminopozzan
Copy link
Member

As @dakejahl mentioned, this issue is also heavly affecting the uxrce_dds_client. After #22302 the RTT of the xrce-dds communication with a companion computer increased from 1.5ms to 20ms (testing on a Raspberry Pi 4, serial hardware on the raspberry (serial5), CUAV pixhawk 6x, TELEM1 used on the CUAV at 921600b/s). 20ms of RTT makes the time synchronization impossible.
@davids5 , I tested your nuttx branch and it definitively improves but it does not remove entirely the issue as the RTT is now around 6ms.

@beniaminopozzan am I following that if you just revert the H7 commit that created the problem you get some timing that 1.5ms. With that commit it goes to to 20 and then with px4_firmware_nuttx-10.3.0%2B-serial-fixes-testing it is 6ms?

@davids5 , that's correct!

  • Current PX4 main: RTT = 20ms
  • Current PX4 main with this commit reverted in Nuttx PX4/NuttX@ed4814f : RTT = 1.5ms (as it should)
  • Current PX4 main with Nuttx on your branch (PX4/NuttX@8cae36b): RTT = 6ms

@davids5
Copy link
Member Author

davids5 commented Nov 22, 2023

@beniaminopozzan can you describe the setup (wiring and commands) so I can reproduce the issue?

@beniaminopozzan
Copy link
Member

@davids5 Of course

Raspberry Pi 4 Setup

PX4 setup

  • I'm using a CUAV Pixhawk v6x
  • The relevant (non firmware default) parameters for uXRCE-DDS are
Parameter value
UXRCE_DDS_CFG TELEM 1
MAV_0_CONFIG Disabled
SER_TEL1_BAUD 921600

Hardware connections

The CUAV TELEM1 is directly connected to uart4 of the Raspberry. The Rasberry RX4 is on PIN21 while TX4 is on PIN24.

Testing procedure

  1. Power on the CUAV
  2. Power on the Raspberry and start the microXRCE-DDS agent. On my setup the command is MicroXRCEAgent serial --dev /dev/ttyAMA1 -b 921600
  3. On a new terminal, echo the /fmu/out/timesync_status ROS2 topic and look for the round_trip_time field. It's value is in microseconds.

Altenative setup

Potentially, you can use any usb to serial adapter so that the raspberry is not neeeded. However I always found the quite unreliable.

@niklaut
Copy link
Contributor

niklaut commented Nov 24, 2023

@beniaminopozzan You're right, there are differences in timing even on an unconnected TELEM1: PX4/NuttX#285 (comment).

@DronecodeBot
Copy link

This issue has been mentioned on Discussion Forum for PX4, Pixhawk, QGroundControl, MAVSDK, MAVLink. There might be relevant details there:

https://discuss.px4.io/t/gps2-on-cube-orange-mini-mapping-possibly-incorrect/35512/6

@beniaminopozzan
Copy link
Member

@davids5 any news on this?

@AlexKlimaj
Copy link
Member

Looks like the two commits still haven't fixed this. Run it long enough and TX gets stuck.

@AlexKlimaj
Copy link
Member

AlexKlimaj commented Dec 24, 2023

I thought removing O_NONBLOCK from the dds client driver fixed it but it just looks like the whole driver locks up when the serial port freezes.

With O_NONBLOCK, the driver thinks its sending data but nothing gets received by the agent. RX into the client is still fully working.

Without O_NONBLOCK, the entire driver locks up.

@davids5
Copy link
Member Author

davids5 commented Dec 24, 2023

I thought removing O_NONBLOCK from the dds client driver fixed it but it just looks like the whole driver locks up when the serial port freezes.

With O_NONBLOCK, the driver thinks its sending data but nothing gets received by the agent. RX into the client is still fully working.

Without O_NONBLOCK, the entire driver locks up.

How many threads is it? Maye you can walk me through that you have learned from uxrce_dds_client/uxrce_dds_client.cpp.

@davids5
Copy link
Member Author

davids5 commented Dec 24, 2023

I thought removing O_NONBLOCK from the dds client driver fixed it but it just looks like the whole driver locks up when the serial port freezes.
With O_NONBLOCK, the driver thinks its sending data but nothing gets received by the agent. RX into the client is still fully working.
Without O_NONBLOCK, the entire driver locks up.

How many threads is it? Maye you can walk me through that you have learned from uxrce_dds_client/uxrce_dds_client.cpp.

@AlexKlimaj can you scope TX, RX and the RTS, CTS lines at the point of failure and post it here?

@AlexKlimaj
Copy link
Member

I'm testing using mavlink now instead of DDS. Running into the same disconnection problem. It will run for a while, then lock up.

@AlexKlimaj
Copy link
Member

I can scope next week but I just tested current main with telem 2 on mavlink and it barely runs for 10 minutes without disconnecting. Main with the dma commits reverted has been running for 4 hours

@davids5
Copy link
Member Author

davids5 commented Jan 24, 2024

@beniaminopozzan The serial DMA issue was fixed with #22667. Please test current master. If this is still and issue please reopen this or post a new issues with what you are seeing.

@davids5 davids5 closed this as completed Jan 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants