Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Heightmap integration test #785

Merged
merged 28 commits into from
Dec 28, 2022

Conversation

darksylinc
Copy link
Contributor

@darksylinc darksylinc commented Dec 10, 2022

🦟 Bug fix

Almost all issues in #546 are covered.

Only Thermal Camera is not covered.

Fixes #332 (again) for macOS and possibly any OS/HW/Driver combination.
Fixes #369

Summary

This PR adds a heightmap test.

This test has uncovered a lot of bugs recently.

Note: This PR currently includes part of #784 because it otherwise will fail unit tests.

Checklist

  • Signed all commits for DCO
  • Added tests
  • Updated documentation (as needed)
  • Updated migration guide (as needed)
  • Consider updating Python bindings (if the library has them)
  • codecheck passed (See contributing)
  • All tests passed (See test coverage)
  • While waiting for a review on your PR, please help review another open pull request to support the maintainers

Note to maintainers: Remember to use Squash-Merge and edit the commit message to match the pull request summary while retaining Signed-off-by messages.

Use proper sRGB conversion in DepthCamera

Signed-off-by: Matias N. Goldberg <[email protected]>
Signed-off-by: Matias N. Goldberg <[email protected]>
Signed-off-by: Matias N. Goldberg <[email protected]>
OgreNext has CompositorPassSceneDef::setVisibilityMask to prevent
overwriting reserved flags.

If user calls gz::Camera::SetVisibilityMask with the reserved flags set,
wrong rendering could happen.

Signed-off-by: Matias N. Goldberg <[email protected]>
Signed-off-by: Matias N. Goldberg <[email protected]>
Signed-off-by: Matias N. Goldberg <[email protected]>
@darksylinc darksylinc requested a review from iche033 as a code owner December 10, 2022 22:40
@github-actions github-actions bot added the 🌱 garden Ignition Garden label Dec 10, 2022
@darksylinc
Copy link
Contributor Author

darksylinc commented Dec 10, 2022

Oooofff.

CI has a lot of failures:

  • Ubuntu CIs from Github Action failed everywhere. Not expected.
    • Bizarrely, total number of wrong pixels is 0.42%; but doesn't sound reasonable.
    • The sky not being constrained to the top 1/4th of the image is puzzling.
  • Homebew macOS fails with ogre1. Although unexpected, this can be ignored. OGL drivers on macOS are bad.
  • Homebew macOS on ogre2 has a few unexpected failures. Look similar to Ubuntu/GH Actions
    • Bizarrely, total number of wrong pixels is 0.25%; which sounds reasonable?
    • But the problem is how wrong those pixels were. But it may be explained by an "off by 1" bug (e.g. if terrain is off by 1 pixel between reference & depth pass; this would be explained). These differences can happen due to how shader compiler optimizes floating point calculations for vertex position. I'm not sold on this being the reason though.
    • The sky not being constrained to the top 1/4th of the image is puzzling.
  • Homebew macOS found a legit bug. GpuRays + Heightmap + ogre2 + metal will result in a shader compiler error

Update: It seems a lot of these errors actually belong to ogre1 as I can repro them locally. I'd swear I tested ogre1 and it was passing. Maybe I accidentally tested ogre2 instead? Or maybe a regression? (when I rebased gz-rendering7)

Fix memory leak

Signed-off-by: Matias N. Goldberg <[email protected]>
Signed-off-by: Matias N. Goldberg <[email protected]>
@darksylinc
Copy link
Contributor Author

This test is a gift that keeps on giving. There is a memory corruption in GpuRays:

if (this->dataPtr->firstPassTextures[i])
    {
      ogreRoot->getRenderSystem()->getTextureGpuManager()->destroyTexture(
         this->dataPtr->firstPassTextures[i]);
      this->dataPtr->firstPassTextures[i] = nullptr;
    }
    if (this->dataPtr->ogreCompositorWorkspace1st[i])
    {
      ogreCompMgr->removeWorkspace(
          this->dataPtr->ogreCompositorWorkspace1st[i]);
      this->dataPtr->ogreCompositorWorkspace1st[i] = nullptr;
    }

The order of destruction is inverted.

Order of destruction was inverted

Signed-off-by: Matias N. Goldberg <[email protected]>
Signed-off-by: Matias N. Goldberg <[email protected]>
Signed-off-by: Matias N. Goldberg <[email protected]>
So tell clang tidy to ignore that.

Signed-off-by: Matias N. Goldberg <[email protected]>
Signed-off-by: Matias N. Goldberg <[email protected]>
Signed-off-by: Matias N. Goldberg <[email protected]>
@darksylinc
Copy link
Contributor Author

darksylinc commented Dec 11, 2022

OK new round of fixes:

  • Focal CI keeps crashing with a memory corruption error. That makes no sense. I have Focal on my system. I ran valgrind. Found a memory corruption. Fixed it. The problem remains but Valgrind now says everything is ok (except for a few leaks). I have no idea what is causing it.
    • The only thing that comes to mind is that the Focal CI is using an extremely outdated version of gz-common (ancient! would it even compile?) as there was a corruption going on with common::Image::SwapRedBlue
    • ARRGGH!!! I tried pushing ASAN into the CI to get some callstack and all I got was:
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer: nested bug in the same thread, aborting.

Useless!

  • I added code to dump to Console the output of the CI in base64 form for later inspection. I needed this to test macOS.
    • Btw this functionality should be extended to all tests. It's extremely handy!!!
  • macOS fails the test but interestingly the problem is not in the terrain but rather in the shadowed side of the cube:

Reference (according to macOS CI):
Original

RGB: 0 73 0

Depth Camera (according to macOS CI):
DepthRgbData

RGB: 0 0 0

There is only two possibilities:

  • An OgreNext/Gazebo bug where the DepthCamera does not properly set the ambient lighting but only affects Metal or macOS.
  • That bug again where encoding RGBA8888 as a floating point ends up being flushed to 0.

Signed-off-by: Matias N. Goldberg <[email protected]>
@darksylinc darksylinc force-pushed the matias-unitTestHeightmap branch from 7ada41b to e70fb25 Compare December 11, 2022 22:27
@iche033
Copy link
Contributor

iche033 commented Dec 20, 2022

I ran the INTEGRATION_heightmap test on Jammy and it passed without issue. I then ran it on Focal inside docker and it crashed. If it helps, here's the backtrace:

backtrace
[Wrn] [Ogre2Heightmap.cc:118] Heightmap final sampling should be 2^n
 which differs from ogre1's 2^n+1
The last row and column will be cropped
size = (width * sampling) - sampling + 1
[257] = ([129] * [2]) - [2] + 1
[Msg] Loading heightmap: example_bowl
[Msg] Heightmap loaded. Process took 0 ms.
INTEGRATION_heightmap: /home/jenkins/workspace/ogre-2.3-debbuilder/repo/OgreMain/include/Math/Array/SSE2/Single/OgreMathlibSSE2.h:136: static Ogre::ArrayReal Ogre::MathlibSSE2::Cmov4(Ogre::ArrayReal, Ogre::ArrayReal, Ogre::ArrayMaskR): Assertion `_mm_movemask_ps( _mm_cmpeq_ps( arg1, arg1 ) ) == 0x0f && _mm_movemask_ps( _mm_cmpeq_ps( arg2, arg2 ) ) == 0x0f && "Passing NaN values to CMov4"' failed.

Thread 1 "INTEGRATION_hei" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007fe99195c859 in __GI_abort () at abort.c:79
#2  0x00007fe99195c729 in __assert_fail_base (fmt=0x7fe991af2588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=0x7fe98792fc20 "_mm_movemask_ps( _mm_cmpeq_ps( arg1, arg1 ) ) == 0x0f && _mm_movemask_ps( _mm_cmpeq_ps( arg2, arg2 ) ) == 0x0f && \"Passing NaN values to CMov4\"", 
    file=0x7fe98792fbb0 "/home/jenkins/workspace/ogre-2.3-debbuilder/repo/OgreMain/include/Math/Array/SSE2/Single/OgreMathlibSSE2.h", line=136, function=<optimized out>) at assert.c:92
#3  0x00007fe99196dfd6 in __GI___assert_fail (assertion=0x7fe98792fc20 "_mm_movemask_ps( _mm_cmpeq_ps( arg1, arg1 ) ) == 0x0f && _mm_movemask_ps( _mm_cmpeq_ps( arg2, arg2 ) ) == 0x0f && \"Passing NaN values to CMov4\"", 
    file=0x7fe98792fbb0 "/home/jenkins/workspace/ogre-2.3-debbuilder/repo/OgreMain/include/Math/Array/SSE2/Single/OgreMathlibSSE2.h", line=136, 
    function=0x7fe98792fb48 "static Ogre::ArrayReal Ogre::MathlibSSE2::Cmov4(Ogre::ArrayReal, Ogre::ArrayReal, Ogre::ArrayMaskR)") at assert.c:101
#4  0x00007fe9877185b4 in Ogre::Node::updateFromParentImpl() () from /usr/lib/x86_64-linux-gnu/OGRE-2.3/libOgreNextMain.so.2.3.1
#5  0x00007fe987718cbf in Ogre::Node::_updateFromParent() () from /usr/lib/x86_64-linux-gnu/OGRE-2.3/libOgreNextMain.so.2.3.1
#6  0x00007fe987718cb6 in Ogre::Node::_updateFromParent() () from /usr/lib/x86_64-linux-gnu/OGRE-2.3/libOgreNextMain.so.2.3.1
#7  0x00007fe987719d7d in Ogre::Node::_getDerivedOrientationUpdated() () from /usr/lib/x86_64-linux-gnu/OGRE-2.3/libOgreNextMain.so.2.3.1
#8  0x00007fe9875c86fe in Ogre::Camera::isViewOutOfDate() const () from /usr/lib/x86_64-linux-gnu/OGRE-2.3/libOgreNextMain.so.2.3.1
#9  0x00007fe98760ea11 in Ogre::Frustum::updateView() const () from /usr/lib/x86_64-linux-gnu/OGRE-2.3/libOgreNextMain.so.2.3.1
#10 0x00007fe9875caa71 in Ogre::Camera::getDerivedPosition() const () from /usr/lib/x86_64-linux-gnu/OGRE-2.3/libOgreNextMain.so.2.3.1
#11 0x00007fe987d76968 in Ogre::Terra::update (this=this@entry=0x55688e33d3e0, lightDir=..., lightEpsilon=lightEpsilon@entry=9.99999997e-07) at /home/user/code/gz_g_ws/src/gz-rendering/ogre2/src/terrain/Terra/src/Terra.cpp:597
#12 0x00007fe987ce04a8 in gz::rendering::v7::Ogre2Heightmap::UpdateForRender (this=this@entry=0x55688e35f940, _activeCamera=_activeCamera@entry=0x55688e2022d0)
    at /home/user/code/gz_g_ws/src/gz-rendering/ogre2/src/Ogre2Heightmap.cc:424
#13 0x00007fe987d49f38 in gz::rendering::v7::Ogre2Scene::UpdateAllHeightmaps (this=0x55688dd0c730, _camera=0x55688e2022d0) at /home/user/code/gz_g_ws/src/gz-rendering/ogre2/src/Ogre2Scene.cc:497
#14 0x00007fe987d4a247 in gz::rendering::v7::Ogre2Scene::StartRendering (this=0x55688dd0c730, _camera=<optimized out>) at /home/user/code/gz_g_ws/src/gz-rendering/ogre2/src/Ogre2Scene.cc:288
#15 0x00007fe987d41ee5 in gz::rendering::v7::Ogre2RenderTarget::Render (this=0x55688e1753d8) at /usr/include/c++/9/bits/shared_ptr_base.h:1020
#16 0x00007fe987ca5a6e in gz::rendering::v7::BaseCamera<gz::rendering::v7::Ogre2Sensor>::Update (this=0x55688e185700) at /usr/include/c++/9/ext/atomicity.h:82
#17 0x00007fe987ca306a in gz::rendering::v7::BaseCamera<gz::rendering::v7::Ogre2Sensor>::Capture (this=0x55688e185700, _image=...) at /home/user/code/gz_g_ws/src/gz-rendering/include/gz/rendering/base/BaseCamera.hh:431
#18 0x000055688c7cb5d3 in HeightmapTest_Heightmap_Test::TestBody (this=0x55688d67cfe0) at /usr/include/c++/9/bits/shared_ptr_base.h:1020
#19 0x000055688c80fee1 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (location=0x55688c812e31 "the test body", method=<optimized out>, object=0x55688d67cfe0)
    at /home/user/code/gz_g_ws/src/gz-rendering/test/gtest_vendor/src/gtest.cc:2580
#20 testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=object@entry=0x55688d67cfe0, method=<optimized out>, location=location@entry=0x55688c812e31 "the test body")
    at /home/user/code/gz_g_ws/src/gz-rendering/test/gtest_vendor/src/gtest.cc:2635
#21 0x000055688c801c26 in testing::Test::Run (this=this@entry=0x55688d67cfe0) at /home/user/code/gz_g_ws/src/gz-rendering/test/gtest_vendor/src/gtest.cc:2674
#22 0x000055688c801eb2 in testing::Test::Run (this=0x55688d67cfe0) at /home/user/code/gz_g_ws/src/gz-rendering/test/gtest_vendor/src/gtest.cc:2853
#23 testing::TestInfo::Run (this=0x55688d67cb00) at /home/user/code/gz_g_ws/src/gz-rendering/test/gtest_vendor/src/gtest.cc:2853
#24 0x000055688c80203a in testing::TestSuite::Run (this=0x55688d644f80) at /home/user/code/gz_g_ws/src/gz-rendering/test/gtest_vendor/src/gtest.cc:3012
#25 0x000055688c8025e5 in testing::TestSuite::Run (this=<optimized out>) at /home/user/code/gz_g_ws/src/gz-rendering/test/gtest_vendor/src/gtest.cc:2986
#26 testing::internal::UnitTestImpl::RunAllTests (this=0x55688d659000) at /home/user/code/gz_g_ws/src/gz-rendering/test/gtest_vendor/src/gtest.cc:5870
#27 0x000055688c810451 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (location=0x55688c8148c8 "auxiliary test code (environments or event listeners)", method=<optimized out>, 
    object=0x55688d659000) at /home/user/code/gz_g_ws/src/gz-rendering/test/gtest_vendor/src/gtest.cc:2580
#28 testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x55688d659000, method=<optimized out>, 
    location=location@entry=0x55688c8148c8 "auxiliary test code (environments or event listeners)") at /home/user/code/gz_g_ws/src/gz-rendering/test/gtest_vendor/src/gtest.cc:2635
#29 0x000055688c802b30 in testing::UnitTest::Run (this=0x55688c82c8a0 <testing::UnitTest::GetInstance()::instance>) at /home/user/code/gz_g_ws/src/gz-rendering/test/gtest_vendor/include/gtest/gtest.h:1246
#30 0x000055688c7c9a04 in RUN_ALL_TESTS () at /home/user/code/gz_g_ws/src/gz-rendering/test/gtest_vendor/include/gtest/gtest.h:2293
#31 main (argc=<optimized out>, argv=0x7ffe45cc7908) at /home/user/code/gz_g_ws/src/gz-rendering/test/gtest_vendor/src/gtest_main.cc:51

I found that if I comment out the directional light in the test, then it passes.

@darksylinc
Copy link
Contributor Author

darksylinc commented Dec 21, 2022

Mmmm this still reeks of some weird corruption going on.

The assert is detecting that a position/direction is NaN and then complaining.

But this shouldn't be happening based on how it runs on my machine and jammy. That is what really throws me off.

Perhaps I should try using the *.SO built by gazebo (since I see the crash comes from /usr/lib/x86_64-linux-gnu/OGRE-2.3/libOgreNextMain.so.2.3.1). Perhaps the problem is in how the library was compiled.

I've been using my own OgreNext 2.3 to run the samples.

Also, if you are able to run Valgrind on it, that would be of great help.

I then ran it on Focal inside docker

Is it complex to launch a docker instance w/ gazebo to test it on?

@iche033
Copy link
Contributor

iche033 commented Dec 21, 2022

ah ok I think I found the right combination for the crash: Focal + llvmpipe. The docker container I was on was doing software rendering. When I tried with Focal + nvidia docker, the test passed. The test also passes on Jammy with llvmpipe so it's just Focal.

So to reproduce, I think you just need to force software rendering:

Xvfb :2 -ac -noreset -core -screen 0 1280x1024x24 &
DISPLAY=:2.0 GZ_ENGINE_TO_TEST=ogre2 ./build/gz-rendering7/bin/INTEGRATION_heightmap

We have a few tests disabled because it does not work well on CI due to the same reason, e.g. here. So if it's not trivial, we should probably just disable this on CI.

If you're still interested in valgrind output, here's the gist:
https://gist.github.com/iche033/76cf86e4043ecac0799f732f452f9eb3

@darksylinc
Copy link
Contributor Author

darksylinc commented Dec 22, 2022

Out of all problems, it was an llvmpipe crash? I did not see that one coming.

That explains it, I am using Kisak PPA which provides more recent mesa drivers, which includes llvmpipe.

Either llvmpipe is buggy or it did not provide a feature we rely on.

This is a major relief because it means there is no corruption bug in our part.

We just need to disable the test for that CI (and see if we can detect why it crashes, perhaps it's just unsupported).

As for the macOS check; I'm torn apart:

  1. The test is working as intended (it detects an error)
  2. One solution is to skip that particular test on macOS
  3. Another solution is to intentionally increase the threshold that is failing, while leaving the other thresholds intact
    • And place a big fat warning around this increase that is only a TODO
  4. Ideally another test should be made that detects this problem in particular.
    • Now that Gazebo has migrated to OgreNext >= 2.2; likely the best solution would be to use RGBA32_UINT ( instead of RGBA32_FLOAT) and instead of packing an RGBA8888 into a float in the A channel, the shader that writes to it reinterprets the RGB channels from uint to float (back and forth) as needed. This solution won't work if blending is required though (I don't think it is though, since the error happens in the last pass where we write to the RGBA32_FLOAT target using a single "combine" fullscreen pass)

I'm leaning towards implementing solution in pt 3 (increase thresholds) because at least it covers some errors

As for the solution in pt 4, I won't be around to help you guys implement that, BUT it shouldn't be too difficult to implement.

@iche033
Copy link
Contributor

iche033 commented Dec 22, 2022

Going with pt 3 with a todo note sounds good given the limited time we have. We'll take care of pt 4 later in time. Thanks!

@codecov
Copy link

codecov bot commented Dec 24, 2022

Codecov Report

Merging #785 (befc848) into gz-rendering7 (bf15100) will increase coverage by 0.09%.
The diff coverage is n/a.

@@                Coverage Diff                @@
##           gz-rendering7     #785      +/-   ##
=================================================
+ Coverage          76.10%   76.19%   +0.09%     
=================================================
  Files                164      164              
  Lines              14393    14393              
=================================================
+ Hits               10954    10967      +13     
+ Misses              3439     3426      -13     
Impacted Files Coverage Δ
ogre2/src/Ogre2DepthCamera.cc 88.17% <ø> (ø)
ogre2/src/Ogre2GpuRays.cc 93.64% <0.00%> (+2.23%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@darksylinc
Copy link
Contributor Author

We have a few tests disabled because it does not work well on CI due to the same reason, e.g. here. So if it's not trivial, we should probably just disable this on CI.

Thanks. It's not just enough to do that, since we only want to block Focal, not Jammy.

I added IsUbuntuFocal to check for that.
Technically we shouldn't be checking for Focal, but rather Mesa versions. But this is good enough.

The only shame is that Focal CI uploads the coverage results, and since Focal is skipped, Codecov won't pick up the coverage results.

Given the other checks via "MESA_GL_VERSION_OVERRIDE" appear to be on GpuRays as well, it sounds like it's the same problem.

It also appears on the GI test, and seems to be a similar problem (llvmpipe doesn't support what we need yet, or crashes)

@darksylinc
Copy link
Contributor Author

darksylinc commented Dec 24, 2022

CI is done running. All checks passed :)

(macOS has relaxed thresholds and we skip the test on Focal CI)

@darksylinc
Copy link
Contributor Author

Whoa!

So I wrote down the instructions for somebody else to pick up and implement pt 4 (use RGBA32_UINT instead of RGBA32_FLOAT).

But my instructions were so detailed I realized it was quick and easy to do so... I just did it. AND IT WORKS!

The problem is forever solved now.

This ticket now closes bug #332 on macOS (and possibly other Linux/Windows driver/HW) for good.
I've enabled the tests in depth_camera.cc that were disabled for macOS.

I apologize for fixing a (mostly unrelated) bug in this PR rather than creating another one. I just won't be around long enough to do that.

Copy link
Contributor

@iche033 iche033 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me! Thanks for fixing #332 as well. I think it'd have taking us a bit of time to do!

@iche033 iche033 merged commit 4de8d1e into gazebosim:gz-rendering7 Dec 28, 2022
iche033 pushed a commit that referenced this pull request Dec 28, 2022
Fixes #332 (again) for macOS and possibly any OS/HW/Driver combination.
Fixes #369

For a very long time macOS (and possibly other OS/HW/Driver combination) would fail the tests because the shader code was doing (simplified, in C to convey the explanation):

```
uint32_t valueToStore = r << 24 | g << 16 | b << 8 | a;
float output;
output = (float*)&valueToStore;
return output;
```

In ticket #332 I made a deep analysis and found out that given the wrong conditions, the bit pattern would result in either a denormal or a NaN in floating point; which risks being flushed to 0 (or set all types of NaN to the same type bit pattern of NaN).

Although the ticket was closed and tagged as "fixed", on some OS/HW/Driver combinations (e.g. macOS + Metal) this error would reappear.

This isn't just a mere test failure. This is an actual bug. Running DepthCamera in macOS could result in wrong readings.

This PR fixes the problem for good by using RGBA32_UINT instead of RGBA32_FLOAT; and having uints be reinterpreted to floats and back when needed; which is well-defined and much more stable as far as GPU shader compilers go.

The same fix is also part of PR #785 which targets Garden and includes other changes.

This PR is a backport specific for Fortress.

Signed-off-by: Matias N. Goldberg <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🌱 garden Ignition Garden
Projects
Archived in project
2 participants