-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Manifold for much faster & multithreaded CSG & minkowski operations #4533
Conversation
Super exciting, @ochafik! Don't hesitate to open issues on Manifold if you find any. I'm about to push a new release, but soon I'd like to integrate Clipper lib for a proper 2D subsystem. Curious if that would be useful for you? |
Thanks @elalish ! So far it looks remarkably stable (and fast), amazing work! I've seen some weird corrupted results when I dramatically increased the number of vertices. Not sure yet if it's a manifold or OpenSCAD thing, will circle back on that. Also, I can't seem to get much multithreading going on w/ TBB, not sure I'm doing it right / for which ops it's meant to kick in (seemingly getting 1.5 cores used at most). Re/ Clipper, think it's already used in OpenSCAD but I'm unfamiliar w/ its 2D stack. I'd be surprised if its performance were a bottleneck for most users, though. Things that would be great to have in Manifold on the 3D front, although they could be quite expensive: quickhull, minkowski (OpenSCAD has its own optimized implementation, which relies on CGAL's Nef to decompose bodies in convex parts). |
Yeah, I've never done much with the OMP/TBB backends of Thrust - considering it's an Nvidia library, my guess is they don't get a lot of optimization love. I would also guess our parallelization is a bit too fine-grained for multi-CPU to help that much. I never prioritized convex hull or minkowski since I've never actually used them in my OpenSCAD designs. I'd guess Hull wouldn't be too hard with a sweep-plane - PRs welcome! |
For multithreading, can you try building manifold with python and run
which is using a lot of cores, so I think thrust is working fine for this, perhaps it is an OS/build flag/TBB version thing? |
Ah yes, managing 2.5 cores with another model (condensed matter from #3641 (comment)) , guess it will depend on the model tree shape. Getting same utilisation figures as you with Manifold’s perfTest so flags are fine. I’m working on separate PRs for tree transforms to alter said shape anyway, and on octree scene partitioning to max out parallelism (probably in a multi process setup for starters). Edit: seeing peak core utilization of 9 cores (Avg. 4) 🎉 w/ a model such as |
Yes we have a very simple tree transform in https://github.com/elalish/manifold/blob/master/src/manifold/src/csg_tree.cpp, there are some parallelization going on but it is limited as I'm not that good at math. Feel free to open PRs to add octree partitioning for speedup! |
just tested your branch and managed to compile. i have one polyedron with points and faces in my source. it displays in default openscad, but not within your branch. could it display already eventually ? |
@gsohler thanks for trying it out! That's unexpected, could you share a model that exhibits the issue? Any special build flags beyond what I've listed above? And did you check that there's no other experimental features enabled? You could also try the rendering in command line (w/ |
Hi Oliver, thank you for attention. points=[ faces=[ union() |
@gsohler ah great, thanks for sharing! If you inspect the console output (easier in command line) you’ll see the warning I’ll see if it’s worth hack some mesh repairing attempt when manifold chokes on inputs. CGAL has some helpers for that although won't be perfect (see this question on SO for instance: PMP::self_intersections + aggressive CGAL::Euler::remove_face + hole filling) |
@elalish , can manifold be used to fix self-intersecting objects(maybe disable optimization, that edges never cut faces of the same object) ? |
Hi, |
Hi,
Can anyone tell me if this is expected or does it suggest I'm missing something somewhere? |
@mconsidine thanks for reporting back on that, I'd missed that warning. Turns out these macros are already defined by Manifold's cmake build files, I got confused with Thrust's instructions / cmake variables that don't apply to the way Manifold includes it. Hopefully you should see some multicore utilization regardless, depending on the model. |
EDIT: Looks like my best route to go is to limit the number of processors being used. Once it compiles a "render" in openscad shows all 32 cores being engaged. Nice!Fyi, compilation of a fresh clone of the branch brings a 32 core, 32 GB ram Linux Mint 21box to it's knees. It happens reliably at the 80 % point "Building CXX object CMakeFiles/OpenSCAD.dir/src/glview/cgal/CGALRenderer.cc.o" If I reboot and don't clean the directory it appears to pick up where it left off. One thing I can try is to increase swap memory. But that's already at 4G and since it and RAM are maxed at this point, I'm suspecting I've got some flag set incorrectly or some configuration setting that is a problem. Any advice? |
@mconsidine re/ build issue: you can be more gentle on your machine by reducing the build parallelism. And re/ utilization of all 32 cores during a render, that's good to read (I'm curious what kind of model you got this with, free to share!). There should be ways to limit the max parallelism to not hog the entire system (e.g. tbb::global_control(tbb::global_control::max_allowed_parallelism)), I'll look into it |
I'll get some numbers to you in the next couple of days. The model (which is a WIP) is a reconstruction from blueprints of an old driving clock for a telescope. So I suspect most of the render resources are being used on calculation of threads. Haven't added gears yet. Re: fast-linalg2 work. Is/was that specific to fastcsg, or are there yet-more speed up opportunities when paired with manifold? Just curious if I can start paying attention to only the manifold version. Re core usage, I noticed that it seemed to be spreading the load out evenly. Ie if one was at 24% they pretty much all were. Ditto for other values. So I suppose there could be a model that would max all of them out. Or, put differently, if my machine only had 4 cores, I might bury it. I do have a Linux laptop that I could run the experiment on. Compiling with all cores but 2 has been nice (ie make -j$(nproc --ignore=2) I think does the trick). But it was only with the latest work that I noticed all the RAM and swap partition getting used up. So I don't think I'll try compiling on the laptop :) |
@mconsidine fast-linalg2 is completely orthogonal to this PR (and isn't ready for prime time yet, although I might open a draft PR soon). It's meant to optimize the evaluation stage of the model (execution of the OpenSCAD language itself) as opposed to the rendering of the geometry (done in C++ using CGAL or now, Manifold). Only models that do huge amounts of geometry computations in their script will benefit from the linalg work (e.g. when going overboard w/ BOSL2). To know if it could be the case for your model, you could see how long it take to render to CSG (i.e. just dump the render AST of the model). |
@t-paul I'd need some help w/ the build images for CircleCI:
|
AppImage: I'll check the backports package. I suppose moving on to 20.04 would be fine by now too, but that might take longer. |
I tried updating the AppImage to 20.04 which went surprisingly smooth (well, it buids and runs on my machine 😀). It's grabbing |
Latest pull compiles without error. FWIW, I am compiling with And, fwiw, attached is the openscad file that I'm working with (definitely a work in progress, so there are probably a zillion ways to do things more effectively ...) |
Right, adding |
@t-paul btw, not sure how often the MacOS dev snapshot is built, is there something jammed there? |
It builds via scheduler as we are limited to 500 minutes per month. |
@ochafik Thank you for this. The WIP seems to work - below are results of testing it on my model. The bottleneck doesn't seem to have been in the STL generation, though there is a pick up. The enormous gain is in the use of manifold. In this version of the model I have added BOSL2 and the generation of gears with teeth and threads, rather than use stand-ins. I have not tried to evaluate the STL yet for printing, as there is still more model building to do. Also, I note that invoking manifold gets me around the errors shown doing it the former way. Many, many thanks! This only invokes 1 core: time ./openscad /home/matt/Downloads/3D_design_files/EquatorialClockDrive_reorg.scad --render -out.stl '-D$fn=192' EXPORT-WARNING: Exported object may not be a valid 2-manifold and may need repair real 17m18.075s
|
It's so exciting to see this @ochafik! OpenSCAD is what got me into both 3D design and computational geometry; I started Manifold in hopes of contributing back, and it's amazing to see it actually happen. Big thank you to everyone in the community! |
Compiled master on a 12-core machine, I've been playing with F6 all day on models that previously took upwards of .5hr to render - cool beans! Just wondering, what's needed to enable CUDA? |
I don't remember the details, but I believe I looked at how ASCII STLs are generated and found that they were just horrible in terms of buffering (or lack thereof). |
#4316 has the details from my previous investigation of ASCII STL performance. |
@mconsidine thanks for the follow up! re/ STL export speed, let's move the discussion to #4316 (@jordanbrown0 thanks for the pointer!), my initial parallel experiment doesn't help much but switching to indexed meshes does. In the meantime note that @butcherg Glad this helps! Re/ CUDA, we'd need someone with access to a CUDA-enabled machine to play with the build flags and check it all works. And find out how to setup the CI / release environments to probably build CUDA-specific binary variants. |
Just a note: As our execution policy is not yet optimized (elalish/manifold#380), CUDA may be slower than CPU only execution. But this should hopefully be fixed soon. |
@pca006132 good to know, thanks! Also, noticed Manifold::Transform seems ~8x slower than OpenSCAD's PolySet::transform (which uses Eigen transforms); haven't fully investigated but wondering if TBB has too much overhead maybe? (even without Eigen's SIMD optimizations, I'd expect to match its speed when throwing a dozen cores at it). I tried batching the thrust::transform calls in Impl::Transform, to no avail. |
Yeah TBB probably have quite a bit of overhead. And maybe we should move that discussion to elalish/manifold#380. |
It would be nice to get some CUDA results for comparison, but I'm not sure this is going to be in an official release. As far as I understand this is still tied to Nvidia only, and I'm not convinced we can maintain a whole GPU specific set of builds which is difficult to test. Maybe there's a chance for vendor independent solution at some point. |
@ochafik, I have a GeForce GTX 1660 Ti GPU, and a compiled OpenSCAD from the master branch; I'm willing to do some testing. @t-paul, fully agree with that, 'vendor independence". I'm just starting my dig into current GPU software architectures, particularly to support an image processing program that uses vulkan, not yet deep enough to know what functions equivalent to CUDA it provides. Maybe you can short-circuit that: would vulkan be a viable alternative? |
I would guess a better solution would be for us over at Manifold to switch from using Thrust to C++17 parallel algorithms, which is the standard the Thrust library evolved into. It would be good to know from a user's perspective how seamless that would make it to target the various parallel architectures out there. Certainly it would nice to remove our related build flags and let the compilers take care of it for us. |
I think only nvc++ currently supports stdpar, and I don't see any news from AMD, clang or gcc. I don't think stdpar will help us enable GPU accelerated processing on other GPUs anytime soon. Also, from a compiler perspective, automatically parallelizing code to the GPU is an extremely hard problem, it depends on access pattern, size of the workload, and usually require special algorithm or data structure design. At least I don't think C++ is a good language for this, so I doubt if stdpar will be that useful if we already have code designed around the GPU. I think compute shader is probably the way to go for accelerating some of the most time consuming parts and remains somewhat vendor neutral. We can incrementally switch over to compute shader, I think sorting, copying are probably the most time consuming parts. An alternative way would be to use google/highway for SIMD sorting, but their code currently does not support sort by key, and I am too lazy to try to adapt their algorithm for it... Due to the nature of our workload, we still have to tune various thresholds for selecting single core, multicore or GPU accelerated computation, so we will probably still need to test multiple different configurations to deliver the best performance. And I don't think this is bad either: we can make some profiling code for users to run if they want to improve the performance, similar to building with -march=native, and potentially submit a PR to us if the default is very bad for certain configuration. |
@butcherg maybe just try adding @pca006132 @elalish Vulkan feels like the best standard practical way to harness both CPU and GPU. Also, I reckon it could definitely attract more OSS contributors (well, speaking for myself, that is; no plans to buy any NIVIDA hardware, and I used to be very much into OpenCL 😅). That said, I'm now super excited again about google/highway, might give it a deeper look. |
Thanks, that's good feedback, @ochafik and @pca006132. I haven't looked into Vulcan much, but it does seem like a reasonable alternative. And then there's WebGPU, which is coming rapidly - I wonder if there's any inkling of a way to cross compile one to the other? |
I have an older Nvidia Quadro M4000 card in my box, so I can try this
recompile as well. Do I need to use the version of manifold that is part
of this git, or can I separately pull down the latest code from the
manifold GitHub site and drop it into the submodules folder in openscad?
Apologies if it's a naive question.
Sent from allegedly smart phone
…On Tue, Mar 21, 2023, 5:00 PM Olivier Chafik ***@***.***> wrote:
@butcherg <https://github.com/butcherg> maybe just try adding
-DMANIFOLD_USE_CUDA=1 to CMake's command line args as per Manifold's build
instructions <https://github.com/elalish/manifold#building>. In fact, I'd
try building and running Manifold's tests on your CUDA hardware to make
sure it all works. Then you could source benchmark scripts from here
<https://gist.github.com/ochafik/70a6b15e982b7ccd5a79ff9afd99dbcf> or here
<#391 (comment)>
@pca006132 <https://github.com/pca006132> @elalish
<https://github.com/elalish> Vulkan feels like the best standard
practical way to harness both CPU and GPU. Also, I reckon it could
definitely attract more OSS contributors (well, speaking for myself, that
is; no plans to buy any NIVIDA hardware, and I used to be
<https://ochafik.com/p_190> very much into OpenCL 😅)
—
Reply to this email directly, view it on GitHub
<#4533 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEPIPAOVYDDBEAJX54Y5RLW5IJHVANCNFSM6AAAAAAVH6DYDA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@mconsidine manifold is a git submodule and will be pulled for you if you follow the general OpenSCAD build instructions. |
Yeah, you can have CUDA. I just finished restoring my NVIDIA driver configuration after my attempt to install the CUDA toolkit borked it all up. Probably my fault, but I'm just not into troubleshooting it, and I need to keep my GPU configuration clean for the Vulkan-based raw processing software to which I'm contributing. That, and its NVIDIA-dependency is not appealing to the open-architecture fool I am... Manifold has already provided a multiple order-of-magnitude improvement to OpenSCAD's mesh rendering, to the point of up-showing preview rendering in some cases. Thanks, @elalish and @ochafik, keeps OpenSCAD relevant for my model-building. |
@ochafik Okay, it looks like (against all odds) I've got the CUDA toolkit installed. A separate download of the latest manifold source compiled, with a few tests failing out of 135 - as well as it seemingly getting hung up on another. Regardless of that, are there any flags you would recommend for a compile beyond this: EDIT: Okay, I think I'm with @butcherg on this. Cuda-related compile errors cropped up in a couple of places, one seemingly being between manifold and CGAL 5.6 (lastest master). Going back down to 5.5.3 seemed to deal with that, but then more cuda issues cropped up. And this is just not a rabbit hole I need to live in. So I'm sticking with a non-CUDA version for the time-being. Given the size of the model I'm working with, there would be far more to be gained for me if Preview mode could be sped up, as I note that is only using one core at a time |
This is introducing experimental support for the https://github.com/elalish/manifold library by @elalish.
TL;DR: It's wicked fast, and ready to test / review. Maybe we get to ditch CGAL for most of the 3D rendering soon (except hulling & convex decomposition for minkowski), with 5-30x speedups over fast-csg (YMMV, multithreading works better on some models than others) and only theoretical precision downgrade (TBC, please help test!)
You can download Windows / Linux binaries from the last green CircleCI run (e.g. a Linux AppImage, a Win64 build, a Win32 build).
Note: fenced by the
manifold
experimental feature (needs enabling in the UI's settings or passing--enable=manifold
in CLI - which doesn't affect the UI).TODO:
some confusion about TBB flags. This incidentally fixes Multi-threaded Geometry rendering [$1,275] #391, although more work could be needed to maximize core utilization.tests/data/scad/issues/issue2342.scad
(sent fix to Manifold: Simplify the CsgOpNodes as we build them, rather than in GetChildren elalish/manifold#368 )A note on precision
Manifold uses single-precision floats, which on paper is a far cry from OpenSCAD's numeric handling, which is double precision at worst, and as a general-rule uses exact rationals (GMP / MPFR + CGAL) when doing CSG operations.
However, there's already substantial exceptions to OpenSCAD's exact numeric handling:
minkowksi
andhull
use double precision conversions and lose the exact rationalsTransformNode
s (multmatrix
/translate
/scale
/rotate
) transform things in double then reconvert them to "exact" rationals. There's no reasonable way to get exact transforms in the general case anyway, e.g. to guarantee that the result of N rotations ofPI / N
around the origin or that applying two successive opposite transforms produces exactly the same solid (although that may appear to work in some cases).That said, rounding errors from transforming simple precision coordinates could potentially snowball in trees that have many nested transforms.
The simple solution I'm rooting for here is to push the double-precision world transforms all the way down to pseudo-leaf nodes (i.e. actual leaves but also shared subtrees, minkowski, render, 2D-3D transitions), similarly to what I did in #3637.
I'll explore that route separately in #4561 (after initially trying to push the transform o the fly in
GeometryEvaluator
as whatCSGTreeEvaluator
does with the state; found out this messes up with caching so going a more traditional tree transform route).Licensing: GPLv2+ + Apache 2 = GPLv3
Manifold is under Apache 2 license, which can be included in GPLv3 but not GPLv2 projects.
Since OpenSCAD source is GPLv2+, this means one can only release an OpenSCAD + Manifold binary under GPLv3. The same was already happening w/ CGAL (see this thread) so this doesn't change anything.
Building & running
To use Manifold as an experimental rendering engine, build locally (make sure to install the general prerequisites) and run with
--enable=manifold
(or enable the feature in the UI's preferences)(Interestingly enough, the rendering of that model breaks down at like
$fn=1000
. Might get back to this later to see if it's a Manifold issue or if it's about how we build spheres)minkowski, 20x faster (YMMV)
I've ported the OpenSCAD minkowski logic for the Manifold case, with a few upgrades:
OPENSCAD_NO_PARALLEL=1
env var to see the difference (doesn't disable Manifold's parallel operations tho).Long story short: minkowski is orders of magnitude faster in lots of cases.
For instance, @revarbat's BOSL2 offset3d docs essentially say the function is so slow they don't even bother generating a preview. Well, the following takes 4sec on my M1 mac (at roughly 2.5 cores utilization), instead of 4m31sec (or 1m34sec with fast-csg):
(Ngl, I struggled to find cool minkowski examples - adding some here - which this PR might help make more practical and widespread)
That said it looks like the CGAL Nef-based convex component decomposition we use is quick to crash with random models. Might muster the courage to file a bug soon