-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLVM green commit tracking for coordination experiment #1178
Comments
Green LLVM commit for the week of 8/1/2022: ec5def5 |
Week of 8/8/2022: |
Week of 8/15/2022: |
Week of 8/22/2022: |
Week of 8/29/2022: |
Week of 9/5/2022: |
Week of 9/12/2022: |
Week of 9/19/2022: |
Week of 9/26/2022: |
Week of 10/02/2022: |
…ld in a few places. My team found this while investigating building llvm-project and onnx-mlir in parallel using a single cmake command. (llvm#1178) I've chosen to add the over-arching `MLIRIR` target rather than individual dependencies to avoid having to update these as interfaces are added/removed/moved within the MLIR IR. Below is the exact list of dependencies i found while researching this problem: MLIRCallInterfacesIncGen MLIRCastInterfacesIncGen MLIRDataLayoutInterfacesIncGen MLIROpAsmInterfaceIncGen MLIRRegionKindInterfaceIncGen MLIRSideEffectInterfacesIncGen MLIRSymbolInterfacesIncGen MLIRTensorEncodingIncGen ###ResultTypeInferenceOpInterface.cpp/HasOnnxSubgraphOpInterface.cpp/SpecializedKernelOpInterface.cpp MLIRBuiltinTypeInterfacesIncGen MLIRTensorEncodingIncGen MLIREmitCAttributesIncGen MLIRGPUOpsAttributesIncGen MLIRGPUOpsEnumsGen MLIROpenACCOpsIncGen MLIRSparseTensorAttrDefsIncGen MLIRTosaStructsIncGen Signed-off-by: Ian Bearman <[email protected]> Co-authored-by: Stella Stamenova <[email protected]>
Week of 10/10/2022: |
Week of 10/17/2022: cc: @silvasean |
Week of 10/24/2022: cc: @silvasean |
Week of 10/31/2022: cc: @tanyokwok |
Week of 11/07/2022: cc: @qiuxiafei |
Week of 11/14/2022: cc: @Shukla-Gaurav |
Week of 08/14/2023: Green LLVM commit: cc: @silvasean |
Sean is out of office this week, so I've also taken this week's bump: #2397 |
Week of 08/21/2023: Green LLVM commit: cc: @tanyokwok |
Week of 08/28/2023: Green LLVM commit: cc: @qiuxiafei |
Week of 09/04/2023: Green LLVM commit: cc: @Shukla-Gaurav |
Hi folks, I see that @ashay posted on Discord that they are no longer able to post green commits. I think we all appreciate the work that has gone into keeping the wheels on the cart: thank you. I'd like to give the community a chance to speak up, but I don't think without a volunteer we can keep the stablehlo backend enabled in its present state, given that it is tied to mlir-hlo, which is both deprecated and a very finicky dep. Can some folks speak up about the need/investment here so that we don't do something that is disconnected from expectations? I'd be open to various pragmatic options to take us forward as needed, including any of:
In any situation I can see, without some work done on the dependency and testing situation, I don't see how we can continue to keep this enabled in the default build. Please speak up if you are invested here and would like to discuss and find a path forward. I'm being a bit direct because this has been dragging on for a year, and I feel that we need to force a resolution. But I'm perfectly willing to discuss and reach a compromise on a supportable path forward. |
|
I agree we should drop the MLHO dep (given it is deprecated and we don't need some blackmagic greencommits where mhlo works). I know there are some users of StableHLO (not necessarily with the LinAlg lowering). So leaving it off by default and without e2e tests is ok. We can add a FYI CI that builds it with it ON and once @burmako work to get LinAlg lowering in StableHLO lands we can potentially re-enable StableHLO e2e testing. |
I will also add a RollLLVM CI that rolls LLVM nightly like we do for Pytorch and if anything fails we can fix it up right away so we are always on top of LLVM (just like we are on PyTorch) |
The other aspect of publishing the green commits was that they pointed to a state of LLVM where we were reasonably sure that it would work correctly on most platforms (including Windows). Then onnx-mlir was able to snap to the same commits and keep torch-mlir and onnx-mlir in sync for those that depend on both. I am not familiar with the RollLLVM CI, would that help in this situation? |
Thanks for initiating this discussion. For what it's worth, we don't use MHLO at work, so my opinion (understandably) carries less weight than someone who depends on MHLO. In the short term, I'd be in favor of not building the MHLO backend (and tests) in the Torch-MLIR CI. I don't think it would help to run a separate CI with the MHLO backend enabled, since we wouldn't have a way to fix problems until we rebase MHLO/StableHlo to the same LLVM commit chosen by Torch-MLIR (which we can't do without creating a branch/fork of MHLO/StableHlo). My (likely very naive) understanding is that if we drop the MHLO/StableHlo integration, there would be no way to lower PyTorch models to StableHlo (please correct me if I am wrong), which would be sad. If that's correct, I can maintain a fork of StableHlo that periodically rebases upstream StableHlo to the chosen LLVM green commit. Assuming that we don't build Torch-MLIR CI with MHLO enabled, it would not be a problem that Torch-MLIR's submodule of StableHlo would use a lagging LLVM commit. Of course, that doesn't resolve the problem of picking the LLVM commit hash. The RollLLVM CI sounds nice, but unless ONNX-MLIR uses the same LLVM commit, we'd be risking a similar fragmentation between Torch-MLIR and ONNX-MLIR as we currently have between Torch-MLIR and MHLO/StableHlo, causing headaches for consumers of both Torch-MLIR and ONNX-MLIR (which was the reason we started the coordination experiment). For the longer term, I wonder if we could kick this upstream and have the LLVM project produce periodic green commits that downstream projects sync on. My (arguably naive) opinion is that unless dependent projects (that build their dependencies from source) use the same commit, there's no way to guarantee problem-free builds. |
I suspect that we need to switch to a rolling commit branch where we land fixes to API breaks and make automatic LLVM bumps. That way everyone can use the branch as a source of patches to sync to their needs vs having one clock that is impossible to synchronize. Specific constraints on a version really need to belong to the consumer in this kind of situation. The upstream project can "keep the train moving and tested" so that others can grab what they need, but it should really only have two buttons: pause-and-fix-for-breakage and move-forward. I've been running an experiment for a couple of months and, leaving aside testing infra deps and all of the gunk that mlir-hlo pulls in, the core of the project is quite tolerant to version skew, and I think we need to make that go faster and patch more incrementally -- not slow it down to the lowest common denominator. Having more sync points is really the only way to meet everyone and stay patched (also, fixing things around actual breakages at the granularity of ~a few upstream patches is the most effective way to stay up to date). (as usual, I'm expressing a strong opinion to avoid ambiguity but am open to discuss it more) |
I believe something like what I am describing above is what @powderluv is referring to as a RollLLVMCI. I don't think it exists at the moment but could relatively easily. What I would like to see is for that CI to roll forward a branch that we keep patched when it breaks and then periodically merge back to main. Then anyone can do a roll-up of the project at the commit that they need. When I've run this experiment in the past, I had the branch move forward only to |
If automation like that is fairly easy to setup and fixes end up being cheaper, it seems like that's the way forward. How complex would it be to make sure it runs on all platforms of interest (e.g. windows, linux, mac) and to replicate elsewhere (e.g. onnx-mlir)? |
When I've done this in the past, I've just had each platform be a normal GitHub actions workflow that either turns green or doesn't and is owned by someone with enough stake in that platform being green for such a project. In my experience there, the issue is not complexity but cost: Windows runners are an order of magnitude more costly than anything else and they need to be funded by someone who has the need and is willing to pay. Technically, MacOS runners aren't cheap either but that community seems to get by with little armies of mac minis toiling away in basements. I expect that the automation part of this can largely be on the order of ~a python script with some knobs. |
From the point of view of Stablehlo users, first of all, we fully agree to remove the mhlo dependency. Regarding the timing of removing mhlo, we prefer to wait for @burmako to migrate Stablehlo-To-Linalg from IREE to the Stablehlo repository(#2177). But if we want to remove the mhlo dependency before this migration is complete, we can accept temporarily shutting down the e2e test of Stablehlo. |
Basically, I agree with @qingyunqu. One thing I'm concerned is that disabling stablehlo e2e tests might cause trouble merging relevant features into torch-mlir. So we hope the migration of Stablehlo-To-Linalg pass from IREE to the Stablehlo can be finished as soon as possible. |
I think I follow what @stellaraccident is describing and it seems reasonable to me. How would we determine when to merge the rolling branch into 'main'? Is it time based, or as needed/desired? Those merges should be low effort, just trying to form a mental model of the flow. One feature that would be nice would be to include more information when the RollLLVMCI fails. I get the RollPyTorch action failure notifications, but not a lot of information is in the mail. To get more, I need to read the CI log. A summary of the error messages or similar would be convenient, and might help get the right person to act on the failures quickly. Not a blocker, just a nice to have after watching what's happening the the LLVM move to GitHub PR notifications. |
I think the thing we (nod+Iree) are balancing is how much to keep the current state of torch-mlir intact vs starting a fork specifically for pytorch2/dynamo that ejects the legacy and sets up the next major version. Having spent some time getting reacquainted, I don't think there is a direct line between the current state and where we need to be (and in fact are already for some of our leading edge downstream work): all of the old torchscript stuff, the jit importer, native pytorch dep, shape inference infra, module builder IR/APIs, and several of the other things go away with pytorch2. What is left is the torch dialect infra and conversation pipelines. The rest is provided by direct Python interop with frontends. This is pretty in line with the long term vision that Sean layed out -- except that the time seems to be now to be going this path for real. Given that there is clearly utility and people relying on the existing setup, I am thinking that we just let people with a dependency on the pt1 world to stay on This would let those of us working on the limited subset that exists in the pt2 world to run at a higher rate and disconnected from the stability needs of the main branch. We can likely keep these two in sync and mergable for a while so as to avoid duplicate patching. In addition, the patch history of the dynamo branch can be used to make it easier to keep the main branch updated, since I assume we will be running that one faster. Happy to have a bigger discussion, but I just wanted to signal that based on what I see, the project will need to break a lot of compatibility to upgrade cleanly to pytorch2, and if we accept that and branch, that also fixes some of the conflicting integrate needs in the short term. I expect that it will be a few months of work to fully upgrade to pytorch2 and be ready for existing users who prefer a more stable current version to think about migrating. We can build out a lighter weight CI and testing integration as part of the new work (since the integration surface area and testing needs are drastically simplified). |
Just to be clear: I have no visibility into whether any of this can or will happen, and since the project goals I work on don't require it, I would request that people who do have such a need be proactive about pushing to get this done (or finding an alternative way to convince ourselves that it is tested adequately). |
Thanks, I think I am beginning to understand your suggestion. The part that I don't comprehend yet is if/when the |
I don't claim to have all of the answers yet, but it is pretty clear to me that the future state has to have more optionality, both in terms of downstream "exits" (i.e. StableHLO, et al) and the testing approach to them. The StableHLO dep itself is light-weight enough that it does not experience much version skew (and in my experience, it is within the realm of what this project could keep patched -- if operating at the level of conversion pipeline and lit tests). My expectation is that if the upstream project is responsible for the conversion pipeline and downstreams are responsible for contributing a CI job that validates it against a battery of tests (maybe even using the StableHLO interpreter in that case), then the dependency is very cheap. In short, torch-mlir/dynamo is:
So I guess to answer your question: I don't think it will be a big problem when the time comes so long as we keep this project out of the business of lock-step integration testing. Dynamo itself lends itself quite well to decoupling middleware like this from backends, and we should just lean in there. The legacy of pt1 was that if we didn't have lock-step integration testing, everything would fall apart. However, I think we've moved on from that need and re-introducing the testing approach from the perspective of pt2, we'd do it completely differently and in a much more manageable way. With the integration testing sequestered in such a way, the compile/unit-test scoped dep itself isn't too bad to manage. |
(and yes, on the rolling integrate branch, we may be broken with respect to some backends when core mlir APIs break, but this will often clear on its own as everything catches up. We just don't down merge the integrate branch to main/dynamo until everything is sufficiently green for our tastes. But more aggressive projects can also pin to a state that doesn't have all backends passing. The key is to not have monolithic builds/tests so we let the timeline progress in a partially green state, acknowledging that many kinds of API breaks can be ignored while things catch up and eventually align) |
Thanks, that resolves my concerns. I agree that we should keep the dependencies light weight to permit a higher tolerance to skews in chosen LLVM commits, and non-monolithic builds/tests sounds like a nice way to make progress faster. |
For anyone on this discussion who may not already know, PyTorch/XLA already does provide API to save StableHLO for your PyTorch models. Please feel free to file issues there if you would like the team to look into any particular features. |
+1 - it hasn't been clear to me that long-term the people using StableHLO based pipelines should be serviced by torch-mlir (if there is another tool for the job, supported by the people invested there). I don't have an opinion on that (our projects are not currently doing this). Definitely not looking to deplatform anything -- just supporting the investigation of other supported options that limit cross-tech-stack issues. For those not using StableHLO, there are a lot of fundamental features that keep us from using StableHLO+PyTorch/XLA, but that is a different conversation vs seeing if there is another, better supported way out of PyTorch to StableHLO. I don't know enough about the motivations of the people here that are using the Torch-MLIR->StableHLO pipelines to know what the gaps are. Just scanning the StableHLO pipelines, it does appear that they are heavily using dynamic shapes and other lowering/policy tweaks. Looking at the other pipelines, there are still further gaps that bar entry to using StableHLO at all (more just presenting as an FYI -- we find each of these to be essential and cannot indirect through a layer that constricts them). With the torch-mlir linalg pipeline, for example, we can:
|
FYI - the deprecated mlir-hlo dep hit me today with the inability to patch or navigate the contribution process. It is time: here is a patch to remove it: #2460 |
Development Notes still refer to this issue for "green commit", but there are not updates anymore and git submodules use top-of-the-trunk for LLVM and StableHLO. So a couple of questions here:
|
Please use #1135 for process discussion.
I'm adding the new issue, so that we can track just the green LLVM commits and corresponding commits/branches in torch-mlir, onnx-mlir and MHLO until we have a more formal process.
The text was updated successfully, but these errors were encountered: