run tests many times #2011

faddat · 2024-10-21T11:58:17Z

This pull request runs the tests twenty times with a cache and twenty times without a cache on linux and mac, for a grand total of eighty runs of the test suite.

The observed error is always:

runtime: bad pointer in frame github.com/CosmWasm/wasmvm/v2/internal/api.AnalyzeCode at 0xc001268738: 0x1
fatal error: invalid pointer found on stack

This pull request makes no changes to the code in this repository, and simply runs its tests eighty times.

faddat · 2024-10-21T12:02:15Z

As can be seen in the test results, we get a pointer error stemming from the internal api of wasmvm a significant portion of the time.

I can also say that after fixing the gas issues that using v2.1.3 poses, this problem still occurs about 5% of the time.

Bump github.com/cosmos/ibc-go/v8 from 8.4.0 to 8.5.1 cosmwasgo/cosmwasgo#10

If I had to guess: CGO.

codecov · 2024-10-21T12:26:18Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 42.87%. Comparing base (b2b6abe) to head (55d930a).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2011   +/-   ##
=======================================
  Coverage   42.87%   42.87%           
=======================================
  Files          85       85           
  Lines       11587    11587           
=======================================
  Hits         4968     4968           
  Misses       6179     6179           
  Partials      440      440

chipshort · 2024-10-22T04:51:22Z

Good find! I'm also able to reproduce this locally on my M1 Macbook. That's really bad.
Could it be similar to CosmWasm/wasmvm#536? Except this one seems to happen with the default optimization level.
We'll look into this.

faddat · 2024-11-02T00:11:40Z

The problem is not the continuous integration system. It is all of the binary blobs that sit in the internal folder of wasm VM, and the fact that:

CGO is nearly impossible to debug
CGO is known to leak ram like whoa
CGO interrupts a bunch of the basic low level features of golang
The current design pass go <-> c <-> rust (rust)

And yes I agree that it is very bad. It means that when merging prs, the process has been to just click retry even though at best there are flaky tests and at worst there's a very serious logic error, but it was chronically ignored.

And the reality is that can actually be seen in the CI system.

chipshort · 2024-11-04T03:23:48Z

I agree that debuggability is a bit of a pain point here.
I don't know much about the internals of CGO, but I would be very surprised if it leaked a lot of memory. And FWIW we haven't gotten any reports about memory leaks that could be traced back to CGO as a culprit. Memory leaks are most likely caused by the underlying C / Rust library that is called through CGO.
It doesn't really pass through C as a language, but we do need to use C ABI on the Rust side. We are planning to investigate using uniffi to simplify all these bindings and make them more robust.

It means that when merging prs, the process has been to just click retry even though at best there are flaky tests and at worst there's a very serious logic error, but it was chronically ignored

You are making some very strong claims here that are entirely based on the assumption that we found this bug already before your PR. To my knowledge, this is not the case. Do you have any proof that someone just reran CI to hide such a bug? Because if so, I would like to talk to them about it.

faddat · 2024-11-08T14:14:02Z

Concerning strong claims -- some of them are by necessity true.

The only one I'm not sure of concerns weather or not the issue is in the tests or

I can assure you (you can believe me or not) that cgo is basically the weak point of any app it is in.

faddat · 2024-12-03T10:19:32Z

Guys, I noticed that a fix to this had been merged to main.

Unfortunately CI is not set up to run tests 20 times.

The result of this is, to quote Do Kwon, "your patch is not patch"

... I am going to make a new PR that targets main, just to confirm this result, but I also wanted to make sure that the main branch is patched?

I am of course referencing lthe PRs/Issues linked to this PR.

chipshort · 2024-12-03T14:18:05Z

I have already confirmed this in a previous PR, but feel free to confirm this yourself.
Just keep in mind that we won't merge it because running all tests that often all the time is just wasteful.

aumetra · 2024-12-04T13:54:19Z

The current design pass go <-> c <-> rust (rust)

If you have the money to fund a few months of full-time dev and all the resulting maintenance, feel free to contact us, then I can aid in implementing the Go calling convention using assembly trampolines to call Rust without CGO. Otherwise we will have to keep this design.

C is the lingua franca of programming languages. Not because it's a good choice, but because it's abundant. C doesn't even have a well-defined ABI, it's just "what compiler people agreed on doing at some point"(tm), and we all have to live with how terrible C is at representing anything complex (see: https://faultlore.com/blah/c-isnt-a-language/).

Sidenote on that "what compiler people agreed on" point: Not even that is true. There are multiple instances of painful interop difficulties between different compilers because they do subtly different things on the ABI level: https://faultlore.com/blah/abi-puns/

But enough of this tangent.

there are flaky tests and at worst there's a very serious logic error

Yes. And these are probably Heisenbugs due to Go having a garbage collector, and garbage collectors are usually non-deterministic. This would be a non-issue if we had deterministic memory reclamation, but unfortunately we don't live in that world.

This is probably an issue of Go not keeping the object around quite long enough, but due to the GC not running immediately it will work in most cases. But in some cases where the GC happens to reclaim memory right before the pointer is dereferenced, the thing will crash.

What's additionally painful is that all of this is very much implicit and has no compiler guidance, so you're basically poking in the dark. It's as if you were writing C, but a chaos-monkey inserts free calls wherever it thinks it could fit at runtime. Not optimal. Quite suboptimal, actually.

Unfortunately CI is not set up to run tests 20 times.

There could be a middleground here. Running the tests scheduled once a day 20 times. Should still catch bad regressions, but it's not as wasteful as running it on every commit.

Otherwise, again, feel free to pay for the CircleCI credits. Then we can run it as often as you want.

faddat · 2024-12-04T15:26:30Z

Hey :)

I don't know who you are, but work is underway to remove cgo.

I am not paying for it, nor am I being paid for it.

I am jfdi it.

This shouldn't cost a penny on circleci: these are GitHub actions jobs.

Personally not a big fan of circleci.

https://GitHub.com/cosmwasgo/cosmwasgo is what I am working on here, and basically all I want to achieve is a drop-in replacement to wasmvm because it is in my opinion very problematic.

There could be a middleground here. Running the tests scheduled once a day 20 times. Should still catch bad regressions, but it's not as wasteful as running it on every commit.

Could you lmk what these tests are costing?

Again -- for me/orgs I set up on github -- typically I keep them 100% public, then there's no need to pay github. I know that sometimes, GH services can be horiffically expensive.

aumetra · 2024-12-04T16:51:48Z

I'm an engineer at Confio. This comment was made purely to explain our stack choices and technical reasons behind it. Not for you specifically but for people coming across this, to get across our reasoning here.

I felt this was needed since your comments display a lack of knowledge on why these things work how they work right now, and make quite bold claims.

Whether this is done on purpose, due to a language barrier, or otherwise: I don't really care.

And the comments about insinuating that "you should pay" have a simple reason: you want us to foot the bill for your CI workflows, that you are seemingly insistent on adding.

This is not something that should need adding and is a defect of tooling around Go, since every other language in existence has address sanitizers.

faddat · 2024-12-04T17:56:04Z

the jobs I added work just fine with the totally free github actions, which is actually all I personally use, and indeed I personally use that exact option.

I 100% get you on the GC and heisenbug thing and that is the precise reason that I am working on pure Go VM to run compiled rust contracts in.

I don't think that it should be controversial -- or bold at all -- to say that it is a bad idea to use unsafe in rust, or report -- as I have for years -- that chains using x/wasm immediately begin to leak RAM. They do, although confio has consistently denied that... but in the end, well, I was able to find the issue, and much to Confio's credit, the issue is now resolved. Thank you and your team for that.

Really, that is about all I am really aiming at here.

It is unfortunate that I (placing blame on my side) was not able to get the point across during the first couple of years.

Weather or not this changes your thoughts on what I am doing here -- I think CosmWasm is a great way to build blockchain applications, but that WasmVM isn't real safe, due to the use of CGO -- in fact... until I made this PR to run the tests many many times, I didn't realize that unsafe was being used.

There's nothing at all wrong with the original virtual machine design in CosmWasm. But today there are better options avail.able.

Thanks to you and anyone else at confio for assisting in the resolution of this issue. To me, that's the key.

If there is anything further that you feel I do not understand here, please, let me know.

PS: about costs -- at no time was I attempting to impose costs on Confio -- me, I just use all the free github stuff, works really well. If you are using the paid stuff, well, I totally get it, and sorry if it came off as a demand that Confio pay.

@aumetra I missed your last sentence

This is not something that should need adding and is a defect of tooling around Go, since every other language in existence has address sanitizers.

yes absolutely, cgo is one of the worst, very very worst parts of go!

That's why I'd like to write CosmWasm contracts in Rust, and run it without cgo.

faddat · 2024-12-04T18:38:18Z

Yes. And these are probably Heisenbugs due to Go having a garbage collector, and garbage collectors are usually non-deterministic. This would be a non-issue if we had deterministic memory reclamation, but unfortunately we don't live in that world.

this is why I want to stop using cgo in chains written in Go. I feel like we are in fact on the same page.

faddat · 2024-12-04T18:45:13Z

we haven't gotten any reports about memory leaks that could be traced back to CGO as a culprit. Memory leaks are most likely caused by the underlying C / Rust library that is called through CGO.

You have, but that is okay!

Look there's one key thing here and the others don't matter: we are fixing stuff.

pinosu · 2024-12-12T09:44:39Z

Hey @faddat , the fix is finally in main. Here you can see the tests passing #2057 .
I will close this PR because the issue is now solved in main.
Thanks for your contribution 👍

faddat · 2024-12-16T19:31:58Z

I'm just really happy to see this get addressed. Thanks for your work.

run tests many times

a6f6d18

run them 20 times each....

3bdd56a

faddat mentioned this pull request Oct 21, 2024

Update to wasmvm 2.2.0-rc.1 #2009

Closed

chipshort mentioned this pull request Nov 13, 2024

Fix pointer problem in UnmanagedVector CosmWasm/wasmvm#571

Merged

This was referenced Nov 13, 2024

Fix pointer problem in UnmanagedVector (backport #571) CosmWasm/wasmvm#572

Merged

Fix pointer problem in UnmanagedVector (backport #571) CosmWasm/wasmvm#573

Merged

Fix pointer problem in UnmanagedVector (backport #571) CosmWasm/wasmvm#574

Merged

Merge remote-tracking branch 'upstream/main' into patch-12

55d930a

use patch

c9fd2af

pinosu closed this Dec 12, 2024

pinosu mentioned this pull request Dec 12, 2024

Run tests 20 times #2057

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run tests many times #2011

run tests many times #2011

faddat commented Oct 21, 2024 •

edited

Loading

faddat commented Oct 21, 2024 •

edited

Loading

codecov bot commented Oct 21, 2024 •

edited

Loading

chipshort commented Oct 22, 2024

faddat commented Nov 2, 2024 •

edited

Loading

chipshort commented Nov 4, 2024

faddat commented Nov 8, 2024

faddat commented Dec 3, 2024

chipshort commented Dec 3, 2024

aumetra commented Dec 4, 2024 •

edited

Loading

faddat commented Dec 4, 2024 •

edited

Loading

aumetra commented Dec 4, 2024

faddat commented Dec 4, 2024 •

edited

Loading

faddat commented Dec 4, 2024

faddat commented Dec 4, 2024

pinosu commented Dec 12, 2024

faddat commented Dec 16, 2024

run tests many times #2011

run tests many times #2011

Conversation

faddat commented Oct 21, 2024 • edited Loading

faddat commented Oct 21, 2024 • edited Loading

codecov bot commented Oct 21, 2024 • edited Loading

Codecov Report

chipshort commented Oct 22, 2024

faddat commented Nov 2, 2024 • edited Loading

chipshort commented Nov 4, 2024

faddat commented Nov 8, 2024

faddat commented Dec 3, 2024

chipshort commented Dec 3, 2024

aumetra commented Dec 4, 2024 • edited Loading

faddat commented Dec 4, 2024 • edited Loading

aumetra commented Dec 4, 2024

faddat commented Dec 4, 2024 • edited Loading

faddat commented Dec 4, 2024

faddat commented Dec 4, 2024

pinosu commented Dec 12, 2024

faddat commented Dec 16, 2024

faddat commented Oct 21, 2024 •

edited

Loading

faddat commented Oct 21, 2024 •

edited

Loading

codecov bot commented Oct 21, 2024 •

edited

Loading

faddat commented Nov 2, 2024 •

edited

Loading

aumetra commented Dec 4, 2024 •

edited

Loading

faddat commented Dec 4, 2024 •

edited

Loading

faddat commented Dec 4, 2024 •

edited

Loading