Occur panic in `resCbRecheck` when sending about 400tx/s #395

tnasu · 2022-04-06T07:55:13Z

Ostracon version

https://github.com/line/ostracon/tree/v1.0.4

Environment

Starting 100 validator
Use abci.example.Counter
- Use abcicli.localClient

What happened

If sending about 400tx/s, there is a high probability that the consensus of one node will stop, and then the consensus of other nodes will stop one after another. It seems that the resCbRecheck function failed to load tx, panic occurred, and recvRoutine of consensus/state stopped.

https://github.com/line/ostracon/blob/v1.0.4/mempool/clist_mempool.go#L553

		if e, ok := mem.txsMap.Load(txHash); ok {
			//...
		} else {
			panic(fmt.Sprintf("unexpected tx response from proxy during recheck\ntxHash=%X, tx=%X", txHash, tx))
		}

Logs on a node

CONSENSUS FAILURE!!!                         module=consensus err="unexpected tx response from proxy during recheck
txHash=C65EC3DCBD8C7036A518F913C2CD329E290B71E702BBC0B9143E48E6D111C664, tx=6170706C6531383738393D31363438303235323130363634" 
stack="goroutine 318 [running]:
	runtime/debug.Stack(0xc003e69948, 0x108b9e0, 0xc004562f70)
	runtime/debug/stack.go:24 +0x9f
github.com/line/ostracon/consensus.(*State).receiveRoutine.func2(0xc00166bc00, 0x13bb858)
	github.com/line/ostracon/consensus/state.go:780 +0x57
panic(0x108b9e0, 0xc004562f70)
	runtime/panic.go:553 +0x1b9
github.com/line/ostracon/mempool.(*CListMempool).resCbRecheck(0xc0000a4400, 0xc004562f50, 0xc004562f60)
	github.com/line/ostracon/mempool/clist_mempool.go:553 +0x58c
github.com/line/ostracon/mempool.(*CListMempool).globalCb(0xc0000a4400, 0xc004562f50, 0xc004562f60)
	github.com/line/ostracon/mempool/clist_mempool.go:376 +0xb9
github.com/line/ostracon/abci/client.(*localClient).done(0xc0003280e0, 0xc0085bce70, 0xc004562f60, 0x0)
	github.com/line/ostracon/abci/client/local_client.go:365 +0x83
github.com/line/ostracon/abci/client.(*localClient).CheckTxAsync.func1(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	github.com/line/ostracon/abci/client/local_client.go:114 +0x10e
github.com/line/ostracon/abci/example/counter.(*Application).CheckTxAsync(0xc000b61ee0, 0xc006aca060, 0x18, 0x18, 0x1, 0xc005d3cfe0)
	github.com/line/ostracon/abci/example/counter/counter.go:70 +0xec
github.com/line/ostracon/abci/client.(*localClient).CheckTxAsync(0xc0003280e0, 0xc006aca060, 0x18, 0x18, 0x1, 0xc004562f20, 0xc004562f20)
	github.com/line/ostracon/abci/client/local_client.go:112 +0x1f2
github.com/line/ostracon/proxy.(*appConnMempool).CheckTxAsync(0xc000bba2e0, 0xc006aca060, 0x18, 0x18, 0x1, 0xc004562f20, 0xc003e69f20)
	github.com/line/ostracon/proxy/app_conn.go:130 +0x63
github.com/line/ostracon/mempool.(*CListMempool).recheckTxs(0xc0000a4400)
	github.com/line/ostracon/mempool/clist_mempool.go:758 +0xb7
github.com/line/ostracon/mempool.(*CListMempool).Update(0xc0000a4400, 0xc007d3e600, 0xc00326a000, 0x2c9, 0x400, 0xc004d56bb0, 0x1, 0xc00161fa68)
	github.com/line/ostracon/mempool/clist_mempool.go:717 +0x875
github.com/line/ostracon/state.(*BlockExecutor).Commit(0xc000ba4060, 0xb, 0x0, 0x0, 0x0, 0xc00161fa50, 0xb, 0x1, 0xc00161fa68, 0x827, ...)
	github.com/line/ostracon/state/execution.go:307 +0x44e
github.com/line/ostracon/state.(*BlockExecutor).ApplyBlock(0xc000ba4060, 0xb, 0x0, 0x0, 0x0, 0xc00161fa50, 0xb, 0x1, 0xc00161fa68, 0x827, ...)
	github.com/line/ostracon/state/execution.go:218 +0x750
github.com/line/ostracon/consensus.(*State).finalizeCommit(0xc00166bc00, 0x827)
	github.com/line/ostracon/consensus/state.go:1699 +0xab7
github.com/line/ostracon/consensus.(*State).tryFinalizeCommit(0xc00166bc00, 0x827)
	github.com/line/ostracon/consensus/state.go:1610 +0x428
github.com/line/ostracon/consensus.(*State).addProposalBlockPart(0xc00166bc00, 0xc0031d5400, 0xc00422a000, 0x28, 0x2f746a89e2d, 0x1b9f980, 0x14a23aca)
	github.com/line/ostracon/consensus/state.go:2037 +0x7c5
github.com/line/ostracon/consensus.(*State).handleMsg(0xc00166bc00, 0x14be1c0, 0xc0031d5400, 0xc00422a000, 0x28)
	github.com/line/ostracon/consensus/state.go:870 +0x5a5
github.com/line/ostracon/consensus.(*State).receiveRoutine(0xc00166bc00, 0x0)
	github.com/line/ostracon/consensus/state.go:816 +0x86c
created by github.com/line/ostracon/consensus.(*State).OnStart
	github.com/line/ostracon/consensus/state.go:430 +0x896"

How panic occurs

Add tx1 by CListMempool.addTx via CListMempool.CheckTxAsync
- add into CListMempool.txs: clist.CList
- put into CListMempool.txsMap: sync.Map
- put into CListMempool.cache: txCache(map)
Propose a block with tx1 by CListMempool.ReapMaxBytesMaxGasMaxTxs is using CListMempool.txs
Add some txs and cache out tx1 via CListMempool.CheckTxAsync
Add tx1' by CListMempool.addTx
- CListMempool.txs have duplicate tx1 in list
- CListMempool.txsMap have tx1 as only one in map
- CListMempool.cache have tx1 as only one in map
Commit a block with tx1
Delivery tx1
Update mempool by CListMempool.Update
Remove tx1 by CListMempool.removeTx using block.txs
- CListMempool.txs remains the one of duplicate tx1 in list
- CListMempool.txsMap remove tx1 in map
- ClistMempool.cache add/replace tx1 in map
Recheck tx1' by CListMempool.rechecTxs using CListMempool.txs
Occur panic since does not exist in CListMempool.txsMap when checking tx1' by CListMempool.resCbRecheck using CListMempool.txsMap
Dead go receiveRoutine

Cause

The cause is an inconsistent state of mempool when ClistMempool.cache cannot work well.

Mempool can't filter duplicate tx if mempool cache misses it. tendermint/tendermint#5281 (comment)

If Tendermint allows being an inconsistent state of mempool, Ostracon shouldn't happen panic when CListMempool.resCbRecheck since Ostracon cannot check it with CListMempool.txsMap on CListMempool.resCbRecheck.
(In Tendermint-v0.34.x, they just check with CListMempool.txs on CListMempool.resCbRecheck, so they rarely meet this problem since they don't use CListMempool.txsMap.)
And from another point of view, what Ostracon wants to do with this CListMempool.recheckTxs is to remove Ostracon's mempool to synchronize with the ABCI side when ABCI says it's not OK. If it's already removed, Ostracon doesn't have to do anything. In other words, it doesn't need to occur a panic.

In summary:

There are definitely conditions for panic to occur
It doesn't need to occur a panic
It doesn't affect not to occur a panic on Ostracon, even if it's already removed from CListMempool.txsMap

How to fix for short term

If you want to record it when tx is already removed, just logger.Error is fine. (Never occur a panic)

Suggested the patch here without logging:

diff --git a/mempool/clist_mempool.go b/mempool/clist_mempool.go
index c88de1fdee..ccdeb09b35 100644
--- a/mempool/clist_mempool.go
+++ b/mempool/clist_mempool.go
@@ -537,20 +537,18 @@ func (mem *CListMempool) resCbFirstTime(
 func (mem *CListMempool) resCbRecheck(req *abci.Request, res *abci.Response) {
        switch r := res.Value.(type) {
        case *abci.Response_CheckTx:
-               tx := req.GetCheckTx().Tx
-               txHash := TxKey(tx)
-               if e, ok := mem.txsMap.Load(txHash); ok {
-                       celem := e.(*clist.CElement)
-                       if r.CheckTx.Code == abci.CodeTypeOK {
-                               // Good, nothing to do.
-                       } else {
+               if r.CheckTx.Code == abci.CodeTypeOK {
+                       // Good, nothing to do.
+               } else {
+                       tx := req.GetCheckTx().Tx
+                       txHash := TxKey(tx)
+                       if e, ok := mem.txsMap.Load(txHash); ok {
+                               celem := e.(*clist.CElement)
                                // Tx became invalidated due to newly committed block.
                                mem.logger.Error("tx is no longer valid", "tx", txID(tx), "res", r)
                                // NOTE: we remove tx from the cache because it might be good later
                                mem.removeTx(tx, celem, true)
                        }
-               } else {
-                       panic(fmt.Sprintf("unexpected tx response from proxy during recheck\ntxHash=%X, tx=%X", txHash, tx))
                }
        default:
                // ignore other messages

How to fix for long term

If we change the policy of allowing duplicate txs on Ostracon, we should add to check whether the duplicated tx on CListMempool.checkTxAsync

Suggested the patch here:

diff --git a/mempool/clist_mempool.go b/mempool/clist_mempool.go
index c88de1fdee..1255d355a3 100644
--- a/mempool/clist_mempool.go
+++ b/mempool/clist_mempool.go
@@ -297,6 +299,11 @@ func (mem *CListMempool) checkTxAsync(tx types.Tx, txInfo TxInfo, prepareCb func
 
 // CONTRACT: `caller` should held `mem.updateMtx.RLock()`
 func (mem *CListMempool) prepareCheckTx(tx types.Tx, txInfo TxInfo) error {
+       if _, ok := mem.txsMap.Load(TxKey(tx)); ok {
+               return ErrTxInMap
+       }
+
        txSize := len(tx)

diff --git a/mempool/errors.go b/mempool/errors.go
index 00f288ac8f..a6f77a004d 100644
--- a/mempool/errors.go
+++ b/mempool/errors.go
@@ -8,6 +8,7 @@ import (
 var (
        // ErrTxInCache is returned to the client if we saw tx earlier
        ErrTxInCache = errors.New("tx already exists in cache")
+       ErrTxInMap   = errors.New("tx already exists in txsMap")
 )

Appendix

Ostracon has updated to do concurrency `ABCI.CheckTxAsync`

The cause was made by the below PR:

https://github.com/line/ostracon/pull/163/files#diff-f71975385bf5bc21c04eb08f496140cc8cca405b8dbd7b628342e6d8afb61cc5R505

Ostracon has updated to make the `reserved` feature

This doesn't relate, JFYI:

https://github.com/line/ostracon/pull/158/files#diff-f71975385bf5bc21c04eb08f496140cc8cca405b8dbd7b628342e6d8afb61cc5R36-R38

Tendermint also has met near panic

abci/grpc: panic during concurrent CheckTx tendermint/tendermint#5439

But, it's not the same reason. When they use gRPC/Socket client, they have met to occur panic and its ongoing issues.

And nobody knows why occurs panic in this commit since 6 years ago:

tendermint/tendermint@d31d3c5#diff-37e0f6f3ccf82f3fcc49edc58fdace00678d70956ff412c13c64b764dc58a6f4R161

And they think to allow to occur a panic now:

mempool: remove panic when recheck-tx was not sent to ABCI application tendermint/tendermint#7134 (comment)

creachadair on 16 Oct 2021

I think this shouldn't happen, right? Is it worth logging? (I realize we didn't before either, but it seems like something that should be unusual and interesting)

williambanfield on 16 Oct 2021

In truth, this may be the kind of place we would be allowed to panic. If this happens, something on the level of a compile-error occurred, i.e. we've used the code in such a wrong way as to be sort of completely broken.

creachadair on 16 Oct 2021

That was my impression as well. That said: This at least preserves the existing behaviour. I was thinking a log would help us debugging them later, once we are no longer gating a near-term release on it

williambanfield on 16 Oct 2021

Definitely, log added.

I agree it's reasonable to allow to occur panic, remove it, and logging

The text was updated successfully, but these errors were encountered:

Kynea0b mentioned this issue Apr 6, 2022

fix: Remove panic for unexpected tx response in resCbRecheck #394

Merged

torao assigned Kynea0b Apr 6, 2022

torao added the C: bug Classification: Something isn't working label Apr 6, 2022

tnasu mentioned this issue Apr 14, 2022

Improve checking tx with txsMap for fixing the inconsistency between mem.txs and mem.txsMap #404

Merged

tnasu closed this as completed in #404 Apr 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occur panic in `resCbRecheck` when sending about 400tx/s #395

Occur panic in `resCbRecheck` when sending about 400tx/s #395

tnasu commented Apr 6, 2022 •

edited

Loading

Occur panic in resCbRecheck when sending about 400tx/s #395

Occur panic in resCbRecheck when sending about 400tx/s #395

Comments

tnasu commented Apr 6, 2022 • edited Loading

Ostracon version

Environment

What happened

Logs on a node

How panic occurs

Cause

How to fix for short term

How to fix for long term

Appendix

Ostracon has updated to do concurrency ABCI.CheckTxAsync

Ostracon has updated to make the reserved feature

Tendermint also has met near panic

Occur panic in `resCbRecheck` when sending about 400tx/s #395

Occur panic in `resCbRecheck` when sending about 400tx/s #395

tnasu commented Apr 6, 2022 •

edited

Loading

Ostracon has updated to do concurrency `ABCI.CheckTxAsync`

Ostracon has updated to make the `reserved` feature