Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occur panic in resCbRecheck when sending about 400tx/s #395

Closed
tnasu opened this issue Apr 6, 2022 · 0 comments · Fixed by #404
Closed

Occur panic in resCbRecheck when sending about 400tx/s #395

tnasu opened this issue Apr 6, 2022 · 0 comments · Fixed by #404
Assignees
Labels
C: bug Classification: Something isn't working

Comments

@tnasu
Copy link
Member

tnasu commented Apr 6, 2022

Ostracon version

https://github.com/line/ostracon/tree/v1.0.4

Environment

  • Starting 100 validator
  • Use abci.example.Counter
    • Use abcicli.localClient

What happened

If sending about 400tx/s, there is a high probability that the consensus of one node will stop, and then the consensus of other nodes will stop one after another. It seems that the resCbRecheck function failed to load tx, panic occurred, and recvRoutine of consensus/state stopped.

https://github.com/line/ostracon/blob/v1.0.4/mempool/clist_mempool.go#L553

		if e, ok := mem.txsMap.Load(txHash); ok {
			//...
		} else {
			panic(fmt.Sprintf("unexpected tx response from proxy during recheck\ntxHash=%X, tx=%X", txHash, tx))
		}

Logs on a node

CONSENSUS FAILURE!!!                         module=consensus err="unexpected tx response from proxy during recheck
txHash=C65EC3DCBD8C7036A518F913C2CD329E290B71E702BBC0B9143E48E6D111C664, tx=6170706C6531383738393D31363438303235323130363634" 
stack="goroutine 318 [running]:
	runtime/debug.Stack(0xc003e69948, 0x108b9e0, 0xc004562f70)
	runtime/debug/stack.go:24 +0x9f
github.com/line/ostracon/consensus.(*State).receiveRoutine.func2(0xc00166bc00, 0x13bb858)
	github.com/line/ostracon/consensus/state.go:780 +0x57
panic(0x108b9e0, 0xc004562f70)
	runtime/panic.go:553 +0x1b9
github.com/line/ostracon/mempool.(*CListMempool).resCbRecheck(0xc0000a4400, 0xc004562f50, 0xc004562f60)
	github.com/line/ostracon/mempool/clist_mempool.go:553 +0x58c
github.com/line/ostracon/mempool.(*CListMempool).globalCb(0xc0000a4400, 0xc004562f50, 0xc004562f60)
	github.com/line/ostracon/mempool/clist_mempool.go:376 +0xb9
github.com/line/ostracon/abci/client.(*localClient).done(0xc0003280e0, 0xc0085bce70, 0xc004562f60, 0x0)
	github.com/line/ostracon/abci/client/local_client.go:365 +0x83
github.com/line/ostracon/abci/client.(*localClient).CheckTxAsync.func1(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	github.com/line/ostracon/abci/client/local_client.go:114 +0x10e
github.com/line/ostracon/abci/example/counter.(*Application).CheckTxAsync(0xc000b61ee0, 0xc006aca060, 0x18, 0x18, 0x1, 0xc005d3cfe0)
	github.com/line/ostracon/abci/example/counter/counter.go:70 +0xec
github.com/line/ostracon/abci/client.(*localClient).CheckTxAsync(0xc0003280e0, 0xc006aca060, 0x18, 0x18, 0x1, 0xc004562f20, 0xc004562f20)
	github.com/line/ostracon/abci/client/local_client.go:112 +0x1f2
github.com/line/ostracon/proxy.(*appConnMempool).CheckTxAsync(0xc000bba2e0, 0xc006aca060, 0x18, 0x18, 0x1, 0xc004562f20, 0xc003e69f20)
	github.com/line/ostracon/proxy/app_conn.go:130 +0x63
github.com/line/ostracon/mempool.(*CListMempool).recheckTxs(0xc0000a4400)
	github.com/line/ostracon/mempool/clist_mempool.go:758 +0xb7
github.com/line/ostracon/mempool.(*CListMempool).Update(0xc0000a4400, 0xc007d3e600, 0xc00326a000, 0x2c9, 0x400, 0xc004d56bb0, 0x1, 0xc00161fa68)
	github.com/line/ostracon/mempool/clist_mempool.go:717 +0x875
github.com/line/ostracon/state.(*BlockExecutor).Commit(0xc000ba4060, 0xb, 0x0, 0x0, 0x0, 0xc00161fa50, 0xb, 0x1, 0xc00161fa68, 0x827, ...)
	github.com/line/ostracon/state/execution.go:307 +0x44e
github.com/line/ostracon/state.(*BlockExecutor).ApplyBlock(0xc000ba4060, 0xb, 0x0, 0x0, 0x0, 0xc00161fa50, 0xb, 0x1, 0xc00161fa68, 0x827, ...)
	github.com/line/ostracon/state/execution.go:218 +0x750
github.com/line/ostracon/consensus.(*State).finalizeCommit(0xc00166bc00, 0x827)
	github.com/line/ostracon/consensus/state.go:1699 +0xab7
github.com/line/ostracon/consensus.(*State).tryFinalizeCommit(0xc00166bc00, 0x827)
	github.com/line/ostracon/consensus/state.go:1610 +0x428
github.com/line/ostracon/consensus.(*State).addProposalBlockPart(0xc00166bc00, 0xc0031d5400, 0xc00422a000, 0x28, 0x2f746a89e2d, 0x1b9f980, 0x14a23aca)
	github.com/line/ostracon/consensus/state.go:2037 +0x7c5
github.com/line/ostracon/consensus.(*State).handleMsg(0xc00166bc00, 0x14be1c0, 0xc0031d5400, 0xc00422a000, 0x28)
	github.com/line/ostracon/consensus/state.go:870 +0x5a5
github.com/line/ostracon/consensus.(*State).receiveRoutine(0xc00166bc00, 0x0)
	github.com/line/ostracon/consensus/state.go:816 +0x86c
created by github.com/line/ostracon/consensus.(*State).OnStart
	github.com/line/ostracon/consensus/state.go:430 +0x896"

How panic occurs

  1. Add tx1 by CListMempool.addTx via CListMempool.CheckTxAsync
    • add into CListMempool.txs: clist.CList
    • put into CListMempool.txsMap: sync.Map
    • put into CListMempool.cache: txCache(map)
  2. Propose a block with tx1 by CListMempool.ReapMaxBytesMaxGasMaxTxs is using CListMempool.txs
  3. Add some txs and cache out tx1 via CListMempool.CheckTxAsync
  4. Add tx1' by CListMempool.addTx
    • CListMempool.txs have duplicate tx1 in list
    • CListMempool.txsMap have tx1 as only one in map
    • CListMempool.cache have tx1 as only one in map
  5. Commit a block with tx1
  6. Delivery tx1
  7. Update mempool by CListMempool.Update
  8. Remove tx1 by CListMempool.removeTx using block.txs
    • CListMempool.txs remains the one of duplicate tx1 in list
    • CListMempool.txsMap remove tx1 in map
    • ClistMempool.cache add/replace tx1 in map
  9. Recheck tx1' by CListMempool.rechecTxs using CListMempool.txs
  10. Occur panic since does not exist in CListMempool.txsMap when checking tx1' by CListMempool.resCbRecheck using CListMempool.txsMap
  11. Dead go receiveRoutine

Cause

The cause is an inconsistent state of mempool when ClistMempool.cache cannot work well.

If Tendermint allows being an inconsistent state of mempool, Ostracon shouldn't happen panic when CListMempool.resCbRecheck since Ostracon cannot check it with CListMempool.txsMap on CListMempool.resCbRecheck.
(In Tendermint-v0.34.x, they just check with CListMempool.txs on CListMempool.resCbRecheck, so they rarely meet this problem since they don't use CListMempool.txsMap.)
And from another point of view, what Ostracon wants to do with this CListMempool.recheckTxs is to remove Ostracon's mempool to synchronize with the ABCI side when ABCI says it's not OK. If it's already removed, Ostracon doesn't have to do anything. In other words, it doesn't need to occur a panic.

In summary:

  • There are definitely conditions for panic to occur
  • It doesn't need to occur a panic
  • It doesn't affect not to occur a panic on Ostracon, even if it's already removed from CListMempool.txsMap

How to fix for short term

If you want to record it when tx is already removed, just logger.Error is fine. (Never occur a panic)

Suggested the patch here without logging:

diff --git a/mempool/clist_mempool.go b/mempool/clist_mempool.go
index c88de1fdee..ccdeb09b35 100644
--- a/mempool/clist_mempool.go
+++ b/mempool/clist_mempool.go
@@ -537,20 +537,18 @@ func (mem *CListMempool) resCbFirstTime(
 func (mem *CListMempool) resCbRecheck(req *abci.Request, res *abci.Response) {
        switch r := res.Value.(type) {
        case *abci.Response_CheckTx:
-               tx := req.GetCheckTx().Tx
-               txHash := TxKey(tx)
-               if e, ok := mem.txsMap.Load(txHash); ok {
-                       celem := e.(*clist.CElement)
-                       if r.CheckTx.Code == abci.CodeTypeOK {
-                               // Good, nothing to do.
-                       } else {
+               if r.CheckTx.Code == abci.CodeTypeOK {
+                       // Good, nothing to do.
+               } else {
+                       tx := req.GetCheckTx().Tx
+                       txHash := TxKey(tx)
+                       if e, ok := mem.txsMap.Load(txHash); ok {
+                               celem := e.(*clist.CElement)
                                // Tx became invalidated due to newly committed block.
                                mem.logger.Error("tx is no longer valid", "tx", txID(tx), "res", r)
                                // NOTE: we remove tx from the cache because it might be good later
                                mem.removeTx(tx, celem, true)
                        }
-               } else {
-                       panic(fmt.Sprintf("unexpected tx response from proxy during recheck\ntxHash=%X, tx=%X", txHash, tx))
                }
        default:
                // ignore other messages

How to fix for long term

If we change the policy of allowing duplicate txs on Ostracon, we should add to check whether the duplicated tx on CListMempool.checkTxAsync

Suggested the patch here:

diff --git a/mempool/clist_mempool.go b/mempool/clist_mempool.go
index c88de1fdee..1255d355a3 100644
--- a/mempool/clist_mempool.go
+++ b/mempool/clist_mempool.go
@@ -297,6 +299,11 @@ func (mem *CListMempool) checkTxAsync(tx types.Tx, txInfo TxInfo, prepareCb func
 
 // CONTRACT: `caller` should held `mem.updateMtx.RLock()`
 func (mem *CListMempool) prepareCheckTx(tx types.Tx, txInfo TxInfo) error {
+       if _, ok := mem.txsMap.Load(TxKey(tx)); ok {
+               return ErrTxInMap
+       }
+
        txSize := len(tx)
diff --git a/mempool/errors.go b/mempool/errors.go
index 00f288ac8f..a6f77a004d 100644
--- a/mempool/errors.go
+++ b/mempool/errors.go
@@ -8,6 +8,7 @@ import (
 var (
        // ErrTxInCache is returned to the client if we saw tx earlier
        ErrTxInCache = errors.New("tx already exists in cache")
+       ErrTxInMap   = errors.New("tx already exists in txsMap")
 )

Appendix

Ostracon has updated to do concurrency ABCI.CheckTxAsync

The cause was made by the below PR:

Ostracon has updated to make the reserved feature

This doesn't relate, JFYI:

Tendermint also has met near panic

But, it's not the same reason. When they use gRPC/Socket client, they have met to occur panic and its ongoing issues.

And nobody knows why occurs panic in this commit since 6 years ago:

And they think to allow to occur a panic now:

creachadair on 16 Oct 2021

I think this shouldn't happen, right? Is it worth logging? (I realize we didn't before either, but it seems like something that should be unusual and interesting)

williambanfield on 16 Oct 2021

In truth, this may be the kind of place we would be allowed to panic. If this happens, something on the level of a compile-error occurred, i.e. we've used the code in such a wrong way as to be sort of completely broken.

creachadair on 16 Oct 2021

That was my impression as well. That said: This at least preserves the existing behaviour. I was thinking a log would help us debugging them later, once we are no longer gating a near-term release on it

williambanfield on 16 Oct 2021

Definitely, log added.

I agree it's reasonable to allow to occur panic, remove it, and logging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: bug Classification: Something isn't working
Projects
None yet
3 participants