Occur panic in resCbRecheck
when sending about 400tx/s
#395
Labels
C: bug
Classification: Something isn't working
resCbRecheck
when sending about 400tx/s
#395
Ostracon version
https://github.com/line/ostracon/tree/v1.0.4
Environment
What happened
If sending about 400tx/s, there is a high probability that the consensus of one node will stop, and then the consensus of other nodes will stop one after another. It seems that the
resCbRecheck
function failed to load tx, panic occurred, andrecvRoutine
of consensus/state stopped.https://github.com/line/ostracon/blob/v1.0.4/mempool/clist_mempool.go#L553
Logs on a node
How panic occurs
tx1
byCListMempool.addTx
viaCListMempool.CheckTxAsync
CListMempool.txs: clist.CList
CListMempool.txsMap: sync.Map
CListMempool.cache: txCache(map)
tx1
byCListMempool.ReapMaxBytesMaxGasMaxTxs
is usingCListMempool.txs
tx1
viaCListMempool.CheckTxAsync
tx1'
byCListMempool.addTx
CListMempool.txs
have duplicatetx1
in listCListMempool.txsMap
havetx1
as only one in mapCListMempool.cache
havetx1
as only one in maptx1
tx1
CListMempool.Update
tx1
byCListMempool.removeTx
usingblock.txs
CListMempool.txs
remains the one of duplicatetx1
in listCListMempool.txsMap
removetx1
in mapClistMempool.cache
add/replacetx1
in maptx1'
byCListMempool.rechecTxs
usingCListMempool.txs
CListMempool.txsMap
when checkingtx1'
byCListMempool.resCbRecheck
usingCListMempool.txsMap
go receiveRoutine
Cause
The cause is an inconsistent state of mempool when
ClistMempool.cache
cannot work well.If Tendermint allows being an inconsistent state of mempool, Ostracon shouldn't happen panic when
CListMempool.resCbRecheck
since Ostracon cannot check it withCListMempool.txsMap
onCListMempool.resCbRecheck
.(In Tendermint-v0.34.x, they just check with
CListMempool.txs
onCListMempool.resCbRecheck
, so they rarely meet this problem since they don't useCListMempool.txsMap
.)And from another point of view, what Ostracon wants to do with this
CListMempool.recheckTxs
is to remove Ostracon's mempool to synchronize with the ABCI side when ABCI says it's not OK. If it's already removed, Ostracon doesn't have to do anything. In other words, it doesn't need to occur a panic.In summary:
CListMempool.txsMap
How to fix for short term
If you want to record it when tx is already removed, just
logger.Error
is fine. (Never occur a panic)Suggested the patch here without logging:
How to fix for long term
If we change the policy of allowing duplicate txs on Ostracon, we should add to check whether the duplicated tx on
CListMempool.checkTxAsync
Suggested the patch here:
Appendix
Ostracon has updated to do concurrency
ABCI.CheckTxAsync
The cause was made by the below PR:
Ostracon has updated to make the
reserved
featureThis doesn't relate, JFYI:
Tendermint also has met near panic
But, it's not the same reason. When they use
gRPC/Socket client
, they have met to occur panic and its ongoing issues.And nobody knows why occurs panic in this commit since 6 years ago:
And they think to allow to occur a panic now:
I agree it's reasonable to allow to occur panic, remove it, and logging
The text was updated successfully, but these errors were encountered: