Negative WaitGroup counter during the analyze cause TiDB is killed #52601

hawkingrei · 2024-04-15T07:18:33Z

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Analyze table with the large column such as blob

2. What did you expect to see? (Required)

success

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

v6.5.5

elsa0520 · 2024-04-16T05:06:44Z

Triggering conditions

auto analyze
enable analyze memory limit control https://docs.pingcap.com/zh/tidb/v6.5/system-variables#tidb_mem_quota_analyze-%E4%BB%8E-v610-%E7%89%88%E6%9C%AC%E5%BC%80%E5%A7%8B%E5%BC%95%E5%85%A5
tidb_mem_quota_analyze
analyze job use more memory than quota. For example, use big type column such as blob, json, text etc

Even if the above three conditions are met, it does not mean that the restart of tidb will definitely be triggered. This is a problem of probability triggering.

elsa0520 · 2024-04-16T05:10:08Z

Phenomenon and Judgment Evidence

TiDB restarted with panic std_out

panic: Out Of Global Analyze Memory Limit! [recovered]
        panic: sync: negative WaitGroup counter

goroutine 168489963 [running]:
sync.(*WaitGroup).Add(0x4672ca0?, 0xc0801be170?)
        /usr/local/go/src/sync/waitgroup.go:83 +0xda
sync.(*WaitGroup).Done(...)
        /usr/local/go/src/sync/waitgroup.go:108
github.com/pingcap/tidb/executor.(*AnalyzeColumnsExecV2).subMergeWorker.func1()
        /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/analyze_col_v2.go:589 +0x3f4
panic({0x4672ca0, 0xc0801be170})
        /usr/local/go/src/runtime/panic.go:884 +0x212
github.com/pingcap/tidb/executor.(*globalPanicOnExceed).Action(0x212524e?, 0xc000377f80?)
        /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/executor.go:186 +0x9f
github.com/pingcap/tidb/util/memory.(*Tracker).Consume.func1(0xc0012ce1f0, 0xc0d515c280?)
        /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/tracker.go:440 +0x10b
github.com/pingcap/tidb/util/memory.(*Tracker).Consume(0xc06ccf5680?, 0x11264fb)
        /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/tracker.go:484 +0x332
github.com/pingcap/tidb/statistics.(*baseCollector).FromProto(0xc0235a2000, 0xc0c7ed6080, 0x1177ba6?)
        /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/statistics/row_sampler.go:307 +0x617
github.com/pingcap/tidb/executor.(*AnalyzeColumnsExecV2).subMergeWorker(0xc1027b4108, 0xc00d602b40, 0xc00d602ba0, 0x10, 0x0)
        /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/analyze_col_v2.go:624 +0x6f1
created by github.com/pingcap/tidb/executor.(*AnalyzeColumnsExecV2).buildSamplingStats
        /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/analyze_col_v2.go:260 +0xa37

Before tidb is killed, the log include the "analyze table xxx has failed" and "analyze worker panicked" "Out Of Global Analyze Memory Limit!"

[2024/04/15 22:09:27.477 +08:00] [INFO] [update.go:1156] ["[stats] auto analyze triggered"] [sql="analyze table `zwcdb`.`plat_trans_rsp`"] [reason="too many modifications(67
5962/275669>0.1)"]
[2024/04/15 22:09:34.415 +08:00] [INFO] [analyze.go:575] ["analyze table `zwcdb`.`plat_trans_rsp` has failed"] [partition=] ["job info"="auto analyze table all columns with
256 buckets, 500 topn, 0.17925586003702448 samplerate"] ["start time"=2024/04/15 22:09:27.486 +08:00] ["end time"=2024/04/15 22:09:34.410 +08:00] [cost=6.924309188s]
[2024/04/15 22:09:40.927 +08:00] [ERROR] [analyze_col_v2.go:578] ["analyze worker panicked"] [recover="Out Of Global Analyze Memory Limit!"] [stack="github.com/pingcap/tidb/
executor.(*AnalyzeColumnsExecV2).subMergeWorker.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/analyze_col_v2.go:578\nruntime.go
panic\n\t/usr/local/go/src/runtime/panic.go:884\ngithub.com/pingcap/tidb/executor.(*globalPanicOnExceed).Action\n\t/home/jenkins/agent/workspace/build-common/go/src/github.c
om/pingcap/tidb/executor/executor.go:186\ngithub.com/pingcap/tidb/util/memory.(*Tracker).Consume.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingca
p/tidb/util/memory/tracker.go:440\ngithub.com/pingcap/tidb/util/memory.(*Tracker).Consume\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/m
emory/tracker.go:484\ngithub.com/pingcap/tidb/statistics.(*baseCollector).FromProto\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/statistics/r
ow_sampler.go:307\ngithub.com/pingcap/tidb/executor.(*AnalyzeColumnsExecV2).subMergeWorker\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/execu
tor/analyze_col_v2.go:624"]

After the time of this log [2024/04/15 22:09:34.415 +08:00] [INFO] [analyze.go:575] ["analyze table `zwcdb`.`plat_trans_rsp` has failed"] , there is immediately a new analyze job is created

The analyze job

is running state, end_time is NULL ,
start_time is later than the time of "analyze failed log"
start_time is earler than tidb restarted time

elsa0520 · 2024-04-16T05:41:35Z

Root cause and analysis

Root cause

Analyze job panic when it retry job and exceed the memory limit. (There is a certain probability that it will happen)

Analysis

Analyze job will retry job when the "analyzeColumnsPushDownV2()" is error
The retry job and the first job share the same param "samplingMergeWg *util.WaitGroupWrapper" which will cause the "panic: sync: negative WaitGroup counter"

func (e *AnalyzeColumnsExecV2) analyzeColumnsPushDownWithRetryV2() *statistics.AnalyzeResults {
	analyzeResult := e.analyzeColumnsPushDownV2()
	// do not retry if succeed / not oom error / not auto-analyze / samplerate not set
	if analyzeResult.Err == nil || analyzeResult.Err != errAnalyzeOOM ||
		!e.ctx.GetSessionVars().InRestrictedSQL ||
		e.analyzePB.ColReq == nil || *e.analyzePB.ColReq.SampleRate <= 0 {
		return analyzeResult
	}
	finishJobWithLog(e.ctx, analyzeResult.Job, analyzeResult.Err)
	......
	prepareV2AnalyzeJobInfo(e.AnalyzeColumnsExec, true)
	AddNewAnalyzeJob(e.ctx, e.job)
	StartAnalyzeJob(e.ctx, e.job)
	return e.analyzeColumnsPushDownV2()
}

func (e *AnalyzeColumnsExecV2) subMergeWorker(resultCh chan<- *samplingMergeResult, taskCh <-chan []byte, l int, index int) {
	isClosedChanThread := index == 0
	defer func() {
		if r := recover(); r != nil {
			logutil.BgLogger().Error("analyze worker panicked", zap.Any("recover", r), zap.Stack("stack"))
			metrics.PanicCounter.WithLabelValues(metrics.LabelAnalyze).Inc()
			resultCh <- &samplingMergeResult{err: getAnalyzePanicErr(r)}
		}
		// Consume the remaining things.
		for {
			_, ok := <-taskCh
			if !ok {
				break
			}
		}
		e.samplingMergeWg.Done()
		if isClosedChanThread {
			e.samplingMergeWg.Wait()
			close(resultCh)
		}
	}()
	failpoint.Inject("mockAnalyzeSamplingMergeWorkerPanic", func() {
		panic("failpoint triggered")
	})
	failpoint.Inject("mockAnalyzeMergeWorkerSlowConsume", func(val failpoint.Value) {
		times := val.(int)
		for i := 0; i < times; i++ {
			e.memTracker.Consume(5 << 20)
			time.Sleep(100 * time.Millisecond)
		}
	})
	retCollector := statistics.NewRowSampleCollector(int(e.analyzePB.ColReq.SampleSize), e.analyzePB.ColReq.GetSampleRate(), l)
	for i := 0; i < l; i++ {
		retCollector.Base().FMSketches = append(retCollector.Base().FMSketches, statistics.NewFMSketch(maxSketchSize))
	}
	for {
		data, ok := <-taskCh
		if !ok {
			break
		}
		dataSize := int64(cap(data))
		colResp := &tipb.AnalyzeColumnsResp{}
		err := colResp.Unmarshal(data)
		if err != nil {
			resultCh <- &samplingMergeResult{err: err}
			return
		}
		colRespSize := int64(colResp.Size())
		e.memTracker.Consume(colRespSize)
		subCollector := statistics.NewRowSampleCollector(int(e.analyzePB.ColReq.SampleSize), e.analyzePB.ColReq.GetSampleRate(), l)
		subCollector.Base().FromProto(colResp.RowCollector, e.memTracker)
		UpdateAnalyzeJob(e.ctx, e.job, subCollector.Base().Count)
		........
	}
	resultCh <- &samplingMergeResult{collector: retCollector}
}

Since both jobs share the same parameter “e.samplingMergeWg.Done()”, the second job overwrites this parameter before the subMergeWorker() of the first job is completed, resulting in a negative number when done.

defer func() {
		if r := recover(); r != nil {
			logutil.BgLogger().Error("analyze worker panicked", zap.Any("recover", r), zap.Stack("stack"))
			metrics.PanicCounter.WithLabelValues(metrics.LabelAnalyze).Inc()
			resultCh <- &samplingMergeResult{err: getAnalyzePanicErr(r)}
		}
		......
		e.samplingMergeWg.Done()
		if isClosedChanThread {
			e.samplingMergeWg.Wait()
			close(resultCh)
		}
	}()

The panic happened inside of defer function also inside of go routine. This kind of panic could not be recovered although we have recover defer function in the outside logic

I write a demo test of panic inside of defer function and go routine. This kind of panic couldn't be recovered and cause process is killed.

package main

import (
	"fmt"
	"sync"
)

func main() {
	defer func() {
		if r := recover(); r != nil {
			fmt.Println("Recovered in main", r)
		}
	}()
	for i := 0; i < 2; i++ {
		go f()
	}
	fmt.Println("Returned normally from f.")
}

func f() {
	var samplingMergeWg *sync.WaitGroup
	samplingMergeWg = &sync.WaitGroup{}
	samplingMergeWg.Add(1)
	defer func() {
		if r := recover(); r != nil {
			fmt.Println("Recovered in f", r)
		}
		samplingMergeWg.Add(-2)
	}()
	fmt.Println("Calling g.")
	g(0)
	fmt.Println("Returned normally from g.")
}

func g(i int) {
	if i > 3 {
		fmt.Println("Panicking!")
		panic(fmt.Sprintf("%v", i))
	}
	defer fmt.Println("Defer in g", i)
	fmt.Println("Printing in g", i)
	g(i + 1)
}

The panic result of my demo code

Calling g.
Printing in g 0
Printing in g 1
Printing in g 2
Printing in g 3
Panicking!
Defer in g 3
Defer in g 2
Defer in g 1
Returned normally from f.
Defer in g 0
Recovered in f 4
panic: 4 [recovered]
        panic: sync: negative WaitGroup counter

goroutine 18 [running]:
sync.(*WaitGroup).Add(0x104f4b098?, 0x140000a0008?)
        /Users/elsa/Package/go/src/sync/waitgroup.go:62 +0x108
main.f.func1()
        /Users/elsa/Workspace/pingcap/tidb/testpanic.go:42 +0x84
panic({0x104f374c0?, 0x14000010050?})
        /Users/elsa/Package/go/src/runtime/panic.go:914 +0x218
main.g(0x4)
        /Users/elsa/Workspace/pingcap/tidb/testpanic.go:52 +0x190
main.g(0x3)
        /Users/elsa/Workspace/pingcap/tidb/testpanic.go:56 +0xec
main.g(0x2)
        /Users/elsa/Workspace/pingcap/tidb/testpanic.go:56 +0xec
main.g(0x1)
        /Users/elsa/Workspace/pingcap/tidb/testpanic.go:56 +0xec
main.g(0x0)
        /Users/elsa/Workspace/pingcap/tidb/testpanic.go:56 +0xec
main.f()
        /Users/elsa/Workspace/pingcap/tidb/testpanic.go:45 +0xa8
created by main.main in goroutine 1
        /Users/elsa/Workspace/pingcap/tidb/testpanic.go:29 +0x4c

Process finished with the exit code 2

When I tested, I also found that it is uncertain whether this type of panic can be caught. There are certain operations that can be caught, but sometimes not.
So this is undefined behavior of the language go
And tidb happened to have this kind of panic, which caused the outer layer to be unable to catch this exception, causing tidb to exit abnormally.

elsa0520 · 2024-04-16T05:47:36Z

Thinking

Streamline defer logic to avoid panic in defer logic, especially the defer logic within go routine.
Avoid internal reset operations when two different go routines share the same wg.

close #52601

seiya-annie · 2024-06-04T02:22:33Z

/found customer

hawkingrei added type/bug The issue is confirmed as a bug. sig/sql-infra SIG: SQL Infra severity/moderate labels Apr 15, 2024

hawkingrei mentioned this issue Apr 15, 2024

executor: fix negative WaitGroup counter when to analyze #52604

Open

13 tasks

hawkingrei added affects-8.1 This bug affects the 8.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-6.5 This bug affects the 6.5.x(LTS) versions. labels Apr 15, 2024

elsa0520 changed the title ~~negative WaitGroup counter during the analyze~~ negative WaitGroup counter during the analyze cause TiDB is killed Apr 16, 2024

elsa0520 changed the title ~~negative WaitGroup counter during the analyze cause TiDB is killed~~ Negative WaitGroup counter during the analyze cause TiDB is killed Apr 16, 2024

elsa0520 added the sig/planner SIG: Planner label Apr 16, 2024

hawkingrei mentioned this issue Apr 16, 2024

executor: remove the retry for analyze #52634

Merged

13 tasks

ti-chi-bot bot closed this as completed in #52634 Apr 16, 2024

ti-chi-bot bot pushed a commit that referenced this issue Apr 16, 2024

executor: remove the retry for analyze (#52634)

2e1d9e1

close #52601

This was referenced Apr 17, 2024

executor: remove the retry for analyze (#52634) #52661

Merged

executor: remove the retry for analyze (#52634) #52662

Merged

executor: remove the retry for analyze (#52634) #52679

Merged

ti-chi-bot bot pushed a commit that referenced this issue Apr 18, 2024

executor: remove the retry for analyze (#52634) (#52662)

1b38808

close #52601

YangKeao removed the sig/sql-infra SIG: SQL Infra label Apr 19, 2024

ti-chi-bot bot pushed a commit that referenced this issue May 21, 2024

executor: remove the retry for analyze (#52634) (#52661)

09acf2a

close #52601

ti-chi-bot mentioned this issue May 31, 2024

executor: remove the retry for analyze (#52634) #53702

Merged

13 tasks

ti-chi-bot bot pushed a commit that referenced this issue Jun 3, 2024

executor: remove the retry for analyze (#52634) (#52679)

409465c

close #52601

ti-chi-bot bot pushed a commit that referenced this issue Jun 3, 2024

executor: remove the retry for analyze (#52634) (#53702)

cf18333

close #52601

ti-chi-bot bot added the report/customer Customers have encountered this bug. label Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Negative WaitGroup counter during the analyze cause TiDB is killed #52601

Negative WaitGroup counter during the analyze cause TiDB is killed #52601

hawkingrei commented Apr 15, 2024 •

edited by elsa0520

Loading

elsa0520 commented Apr 16, 2024 •

edited

Loading

elsa0520 commented Apr 16, 2024 •

edited

Loading

elsa0520 commented Apr 16, 2024 •

edited

Loading

elsa0520 commented Apr 16, 2024

seiya-annie commented Jun 4, 2024

Negative WaitGroup counter during the analyze cause TiDB is killed #52601

Negative WaitGroup counter during the analyze cause TiDB is killed #52601

Comments

hawkingrei commented Apr 15, 2024 • edited by elsa0520 Loading

Bug Report

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

elsa0520 commented Apr 16, 2024 • edited Loading

Triggering conditions

elsa0520 commented Apr 16, 2024 • edited Loading

Phenomenon and Judgment Evidence

elsa0520 commented Apr 16, 2024 • edited Loading

Root cause and analysis

Root cause

Analysis

elsa0520 commented Apr 16, 2024

Thinking

seiya-annie commented Jun 4, 2024

hawkingrei commented Apr 15, 2024 •

edited by elsa0520

Loading

elsa0520 commented Apr 16, 2024 •

edited

Loading

elsa0520 commented Apr 16, 2024 •

edited

Loading

elsa0520 commented Apr 16, 2024 •

edited

Loading