Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query stuck when trying to kill it because of OOM #55042

Closed
guo-shaoge opened this issue Jul 30, 2024 · 18 comments · Fixed by #55118
Closed

query stuck when trying to kill it because of OOM #55042

guo-shaoge opened this issue Jul 30, 2024 · 18 comments · Fixed by #55118
Assignees
Labels
affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. impact/oom severity/major sig/execution SIG execution type/bug The issue is confirmed as a bug.

Comments

@guo-shaoge
Copy link
Collaborator

guo-shaoge commented Jul 30, 2024

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. hack tidb code, delete following code, so we can set tidb_mem_quota_query a very small value to help reproduce this bug
    if intVal > 0 && intVal < 128 { // 128 Bytes
    s.StmtCtx.AppendWarning(ErrTruncatedWrongValue.FastGenByArgs(TiDBServerMemoryLimitSessMinSize, originalValue))
    intVal = 128
    }
  2. build tidb and start tiup playround
  3. download 10-tidb-slow.log and set @@tidb_slow_query_file = '/home/guojiangtao/10-tidb-slow.log'; (remember to change the slow query file path)
  4. set @@tidb_mem_quota_query = 10;
  5. run select time,host host_ip,Query_time as exec_max_time,parse_time,compile_time,Query as sql_text,Digest as sql_id,is_internal,succ, Plan as plan_text,mem_max as mem_max,User as parse_user,DB as database_name,total_keys,request_count,process_time,process_keys from information_schema.SLOW_QUERY order by time desc;

2. What did you expect to see? (Required)

query is canceled

3. What did you see instead (Required)

query is stucked

4. What is your TiDB version? (Required)

master 560e92e

@guo-shaoge guo-shaoge added the type/bug The issue is confirmed as a bug. label Jul 30, 2024
@guo-shaoge guo-shaoge changed the title tidb cannot stuck when trying to cancel query tidb stuck when trying to kill query Jul 30, 2024
@guo-shaoge guo-shaoge changed the title tidb stuck when trying to kill query query stuck when trying to kill it because of OOM Jul 30, 2024
@yibin87
Copy link
Contributor

yibin87 commented Jul 31, 2024

/assign @yibin87

@yibin87
Copy link
Contributor

yibin87 commented Jul 31, 2024

Selecting plan from slow_query table without limit restriction is not recommanded behavior. And this is a long existing issue. Down to major.

@yibin87
Copy link
Contributor

yibin87 commented Jul 31, 2024

/remove severity-critical

@yibin87
Copy link
Contributor

yibin87 commented Jul 31, 2024

/remove-severity critical

@yibin87
Copy link
Contributor

yibin87 commented Jul 31, 2024

/severity major

@yibin87
Copy link
Contributor

yibin87 commented Jul 31, 2024

For version 6.5, 7.5, the query met illegal nil pointer: ERROR 1105 (HY000): runtime error: invalid memory address or nil pointer dereference

@yibin87
Copy link
Contributor

yibin87 commented Jul 31, 2024

v8.1.0 stuck also

@yibin87
Copy link
Contributor

yibin87 commented Aug 1, 2024

The plan is Sort <=== Projection <==== MemTableReader
After debug, located the goroutine that get stuck:
First Projection executor tries to alloc memory:

e.memTracker.Consume(outputChk.MemoryUsage())

Then, it triggers the memory action panic:
panic(err)

After it enters runtime.gopanic, and it invokes the recover function in executor::Next:
err = util.GetRecoverError(r)

Finally, it got stuck when invokes runtime mcall(recovery)

@yibin87
Copy link
Contributor

yibin87 commented Aug 1, 2024

The plan is Sort <=== Projection <==== MemTableReader After debug, located the goroutine that get stuck: First Projection executor tries to alloc memory:

e.memTracker.Consume(outputChk.MemoryUsage())

Then, it triggers the memory action panic:

panic(err)

After it enters runtime.gopanic, and it invokes the recover function in executor::Next:

err = util.GetRecoverError(r)

Finally, it got stuck when invokes runtime mcall(recovery)

It turned out to be debug only stuck on mac OS, the real stuck happen on the following stack:
img_v3_02db_22a7a4bf-a3fb-4d28-b340-a6a3567f0c4g

@yibin87
Copy link
Contributor

yibin87 commented Aug 1, 2024

And check the source code, found the unlock may not be invoked:

@yibin87
Copy link
Contributor

yibin87 commented Aug 1, 2024

After fix this, the stuck disappeared, still not know what's actually happening when the panic happen.

@yibin87
Copy link
Contributor

yibin87 commented Aug 1, 2024

/label affects-8.1

@ti-chi-bot ti-chi-bot bot added affects-8.1 This bug affects the 8.1.x(LTS) versions. and removed may-affects-8.1 labels Aug 1, 2024
@yibin87
Copy link
Contributor

yibin87 commented Aug 1, 2024

/label affects-7.5

@yibin87
Copy link
Contributor

yibin87 commented Aug 1, 2024

/label affects-7.1

@ti-chi-bot ti-chi-bot bot added affects-7.1 This bug affects the 7.1.x(LTS) versions. and removed may-affects-7.1 labels Aug 1, 2024
@yibin87
Copy link
Contributor

yibin87 commented Aug 1, 2024

/label affects-6.5

@ti-chi-bot ti-chi-bot bot added affects-6.5 This bug affects the 6.5.x(LTS) versions. and removed may-affects-6.5 labels Aug 1, 2024
@yibin87
Copy link
Contributor

yibin87 commented Aug 1, 2024

/label affects-6.1

@ti-chi-bot ti-chi-bot bot added affects-6.1 This bug affects the 6.1.x(LTS) versions. and removed may-affects-6.1 labels Aug 1, 2024
@yibin87
Copy link
Contributor

yibin87 commented Aug 1, 2024

/remove-impact leak

@ti-chi-bot ti-chi-bot bot removed the impact/leak label Aug 1, 2024
@yibin87
Copy link
Contributor

yibin87 commented Aug 1, 2024

/label affects-5.4

@ti-chi-bot ti-chi-bot bot added affects-5.4 This bug affects the 5.4.x(LTS) versions. and removed may-affects-5.4 This bug maybe affects 5.4.x versions. labels Aug 1, 2024
@ti-chi-bot ti-chi-bot bot closed this as completed in b0aced8 Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. impact/oom severity/major sig/execution SIG execution type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants