Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiDB memory leak after meets the fail to get stats version for this histogram #54022

Closed
Rustin170506 opened this issue Jun 14, 2024 · 6 comments
Assignees
Labels
affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. impact/leak report/customer Customers have encountered this bug. severity/major sig/planner SIG: Planner type/bug The issue is confirmed as a bug.

Comments

@Rustin170506
Copy link
Member

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. Change the async load stats code:
diff --git a/pkg/sessionctx/variable/tidb_vars.go b/pkg/sessionctx/variable/tidb_vars.go
index e10aa9c5bbb9f..39153b4433448 100644
--- a/pkg/sessionctx/variable/tidb_vars.go
+++ b/pkg/sessionctx/variable/tidb_vars.go
@@ -1287,7 +1287,7 @@ const (
 	DefTiDBTableCacheLease                         = 3 // 3s
 	DefTiDBPersistAnalyzeOptions                   = true
 	DefTiDBEnableColumnTracking                    = false
-	DefTiDBStatsLoadSyncWait                       = 100
+	DefTiDBStatsLoadSyncWait                       = 0
 	DefTiDBStatsLoadPseudoTimeout                  = true
 	DefSysdateIsNow                                = false
 	DefTiDBEnableMutationChecker                   = false
diff --git a/pkg/statistics/handle/storage/read.go b/pkg/statistics/handle/storage/read.go
index bacf43abef60f..e1f43e8ecdc65 100644
--- a/pkg/statistics/handle/storage/read.go
+++ b/pkg/statistics/handle/storage/read.go
@@ -539,7 +539,7 @@ func loadNeededColumnHistograms(sctx sessionctx.Context, statsCache util.StatsCa
 	if err != nil {
 		return errors.Trace(err)
 	}
-	if len(rows) == 0 {
+	if true {
 		logutil.BgLogger().Error("fail to get stats version for this histogram", zap.Int64("table_id", col.TableID), zap.Int64("hist_id", col.ID))
 		return errors.Trace(fmt.Errorf("fail to get stats version for this histogram, table_id:%v, hist_id:%v", col.TableID, col.ID))
 	}
@@ -599,7 +599,7 @@ func loadNeededIndexHistograms(sctx sessionctx.Context, statsCache util.StatsCac
 	if err != nil {
 		return errors.Trace(err)
 	}
-	if len(rows) == 0 {
+	if true {
 		logutil.BgLogger().Error("fail to get stats version for this histogram", zap.Int64("table_id", idx.TableID), zap.Int64("hist_id", idx.ID))
 		return errors.Trace(fmt.Errorf("fail to get stats version for this histogram, table_id:%v, hist_id:%v", idx.TableID, idx.ID))
 	}
  1. Create some tables and analyze those tables
  2. Try to issue some select queries to TiDB

2. What did you expect to see? (Required)

The memory usage of TiDB does not increase.

3. What did you see instead (Required)

img_v3_02bq_c13de680-8d76-4927-8f90-70d0f55d1e5g
img_v3_02bq_3ff29c9b-d5c3-40a1-b8d6-48242f5e8aag
img_v3_02bq_1278b481-a16b-4a56-ab13-5f86d68d0abg

4. What is your TiDB version? (Required)

v7.5.1

@Rustin170506 Rustin170506 added the type/bug The issue is confirmed as a bug. label Jun 14, 2024
@hawkingrei hawkingrei added the sig/planner SIG: Planner label Jun 14, 2024
@kennedy8312
Copy link

/epic leak

@Rustin170506
Copy link
Member Author

The memory leak issue:

  1. The issue was triggered by a persistent error when a user adjusted tidb_opt_objective to trigger a bug in one of the stats causing the hist of the index to be loaded asynchronously: [2024/06/12 01:53:27.176 +00:00] [ERROR] [read.go:603] [" fail to get stats version for this histogram"] [table_id=143] [hist_id=1]
  2. In the event of an error, our util method drops the session ctx instead of putting it back into the pool.
    img_v3_02bq_fb304f34-b90a-4249-8e8e-895a5214e32g
  3. Not putting it back into the pool isn't the worst part, the worst part is that we record the session internally to optimise gc safe time advancement, so essentially throwing the session away can't be used by golang gc because it's still referenced in memory.
    img_v3_02bq_2f2200ff-d3ff-4c0c-9735-8668dad414bg
  4. Because the pool is essentially unbound, new sessions are created all the time and internal references are created to the
    img_v3_02bq_5fe76bdc-8dda-4056-87cd-3b983ea7aefg
  5. So this results in 1. memory continuing to grow 2. even if we fix the stats failure by dropping stats, the allocated memory cannot be freed because of internal references gc 3. it also explains why that portion of the memory hasn't continued to grow since the problem was fixed 4. it also explains why the problem went away after a restart of the other node.
  6. A record of solid reproduction:
    a. Make load stats report errors directly and turn off sync loads
    img_v3_02bq_7eb12fc4-df42-4baa-8e42-df41a9c1973g
    b. Create some tables, analyse them and then use a query statement to trigger the loading of statistics
    c. Memory leaks can be seen by looking at memory profiles and monitors
    img_v3_02bq_b0a7c552-e66b-4327-9641-c0a0daa8c44g
    img_v3_00bq_cd4bb6e1-4e04-437b-8788-0f53f0ae4c1g

@Rustin170506
Copy link
Member Author

How to trigger the bug and the root cause:

  1. Looking at the user's table structure and the amount of data, as well as the ask tug information, we can get a characteristic that the table where the user has problems is very small, and does not even reach the analysis threshold. the table where the problem occurs in oncall has only 2-4 rows.
  2. Observe the parameters adjusted by the user for the day:
    05/29 17:00 +utc0900 tidb_auto_analyze_ratio=0.1 /tidb_auto_analyze_partition_batch_size = 128
    05/30 18:30 tidb_opt_objective='determine'
  3. The problem logs actually started appearing on the 30th.
  4. Combined with the fact that the code path in question is on the hist where the indexes are loaded, we can determine that the query is trying to load the statistics for the indexes of these tables
  5. Looking at the code again, there is only one place in the code that could cause the index's stats to be loaded incorrectly:
    img_v3_02br_cd7a4a38-7233-432f-a9da-9f03bdfa885g
  6. The only thing that could go wrong here is if the phsyical id isn't -1.
  7. Then you can look up and find the only parameter that might make the phsyical id not -1: allowPseudoTblTriggerLoading
    img_v3_02br_a95421aa-736d-44d6-aec6-6376435e900g
  8. Looking at the setting of this parameter again, we can see a familiar parameter: OptObjectiveDeterminate

img_v3_02br_d00abff8-601c-4186-a49c-c4aab370416g
Based on the above information, I suspect that after opening this parameter, due to some special conditions, it triggered the loading of the index statistics of the table, while this index actually has no statistics at all. So it leads to an infinite loop.

@Rustin170506
Copy link
Member Author

Rustin170506 commented Jun 17, 2024

Summary:

  1. We were able to reproduce the problem by following this path, but the only thing we are not quite sure of is how to cause the system table to drop the hist records for the index, which I did manually, and which kunqin simulated by dropping stats.
  2. At present, we can determine that tidb_opt_objective='determine' is a necessary condition for reproduction.
  3. The tables where users have problems are relatively small, and there is no statistical information on the table, so if the problem can be found in time, then re-analysis of these small tables can solve the problem.
  4. The way to fix this is to do another check when loading as kunqin said, so that we can avoid loading this kind of index information.
  5. Because the default value of tidb_opt_objective is not determinate, the impact of this problem is not very large, and the probability of users encountering it is small.
  6. as for how to throw away I feel that the cost of checking it out directly may be very very high, so I think we can first fix the loading itself is not reasonable. In the place where this situation may occur and then more related logs, see the next time after the reproduction and then continue to check (not reproduce the leakage problem, is the later reproduction: index in the cache index info, but the system table inside the hist record does not have the case).

@Rustin170506
Copy link
Member Author

Rustin170506 commented Jun 17, 2024

The way to reproduce it:

create table t(a int, b int, index ia(a));

drop stats t;

insert into t value(1,1), (2,2);

# Wait for a while until you can see the stats for the table.
show stats_meta; 

set tidb_opt_objective='determinate';

explain select * from t where a = 1 and b = 1;

@seiya-annie
Copy link

/report customer

@ti-chi-bot ti-chi-bot bot added the report/customer Customers have encountered this bug. label Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. impact/leak report/customer Customers have encountered this bug. severity/major sig/planner SIG: Planner type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

4 participants