Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Load-based replica read #105

Merged
merged 7 commits into from
Sep 18, 2023
Merged

Conversation

sticnarf
Copy link
Contributor

No description provided.

@sticnarf
Copy link
Contributor Author

sticnarf commented Jan 12, 2023

This design only maintains the load of TiKVs in the client. And the client only receives the load info only when ServerIsBusy is returned. Hope we can get enough benefit only with the simple design.

I'm still not confident about the retry strategy in this document. Completely different strategy designs are welcome.

text/0105-load-based-replica-read.md Outdated Show resolved Hide resolved
text/0105-load-based-replica-read.md Outdated Show resolved Hide resolved
text/0105-load-based-replica-read.md Show resolved Hide resolved
text/0105-load-based-replica-read.md Show resolved Hide resolved

The current queue length is easily known. But we have to predict the average time slice in the short future. We can use the EWMA of the previous time slices to estimate it. $S_{now}$ is the average time slice length of the read pool in the past second. We update the latest EWMA $S_{i}$ every second using the following formula:

$$S_{i}=\alpha \cdot S_{now}+(1-\alpha) \cdot S_{i-1}$$
Copy link
Contributor

@ekexium ekexium Jan 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit vague to me. Does $S_j$ represent either an estimate value or observed data? Maybe we should distinguish them. Oh I get it. $S_j$ always means estimate. Only $S_{now}$ is observed data. We could use other symbols to distinguish them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I use $\hat S$ for the predicted value (EWMA) and $Y_{t}$ as the observed value.


Knowing the current queue length $L$ and the average time slice $S$ of the read pool, we can estimate that the wait duration is $T_{waiting} =L \cdot S$.

The current queue length is easily known. But we have to predict the average time slice in the short future. We can use the EWMA of the previous time slices to estimate it. $S_{now}$ is the average time slice length of the read pool in the past second. We update the latest EWMA $S_{i}$ every second using the following formula:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems it can take at most 1 second for the mechanism to recognize a spike of load. Underestimating the load might undermine the optimization.
Does a shorter interval improve the sensitivity while not introduce much more overhead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I change it to 200ms. The average time slice does not change much under a spike of load. So, the update frequency needn't be very short.


Knowing the current queue length $L$ and the average time slice $S$ of the read pool, we can estimate that the wait duration is $T_{waiting} =L \cdot S$.

The current queue length is easily known. But we have to predict the average time slice in the short future. We can use the EWMA of the previous time slices to estimate it. $S_{now}$ is the average time slice length of the read pool in the past second. We update the latest EWMA $S_{i}$ every second using the following formula:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When load is extremely low (e.g. there is only 1 large read request, or even 0), could it misestimate $S_{now}$ by simply calculating the average?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a good point. I add a paragraph below for this case.


To make use of as many resources as possible, the load we predict should not be larger than the current load. Otherwise, we may skip a node that is already free for executing requests and not get the best performance.

We use `estimatedWait - (time.Now().Since(waitTimeUpdatedAt))` as the estimated waiting duration in the client. It's mostly certain that this estimated value is smaller than real because the TiKV accepts requests meanwhile and some queries don't finish in a single time slice.
Copy link
Contributor

@cfzjywxk cfzjywxk Jan 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial thought is to let the client use the observed metrics like cop_task_avg_wait_duration in a recent time interval or something like that to decide which replica to choose next. This estimatedWait - (time.Now().Since(waitTimeUpdatedAt)) looks simpler and could avoid retrying already busy replicas 🤔

}
```

Because we will retry in replica-read mode, we don't need the follower or learner to issue a read index RPC again after knowing the applied index.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will the replica-read node do when it's applied index is not satisfied?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It waits until it applies the index. This saves the read index RPC, and the other procedures are the same with the original replica read.

@zhangjinpeng87
Copy link
Member

Please also consider cross AZ data transfer fee when deploy tikv cross AZs.

@sticnarf
Copy link
Contributor Author

Please also consider cross AZ data transfer fee when deploy tikv cross AZs.

If user experience is more important, this feature is also worth considering in spite of the extra cost.

Anyway, this mode is not available to users using closest-replica/adaptive mode now.

@ekexium
Copy link
Contributor

ekexium commented Sep 18, 2023

/merge

@ekexium ekexium merged commit 23a29b6 into tikv:master Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants