Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Ruler】Ruler 高可用问题 #390

Open
frezes opened this issue Nov 8, 2023 · 3 comments
Open

【Ruler】Ruler 高可用问题 #390

frezes opened this issue Nov 8, 2023 · 3 comments

Comments

@frezes
Copy link
Collaborator

frezes commented Nov 8, 2023

当前设计中,Ruler 为 StatefulSet 方式部署,且为每个集群部署一个独立的 Ruler,用于计算 Recording rule 和部分的 Alerting rule;但节点宕机情况下,statefulSet 的 Pod 不会进行漂移,导致存在 Ruler 无法正常工作的情况。

修复方式:

  1. 增加PDB或其他策略,促使 Statefulset Pod 进行漂移或新建(PDB 是否可行待验证,可参考这里
  2. 增加 Pod 副本数量(需要处理recording rule 数据重复、irate 函数计算精度问题)
@frezes
Copy link
Collaborator Author

frezes commented Nov 10, 2023

PDB 无法防止非自愿干扰(如节点宕机),修复方式 1 中使用 pdb 不可行;

@frezes
Copy link
Collaborator Author

frezes commented Nov 10, 2023

当前设计中,Ruler 为 StatefulSet 方式部署,且为每个集群部署一个独立的 Ruler,用于计算 Recording rule 和部分的 Alerting rule;但节点宕机情况下,statefulSet 的 Pod 不会进行漂移,导致存在 Ruler 无法正常工作的情况。

修复方式:

  1. 增加PDB或其他策略,促使 Statefulset Pod 进行漂移或新建(PDB 是否可行待验证,可参考这里
  2. 增加 Pod 副本数量(需要处理recording rule 数据重复、irate 函数计算精度问题)
  1. 其他的自行实现方案可参考:tidb-故障转移策略

@benjaminhuo
Copy link
Member

每个租户多个 ruler 有点浪费,我们可以考虑多个租户共享多个 ruler 的实现

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants