Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Subscription may not be cleaned by datacoord under corner cases, causing quota exceeded in milvus #15371

Closed
1 task done
xiaofan-luan opened this issue Jan 24, 2022 · 4 comments
Assignees
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@xiaofan-luan
Copy link
Collaborator

xiaofan-luan commented Jan 24, 2022

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:
- Deployment mode(standalone or cluster):
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

after #15353 is in, subscription will be cleaned up by datacoord.

However, if a node crash after a balance task, previous balance task subscription source may be deleted and there is nowhere to find the previous nodeId, causing subscription created from the crashed node can not be unsubscribed and leakage.

If this corner case happened, pulsar backlog will not be consumed due to the leaked subscription and system can not be produced by quota exceeded exception.

Expected Behavior

multiple task handled in parallel should be handled gracefully.

I would suggest add more states in etcd. A task life cycle in channel_manager should be as follow:

  1. Add -> when channel is assigned
  2. Offline(Added) -> when datacoord decide channel should be closed at a data node
  3. Delete -> when channel is successfully removed from the datanode

Steps To Reproduce

No response

Anything else?

you can fix it by using pulsar admin to cleanup exist subscriptions.

@xiaofan-luan xiaofan-luan added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 24, 2022
@xiaofan-luan xiaofan-luan added this to the 2.0-Backlog milestone Jan 24, 2022
@xiaofan-luan xiaofan-luan added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 24, 2022
@yanliang567
Copy link
Contributor

@sunby would you take this improvement?
/assign @sunby
/unassign

@sre-ci-robot sre-ci-robot assigned sunby and unassigned yanliang567 Jan 24, 2022
@xiaofan-luan xiaofan-luan changed the title [Bug]: Subscription may not be cleaned by datacoord under [Bug]: Subscription may not be cleaned by datacoord under corner cases, causing quota exceeded in milvus Jan 24, 2022
@stale
Copy link

stale bot commented Feb 23, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Feb 23, 2022
@congqixia
Copy link
Contributor

Working on extending channel_manager state. Let's keep this issue open.

@stale stale bot removed the stale indicates no udpates for 30 days label Feb 24, 2022
@stale
Copy link

stale bot commented Mar 26, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Mar 26, 2022
@stale stale bot closed this as completed Apr 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants