-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: [benchmark][standalone] search and query timeout #14077
Comments
rocksmq produce msg too slow
rocksmq consume msg too slow
querynode first receive a query msg
querynode wait to query msg timeout
|
It seems that rockdb is too slow to access the disk |
etcd log is normal, it should not be etcd which caused the slow sending of timtick
|
Rocksmq produce and consume both took a lot of time to get lock |
argo task: benchmark-tag-cnrlf test yaml: server:
client pod: benchmark-tag-cnrlf-3228498151 client log:
|
Using local-path argo task: benchmark-tag-dcmmg test yaml: server:
client pod: benchmark-tag-dcmmg-2773654560 client log:
|
/unassign |
are we still hitting the same issue after the fix @wangting0128 |
/assign @wangting0128 |
Recoverable after query and search failure argo task: benchmark-px6ww test yaml: server:
client pod: benchmark-px6ww-1300394705 client log:
|
seems to be a different issue. |
Can't find the cause from the stack trace log, try to reproduce on my computer, and execute from step 1 to step 6 mentioned above, then keep loading collection and doing search concurrently by 50 threads. QueryNode doesn't crash. |
|
argo task: benchmark-no-clean-kxmp4 test yaml: server:
client pod: benchmark-no-clean-kxmp4-1560983271 locust_report_2022-01-10_119.log client logs:
|
standalone restart, but querynode recovery failed. rootCoord set state to helthy before registe session to etcd, so when getRecoveryInfo call rootCoord's describeCollection, rootCoord return error after two retries.
|
argo task: benchmark-tag-sk47h test yaml: server:
client pod: benchmark-tag-sk47h-2692756289 locust_report_2022-01-12_149.log client log:
|
|
Search panic, after standalone restart, querycoord cannot obtain recovery information through DataCoord.GetRecoveryInfo when recovering node load, resulting in standalone panic again, but querycoord clears the querynode information that needs to be recovered before panic, which is wrong
|
Search panic after |
Have checked all code diff under directory "internal/core" between rc8 and latest, no findings.
|
#run1
#run2 (panic)
#run3 (panic)
#run4
|
any progress? |
still running |
#run5
#run6
#run7
|
argo task: benchmark-tag-s7642 test yaml: server:
client pod: benchmark-tag-s7642-4095128662 client log:
|
run argo 6 times, the root cause of this issue MUST in the code change of "core" between 2021/12/15 ~ 2021/12/22 |
code change of "core" between 2021/12/15 ~ 2021/12/22: |
Compared the code diff between 2021/11/30 ~ 2021/12/31, find no suspicious PR which could cause Search panic.
|
Have build docker image supporting coredump Use this docker image to verify search panic |
remove urgent label, as it #15133 was fixed in 2.0.1 |
verify:benchmark-d4l42 |
Is there an existing issue for this?
Environment
Current Behavior
client pod: benchmark-tag-bc6ld-3487203131
client logs:
Expected Behavior
argo task: benchmark-tag-bc6ld
test yaml:
client-configmap:client-random-locust-100m-ddl-r8-w2
server-configmap:server-single-32c128m
server:
Steps To Reproduce
Anything else?
client-random-locust-100m-ddl-r8-w2:
scene_test:
The text was updated successfully, but these errors were encountered: