Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [benchmark][cluster] Milvus hang when insert data #14747

Closed
1 task done
wangting0128 opened this issue Jan 4, 2022 · 9 comments
Closed
1 task done

[Bug]: [benchmark][cluster] Milvus hang when insert data #14747

wangting0128 opened this issue Jan 4, 2022 · 9 comments
Assignees
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@wangting0128
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: master-20211231-9baa6e8
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): 2.0.0rc9.dev22
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

client pod: benchmark-tag-nc88q-4263525061

locust_report_2021-12-31_193.log

client log:

[2021-12-31 15:11:02,723] [   DEBUG] - Milvus insert run in 1.6924s (milvus_benchmark.client:53)
[2021-12-31 15:11:02,727] [   DEBUG] - Row count: 142181696 in collection: <sift_1b_128_l2> (milvus_benchmark.client:421)
[2021-12-31 15:11:02,728] [   DEBUG] - 142181696 (milvus_benchmark.runners.base:89)
[2021-12-31 15:11:03,177] [   DEBUG] - Start id: 360750000, end id: 360800000 (milvus_benchmark.runners.base:76)
[2021-12-31 15:11:05,344] [   DEBUG] - Milvus insert run in 2.1643s (milvus_benchmark.client:53)
[2021-12-31 15:11:05,349] [   DEBUG] - Row count: 142181696 in collection: <sift_1b_128_l2> (milvus_benchmark.client:421)
[2021-12-31 15:11:05,350] [   DEBUG] - 142181696 (milvus_benchmark.runners.base:89)
[2021-12-31 15:11:07,405] [   DEBUG] - Start id: 360800000, end id: 360850000 (milvus_benchmark.runners.base:76)

Expected Behavior

argo task: benchmark-tag-nc88q

test yaml:
client-configmap:client-random-locust-search-84h-1b
server-configmap:server-cluster-8c64m-datanode2-indexnode4-querynode6

server:

NAME                                                              READY   STATUS      RESTARTS   AGE     IP             NODE                      NOMINATED NODE   READINESS GATES
benchmark-tag-nc88q-1-etcd-0                                      1/1     Running     0          3d17h   10.97.16.118   qa-node013.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-etcd-1                                      1/1     Running     0          3d17h   10.97.17.78    qa-node014.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-etcd-2                                      1/1     Running     0          3d17h   10.97.16.117   qa-node013.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-milvus-datacoord-547dc7547d-2s69x           1/1     Running     0          3d17h   10.97.14.229   qa-node011.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-milvus-datanode-d947cc4bc-75tq6             1/1     Running     1          3d17h   10.97.10.87    qa-node008.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-milvus-datanode-d947cc4bc-hv4kx             1/1     Running     1          3d17h   10.97.16.113   qa-node013.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-milvus-indexcoord-6d974fcd55-cqzdk          1/1     Running     0          3d17h   10.97.14.228   qa-node011.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-milvus-indexnode-75fbf69bfd-nflrv           1/1     Running     101        3d17h   10.97.14.227   qa-node011.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-milvus-indexnode-75fbf69bfd-pnbsf           1/1     Running     16         3d17h   10.97.12.130   qa-node015.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-milvus-indexnode-75fbf69bfd-vq8tk           1/1     Running     150        3d17h   10.97.19.206   qa-node016.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-milvus-indexnode-75fbf69bfd-zkqql           1/1     Running     147        3d17h   10.97.16.112   qa-node013.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-milvus-proxy-6ffd49bc99-gj8n2               1/1     Running     0          3d17h   10.97.14.230   qa-node011.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-milvus-querycoord-655df9d48d-8fcsk          1/1     Running     0          3d17h   10.97.14.226   qa-node011.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-milvus-querynode-5654f66b44-5gfpz           1/1     Running     0          3d17h   10.97.14.231   qa-node011.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-milvus-querynode-5654f66b44-bslxx           1/1     Running     0          3d17h   10.97.20.128   qa-node018.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-milvus-querynode-5654f66b44-dg74h           1/1     Running     0          3d17h   10.97.20.127   qa-node018.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-milvus-querynode-5654f66b44-dpg8q           1/1     Running     0          3d17h   10.97.17.77    qa-node014.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-milvus-querynode-5654f66b44-l8wwc           1/1     Running     13         3d17h   10.97.19.205   qa-node016.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-milvus-querynode-5654f66b44-m6wpc           1/1     Running     0          3d17h   10.97.16.114   qa-node013.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-milvus-rootcoord-5c46bbcd94-8xb94           1/1     Running     0          3d17h   10.97.14.225   qa-node011.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-minio-0                                     1/1     Running     98         3d17h   10.97.19.210   qa-node016.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-minio-1                                     1/1     Running     98         3d17h   10.97.19.208   qa-node016.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-minio-2                                     1/1     Running     14         3d17h   10.97.12.132   qa-node015.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-minio-3                                     1/1     Running     99         3d17h   10.97.19.207   qa-node016.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-pulsar-autorecovery-69764fbf4d-2f2gw        1/1     Running     0          3d17h   10.97.11.190   qa-node009.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-pulsar-bastion-85b47ddcc6-vwsd8             1/1     Running     0          3d17h   10.97.3.20     qa-node001.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-pulsar-bookkeeper-0                         1/1     Running     0          3d17h   10.97.9.94     qa-node007.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-pulsar-bookkeeper-1                         1/1     Running     0          3d17h   10.97.11.191   qa-node009.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-pulsar-bookkeeper-2                         1/1     Running     0          3d17h   10.97.3.37     qa-node001.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-pulsar-broker-55dcfb458d-gsrdh              1/1     Running     0          3d17h   10.97.17.75    qa-node014.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-pulsar-proxy-6f56b87896-p4bv9               2/2     Running     0          3d17h   10.97.13.129   qa-node010.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-pulsar-zookeeper-0                          1/1     Running     0          3d17h   10.97.3.21     qa-node001.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-pulsar-zookeeper-1                          1/1     Running     0          3d17h   10.97.8.129    qa-node006.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-pulsar-zookeeper-2                          1/1     Running     0          3d17h   10.97.8.132    qa-node006.zilliz.local   <none>           <none>
benchmark-tag-nc88q-1-pulsar-zookeeper-metadata-dfnpm             0/1     Completed   0          3d17h   10.97.8.125    qa-node006.zilliz.local   <none>           <none>
redash-postgres-postgresql-0                                      1/1     Running     0          145d    10.97.5.64     qa-node003.zilliz.local   <none>           <none>

Steps To Reproduce

1、create collation
2、build index 
3、insert 1 billion vectors <- milvus hang

Anything else?

No response

@wangting0128 wangting0128 added kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. test/benchmark benchmark test labels Jan 4, 2022
@yanliang567 yanliang567 added this to the 2.0.0-GA milestone Jan 4, 2022
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 4, 2022
@yanliang567
Copy link
Contributor

/assign @sunby
/unassign

@sre-ci-robot sre-ci-robot assigned sunby and unassigned yanliang567 Jan 4, 2022
@congqixia
Copy link
Contributor

Proxy failed to connect to pulsar due to backlog quota exceeded

time="2022-01-01T15:11:10Z" level=error msg="[Failed to create producer]" error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" producerID=1 producer_name=benchmark-tag-nc88q-1-pulsar-0-28 topic="persistent://public/default/by-dev-rootcoord-dml_0" 

@congqixia
Copy link
Contributor

Data node failed to create datasyncservice with Minio error "Specified key does not exist."
And minio restarted hundreds of times:

benchmark-tag-nc88q-1-minio-0                                     1/1     Running     100        3d19h
benchmark-tag-nc88q-1-minio-1                                     1/1     Running     101        3d19h
benchmark-tag-nc88q-1-minio-2                                     1/1     Running     14         3d19h
benchmark-tag-nc88q-1-minio-3                                     1/1     Running     101        3d19h

Some error log of minio highlighted:

Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/readversion?check-data-dir=true&disk-id=6ca7377d-a56d-4a76-8a58-1a0d5d90ed9a&file-path=file%2Finsert_log%2F430162438343822657%2F430162438343822658%2F430164092354297857%2F100%2F430164116930297857&version-id=&volume=milvus-bucket": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/readversion?check-data-dir=true&disk-id=6ca7377d-a56d-4a76-8a58-1a0d5d90ed9a&file-path=file%2Finsert_log%2F430162438343822657%2F430162438343822658%2F430164156592685058%2F100%2F430164169214918657&version-id=&volume=milvus-bucket": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/readversion?check-data-dir=true&disk-id=6ca7377d-a56d-4a76-8a58-1a0d5d90ed9a&file-path=file%2Finsert_log%2F430162438343822657%2F430162438343822658%2F430164092354297857%2F100%2F430164108987334657&version-id=&volume=milvus-bucket": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/readversion?check-data-dir=true&disk-id=6ca7377d-a56d-4a76-8a58-1a0d5d90ed9a&file-path=file%2Finsert_log%2F430162438343822657%2F430162438343822658%2F430164156592685058%2F100%2F430164167275053057&version-id=&volume=milvus-bucket": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/readversion?check-data-dir=true&disk-id=6ca7377d-a56d-4a76-8a58-1a0d5d90ed9a&file-path=file%2Finsert_log%2F430162438343822657%2F430162438343822658%2F430164188744646658%2F100%2F430164210790957057&version-id=&volume=milvus-bucket": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/readversion?check-data-dir=true&disk-id=6ca7377d-a56d-4a76-8a58-1a0d5d90ed9a&file-path=file%2Finsert_log%2F430162438343822657%2F430162438343822658%2F430164188744646658%2F100%2F430164203909677057&version-id=&volume=milvus-bucket": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/readversion?check-data-dir=true&disk-id=6ca7377d-a56d-4a76-8a58-1a0d5d90ed9a&file-path=file%2Finsert_log%2F430162438343822657%2F430162438343822658%2F430164156592685058%2F100%2F430164160931168257&version-id=&volume=milvus-bucket": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/readversion?check-data-dir=true&disk-id=6ca7377d-a56d-4a76-8a58-1a0d5d90ed9a&file-path=file%2Finsert_log%2F430162438343822657%2F430162438343822658%2F430164156592685058%2F100%2F430164184628985857&version-id=&volume=milvus-bucket": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/readversion?check-data-dir=true&disk-id=6ca7377d-a56d-4a76-8a58-1a0d5d90ed9a&file-path=file%2Finsert_log%2F430162438343822657%2F430162438343822658%2F430164216230445057%2F100%2F430164239941369857&version-id=&volume=milvus-bucket": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/readversion?check-data-dir=true&disk-id=6ca7377d-a56d-4a76-8a58-1a0d5d90ed9a&file-path=file%2Finsert_log%2F430162438343822657%2F430162438343822658%2F430164216230445058%2F100%2F430164222757830657&version-id=&volume=milvus-bucket": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/readversion?check-data-dir=true&disk-id=6ca7377d-a56d-4a76-8a58-1a0d5d90ed9a&file-path=file%2Finsert_log%2F430162438343822657%2F430162438343822658%2F430164156592685058%2F100%2F430164178389958657&version-id=&volume=milvus-bucket": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/readversion?check-data-dir=true&disk-id=6ca7377d-a56d-4a76-8a58-1a0d5d90ed9a&file-path=file%2Finsert_log%2F430162438343822657%2F430162438343822658%2F430164156592685058%2F100%2F430164180172537857&version-id=&volume=milvus-bucket": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/readversion?check-data-dir=true&disk-id=6ca7377d-a56d-4a76-8a58-1a0d5d90ed9a&file-path=file%2Finsert_log%2F430162438343822657%2F430162438343822658%2F430164156592685057%2F100%2F430164160944275457&version-id=&volume=milvus-bucket": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/readversion?check-data-dir=true&disk-id=6ca7377d-a56d-4a76-8a58-1a0d5d90ed9a&file-path=file%2Finsert_log%2F430162438343822657%2F430162438343822658%2F430164125253369857%2F100%2F430164130247213057&version-id=&volume=milvus-bucket": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/health?": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4/health?": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/health?": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4/health?": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4/health?": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4/health?": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/health?": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/health?": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4/health?": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4/health?": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4/health?": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/health?": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4/health?": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4/health?": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/health?": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/health?": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/health?": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4/health?": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4/health?": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/health?": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4/health?": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/health?": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4/health?": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/health?": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/lock/export/v4/health?": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/health?": dial tcp 10.97.19.207:9000: connect: connection refused
Marking http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22 temporary offline; caused by Post "http://benchmark-tag-nc88q-1-minio-3.benchmark-tag-nc88q-1-minio-svc.qa-milvus.svc.cluster.local:9000/minio/storage/export/v22/health?": dial tcp 10.97.19.207:9000: connect: connection refused

@ThreadDao
Copy link
Contributor

I also encounter this problem. Flush hangs after inserting 139779 entities (dim=512). And mic-memory-minio-2 restart once mic-memory-etcd-2 restart twice

mic-memory-etcd-0                                                 1/1     Running     0          3h33m
mic-memory-etcd-1                                                 1/1     Running     0          3h33m
mic-memory-etcd-2                                                 1/1     Running     2          3h33m
mic-memory-milvus-datacoord-7956b49bbb-qsm8d                      1/1     Running     0          3h30m
mic-memory-milvus-datanode-7c79c7cf65-tvbct                       1/1     Running     0          3h30m
mic-memory-milvus-indexcoord-5fd6577fdc-84ggs                     1/1     Running     0          3h30m
mic-memory-milvus-indexnode-dbd6748f6-482d8                       1/1     Running     0          3h30m
mic-memory-milvus-proxy-6499c76645-cc5nk                          1/1     Running     0          3h30m
mic-memory-milvus-querycoord-8475865bfb-bpwz5                     1/1     Running     0          3h30m
mic-memory-milvus-querynode-664c94bf68-klmlg                      1/1     Running     0          3h30m
mic-memory-milvus-rootcoord-9b6f956f8-vb497                       1/1     Running     0          3h30m
mic-memory-minio-0                                                1/1     Running     0          3h33m
mic-memory-minio-1                                                1/1     Running     0          3h33m
mic-memory-minio-2                                                1/1     Running     1          3h33m
mic-memory-minio-3                                                1/1     Running     0          3h33m
mic-memory-pulsar-bookie-0                                        1/1     Running     0          3h33m
mic-memory-pulsar-bookie-1                                        1/1     Running     0          3h33m
mic-memory-pulsar-bookie-2                                        1/1     Running     0          3h33m
mic-memory-pulsar-broker-0                                        1/1     Running     0          3h33m
mic-memory-pulsar-broker-1                                        1/1     Running     0          3h33m
mic-memory-pulsar-proxy-0                                         1/1     Running     0          3h33m
mic-memory-pulsar-proxy-1                                         1/1     Running     0          3h33m
mic-memory-pulsar-recovery-0                                      1/1     Running     0          3h33m
mic-memory-pulsar-toolset-0                                       1/1     Running     0          3h33m
mic-memory-pulsar-zookeeper-0                                     1/1     Running     0          3h33m
mic-memory-pulsar-zookeeper-1                                     1/1     Running     0          3h33m
mic-memory-pulsar-zookeeper-2                                     1/1     Running     0          3h32m
  1. Image version master-20220104-47e19fd
  2. Steps To Reproduce
       nb = 399360
        dim = 512
        c_name = cf.gen_unique_str('chaos_memory')
        collection_w = ApiCollectionWrapper()
        collection_w.init_collection(name=c_name,
                                     schema=cf.gen_default_collection_schema(dim=dim))
        for i in range(10):
            df = cf.gen_default_dataframe_data(nb=nb, dim=dim)
            collection_w.insert(df)[0]
            log.info(f'After {i + 1} insert, num_entities: {collection_w.num_entities}')
  1. Client log
<partitions>: [{"name": "_default", "collection_name": "chaos_memory_fUU5jmwh", "description": ""}]
<description>: 
<schema>: {
  auto_id: False
  description: 
  fields: [{
    name: int64
    description: 
    type: 5
    is_primary: True
 ......  (api_request.py:27)
[2022-01-04 06:43:48,623 - DEBUG - ci_test]: (api_request)  : [Collection.insert] args: [       int64    float                                       float_vector
0          0      0.0  [0.07346399916402581, 0.008118018784176988, 0....
1          1      1.0  [0.02299471850774338, 0.03650073879641807, 0.0...
2          2      2.0  [0.07638110721837194, 0.046163664631532444, 0....
3      ......, kwargs: {'timeout': 20} (api_request.py:55)
[2022-01-04 06:43:58,492 - DEBUG - ci_test]: (api_response) : (insert count: 39936, delete count: 0, upsert count: 0, timestamp: 430251346864898049)  (api_request.py:27)
[2022-01-04 06:47:05,946 - INFO - ci_test]: After 1 insert, num_entities: 39936 (test_issue.py:23)
[2022-01-04 06:47:13,585 - DEBUG - ci_test]: (api_request)  : [Collection.insert] args: [       int64    float                                       float_vector
0          0      0.0  [0.05125624698035131, 0.007397765540320532, 0....
1          1      1.0  [0.012949129257385934, 0.013630122127253225, 0...
2          2      2.0  [0.045867500260432625, 0.03802056208585535, 0....
3      ......, kwargs: {'timeout': 20} (api_request.py:55)
[2022-01-04 06:47:23,674 - DEBUG - ci_test]: (api_response) : (insert count: 39936, delete count: 0, upsert count: 0, timestamp: 430251400617525250)  (api_request.py:27)
[2022-01-04 06:47:25,899 - INFO - ci_test]: After 2 insert, num_entities: 79872 (test_issue.py:23)
[2022-01-04 06:47:32,590 - DEBUG - ci_test]: (api_request)  : [Collection.insert] args: [       int64    float                                       float_vector
0          0      0.0  [0.06715657931228057, 0.05852660935810229, 0.0...
1          1      1.0  [0.01524461165085814, 0.0353182117755068, 0.04...
2          2      2.0  [0.008627019802114309, 0.0345993015147255, 0.0...
3      ......, kwargs: {'timeout': 20} (api_request.py:55)
[2022-01-04 06:47:42,112 - DEBUG - ci_test]: (api_response) : (insert count: 39936, delete count: 0, upsert count: 0, timestamp: 430251405414760450)  (api_request.py:27)
[2022-01-04 06:47:44,130 - INFO - ci_test]: After 3 insert, num_entities: 119808 (test_issue.py:23)
[2022-01-04 06:47:51,524 - DEBUG - ci_test]: (api_request)  : [Collection.insert] args: [       int64    float                                       float_vector
0          0      0.0  [0.0006018422325566739, 0.03216450650895717, 0...
1          1      1.0  [0.057846384626960935, 0.046126230193544955, 0...
2          2      2.0  [0.025808790874017818, 0.05704071989994611, 0....
3      ......, kwargs: {'timeout': 20} (api_request.py:55)
[2022-01-04 06:48:01,487 - DEBUG - ci_test]: (api_response) : (insert count: 39936, delete count: 0, upsert count: 0, timestamp: 430251410487246849)  (api_request.py:27)
  1. Server log
    milvus_logs.tar.gz

@wangting0128
Copy link
Contributor Author

argo task: benchmark-tag-p5gkx

test yaml:
client-configmap:client-random-locust-search-84h-1b
server-configmap:server-cluster-8c64m-datanode2-indexnode4-querynode6

server:

NAME                                                            READY   STATUS      RESTARTS   AGE     IP             NODE                      NOMINATED NODE   READINESS GATES
benchmark-tag-p5gkx-1-etcd-0                                    1/1     Running     0          2d15h   10.97.16.81    qa-node013.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-etcd-1                                    1/1     Running     0          2d15h   10.97.17.233   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-etcd-2                                    1/1     Running     0          2d15h   10.97.16.83    qa-node013.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-milvus-datacoord-77b659d7fd-2hsbd         1/1     Running     0          2d15h   10.97.14.192   qa-node011.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-milvus-datanode-7cddd8fd87-h48db          1/1     Running     1          2d15h   10.97.16.75    qa-node013.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-milvus-datanode-7cddd8fd87-pvvsk          1/1     Running     1          2d15h   10.97.20.13    qa-node018.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-milvus-indexcoord-796bcb999f-ch7xm        1/1     Running     0          2d15h   10.97.9.8      qa-node007.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-milvus-indexnode-664f65d954-hcqwq         1/1     Running     0          2d15h   10.97.16.78    qa-node013.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-milvus-indexnode-664f65d954-pmn88         1/1     Running     0          2d15h   10.97.20.15    qa-node018.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-milvus-indexnode-664f65d954-ztxwn         1/1     Running     0          2d15h   10.97.10.169   qa-node008.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-milvus-indexnode-664f65d954-zxcf9         1/1     Running     0          2d15h   10.97.17.231   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-milvus-proxy-5dddbbb979-jfk5c             1/1     Running     0          2d15h   10.97.16.72    qa-node013.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-milvus-querycoord-555c4fc768-bg27l        1/1     Running     0          2d15h   10.97.9.7      qa-node007.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-milvus-querynode-5b88599944-lfs5r         1/1     Running     0          2d15h   10.97.12.181   qa-node015.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-milvus-querynode-5b88599944-lthmk         1/1     Running     0          2d15h   10.97.16.74    qa-node013.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-milvus-querynode-5b88599944-qrq6w         1/1     Running     0          2d15h   10.97.17.230   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-milvus-querynode-5b88599944-sqdnm         1/1     Running     0          2d15h   10.97.14.193   qa-node011.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-milvus-querynode-5b88599944-wkfxk         1/1     Running     0          2d15h   10.97.20.14    qa-node018.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-milvus-querynode-5b88599944-z8dwl         1/1     Running     0          2d15h   10.97.16.71    qa-node013.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-milvus-rootcoord-55dbbcb7b6-qtn5f         1/1     Running     0          2d15h   10.97.9.9      qa-node007.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-minio-0                                   1/1     Running     0          2d15h   10.97.16.80    qa-node013.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-minio-1                                   1/1     Running     0          2d15h   10.97.20.17    qa-node018.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-minio-2                                   1/1     Running     0          2d15h   10.97.12.183   qa-node015.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-minio-3                                   1/1     Running     0          2d15h   10.97.12.185   qa-node015.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-pulsar-autorecovery-5c84f874f8-mxjhj      1/1     Running     0          2d15h   10.97.20.12    qa-node018.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-pulsar-bastion-84d66cc74d-pbbrm           1/1     Running     0          2d15h   10.97.10.168   qa-node008.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-pulsar-bookkeeper-0                       1/1     Running     0          2d15h   10.97.4.36     qa-node002.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-pulsar-bookkeeper-1                       1/1     Running     0          2d15h   10.97.17.239   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-pulsar-bookkeeper-2                       1/1     Running     0          2d15h   10.97.4.39     qa-node002.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-pulsar-broker-5f98945459-pdxzp            1/1     Running     0          2d15h   10.97.17.229   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-pulsar-proxy-64d7f77d85-4cf8w             2/2     Running     0          2d15h   10.97.9.10     qa-node007.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-pulsar-zookeeper-0                        1/1     Running     0          2d15h   10.97.4.35     qa-node002.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-pulsar-zookeeper-1                        1/1     Running     0          2d15h   10.97.7.15     qa-node005.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-pulsar-zookeeper-2                        1/1     Running     0          2d15h   10.97.11.194   qa-node009.zilliz.local   <none>           <none>
benchmark-tag-p5gkx-1-pulsar-zookeeper-metadata-jhpjr           0/1     Completed   0          2d15h   10.97.9.6      qa-node007.zilliz.local   <none>           <none>

client pod: benchmark-tag-p5gkx-2685632679

client log:

[2022-01-07 14:02:47,328] [   DEBUG] - 234524972 (milvus_benchmark.runners.base:89)
[2022-01-07 14:02:48,620] [   DEBUG] - Start id: 234600000, end id: 234650000 (milvus_benchmark.runners.base:76)
[2022-01-07 14:02:49,988] [   DEBUG] - Milvus insert run in 1.3646s (milvus_benchmark.client:53)
[2022-01-07 14:02:49,992] [   DEBUG] - Row count: 234574871 in collection: <sift_1b_128_l2> (milvus_benchmark.client:421)
[2022-01-07 14:02:49,992] [   DEBUG] - 234574871 (milvus_benchmark.runners.base:89)
[2022-01-07 14:02:50,557] [   DEBUG] - Start id: 234650000, end id: 234700000 (milvus_benchmark.runners.base:76)
[2022-01-07 14:02:52,855] [   DEBUG] - Milvus insert run in 2.2959s (milvus_benchmark.client:53)
[2022-01-07 14:02:52,860] [   DEBUG] - Row count: 234624929 in collection: <sift_1b_128_l2> (milvus_benchmark.client:421)
[2022-01-07 14:02:52,860] [   DEBUG] - 234624929 (milvus_benchmark.runners.base:89)
[2022-01-07 14:02:54,670] [   DEBUG] - Start id: 234700000, end id: 234750000 (milvus_benchmark.runners.base:76)
[2022-01-07 14:02:56,094] [   DEBUG] - Milvus insert run in 1.4205s (milvus_benchmark.client:53)
[2022-01-07 14:02:56,107] [   DEBUG] - Row count: 234674835 in collection: <sift_1b_128_l2> (milvus_benchmark.client:421)
[2022-01-07 14:02:56,107] [   DEBUG] - 234674835 (milvus_benchmark.runners.base:89)
[2022-01-07 14:02:56,653] [   DEBUG] - Start id: 234750000, end id: 234800000 (milvus_benchmark.runners.base:76)

@sroshkul
Copy link

sroshkul commented Jan 10, 2022

Also we have this issue.
After inserting ~20M vectors Milvus crashes with

RPC error: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "{"created":"@1641815286.357480005","description":"Error received from peer ipv4:34.236.189.197:19530","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"Socket closed","grpc_status":14}"
        {'API start': '2022-01-10 13:45:08.402234', 'RPC start': '2022-01-10 13:45:08.402590', 'RPC error': '2022-01-10 13:48:06.363412'}
Addr [xxxxxxxxxxxxx94.xxxxxxx.elb.amazonaws.com:19530] bulk_insert
RPC error: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "{"created":"@1641815468.041768598","description":"Error received from peer ipv4:54.163.172.194:19530","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"Socket closed","grpc_status":14}"
        {'API start': '2022-01-10 13:48:08.885172', 'RPC start': '2022-01-10 13:48:08.886289', 'RPC error': '2022-01-10 13:51:08.044843'}

In proxy/pulsar-broker logs we see

time="2022-01-10T14:10:46Z" level=error msg="[Failed to create producer]" error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" producerID=2 producer_name=milvus-pulsar-1-662 topic="persistent://public/default/by-dev-rootcoord-dml_8"
time="2022-01-10T14:10:46Z" level=info msg="[Reconnecting to broker in  1m2.769784626s]" producerID=2 producer_name=milvus-pulsar-1-662 topic="persistent://public/default/by-dev-rootcoord-dml_8"
[2022/01/10 14:10:48.666 +00:00] [WARN] [grpclog.go:46] ["[core]grpc: Server.Serve failed to create ServerTransport: connection error: desc = \"transport: http2Server.HandleStreams failed to receive the preface from client: EOF\""]
[2022/01/10 14:10:58.348 +00:00] [WARN] [grpclog.go:46] ["[core]grpc: Server.Serve failed to create ServerTransport: connection error: desc = \"transport: http2Server.HandleStreams failed to receive the preface from client: EOF\""]
NAME                                          READY   STATUS      RESTARTS   AGE
milvus-datacoord-f89857d8-5wm4r               1/1     Running     1          7h35m
milvus-datanode-666fcdc654-2jgpd              1/1     Running     0          3m27s
milvus-datanode-666fcdc654-54kkk              1/1     Running     0          3m57s
milvus-datanode-666fcdc654-gmfjq              1/1     Running     0          3m27s
milvus-datanode-666fcdc654-lkgkj              1/1     Running     0          3m57s
milvus-datanode-666fcdc654-mv9t9              1/1     Running     0          5h44m
milvus-datanode-666fcdc654-pl4rx              1/1     Running     0          12m
milvus-datanode-666fcdc654-r6r2k              1/1     Running     0          3m57s
milvus-datanode-666fcdc654-rnpt6              1/1     Running     0          3m27s
milvus-datanode-666fcdc654-t97px              1/1     Running     0          44m
milvus-datanode-666fcdc654-xk7jq              1/1     Running     0          123m
milvus-etcd-0                                 1/1     Running     0          7h35m
milvus-etcd-1                                 1/1     Running     0          7h35m
milvus-etcd-2                                 1/1     Running     0          7h35m
milvus-indexcoord-5df9695cff-spqhg            1/1     Running     0          7h35m
milvus-indexnode-778ff6c59-4hslp              1/1     Running     0          7h35m
milvus-indexnode-778ff6c59-6q8lb              1/1     Running     0          5h25m
milvus-proxy-5b548dc64-5m686                  1/1     Running     0          149m
milvus-proxy-5b548dc64-b44wc                  1/1     Running     1          4h55m
milvus-proxy-5b548dc64-dm6tl                  1/1     Running     0          4h48m
milvus-proxy-5b548dc64-t5jnl                  1/1     Running     0          4h55m
milvus-proxy-5b548dc64-zpd9j                  1/1     Running     0          6h24m
milvus-proxy-5b548dc64-zqzzq                  1/1     Running     1          6h24m
milvus-pulsar-autorecovery-75bd6f6dff-9scvg   1/1     Running     0          7h35m
milvus-pulsar-bastion-6f8cbcd9c7-2vwjf        1/1     Running     0          7h35m
milvus-pulsar-bookkeeper-0                    1/1     Running     0          7h35m
milvus-pulsar-bookkeeper-1                    1/1     Running     0          7h25m
milvus-pulsar-bookkeeper-2                    1/1     Running     0          7h33m
milvus-pulsar-broker-66d576f5f6-7f2x7         1/1     Running     1          7h35m
milvus-pulsar-broker-66d576f5f6-x5hn2         1/1     Running     0          123m
milvus-pulsar-proxy-5754b7cc4b-5jwhc          2/2     Running     0          7h25m
milvus-pulsar-zookeeper-0                     1/1     Running     0          7h35m
milvus-pulsar-zookeeper-1                     1/1     Running     0          7h35m
milvus-pulsar-zookeeper-2                     1/1     Running     0          7h34m
milvus-pulsar-zookeeper-metadata-d9st2        0/1     Completed   0          7h35m
milvus-querycoord-564bc84b8b-zgbwd            1/1     Running     1          7h35m
milvus-querynode-f57599f7f-8c2ld              1/1     Running     0          5h42m
milvus-querynode-f57599f7f-gspc7              1/1     Running     0          5h41m
milvus-querynode-f57599f7f-jzkqh              1/1     Running     1          7h35m
milvus-querynode-f57599f7f-l6jr2              1/1     Running     0          5h42m
milvus-rootcoord-89798cb68-4l287              1/1     Running     1          7h35m

After ~4 hours seems Pulsar queue was processed as no more ProducerBlockedQuotaExceededException and we are able to run collection.num_entities(when exception appeared it was hanging and no returning results).
What can we do to be able insert >1B vectors w/o this exception?

@wangting0128
Copy link
Contributor Author

argo task: benchmark-tag-xcglg

test yaml:
client-configmap:client-random-locust-search-84h-1b
server-configmap:server-cluster-8c64m-datanode2-indexnode4-querynode6

server:

NAME                                                         READY   STATUS      RESTARTS   AGE    IP             NODE                      NOMINATED NODE   READINESS GATES
benchmark-tag-xcglg-1-etcd-0                                 1/1     Running     0          15h    10.97.17.226   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-etcd-1                                 1/1     Running     0          15h    10.97.16.171   qa-node013.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-etcd-2                                 1/1     Running     0          15h    10.97.17.227   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-datacoord-64b968bddb-zgs67      1/1     Running     0          15h    10.97.6.120    qa-node004.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-datanode-57ffc9ccf9-hcrj7       1/1     Running     0          15h    10.97.20.105   qa-node018.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-datanode-57ffc9ccf9-zzsv4       1/1     Running     0          15h    10.97.11.201   qa-node009.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-indexcoord-86878fd687-mxtbc     1/1     Running     0          15h    10.97.3.117    qa-node001.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-indexnode-8497dd4794-7fmmx      1/1     Running     0          15h    10.97.16.168   qa-node013.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-indexnode-8497dd4794-f4jsw      1/1     Running     0          15h    10.97.12.62    qa-node015.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-indexnode-8497dd4794-hwr4z      1/1     Running     0          15h    10.97.10.7     qa-node008.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-indexnode-8497dd4794-nd6cq      1/1     Running     0          15h    10.97.20.103   qa-node018.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-proxy-c788bfbbc-4vf8g           1/1     Running     0          15h    10.97.6.119    qa-node004.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-querycoord-78c6b86644-cbn52     1/1     Running     0          15h    10.97.6.118    qa-node004.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-querynode-6bd564bd88-6cr4s      1/1     Running     0          15h    10.97.17.223   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-querynode-6bd564bd88-9xfjv      1/1     Running     0          15h    10.97.19.77    qa-node016.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-querynode-6bd564bd88-j4zxt      1/1     Running     0          15h    10.97.16.169   qa-node013.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-querynode-6bd564bd88-lxbbg      1/1     Running     0          15h    10.97.17.225   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-querynode-6bd564bd88-sdg9x      1/1     Running     0          15h    10.97.20.104   qa-node018.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-querynode-6bd564bd88-sp54k      1/1     Running     0          15h    10.97.14.210   qa-node011.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-rootcoord-77bf69d4cd-kfvd6      1/1     Running     0          15h    10.97.14.209   qa-node011.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-minio-0                                1/1     Running     0          15h    10.97.12.65    qa-node015.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-minio-1                                1/1     Running     0          15h    10.97.19.79    qa-node016.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-minio-2                                1/1     Running     0          15h    10.97.12.68    qa-node015.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-minio-3                                1/1     Running     0          15h    10.97.12.66    qa-node015.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-pulsar-autorecovery-84f79f4cc4-dxtfc   1/1     Running     0          15h    10.97.13.60    qa-node010.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-pulsar-bastion-69ff9fddb8-lfzvh        1/1     Running     0          15h    10.97.3.118    qa-node001.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-pulsar-bookkeeper-0                    1/1     Running     0          15h    10.97.4.32     qa-node002.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-pulsar-bookkeeper-1                    1/1     Running     0          15h    10.97.13.61    qa-node010.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-pulsar-bookkeeper-2                    1/1     Running     0          15h    10.97.4.33     qa-node002.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-pulsar-broker-57cfc8f8ff-4pv2n         1/1     Running     0          15h    10.97.8.145    qa-node006.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-pulsar-proxy-58cf7d9b8f-2pjqt          2/2     Running     0          15h    10.97.11.202   qa-node009.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-pulsar-zookeeper-0                     1/1     Running     0          15h    10.97.9.11     qa-node007.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-pulsar-zookeeper-1                     1/1     Running     0          15h    10.97.3.119    qa-node001.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-pulsar-zookeeper-2                     1/1     Running     0          15h    10.97.7.153    qa-node005.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-pulsar-zookeeper-metadata-gmnjl        0/1     Completed   0          15h    10.97.3.116    qa-node001.zilliz.local   <none>           <none>

client pod: benchmark-tag-xcglg-1701033979

locust_report_2022-01-12_835.log

client log:

[2022-01-13 03:08:18,887] [   DEBUG] - Row count: 991600000 in collection: <sift_1b_128_l2> (milvus_benchmark.client:421)
[2022-01-13 03:08:18,887] [   DEBUG] - 991600000 (milvus_benchmark.runners.base:89)
[2022-01-13 03:08:19,279] [   DEBUG] - Start id: 991650000, end id: 991700000 (milvus_benchmark.runners.base:76)
[2022-01-13 03:08:21,186] [   DEBUG] - Milvus insert run in 1.9047s (milvus_benchmark.client:53)
[2022-01-13 03:08:21,191] [   DEBUG] - Row count: 991600000 in collection: <sift_1b_128_l2> (milvus_benchmark.client:421)
[2022-01-13 03:08:21,192] [   DEBUG] - 991600000 (milvus_benchmark.runners.base:89)
[2022-01-13 03:08:24,155] [   DEBUG] - Start id: 991700000, end id: 991750000 (milvus_benchmark.runners.base:76)
[2022-01-13 03:08:26,047] [   DEBUG] - Milvus insert run in 1.8882s (milvus_benchmark.client:53)
[2022-01-13 03:08:26,051] [   DEBUG] - Row count: 991700000 in collection: <sift_1b_128_l2> (milvus_benchmark.client:421)
[2022-01-13 03:08:26,053] [   DEBUG] - 991700000 (milvus_benchmark.runners.base:89)
[2022-01-13 03:08:26,456] [   DEBUG] - Start id: 991750000, end id: 991800000 (milvus_benchmark.runners.base:76)

@sunby
Copy link
Contributor

sunby commented Jan 17, 2022

argo task: benchmark-tag-xcglg

test yaml: client-configmap:client-random-locust-search-84h-1b server-configmap:server-cluster-8c64m-datanode2-indexnode4-querynode6

server:

NAME                                                         READY   STATUS      RESTARTS   AGE    IP             NODE                      NOMINATED NODE   READINESS GATES
benchmark-tag-xcglg-1-etcd-0                                 1/1     Running     0          15h    10.97.17.226   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-etcd-1                                 1/1     Running     0          15h    10.97.16.171   qa-node013.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-etcd-2                                 1/1     Running     0          15h    10.97.17.227   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-datacoord-64b968bddb-zgs67      1/1     Running     0          15h    10.97.6.120    qa-node004.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-datanode-57ffc9ccf9-hcrj7       1/1     Running     0          15h    10.97.20.105   qa-node018.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-datanode-57ffc9ccf9-zzsv4       1/1     Running     0          15h    10.97.11.201   qa-node009.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-indexcoord-86878fd687-mxtbc     1/1     Running     0          15h    10.97.3.117    qa-node001.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-indexnode-8497dd4794-7fmmx      1/1     Running     0          15h    10.97.16.168   qa-node013.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-indexnode-8497dd4794-f4jsw      1/1     Running     0          15h    10.97.12.62    qa-node015.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-indexnode-8497dd4794-hwr4z      1/1     Running     0          15h    10.97.10.7     qa-node008.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-indexnode-8497dd4794-nd6cq      1/1     Running     0          15h    10.97.20.103   qa-node018.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-proxy-c788bfbbc-4vf8g           1/1     Running     0          15h    10.97.6.119    qa-node004.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-querycoord-78c6b86644-cbn52     1/1     Running     0          15h    10.97.6.118    qa-node004.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-querynode-6bd564bd88-6cr4s      1/1     Running     0          15h    10.97.17.223   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-querynode-6bd564bd88-9xfjv      1/1     Running     0          15h    10.97.19.77    qa-node016.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-querynode-6bd564bd88-j4zxt      1/1     Running     0          15h    10.97.16.169   qa-node013.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-querynode-6bd564bd88-lxbbg      1/1     Running     0          15h    10.97.17.225   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-querynode-6bd564bd88-sdg9x      1/1     Running     0          15h    10.97.20.104   qa-node018.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-querynode-6bd564bd88-sp54k      1/1     Running     0          15h    10.97.14.210   qa-node011.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-milvus-rootcoord-77bf69d4cd-kfvd6      1/1     Running     0          15h    10.97.14.209   qa-node011.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-minio-0                                1/1     Running     0          15h    10.97.12.65    qa-node015.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-minio-1                                1/1     Running     0          15h    10.97.19.79    qa-node016.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-minio-2                                1/1     Running     0          15h    10.97.12.68    qa-node015.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-minio-3                                1/1     Running     0          15h    10.97.12.66    qa-node015.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-pulsar-autorecovery-84f79f4cc4-dxtfc   1/1     Running     0          15h    10.97.13.60    qa-node010.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-pulsar-bastion-69ff9fddb8-lfzvh        1/1     Running     0          15h    10.97.3.118    qa-node001.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-pulsar-bookkeeper-0                    1/1     Running     0          15h    10.97.4.32     qa-node002.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-pulsar-bookkeeper-1                    1/1     Running     0          15h    10.97.13.61    qa-node010.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-pulsar-bookkeeper-2                    1/1     Running     0          15h    10.97.4.33     qa-node002.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-pulsar-broker-57cfc8f8ff-4pv2n         1/1     Running     0          15h    10.97.8.145    qa-node006.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-pulsar-proxy-58cf7d9b8f-2pjqt          2/2     Running     0          15h    10.97.11.202   qa-node009.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-pulsar-zookeeper-0                     1/1     Running     0          15h    10.97.9.11     qa-node007.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-pulsar-zookeeper-1                     1/1     Running     0          15h    10.97.3.119    qa-node001.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-pulsar-zookeeper-2                     1/1     Running     0          15h    10.97.7.153    qa-node005.zilliz.local   <none>           <none>
benchmark-tag-xcglg-1-pulsar-zookeeper-metadata-gmnjl        0/1     Completed   0          15h    10.97.3.116    qa-node001.zilliz.local   <none>           <none>

client pod: benchmark-tag-xcglg-1701033979

locust_report_2022-01-12_835.log

client log:

[2022-01-13 03:08:18,887] [   DEBUG] - Row count: 991600000 in collection: <sift_1b_128_l2> (milvus_benchmark.client:421)
[2022-01-13 03:08:18,887] [   DEBUG] - 991600000 (milvus_benchmark.runners.base:89)
[2022-01-13 03:08:19,279] [   DEBUG] - Start id: 991650000, end id: 991700000 (milvus_benchmark.runners.base:76)
[2022-01-13 03:08:21,186] [   DEBUG] - Milvus insert run in 1.9047s (milvus_benchmark.client:53)
[2022-01-13 03:08:21,191] [   DEBUG] - Row count: 991600000 in collection: <sift_1b_128_l2> (milvus_benchmark.client:421)
[2022-01-13 03:08:21,192] [   DEBUG] - 991600000 (milvus_benchmark.runners.base:89)
[2022-01-13 03:08:24,155] [   DEBUG] - Start id: 991700000, end id: 991750000 (milvus_benchmark.runners.base:76)
[2022-01-13 03:08:26,047] [   DEBUG] - Milvus insert run in 1.8882s (milvus_benchmark.client:53)
[2022-01-13 03:08:26,051] [   DEBUG] - Row count: 991700000 in collection: <sift_1b_128_l2> (milvus_benchmark.client:421)
[2022-01-13 03:08:26,053] [   DEBUG] - 991700000 (milvus_benchmark.runners.base:89)
[2022-01-13 03:08:26,456] [   DEBUG] - Start id: 991750000, end id: 991800000 (milvus_benchmark.runners.base:76)

Does it work after we change the missingTolerance? @wangting0128

@sunby
Copy link
Contributor

sunby commented Jan 17, 2022

/assign @wangting0128

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants