Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [benchmark][cluster] queryNode OOM when enabled all queryNode.mmap params #38410

Open
1 task done
wangting0128 opened this issue Dec 12, 2024 · 1 comment
Open
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@wangting0128
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:master-20241211-e279ccf1-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2):2.5.0rc124
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task: fouramf-mdh8v

server:

NAME                                                              READY   STATUS        RESTARTS         AGE     IP              NODE         NOMINATED NODE   READINESS GATES
verify-master-dql-ddl-etcd-0                                      1/1     Running       0                22h     10.104.17.29    4am-node23   <none>           <none>
verify-master-dql-ddl-etcd-1                                      1/1     Running       0                22h     10.104.34.104   4am-node37   <none>           <none>
verify-master-dql-ddl-etcd-2                                      1/1     Running       0                22h     10.104.23.88    4am-node27   <none>           <none>
verify-master-dql-ddl-milvus-datanode-68f58cdc68-hv82g            1/1     Running       2 (22h ago)      22h     10.104.26.120   4am-node32   <none>           <none>
verify-master-dql-ddl-milvus-indexnode-75db697bd9-8qcc9           1/1     Running       2 (22h ago)      22h     10.104.27.116   4am-node31   <none>           <none>
verify-master-dql-ddl-milvus-indexnode-75db697bd9-fh7w5           1/1     Running       2 (22h ago)      22h     10.104.16.101   4am-node21   <none>           <none>
verify-master-dql-ddl-milvus-indexnode-75db697bd9-xdfng           1/1     Running       2 (22h ago)      22h     10.104.17.25    4am-node23   <none>           <none>
verify-master-dql-ddl-milvus-indexnode-75db697bd9-xwm84           1/1     Running       1 (22h ago)      22h     10.104.6.245    4am-node13   <none>           <none>
verify-master-dql-ddl-milvus-mixcoord-79cf79c44f-rbgvp            1/1     Running       1 (22h ago)      22h     10.104.6.243    4am-node13   <none>           <none>
verify-master-dql-ddl-milvus-proxy-5c5db9b6bf-hfhb6               1/1     Running       2 (22h ago)      22h     10.104.16.100   4am-node21   <none>           <none>
verify-master-dql-ddl-milvus-querynode-6b5798ff-rfkrg             1/1     Running       24 (7h34m ago)   22h     10.104.21.129   4am-node24   <none>           <none>
verify-master-dql-ddl-minio-0                                     1/1     Running       0                22h     10.104.34.102   4am-node37   <none>           <none>
verify-master-dql-ddl-minio-1                                     1/1     Running       0                22h     10.104.25.146   4am-node30   <none>           <none>
verify-master-dql-ddl-minio-2                                     1/1     Running       0                22h     10.104.26.124   4am-node32   <none>           <none>
verify-master-dql-ddl-minio-3                                     1/1     Running       0                22h     10.104.17.31    4am-node23   <none>           <none>
verify-master-dql-ddl-pulsarv3-bookie-0                           1/1     Running       0                22h     10.104.34.101   4am-node37   <none>           <none>
verify-master-dql-ddl-pulsarv3-bookie-1                           1/1     Running       0                22h     10.104.16.104   4am-node21   <none>           <none>
verify-master-dql-ddl-pulsarv3-bookie-2                           1/1     Running       0                22h     10.104.27.119   4am-node31   <none>           <none>
verify-master-dql-ddl-pulsarv3-bookie-init-hfwt2                  0/1     Completed     0                22h     10.104.6.248    4am-node13   <none>           <none>
verify-master-dql-ddl-pulsarv3-broker-0                           1/1     Running       0                22h     10.104.14.197   4am-node18   <none>           <none>
verify-master-dql-ddl-pulsarv3-broker-1                           1/1     Running       0                22h     10.104.26.121   4am-node32   <none>           <none>
verify-master-dql-ddl-pulsarv3-proxy-0                            1/1     Running       0                22h     10.104.6.250    4am-node13   <none>           <none>
verify-master-dql-ddl-pulsarv3-proxy-1                            1/1     Running       0                22h     10.104.14.198   4am-node18   <none>           <none>
verify-master-dql-ddl-pulsarv3-pulsar-init-lm9qg                  0/1     Completed     0                22h     10.104.6.246    4am-node13   <none>           <none>
verify-master-dql-ddl-pulsarv3-recovery-0                         1/1     Running       0                22h     10.104.6.249    4am-node13   <none>           <none>
verify-master-dql-ddl-pulsarv3-zookeeper-0                        1/1     Running       0                22h     10.104.34.100   4am-node37   <none>           <none>
verify-master-dql-ddl-pulsarv3-zookeeper-1                        1/1     Running       0                22h     10.104.17.28    4am-node23   <none>           <none>
verify-master-dql-ddl-pulsarv3-zookeeper-2                        1/1     Running       0                22h     10.104.26.125   4am-node32   <none>           <none>

queryNode OOM
截屏2024-12-12 18 05 38

截屏2024-12-12 18 06 00

image

client log:
request failed
image

Expected Behavior

No response

Steps To Reproduce

1. create a collection with fields: 'id', 'float_vector', 'float_vector_1', 'sparse_float_vector', 'bfloat16_vector', 'int64_1', 'varchar_1'
2. create indexes
   - INVERTED: int64_1, varchar_1
   - HNSW: float_vector
   - DISKANN: float_vector_1
   - SPARSE_INVERTED_INDEX: sparse_float_vector
   - IVF_SQ8: bfloat16_vector
3. insert 20m data
4. flush collection
5. build indexes again with the same params
6. load collection
7. concurrent requests:
   - scene_hybrid_search_test
     (collection: create->insert->flush->index->load->hybrid_search->drop)
   - scene_test
     (collection: create->insert->flush->index->drop)
   - scene_test_partition_hybrid_search
     (partition: create->insert->flush->index again->load->hybrid_search->release->hybrid_search failed->drop)
   - search <- search on default partition
   - hybrid_search <- hybrid_search on default partition
   - query <- query on default partition

Milvus Log

No response

Anything else?

server config:

extraConfigFiles:
  user.yaml: |+
    queryNode:
      mmap:
        vectorField: true
        vectorIndex: true
        scalarField: true
        scalarIndex: true
queryNode:
  resources:
    limits:
      cpu: '32'
      memory: 32Gi
    requests:
      cpu: '16'
      memory: 32Gi
  replicas: 1
  nodeSelector:
    node-role/nvme: 'true'
indexNode:
  resources:
    limits:
      cpu: '4.0'
      memory: 16Gi
    requests:
      cpu: '2.0'
      memory: 4Gi
  replicas: 4
dataNode:
  resources:
    limits:
      cpu: '2.0'
      memory: 16Gi
    requests:
      cpu: '2.0'
      memory: 5Gi

client config:

{
     "dataset_params": {
          "metric_type": "L2",
          "dim": 128,
          "scalars_index": {
               "int64_1": {
                    "index_type": "INVERTED"
               },
               "varchar_1": {
                    "index_type": "INVERTED"
               }
          },
          "vectors_index": {
               "float_vector_1": {
                    "index_type": "DISKANN",
                    "index_param": {},
                    "metric_type": "IP"
               },
               "sparse_float_vector": {
                    "index_type": "SPARSE_INVERTED_INDEX",
                    "index_param": {
                         "drop_ratio_build": 0.2
                    },
                    "metric_type": "IP"
               },
               "bfloat16_vector": {
                    "index_type": "IVF_SQ8",
                    "index_param": {
                         "nlist": 2048
                    },
                    "metric_type": "L2"
               }
          },
          "scalars_params": {
               "float_vector_1": {
                    "params": {
                         "dim": 768
                    },
                    "other_params": {
                         "dataset": "laion2b_multi",
                         "column_name": "float32_vector"
                    }
               },
               "sparse_float_vector": {
                    "other_params": {
                         "dim": 10000,
                         "sparse_range": [
                              1,
                              20
                         ]
                    }
               },
               "bfloat16_vector": {
                    "params": {
                         "dim": 256
                    }
               }
          },
          "dataset_name": "sift",
          "dataset_size": "20m",
          "ni_per": 10000
     },
     "collection_params": {
          "other_fields": [
               "float_vector_1",
               "sparse_float_vector",
               "bfloat16_vector",
               "int64_1",
               "varchar_1"
          ],
          "shards_num": 2
     },
     "index_params": {
          "index_type": "HNSW",
          "index_param": {
               "M": 8,
               "efConstruction": 200
          }
     },
     "concurrent_params": {
          "concurrent_number": 20,
          "during_time": "96h",
          "interval": 20
     },
     "concurrent_tasks": [
          {
               "type": "scene_hybrid_search_test",
               "weight": 1,
               "params": {
                    "nq": 2,
                    "top_k": 5,
                    "reqs": [
                         {
                              "search_param": {
                                   "nprobe": 128
                              },
                              "anns_field": "float_vector",
                              "expr": "bool_1 == True",
                              "top_k": 100
                         },
                         {
                              "search_param": {
                                   "nprobe": 32
                              },
                              "anns_field": "binary_vector_scene_hybrid_search_test_1",
                              "expr": "bool_1 != True",
                              "top_k": 10
                         },
                         {
                              "search_param": {
                                   "search_list": 30
                              },
                              "anns_field": "float16_vector_scene_hybrid_search_test_2",
                              "expr": "int64_1 >= 1500",
                              "top_k": 5
                         },
                         {
                              "search_param": {
                                   "drop_ratio_search": 0.1
                              },
                              "anns_field": "sparse_float_vector_scene_hybrid_search_test_3",
                              "expr": "varchar_1 like \"1%\"",
                              "top_k": 10
                         }
                    ],
                    "rerank": {
                         "RRFRanker": []
                    },
                    "output_fields": [
                         "*"
                    ],
                    "timeout": 600,
                    "random_data": true,
                    "dataset": "local",
                    "dim": 128,
                    "shards_num": 2,
                    "data_size": 3000,
                    "nb": 3000,
                    "index_type": "IVF_SQ8",
                    "index_param": {
                         "nlist": 2048
                    },
                    "metric_type": "L2",
                    "other_fields": [
                         "binary_vector_scene_hybrid_search_test_1",
                         "float16_vector_scene_hybrid_search_test_2",
                         "sparse_float_vector_scene_hybrid_search_test_3",
                         "int64_1",
                         "bool_1",
                         "varchar_1"
                    ],
                    "replica_number": 1,
                    "scalars_params": {
                         "binary_vector_scene_hybrid_search_test_1": {
                              "params": {
                                   "dim": 512
                              },
                              "other_params": {
                                   "dataset": "binary"
                              }
                         },
                         "float16_vector_scene_hybrid_search_test_2": {
                              "params": {
                                   "dim": 64
                              }
                         }
                    },
                    "scalars_index": {
                         "int64_1": {},
                         "bool_1": {
                              "index_type": "BITMAP"
                         },
                         "varchar_1": {
                              "index_type": "INVERTED"
                         }
                    },
                    "vectors_index": {
                         "binary_vector_scene_hybrid_search_test_1": {
                              "index_type": "BIN_IVF_FLAT",
                              "index_param": {
                                   "nlist": 2048
                              },
                              "metric_type": "JACCARD"
                         },
                         "float16_vector_scene_hybrid_search_test_2": {
                              "index_type": "DISKANN",
                              "index_param": {},
                              "metric_type": "IP"
                         },
                         "sparse_float_vector_scene_hybrid_search_test_3": {
                              "index_type": "SPARSE_WAND",
                              "index_param": {
                                   "drop_ratio_build": 0.2
                              },
                              "metric_type": "IP"
                         }
                    },
                    "hybrid_search_counts": 10
               }
          },
          {
               "type": "scene_test",
               "weight": 1,
               "params": {
                    "dim": 128,
                    "data_size": 3000,
                    "nb": 3000,
                    "index_type": "IVF_SQ8",
                    "index_param": {
                         "nlist": 2048
                    },
                    "metric_type": "L2"
               }
          },
          {
               "type": "scene_test_partition_hybrid_search",
               "weight": 1,
               "params": {
                    "nq": 1,
                    "top_k": 1,
                    "reqs": [
                         {
                              "search_param": {
                                   "ef": 32
                              },
                              "anns_field": "float_vector",
                              "top_k": 10
                         },
                         {
                              "search_param": {
                                   "search_list": 30
                              },
                              "anns_field": "float_vector_1",
                              "top_k": 10
                         },
                         {
                              "search_param": {
                                   "drop_ratio_search": 0.3
                              },
                              "anns_field": "sparse_float_vector",
                              "top_k": 30
                         },
                         {
                              "search_param": {
                                   "nprobe": 16
                              },
                              "anns_field": "bfloat16_vector",
                              "top_k": 400
                         }
                    ],
                    "rerank": {
                         "RRFRanker": []
                    },
                    "output_fields": [
                         "*"
                    ],
                    "timeout": 6000,
                    "random_data": true,
                    "hybrid_search_counts": 10,
                    "data_size": 3000,
                    "ni": 3000
               }
          },
          {
               "type": "search",
               "weight": 1,
               "params": {
                    "nq": 1000,
                    "top_k": 1,
                    "search_param": {
                         "nprobe": 1000
                    },
                    "expr": "int64_1 >= 0",
                    "timeout": 6000,
                    "random_data": true,
                    "partition_names": [
                         "_default"
                    ]
               }
          },
          {
               "type": "hybrid_search",
               "weight": 1,
               "params": {
                    "nq": 1,
                    "top_k": 100,
                    "reqs": [
                         {
                              "search_param": {
                                   "ef": 32
                              },
                              "anns_field": "float_vector",
                              "expr": "int64_1 > 100000",
                              "top_k": 10
                         },
                         {
                              "search_param": {
                                   "search_list": 30
                              },
                              "anns_field": "float_vector_1",
                              "expr": "id < 900000",
                              "top_k": 10
                         },
                         {
                              "search_param": {
                                   "drop_ratio_search": 0.3
                              },
                              "anns_field": "sparse_float_vector",
                              "expr": "varchar_1 > \"1\"",
                              "top_k": 30
                         },
                         {
                              "search_param": {
                                   "nprobe": 16
                              },
                              "anns_field": "bfloat16_vector",
                              "top_k": 400
                         }
                    ],
                    "rerank": {
                         "WeightedRanker": [
                              0.85,
                              0.95,
                              0.51,
                              0.32
                         ]
                    },
                    "output_fields": [
                         "*"
                    ],
                    "partition_names": [
                         "_default"
                    ],
                    "timeout": 6000,
                    "random_data": true
               }
          },
          {
               "type": "query",
               "weight": 1,
               "params": {
                    "expr": "int64_1 > -1 && ",
                    "output_fields": [
                         "*"
                    ],
                    "partition_names": [
                         "_default"
                    ],
                    "limit": 10,
                    "timeout": 6000,
                    "custom_expr": " {0} < id < {0} + 1000000",
                    "custom_range": [
                         0,
                         20000000
                    ]
               }
          }
     ]
}
@wangting0128 wangting0128 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. test/benchmark benchmark test labels Dec 12, 2024
@wangting0128 wangting0128 added this to the 2.5.0 milestone Dec 12, 2024
@yanliang567
Copy link
Contributor

/assign @sunby
/unassign

@sre-ci-robot sre-ci-robot assigned sunby and unassigned yanliang567 Dec 13, 2024
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants