You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Your output of `python collect_env.py` here
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26218 (26K) [text/plain]
Saving to: ‘collect_env.py’
collect_env.py 100%[========================================================================================>] 25.60K --.-KB/s in 0s
2024-12-17 19:47:13 (84.6 MB/s) - ‘collect_env.py’ saved [26218/26218]
zsh: command not found: #
Collecting environment information...
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (conda-forge gcc 13.3.0-1) 13.3.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.22.1
Libc version: glibc-2.35
Python version: 3.11.11 | packaged by conda-forge | (main, Dec 5 2024, 14:17:24) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.15.149-99.162.amzn2.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A10G
Nvidia driver version: 550.54.15
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7R32
CPU family: 23
Model: 49
Thread(s) per core: 2
Core(s) per socket: 48
Socket(s): 1
Stepping: 0
BogoMIPS: 5599.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 1.5 MiB (48 instances)
L1i cache: 1.5 MiB (48 instances)
L2 cache: 24 MiB (48 instances)
L3 cache: 192 MiB (12 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-95
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.2.0
[pip3] torch==2.5.1
[pip3] torchvision==0.20.1
[pip3] transformers==4.47.0
[pip3] triton==3.1.0
[conda] libmagma 2.8.0 h0af6554_0 conda-forge
[conda] libmagma_sparse 2.8.0 h0af6554_0 conda-forge
[conda] libtorch 2.4.1 cuda120_h1d34654_302 conda-forge
[conda] mkl 2023.2.0 h84fe81f_50496 conda-forge
[conda] nccl 2.23.4.1 h2b5d15b_3 conda-forge
[conda] numpy 1.26.4 pypi_0 pypi
[conda] nvidia-cublas-cu12 12.4.5.8 pypi_0 pypi
[conda] nvidia-cuda-cupti-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cuda-nvrtc-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cuda-runtime-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi
[conda] nvidia-cufft-cu12 11.2.1.3 pypi_0 pypi
[conda] nvidia-curand-cu12 10.3.5.147 pypi_0 pypi
[conda] nvidia-cusolver-cu12 11.6.1.9 pypi_0 pypi
[conda] nvidia-cusparse-cu12 12.3.1.170 pypi_0 pypi
[conda] nvidia-ml-py 12.560.30 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.21.5 pypi_0 pypi
[conda] nvidia-nvjitlink-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-nvtx-cu12 12.4.127 pypi_0 pypi
[conda] pyzmq 26.2.0 py311h7deb3e3_3 conda-forge
[conda] torch 2.5.1 pypi_0 pypi
[conda] torchvision 0.20.1 pypi_0 pypi
[conda] transformers 4.47.0 pyhd8ed1ab_0 conda-forge
[conda] triton 3.1.0 pypi_0 pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.4.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 0-95 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NVIDIA_VISIBLE_DEVICES=GPU-4d7c9f05-a7a3-237a-0be1-bc1c9c8219d2
CUDA_MODULE_LOADING=LAZY
Model Input Dumps
No response
🐛 Describe the bug
This PR (https://github.com/vllm-project/vllm/pull/9506/files) introduce a default loading for sentence transformer config whenever the model config is created regardless the actual model type (sentence transformer vs regular CLMs).
from vllm import LLM
model = LLM("meta-llama/Llama-3.1-70B-Instruct")
###
{
"name": "HfHubHTTPError",
"message": "404 Client Error: Not Found for url: https://hfproxy/meta-llama/Llama-3.1-70B-Instruct/resolve/main/sentence_bert_config.json",
"stack": "---------------------------------------------------------------------------
HTTPError Traceback (most recent call last)
File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/huggingface_hub/utils/_http.py:406, in hf_raise_for_status(response, endpoint_name)
405 try:
--> 406 response.raise_for_status()
407 except HTTPError as e:
File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/requests/models.py:1024, in Response.raise_for_status(self)
1023 if http_error_msg:
-> 1024 raise HTTPError(http_error_msg, response=self)
HTTPError: 404 Client Error: Not Found for url: https://hfproxy/meta-llama/Llama-3.1-70B-Instruct/resolve/main/sentence_bert_config.json
The above exception was the direct cause of the following exception:
HfHubHTTPError Traceback (most recent call last)
Cell In[4], line 1
----> 1 model = LLM(\"meta-llama/Llama-3.1-70B-Instruct\")
File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/vllm/utils.py:1028, in deprecate_args.<locals>.wrapper.<locals>.inner(*args, **kwargs)
1021 msg += f\" {additional_message}\"
1023 warnings.warn(
1024 DeprecationWarning(msg),
1025 stacklevel=3, # The inner function takes up one level
1026 )
-> 1028 return fn(*args, **kwargs)
File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/vllm/entrypoints/llm.py:210, in LLM.__init__(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, allowed_local_media_path, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, hf_overrides, mm_processor_kwargs, task, override_pooler_config, **kwargs)
207 self.engine_class = self.get_engine_class()
209 # TODO(rob): enable mp by default (issue with fork vs spawn)
--> 210 self.llm_engine = self.engine_class.from_engine_args(
211 engine_args, usage_context=UsageContext.LLM_CLASS)
213 self.request_counter = Counter()
File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/vllm/engine/llm_engine.py:582, in LLMEngine.from_engine_args(cls, engine_args, usage_context, stat_loggers)
580 \"\"\"Creates an LLM engine from the engine arguments.\"\"\"
581 # Create the engine configs.
--> 582 engine_config = engine_args.create_engine_config()
583 executor_class = cls._get_executor_cls(engine_config)
584 # Create the LLM engine.
File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/vllm/engine/arg_utils.py:959, in EngineArgs.create_engine_config(self)
954 assert self.cpu_offload_gb >= 0, (
955 \"CPU offload space must be non-negative\"
956 f\", but got {self.cpu_offload_gb}\")
958 device_config = DeviceConfig(device=self.device)
--> 959 model_config = self.create_model_config()
961 if model_config.is_multimodal_model:
962 if self.enable_prefix_caching:
File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/vllm/engine/arg_utils.py:891, in EngineArgs.create_model_config(self)
890 def create_model_config(self) -> ModelConfig:
--> 891 return ModelConfig(
892 model=self.model,
893 task=self.task,
894 # We know this is not None because we set it in __post_init__
895 tokenizer=cast(str, self.tokenizer),
896 tokenizer_mode=self.tokenizer_mode,
897 chat_template_text_format=self.chat_template_text_format,
898 trust_remote_code=self.trust_remote_code,
899 allowed_local_media_path=self.allowed_local_media_path,
900 dtype=self.dtype,
901 seed=self.seed,
902 revision=self.revision,
903 code_revision=self.code_revision,
904 rope_scaling=self.rope_scaling,
905 rope_theta=self.rope_theta,
906 hf_overrides=self.hf_overrides,
907 tokenizer_revision=self.tokenizer_revision,
908 max_model_len=self.max_model_len,
909 quantization=self.quantization,
910 quantization_param_path=self.quantization_param_path,
911 enforce_eager=self.enforce_eager,
912 max_seq_len_to_capture=self.max_seq_len_to_capture,
913 max_logprobs=self.max_logprobs,
914 disable_sliding_window=self.disable_sliding_window,
915 skip_tokenizer_init=self.skip_tokenizer_init,
916 served_model_name=self.served_model_name,
917 limit_mm_per_prompt=self.limit_mm_per_prompt,
918 use_async_output_proc=not self.disable_async_output_proc,
919 config_format=self.config_format,
920 mm_processor_kwargs=self.mm_processor_kwargs,
921 override_neuron_config=self.override_neuron_config,
922 override_pooler_config=self.override_pooler_config,
923 )
File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/vllm/config.py:214, in ModelConfig.__init__(self, model, task, tokenizer, tokenizer_mode, trust_remote_code, dtype, seed, allowed_local_media_path, revision, code_revision, rope_scaling, rope_theta, tokenizer_revision, max_model_len, spec_target_max_model_len, quantization, quantization_param_path, enforce_eager, max_seq_len_to_capture, max_logprobs, disable_sliding_window, skip_tokenizer_init, served_model_name, limit_mm_per_prompt, use_async_output_proc, config_format, chat_template_text_format, hf_overrides, mm_processor_kwargs, override_neuron_config, override_pooler_config)
211 self.hf_config = hf_config
213 self.hf_text_config = get_hf_text_config(self.hf_config)
--> 214 self.encoder_config = self._get_encoder_config()
215 self.hf_image_processor_config = get_hf_image_processor_config(
216 self.model, revision)
217 self.dtype = _get_and_verify_dtype(self.hf_text_config, dtype)
File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/vllm/config.py:287, in ModelConfig._get_encoder_config(self)
286 def _get_encoder_config(self):
--> 287 return get_sentence_transformer_tokenizer_config(
288 self.model, self.revision)
File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/vllm/transformers_utils/config.py:383, in get_sentence_transformer_tokenizer_config(model, revision, token)
359 \"\"\"
360 Returns the tokenization configuration dictionary for a
361 given Sentence Transformer BERT model.
(...)
372 for the Sentence Transformer BERT model.
373 \"\"\"
374 for config_name in [
375 \"sentence_bert_config.json\",
376 \"sentence_roberta_config.json\",
(...)
381 \"sentence_xlnet_config.json\",
382 ]:
--> 383 encoder_dict = get_hf_file_to_dict(config_name, model, revision, token)
384 if encoder_dict:
385 break
File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/vllm/transformers_utils/config.py:263, in get_hf_file_to_dict(file_name, model, revision, token)
247 \"\"\"
248 Downloads a file from the Hugging Face Hub and returns
249 its contents as a dictionary.
(...)
259 the contents of the downloaded file.
260 \"\"\"
261 file_path = Path(model) / file_name
--> 263 if file_or_path_exists(model=model,
264 config_name=file_name,
265 revision=revision,
266 token=token):
268 if not file_path.is_file():
269 try:
File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/vllm/transformers_utils/config.py:91, in file_or_path_exists(model, config_name, revision, token)
88 # NB: file_exists will only check for the existence of the config file on
89 # hf_hub. This will fail in offline mode.
90 try:
---> 91 return file_exists(model, config_name, revision=revision, token=token)
92 except huggingface_hub.errors.OfflineModeIsEnabled:
93 # Don't raise in offline mode, all we know is that we don't have this
94 # file cached.
95 return False
File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py:114, in validate_hf_hub_args.<locals>._inner_fn(*args, **kwargs)
111 if check_use_auth_token:
112 kwargs = smoothly_deprecate_use_auth_token(fn_name=fn.__name__, has_token=has_token, kwargs=kwargs)
--> 114 return fn(*args, **kwargs)
File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/huggingface_hub/hf_api.py:2907, in HfApi.file_exists(self, repo_id, filename, repo_type, revision, token)
2905 if token is None:
2906 token = self.token
-> 2907 get_hf_file_metadata(url, token=token)
2908 return True
2909 except GatedRepoError: # raise specifically on gated repo
File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py:114, in validate_hf_hub_args.<locals>._inner_fn(*args, **kwargs)
111 if check_use_auth_token:
112 kwargs = smoothly_deprecate_use_auth_token(fn_name=fn.__name__, has_token=has_token, kwargs=kwargs)
--> 114 return fn(*args, **kwargs)
File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/huggingface_hub/file_download.py:1296, in get_hf_file_metadata(url, token, proxies, timeout, library_name, library_version, user_agent, headers)
1293 headers[\"Accept-Encoding\"] = \"identity\" # prevent any compression => we want to know the real size of the file
1295 # Retrieve metadata
-> 1296 r = _request_wrapper(
1297 method=\"HEAD\",
1298 url=url,
1299 headers=headers,
1300 allow_redirects=False,
1301 follow_relative_redirects=True,
1302 proxies=proxies,
1303 timeout=timeout,
1304 )
1305 hf_raise_for_status(r)
1307 # Return
File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/huggingface_hub/file_download.py:277, in _request_wrapper(method, url, follow_relative_redirects, **params)
275 # Recursively follow relative redirects
276 if follow_relative_redirects:
--> 277 response = _request_wrapper(
278 method=method,
279 url=url,
280 follow_relative_redirects=False,
281 **params,
282 )
284 # If redirection, we redirect only relative paths.
285 # This is useful in case of a renamed repository.
286 if 300 <= response.status_code <= 399:
File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/huggingface_hub/file_download.py:301, in _request_wrapper(method, url, follow_relative_redirects, **params)
299 # Perform request and return if status_code is not in the retry list.
300 response = get_session().request(method=method, url=url, **params)
--> 301 hf_raise_for_status(response)
302 return response
File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/huggingface_hub/utils/_http.py:477, in hf_raise_for_status(response, endpoint_name)
473 raise _format(HfHubHTTPError, message, response) from e
475 # Convert `HTTPError` into a `HfHubHTTPError` to display request information
476 # as well (request id and/or server error message)
--> 477 raise _format(HfHubHTTPError, str(e), response) from e
HfHubHTTPError: 404 Client Error: Not Found for url: https://hfproxy/meta-llama/Llama-3.1-70B-Instruct/resolve/main/sentence_bert_config.json"
}
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
Your current environment
The output of `python collect_env.py`
Model Input Dumps
No response
🐛 Describe the bug
This PR (https://github.com/vllm-project/vllm/pull/9506/files) introduce a default loading for sentence transformer config whenever the model config is created regardless the actual model type (sentence transformer vs regular CLMs).
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: