[Bug]: sentence_bert_config.json 404 Client Error #11268

shaowei-su · 2024-12-17T19:52:42Z

Your current environment

The output of `python collect_env.py`

Your output of `python collect_env.py` here

Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26218 (26K) [text/plain]
Saving to: ‘collect_env.py’

collect_env.py                              100%[========================================================================================>]  25.60K  --.-KB/s    in 0s      

2024-12-17 19:47:13 (84.6 MB/s) - ‘collect_env.py’ saved [26218/26218]

zsh: command not found: #




Collecting environment information...
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (conda-forge gcc 13.3.0-1) 13.3.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.11.11 | packaged by conda-forge | (main, Dec  5 2024, 14:17:24) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.15.149-99.162.amzn2.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A10G
Nvidia driver version: 550.54.15
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             96
On-line CPU(s) list:                0-95
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 7R32
CPU family:                         23
Model:                              49
Thread(s) per core:                 2
Core(s) per socket:                 48
Socket(s):                          1
Stepping:                           0
BogoMIPS:                           5599.99
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          1.5 MiB (48 instances)
L1i cache:                          1.5 MiB (48 instances)
L2 cache:                           24 MiB (48 instances)
L3 cache:                           192 MiB (12 instances)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-95
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.2.0
[pip3] torch==2.5.1
[pip3] torchvision==0.20.1
[pip3] transformers==4.47.0
[pip3] triton==3.1.0
[conda] libmagma                  2.8.0                h0af6554_0    conda-forge
[conda] libmagma_sparse           2.8.0                h0af6554_0    conda-forge
[conda] libtorch                  2.4.1           cuda120_h1d34654_302    conda-forge
[conda] mkl                       2023.2.0         h84fe81f_50496    conda-forge
[conda] nccl                      2.23.4.1             h2b5d15b_3    conda-forge
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.4.5.8                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.4.127                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.2.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.5.147               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.6.1.9                 pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.3.1.170               pypi_0    pypi
[conda] nvidia-ml-py              12.560.30                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.21.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.4.127                 pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.4.127                 pypi_0    pypi
[conda] pyzmq                     26.2.0          py311h7deb3e3_3    conda-forge
[conda] torch                     2.5.1                    pypi_0    pypi
[conda] torchvision               0.20.1                   pypi_0    pypi
[conda] transformers              4.47.0             pyhd8ed1ab_0    conda-forge
[conda] triton                    3.1.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.4.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-95    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NVIDIA_VISIBLE_DEVICES=GPU-4d7c9f05-a7a3-237a-0be1-bc1c9c8219d2
CUDA_MODULE_LOADING=LAZY

Model Input Dumps

No response

🐛 Describe the bug

This PR (https://github.com/vllm-project/vllm/pull/9506/files) introduce a default loading for sentence transformer config whenever the model config is created regardless the actual model type (sentence transformer vs regular CLMs).

from vllm import LLM

model = LLM("meta-llama/Llama-3.1-70B-Instruct")

###
{
    "name": "HfHubHTTPError",
    "message": "404 Client Error: Not Found for url: https://hfproxy/meta-llama/Llama-3.1-70B-Instruct/resolve/main/sentence_bert_config.json",
    "stack": "---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/huggingface_hub/utils/_http.py:406, in hf_raise_for_status(response, endpoint_name)
    405 try:
--> 406     response.raise_for_status()
    407 except HTTPError as e:

File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/requests/models.py:1024, in Response.raise_for_status(self)
   1023 if http_error_msg:
-> 1024     raise HTTPError(http_error_msg, response=self)

HTTPError: 404 Client Error: Not Found for url: https://hfproxy/meta-llama/Llama-3.1-70B-Instruct/resolve/main/sentence_bert_config.json

The above exception was the direct cause of the following exception:

HfHubHTTPError                            Traceback (most recent call last)
Cell In[4], line 1
----> 1 model = LLM(\"meta-llama/Llama-3.1-70B-Instruct\")

File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/vllm/utils.py:1028, in deprecate_args.<locals>.wrapper.<locals>.inner(*args, **kwargs)
   1021             msg += f\" {additional_message}\"
   1023         warnings.warn(
   1024             DeprecationWarning(msg),
   1025             stacklevel=3,  # The inner function takes up one level
   1026         )
-> 1028 return fn(*args, **kwargs)

File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/vllm/entrypoints/llm.py:210, in LLM.__init__(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, allowed_local_media_path, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, hf_overrides, mm_processor_kwargs, task, override_pooler_config, **kwargs)
    207 self.engine_class = self.get_engine_class()
    209 # TODO(rob): enable mp by default (issue with fork vs spawn)
--> 210 self.llm_engine = self.engine_class.from_engine_args(
    211     engine_args, usage_context=UsageContext.LLM_CLASS)
    213 self.request_counter = Counter()

File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/vllm/engine/llm_engine.py:582, in LLMEngine.from_engine_args(cls, engine_args, usage_context, stat_loggers)
    580 \"\"\"Creates an LLM engine from the engine arguments.\"\"\"
    581 # Create the engine configs.
--> 582 engine_config = engine_args.create_engine_config()
    583 executor_class = cls._get_executor_cls(engine_config)
    584 # Create the LLM engine.

File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/vllm/engine/arg_utils.py:959, in EngineArgs.create_engine_config(self)
    954 assert self.cpu_offload_gb >= 0, (
    955     \"CPU offload space must be non-negative\"
    956     f\", but got {self.cpu_offload_gb}\")
    958 device_config = DeviceConfig(device=self.device)
--> 959 model_config = self.create_model_config()
    961 if model_config.is_multimodal_model:
    962     if self.enable_prefix_caching:

File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/vllm/engine/arg_utils.py:891, in EngineArgs.create_model_config(self)
    890 def create_model_config(self) -> ModelConfig:
--> 891     return ModelConfig(
    892         model=self.model,
    893         task=self.task,
    894         # We know this is not None because we set it in __post_init__
    895         tokenizer=cast(str, self.tokenizer),
    896         tokenizer_mode=self.tokenizer_mode,
    897         chat_template_text_format=self.chat_template_text_format,
    898         trust_remote_code=self.trust_remote_code,
    899         allowed_local_media_path=self.allowed_local_media_path,
    900         dtype=self.dtype,
    901         seed=self.seed,
    902         revision=self.revision,
    903         code_revision=self.code_revision,
    904         rope_scaling=self.rope_scaling,
    905         rope_theta=self.rope_theta,
    906         hf_overrides=self.hf_overrides,
    907         tokenizer_revision=self.tokenizer_revision,
    908         max_model_len=self.max_model_len,
    909         quantization=self.quantization,
    910         quantization_param_path=self.quantization_param_path,
    911         enforce_eager=self.enforce_eager,
    912         max_seq_len_to_capture=self.max_seq_len_to_capture,
    913         max_logprobs=self.max_logprobs,
    914         disable_sliding_window=self.disable_sliding_window,
    915         skip_tokenizer_init=self.skip_tokenizer_init,
    916         served_model_name=self.served_model_name,
    917         limit_mm_per_prompt=self.limit_mm_per_prompt,
    918         use_async_output_proc=not self.disable_async_output_proc,
    919         config_format=self.config_format,
    920         mm_processor_kwargs=self.mm_processor_kwargs,
    921         override_neuron_config=self.override_neuron_config,
    922         override_pooler_config=self.override_pooler_config,
    923     )

File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/vllm/config.py:214, in ModelConfig.__init__(self, model, task, tokenizer, tokenizer_mode, trust_remote_code, dtype, seed, allowed_local_media_path, revision, code_revision, rope_scaling, rope_theta, tokenizer_revision, max_model_len, spec_target_max_model_len, quantization, quantization_param_path, enforce_eager, max_seq_len_to_capture, max_logprobs, disable_sliding_window, skip_tokenizer_init, served_model_name, limit_mm_per_prompt, use_async_output_proc, config_format, chat_template_text_format, hf_overrides, mm_processor_kwargs, override_neuron_config, override_pooler_config)
    211 self.hf_config = hf_config
    213 self.hf_text_config = get_hf_text_config(self.hf_config)
--> 214 self.encoder_config = self._get_encoder_config()
    215 self.hf_image_processor_config = get_hf_image_processor_config(
    216     self.model, revision)
    217 self.dtype = _get_and_verify_dtype(self.hf_text_config, dtype)

File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/vllm/config.py:287, in ModelConfig._get_encoder_config(self)
    286 def _get_encoder_config(self):
--> 287     return get_sentence_transformer_tokenizer_config(
    288         self.model, self.revision)

File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/vllm/transformers_utils/config.py:383, in get_sentence_transformer_tokenizer_config(model, revision, token)
    359 \"\"\"
    360 Returns the tokenization configuration dictionary for a 
    361 given Sentence Transformer BERT model.
   (...)
    372 for the Sentence Transformer BERT model.
    373 \"\"\"
    374 for config_name in [
    375         \"sentence_bert_config.json\",
    376         \"sentence_roberta_config.json\",
   (...)
    381         \"sentence_xlnet_config.json\",
    382 ]:
--> 383     encoder_dict = get_hf_file_to_dict(config_name, model, revision, token)
    384     if encoder_dict:
    385         break

File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/vllm/transformers_utils/config.py:263, in get_hf_file_to_dict(file_name, model, revision, token)
    247 \"\"\"
    248 Downloads a file from the Hugging Face Hub and returns 
    249 its contents as a dictionary.
   (...)
    259 the contents of the downloaded file.
    260 \"\"\"
    261 file_path = Path(model) / file_name
--> 263 if file_or_path_exists(model=model,
    264                        config_name=file_name,
    265                        revision=revision,
    266                        token=token):
    268     if not file_path.is_file():
    269         try:

File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/vllm/transformers_utils/config.py:91, in file_or_path_exists(model, config_name, revision, token)
     88 # NB: file_exists will only check for the existence of the config file on
     89 # hf_hub. This will fail in offline mode.
     90 try:
---> 91     return file_exists(model, config_name, revision=revision, token=token)
     92 except huggingface_hub.errors.OfflineModeIsEnabled:
     93     # Don't raise in offline mode, all we know is that we don't have this
     94     # file cached.
     95     return False

File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py:114, in validate_hf_hub_args.<locals>._inner_fn(*args, **kwargs)
    111 if check_use_auth_token:
    112     kwargs = smoothly_deprecate_use_auth_token(fn_name=fn.__name__, has_token=has_token, kwargs=kwargs)
--> 114 return fn(*args, **kwargs)

File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/huggingface_hub/hf_api.py:2907, in HfApi.file_exists(self, repo_id, filename, repo_type, revision, token)
   2905     if token is None:
   2906         token = self.token
-> 2907     get_hf_file_metadata(url, token=token)
   2908     return True
   2909 except GatedRepoError:  # raise specifically on gated repo

File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py:114, in validate_hf_hub_args.<locals>._inner_fn(*args, **kwargs)
    111 if check_use_auth_token:
    112     kwargs = smoothly_deprecate_use_auth_token(fn_name=fn.__name__, has_token=has_token, kwargs=kwargs)
--> 114 return fn(*args, **kwargs)

File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/huggingface_hub/file_download.py:1296, in get_hf_file_metadata(url, token, proxies, timeout, library_name, library_version, user_agent, headers)
   1293 headers[\"Accept-Encoding\"] = \"identity\"  # prevent any compression => we want to know the real size of the file
   1295 # Retrieve metadata
-> 1296 r = _request_wrapper(
   1297     method=\"HEAD\",
   1298     url=url,
   1299     headers=headers,
   1300     allow_redirects=False,
   1301     follow_relative_redirects=True,
   1302     proxies=proxies,
   1303     timeout=timeout,
   1304 )
   1305 hf_raise_for_status(r)
   1307 # Return

File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/huggingface_hub/file_download.py:277, in _request_wrapper(method, url, follow_relative_redirects, **params)
    275 # Recursively follow relative redirects
    276 if follow_relative_redirects:
--> 277     response = _request_wrapper(
    278         method=method,
    279         url=url,
    280         follow_relative_redirects=False,
    281         **params,
    282     )
    284     # If redirection, we redirect only relative paths.
    285     # This is useful in case of a renamed repository.
    286     if 300 <= response.status_code <= 399:

File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/huggingface_hub/file_download.py:301, in _request_wrapper(method, url, follow_relative_redirects, **params)
    299 # Perform request and return if status_code is not in the retry list.
    300 response = get_session().request(method=method, url=url, **params)
--> 301 hf_raise_for_status(response)
    302 return response

File ~/.airconda-environments/production--ml_infra--ray--vllm--v0.0.9/lib/python3.11/site-packages/huggingface_hub/utils/_http.py:477, in hf_raise_for_status(response, endpoint_name)
    473     raise _format(HfHubHTTPError, message, response) from e
    475 # Convert `HTTPError` into a `HfHubHTTPError` to display request information
    476 # as well (request id and/or server error message)
--> 477 raise _format(HfHubHTTPError, str(e), response) from e

HfHubHTTPError: 404 Client Error: Not Found for url: https://hfproxy/meta-llama/Llama-3.1-70B-Instruct/resolve/main/sentence_bert_config.json"
}

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

shaowei-su added the bug Something isn't working label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: sentence_bert_config.json 404 Client Error #11268

[Bug]: sentence_bert_config.json 404 Client Error #11268

shaowei-su commented Dec 17, 2024

[Bug]: sentence_bert_config.json 404 Client Error #11268

[Bug]: sentence_bert_config.json 404 Client Error #11268

Comments

shaowei-su commented Dec 17, 2024

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...