Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] Hermes tool parser fails to check for & handle None values in some cases #9908

Closed

Conversation

K-Mistele
Copy link
Contributor

This PR patches a bug in the hermes tool parser where certain edge conditions (which are not fully understood, as I have not been able to replicate them) can result in values being set to None which causes downstream issues due to a lack of handling None in those values.

While I have not been able to replicate the issue, @ankush13r was kind enough to point me at the specific lines causing the issue. Appropriate checks & handling have been added.

cc @DarkLight1337 @githebs @ankush13r

FIX #9693
FIX #9874

Recommended debugging process for @githebs @ankush13r to determine if this fixes the behavior you are experiencing:

  1. Copy the entire updated HermesToolParser from this PR onto your local disk.

  2. change the plugin name in the ToolParserManager decorator from this:

    @ToolParserManager.register_module("hermes")

    to this:

    @ToolParserManager.register_module("hermes_patched")

    This prevents the patched version that you'll be testing with, from conflicting with the parser that vLLM comes packaged with.

  3. Run vLLM with the following configuration for tools:

        --enable-auto-tool-choice \
        --tool-parser-plugin /path/to/plugin/file.py \
        --tool-call-parser hermes_patched \ # the renamed tool parser
        --chat-template <your chat template>  # can omit this if you're using your model's default 
  4. Try to reproduce the issue however you encountered it. If you do not have the same issue, then this PR fixes
    [Bug]: Function calling with stream vs without stream, arguments=None when stream option is enabled #9693
    and [Bug]: Function calling with Qwen & Streaming ('NoneType' object has no attribute 'get') #9874.
    If you encounter the same issue, then there is a different problem

Once this has been tested by @ankush13r and @githebs and they confirm that this fixes their issues, i will move this from a draft to an open PR.


PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [Doc] for documentation fixes and improvements.
  • [Model] for adding a new model or improving an existing model. Model name should appear in the title.
  • [Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
  • [Kernel] for changes affecting CUDA kernels or other compute kernels.
  • [Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
  • [Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • We adhere to Google Python style guide and Google C++ style guide.
  • Pass all linter checks. Please use format.sh to format your code.
  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
  • Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Adding or changing kernels

Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.

  • Make sure custom ops are registered following PyTorch guidelines: Custom C++ and CUDA Operators and The Custom Operators Manual
  • Custom operations that return Tensors require meta-functions. Meta-functions should be implemented and registered in python so that dynamic dims can be handled automatically. See above documents for a description of meta-functions.
  • Use torch.libary.opcheck() to test the function registration and meta-function for any registered ops. See tests/kernels for examples.
  • When changing the C++ signature of an existing op, the schema must be updated to reflect the changes.
  • If a new custom type is needed, see the following document: Custom Class Support in PT2.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

  • After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
  • After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
  • After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
  • Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

Copy link

github-actions bot commented Nov 1, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@githebs
Copy link

githebs commented Nov 1, 2024

Hello @K-Mistele thanks for the ping on the two issues #9693 and #9674

Here is what i tried

The patch did not fix the issue, while streaming tools won't worked here is the new error

vllm  | INFO 11-01 10:47:43 engine.py:290] Added request chat-0c85bd4ed3ac4c84ae7df36d59ee63bc.
vllm  | INFO 11-01 10:47:43 metrics.py:349] Avg prompt throughput: 47.6 tokens/s, Avg generation throughput: 1.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.6%, CPU KV cache usage: 0.0%.
vllm  | Error trying to handle streaming tool call.
vllm  | Traceback (most recent call last):
vllm  |   File "/tools/hermes_tool_parser.py", line 219, in extract_tool_calls_streaming
vllm  |     flags) if tool_call_portion else None
vllm  |               ^^^^^^^^^^^^^^^^^
vllm  | UnboundLocalError: cannot access local variable 'tool_call_portion' where it is not associated with a value

I confirm that normal discussion works and tools without streaming works
I can jump on the vLLM discord to reproduce the bug on screenshare if needed

@K-Mistele
Copy link
Contributor Author

Thanks for this @githebs - I will take another crack at it and see if I can't resolve that issue, and failing that, reproducing on screenshare would be awesome. I'm beginning to suspect this is related to the Qwen models doing something slightly different than hermes; I need to dig into their chat template a little more deeply. I've seen a couple open bug issues about things related to this and they're all related to Qwen models, rather than hermes. I'll follow up as soon as I have an update :)

@githebs
Copy link

githebs commented Nov 1, 2024

Just in case it might help i've included a draft of a fastapi endpoint to test vllm with Qwen and that's what made me discover the issue

I might be totally wrong in the implementation

import logging
import json
from datetime import datetime
from typing import Dict, Any, List, Optional, AsyncGenerator
import httpx
from fastapi import APIRouter, HTTPException, Request
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from pathlib import Path

# Ensure logs directory exists
Path("logs").mkdir(exist_ok=True)

# API Constants
VLLM_API_BASE = "http://192.168.1.100:9000/v1/chat/completions"
DEFAULT_MODEL = "/models/Qwen-Qwen2.5-7B-Instruct"

# Define tools at module level
DEFAULT_TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "get_public_ip",
            "description": "Get the public IP address of the server",
            "parameters": {
                "type": "object",
                "properties": {},
                "required": []
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "add_numbers",
            "description": "Add two numbers together",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {"type": "number", "description": "First number"},
                    "b": {"type": "number", "description": "Second number"}
                },
                "required": ["a", "b"]
            }
        }
    }
]

router = APIRouter()

class ChatRequest(BaseModel):
    messages: List[Dict[str, str]] = [
        {"role": "assistant", "content": "You are a useful assistant."},
        {"role": "user", "content": "hello"}
    ]
    temperature: Optional[float] = 0.7
    tools: Optional[List[Dict]] = DEFAULT_TOOLS

class RequestLogger:
    def __init__(self):
        self.start_time = datetime.now()
        
        # Create a unique log file for this request using detailed timestamp
        log_dir = Path("logs")
        log_dir.mkdir(exist_ok=True)
        
        # Format: chat_YYYYMMDD_HHMMSS_microseconds.log
        self.request_id = self.start_time.strftime("%Y%m%d_%H%M%S_%f")
        self.file_handler = logging.FileHandler(
            f'logs/chat_{self.request_id}.log'
        )
        
        # Configure formatter
        formatter = logging.Formatter(
            '%(asctime)s - %(levelname)s - %(message)s'
        )
        self.file_handler.setFormatter(formatter)
        
        # Create a logger specific to this request
        self.logger = logging.getLogger(f'chat.{self.request_id}')
        self.logger.setLevel(logging.INFO)
        self.logger.addHandler(self.file_handler)
        
        # Prevent propagation to root logger
        self.logger.propagate = False

    def __call__(self, msg: str, data: Any = None, level: str = 'info'):
        elapsed = (datetime.now() - self.start_time).total_seconds()
        
        log_entry = {
            'timestamp': datetime.now().isoformat(),
            'request_id': self.request_id,
            'elapsed_seconds': round(elapsed, 3),
            'message': msg,
            'data': data
        }
        
        getattr(self.logger, level)(
            json.dumps(log_entry, ensure_ascii=False)
        )

    def __del__(self):
        # Clean up handlers when the logger is destroyed
        if hasattr(self, 'file_handler'):
            self.file_handler.close()
            self.logger.removeHandler(self.file_handler)

def add_numbers(a: float, b: float) -> float:
    """Simple addition function."""
    return a + b

def get_public_ip() -> str:
    """Get the public IP address using ifconfig.me."""
    try:
        response = httpx.get("https://ifconfig.me", timeout=5)
        return response.text.strip()
    except httpx.RequestError as e:
        return f"Failed to get IP: {str(e)}"

async def stream_response(payload: dict, log: RequestLogger) -> AsyncGenerator[str, None]:
    """Handle streaming response from vLLM."""
    async with httpx.AsyncClient() as client:
        try:
            async with client.stream(
                'POST',
                VLLM_API_BASE,
                json=payload,
                headers={"Content-Type": "application/json"},
                timeout=30.0
            ) as response:
                if response.status_code != 200:
                    error_msg = f"vLLM API error: {response.status_code}"
                    log(error_msg, level='error')
                    yield f"data: {json.dumps({'error': error_msg})}\n\n"
                    return

                async for line in response.aiter_lines():
                    if not line or not line.startswith('data: '):
                        continue
                        
                    line = line.removeprefix('data: ')
                    if line.strip() == '[DONE]':
                        log("Stream completed")
                        yield 'data: [DONE]\n\n'
                        break
                    
                    try:
                        parsed = json.loads(line)
                        log("Streaming chunk", parsed)

                        # Handle tool calls in streaming response
                        if 'choices' in parsed and parsed['choices']:
                            choice = parsed['choices'][0]
                            if 'delta' in choice and 'tool_calls' in choice['delta']:
                                tool_call = choice['delta']['tool_calls'][0]
                                
                                if ('function' in tool_call and 
                                    'name' in tool_call['function'] and 
                                    'arguments' in tool_call['function']):
                                    
                                    func_name = tool_call['function']['name']
                                    args = json.loads(tool_call['function']['arguments'])
                                    
                                    if func_name == 'add_numbers':
                                        result = add_numbers(args['a'], args['b'])
                                        yield f'data: {json.dumps({"choices": [{"delta": {"content": str(result)}}]})}\n\n'
                                        continue

                        yield f'data: {line}\n\n'
                    except json.JSONDecodeError as e:
                        log(f"Failed to parse streaming response: {str(e)}", level='error')
                        continue

        except httpx.RequestError as e:
            error_msg = f"Streaming request failed: {str(e)}"
            log(error_msg, level='error')
            yield f"data: {json.dumps({'error': error_msg})}\n\n"
        
    log("Stream connection closed")

@router.post("/chat/completions")
async def chat_completion(request: ChatRequest, raw_request: Request):
    log = RequestLogger()
    
    # Parse raw request to check for streaming
    request_body = await raw_request.json()
    is_streaming = request_body.get('stream', False)
    
    # Use default tools if none provided
    tools = request.tools if request.tools is not None else DEFAULT_TOOLS
    
    log("Received chat completion request", {
        "messages": request.messages,
        "temperature": request.temperature,
        "tools": tools,
        "stream": is_streaming
    })

    try:
        payload = {
            "model": DEFAULT_MODEL,
            "messages": request.messages,
            "temperature": request.temperature,
            "stream": is_streaming,
            "tools": tools
        }

        log("Sending request to vLLM", payload)
        
        if is_streaming:
            return StreamingResponse(
                stream_response(payload, log),
                media_type='text/event-stream'
            )

        # Non-streaming response handling
        async with httpx.AsyncClient() as client:
            response = await client.post(
                VLLM_API_BASE,
                headers={"Content-Type": "application/json"},
                json=payload,
                timeout=30.0
            )

            if not response.content:
                log("Empty response from vLLM API", level='error')
                raise HTTPException(status_code=500, detail="Empty response from API")

            result = response.json()
            log("Received response from vLLM", result)

            # Handle tool calls if present
            if "choices" in result and result["choices"] and "message" in result["choices"][0]:
                message = result["choices"][0]["message"]
                if "tool_calls" in message and message["tool_calls"]:
                    log("Processing tool calls", message["tool_calls"])
                    
                    # Add assistant's tool call message to conversation
                    messages = request.messages + [{
                        "role": "assistant",
                        "content": None,
                        "tool_calls": message["tool_calls"]
                    }]

                    for tool_call in message["tool_calls"]:
                        try:
                            func_name = tool_call["function"]["name"]
                            args = json.loads(tool_call["function"]["arguments"])
                            
                            if func_name == "get_public_ip":
                                tool_result = get_public_ip()
                            elif func_name == "add_numbers":
                                tool_result = add_numbers(args["a"], args["b"])
                            else:
                                raise ValueError(f"Unknown tool: {func_name}")
                            
                            log(f"Tool execution result for {func_name}", tool_result)
                            
                            # Add tool result to messages
                            messages.append({
                                "role": "tool",
                                "tool_call_id": tool_call["id"],
                                "name": func_name,
                                "content": str(tool_result)
                            })

                        except Exception as e:
                            error_msg = f"Tool execution failed: {str(e)}"
                            log(error_msg, level='error')
                            raise HTTPException(status_code=500, detail=error_msg)

                    # Make another call to vLLM with tool results
                    final_payload = {
                        "model": DEFAULT_MODEL,
                        "messages": messages,
                        "temperature": request.temperature,
                        "stream": is_streaming,
                        "tools": tools
                    }
                    
                    final_response = await client.post(
                        VLLM_API_BASE,
                        headers={"Content-Type": "application/json"},
                        json=final_payload,
                        timeout=30.0
                    )
                    
                    result = final_response.json()
                    log("Received final response from vLLM", result)

            return result

    except httpx.RequestError as e:
        error_msg = f"Request to vLLM failed: {str(e)}"
        log(error_msg, level='error')
        raise HTTPException(status_code=500, detail=error_msg)
    except Exception as e:
        error_msg = f"Unexpected error: {str(e)}"
        log(error_msg, level='error')
        raise HTTPException(status_code=500, detail=error_msg)

@K-Mistele
Copy link
Contributor Author

Out of curiosity, are you trying to get the model to call the function with the empty arguments object? I wonder if that's the issue; the parser may not handle that case properly due to how the streaming extraction works. If you're trying that, can you try to get it to call a version of the function with non-empty arguments and see if that resolve the issue? I will dig in more in the meantime

@K-Mistele
Copy link
Contributor Author

cc @frei-x

#9874 (comment)

@githebs
Copy link

githebs commented Nov 15, 2024

I re-implemented from scratch with and without the openai package and i can confirm that all parser versions fail if the tool doesn't have arguments

i can reproduce the bug each time and it goes away (with streaming) if arguments are present

@K-Mistele
Copy link
Contributor Author

I re-implemented from scratch with and without the openai package and i can confirm that all parser versions fail if the tool doesn't have arguments

i can reproduce the bug each time and it goes away (with streaming) if arguments are present

ok, check out #10398 and see if this fixes it for you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants