Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Freeing VRAM after timeout #413

Open
ei23fxg opened this issue Nov 30, 2024 · 2 comments
Open

Freeing VRAM after timeout #413

ei23fxg opened this issue Nov 30, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@ei23fxg
Copy link
Contributor

ei23fxg commented Nov 30, 2024

I'm using this great program as a Docker container and I'm wondering if it would be possible to remove the Whisper model from the VRAM after a certain time so that it doesn't permanently block the VRAM.
Maybe some kind of timeout parameter would be useful, so that you can set it to 60 seconds, for example, and after 60 seconds of standby the model is removed from the VRAM again.

Restarting the container is currently my workaround.

  whisper-webui:
    build:
      context: ./volumes/Whisper-WebUI/.
      dockerfile: dockerfile
    image: jhj0517/whisper-webui:latest
    container_name: Whisper-WebUI
    restart: no
    volumes:
      - ./volumes/Whisper-WebUI/models:/Whisper-WebUI/models
      - ./volumes/Whisper-WebUI/outputs:/Whisper-WebUI/outputs
      - ./volumes/Whisper-WebUI/configs:/Whisper-WebUI/configs
    ports:
      - "7860:7860"
    stdin_open: true
    tty: true
    entrypoint: ["python", "app.py", "--server_port", "7860", "--whisper_type", "insanely-fast-whisper", "--api_open", "True", "--server_name", "0.0.0.0",]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [ gpu ]

Maybe something like this:

import gc
import torch

class App:
    def __init__(self, args):
        self.args = args
        # ... (rest of your __init__ code)
        self.whisper_inf = None  # Initialize as None

    def load_model(self):
        if self.whisper_inf is None:
            self.whisper_inf = WhisperFactory.create_whisper_inference(
                # ... your model loading parameters
            )

    def unload_model(self):
        if self.whisper_inf is not None:
            del self.whisper_inf
            self.whisper_inf = None
            gc.collect()  # Force garbage collection
            torch.cuda.empty_cache()  # Empty CUDA cache

As mentioned with a timeout, or an api call / web-request / button for unloading.

@ei23fxg ei23fxg added the enhancement New feature or request label Nov 30, 2024
@jhj0517
Copy link
Owner

jhj0517 commented Dec 1, 2024

Hi, Thanks for your suggestion!

Similar to #398, it seems there's a lot more people asking for this feature than I thought.

Probably running a fully dependent thread for it, like #398, would be a good way to approach this?
And I'm planning to add offload() function to the each inferencer as I did in MusicSeparator():

def offload(self):
"""Offload the model and free up the memory"""

@ei23fxg
Copy link
Contributor Author

ei23fxg commented Dec 1, 2024

Hi, Thanks for your suggestion!

Similar to #398, it seems there's a lot more people asking for this feature than I thought.

Probably running a fully dependent thread for it, like #398, would be a good way to approach this? And I'm planning to add offload() function to the each inferencer as I did in MusicSeparator():

def offload(self):
"""Offload the model and free up the memory"""

Sounds great!
BTW, I made a python voice keyboard (currently only testing on debian) based on this tool.
Running pretty fast, even with gradio api. Feels already like mic streaming.
It's not public atm, but let me know if you're interested or anything else i can help with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants