Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow multiple inference engines in single script #4384

Merged
merged 5 commits into from
Sep 22, 2023

Conversation

mrwyattii
Copy link
Contributor

The InferenceEngine relies on some global values and fails to clean up workspaces such that if we try to run multiple models via DeepSpeed-Inference in a single script (see example below), we get errors like this:

RuntimeError: The specified pointer resides on host memory and is not registered with any CUDA device.

This PR adds a destroy() method that will be automatically called on subsequent invocations of deepspeed.init_inference().

example.py:

import os
import torch
import deepspeed
from transformers import pipeline

local_rank = int(os.environ.get("LOCAL_RANK", 0))
world_size = int(os.environ.get("WORLD_SIZE", 1))

model1 = "bigscience/bloom-560m"
task1 = "text-generation"
pipe1 = pipeline(task1, model1, torch_dtype=torch.float16, device=local_rank)
pipe1.model = deepspeed.init_inference(
    pipe1.model,
    dtype=torch.float16,
    replace_with_kernel_inject=True,
    mp_size=world_size,
)
print(model1, pipe1("test one two"))

model2 = "gpt2"
task2 = "text-generation"
pipe2 = pipeline(task2, model2, torch_dtype=torch.float16, device=local_rank)
pipe2.model = deepspeed.init_inference(
    pipe2.model,
    dtype=torch.float16,
    replace_with_kernel_inject=True,
    mp_size=world_size,
)
print(model2, pipe2("test one two"))

Co-authored-by: Jeff Rasley [email protected]

@mrwyattii mrwyattii removed this pull request from the merge queue due to a manual request Sep 22, 2023
@mrwyattii mrwyattii enabled auto-merge September 22, 2023 22:09
@mrwyattii mrwyattii disabled auto-merge September 22, 2023 22:56
@mrwyattii mrwyattii merged commit 4c35880 into master Sep 22, 2023
16 checks passed
@mrwyattii mrwyattii deleted the mrwyattii/allow-multiple-inf-engines branch September 22, 2023 23:11
CurryRice233 pushed a commit to CurryRice233/DeepSpeed that referenced this pull request Sep 28, 2023
* origin/master:
  Allow multiple inference engines in single script (microsoft#4384)
  adds triton flash attention2 kernel (microsoft#4337)
  Fix llama meta tensor loading in AutoTP and kernel injected inference (microsoft#3608)
  Fix min torch version (microsoft#4375)
  Fix multinode runner to properly append to PDSH_SSH_ARGS_APPEND (microsoft#4373)
  add the missing method (microsoft#4363)
  Openfold fix (microsoft#4368)
  deepspeed4science japanese blog (microsoft#4369)
  deepspeed4science chinese blog (microsoft#4366)
  Enable workflow dispatch on Torch 1.10 CI tests (microsoft#4361)
  Update conda env to have max pydantic version (microsoft#4362)
  add deepspeed4science blog link (microsoft#4364)
  added check to avoid undefined behavior when the input_id length is greater than max_tokens (microsoft#4349)
  Add the policy to run llama model from the official repo (microsoft#4313)
  fix deepspeed4science links (microsoft#4358)
  DeepSpeed4Science (microsoft#4357)
  Support InternLM (microsoft#4137)
  Pass base_dir to model files can be loaded for auto-tp/meta-tensor. (microsoft#4348)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants