You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Using DeepSpeed inference with tensor parallelism and ZeRO optimization for NVLM model is showing slower performance compared to HuggingFace baseline implementation.
also, no matter tp_size set 2, 4, 6, inference time get the same in deepspeed
I ask same 6 quesiotn ( I have 6 A100 gpu) and ensure get same answer, but result get slower although deepspeed can ask 6 question parallel deepspeed --num_gpus=6 main.py,
left is baseline and right is using deepspeed
although deepspeed's gpu utilization is very high (average near 80%) and cost more memory(below figure)
baseline gpu utilization and memory cost To Reproduce
Running identical inference workload (6 questions) on both implementations
Using tensor parallelism and ZeRO optimization in DeepSpeed version ( deepspeed --num_gpus=6 main.py)
Using default HuggingFace implementation as baseline
Expected behavior
DeepSpeed with tensor parallelism and ZeRO optimization should show faster inference times.
deepspeed code
fromtransformersimportAutoTokenizer, AutoConfig, AutoModelForSeq2SeqLM, AutoModelfromtransformers.integrationsimportHfDeepSpeedConfigimportdeepspeedimportosimporttorchimporttimeimporttorch.distributedasdistfromcollectionsimportdefaultdictos.environ["TOKENIZERS_PARALLELISM"] ="false"# To avoid warnings about parallelism in tokenizers# distributed setuplocal_rank=int(os.getenv("LOCAL_RANK", "0"))
world_size=int(os.getenv("WORLD_SIZE", "1"))
torch.cuda.set_device(local_rank)
deepspeed.init_distributed()
model_name="nvidia/NVLM-D-72B"ds_config= {
# "replace_with_kernel_inject":True"bf16": {
"enabled": True
},
"tensor_parallel": {
"tp_size": 6
},
"zero_optimization": {
"stage": 3,
"overlap_comm": True,
"contiguous_gradients": True,
},
"steps_per_print": 2000,
# "train_batch_size": train_batch_size,"train_micro_batch_size_per_gpu": 1,
"wall_clock_breakdown": False
}
# fmt: on# next line instructs transformers to partition the model directly over multiple gpus using# deepspeed.zero.Init when model's `from_pretrained` method is called.## **it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name)**## deepspeed.zero.Init# otherwise the model will first be loaded normally and only partitioned at forward time which is# less efficient and when there is little CPU RAM may faildschf=HfDeepSpeedConfig(ds_config) # keep this object alive# now a model can be loaded.model=AutoModel.from_pretrained(model_name,trust_remote_code=True).eval()
# initialise Deepspeed ZeRO and store only the engine objectds_engine=deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval() # inferenceprompts= [
"Is this review positive or negative? Review: The customer service was absolutely terrible and I'll never shop here again",
"Is this review positive or negative? Review: Great product, exactly what I needed and arrived on time",
"Is this review positive or negative? Review: Don't waste your money, it broke after two uses",
"Is this review positive or negative? Review: Amazing quality and the price can't be beat, highly recommend",
"Is this review positive or negative? Review: Mediocre at best, there are better options out there",
"Is this review positive or negative? Review: Beyond disappointed with this purchase, complete garbage"
]
rank=torch.distributed.get_rank()
text_in=prompts[rank]
tokenizer=AutoTokenizer.from_pretrained(model_name)
# Synchronize before starting timingtorch.distributed.barrier()
start_time=time.time()
response, _=ds_engine.module.chat(
tokenizer,
None,
text_in,
{"max_new_tokens": 1024},
history=None,
return_history=True
)
ifrank==0:
end_time=time.time()
total_time=end_time-start_time# Gather responses from all ranksall_responses= [None] *world_sizedist.all_gather_object(all_responses, response)
# Calculate on rank 0ifrank==0:
# Calculate total characters generatedtotal_chars=sum(len(resp) forrespinall_responses)
throughput=total_chars/total_timeprint(f"\nTotal characters generated: {total_chars}")
print(f"Total time taken: {total_time:.2f} seconds")
print(f"Throughput: {throughput:.2f} characters/second")
# Print individual responsesfori, respinenumerate(all_responses):
print(f"\nGPU {i} response ({len(resp)} chars):\n{resp}")
baseline code template
importtorchfromtransformersimportAutoTokenizer, AutoModelimportmathfromPILimportImageimporttorchvision.transformsasTfromtorchvision.transforms.functionalimportInterpolationModedefsplit_model():
device_map= {}
world_size=torch.cuda.device_count()
num_layers=80# Since the first GPU will be used for ViT, treat it as half a GPU.num_layers_per_gpu=math.ceil(num_layers/ (world_size-0.5))
num_layers_per_gpu= [num_layers_per_gpu] *world_sizenum_layers_per_gpu[0] =math.ceil(num_layers_per_gpu[0] *0.5)
layer_cnt=0fori, num_layerinenumerate(num_layers_per_gpu):
forjinrange(num_layer):
device_map[f'language_model.model.layers.{layer_cnt}'] =ilayer_cnt+=1device_map['vision_model'] =0device_map['mlp1'] =0device_map['language_model.model.tok_embeddings'] =0device_map['language_model.model.embed_tokens'] =0device_map['language_model.output'] =0device_map['language_model.model.norm'] =0device_map['language_model.lm_head'] =0device_map['language_model.model.rotary_emb'] =0device_map[f'language_model.model.layers.{num_layers-1}'] =0returndevice_mappath="nvidia/NVLM-D-72B"device_map=split_model()
model=AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=False,
trust_remote_code=True,
device_map=device_map).eval()
print(model)
tokenizer=AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
generation_config=dict(max_new_tokens=1024, do_sample=False)
# pure-text conversationquestion='Hello, who are you?'response, history=model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')
System info (please complete the following information):
OS: [Ubuntu 22.04.3 LTS]
GPU count and types [one nodes with x6 A100s]
The text was updated successfully, but these errors were encountered:
Describe the bug
Using DeepSpeed inference with tensor parallelism and ZeRO optimization for NVLM model is showing slower performance compared to HuggingFace baseline implementation.
also, no matter
tp_size
set 2, 4, 6, inference time get the same in deepspeedI ask same 6 quesiotn ( I have 6 A100 gpu) and ensure get same answer, but result get slower although deepspeed can ask 6 question parallel
![Image](https://private-user-images.githubusercontent.com/80030403/392494930-9f0db61a-2891-48d2-9046-7fc7c3d706d1.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzQ0NjQzMDQsIm5iZiI6MTczNDQ2NDAwNCwicGF0aCI6Ii84MDAzMDQwMy8zOTI0OTQ5MzAtOWYwZGI2MWEtMjg5MS00OGQyLTkwNDYtN2ZjN2MzZDcwNmQxLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDEyMTclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQxMjE3VDE5MzMyNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTJhM2U0YjNmY2Y0OTI5NjQxMGVjYjMwOTdhMjc1MGI5ZjFmOTkwYjc1MTU5YzFhMTRiYTZlYzAzZmI1MjMyMzAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0._b0gRDBe0C2noY43waDquCtnycAfpmoqHW927wsGkw0)
deepspeed --num_gpus=6 main.py
,left is baseline and right is using deepspeed
although deepspeed's gpu utilization is very high (average near 80%) and cost more memory(below figure)
![Image](https://private-user-images.githubusercontent.com/80030403/392497884-140d17dc-852e-4870-9228-81b3cc85e42d.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzQ0NjQzMDQsIm5iZiI6MTczNDQ2NDAwNCwicGF0aCI6Ii84MDAzMDQwMy8zOTI0OTc4ODQtMTQwZDE3ZGMtODUyZS00ODcwLTkyMjgtODFiM2NjODVlNDJkLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDEyMTclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQxMjE3VDE5MzMyNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTFhNWU4OWMxMDk0NzM0OTMyOTRlZjJmN2Q1Y2EzMGQwMjU2M2ZkNWQyYjI4MGRmMDYwOGMzMTAzNjE3NTdlY2UmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.F4cDUcNa0PJ1jmYTa8i7IWu8PMIWb9Cg9HwS4elXiTI)
baseline gpu utilization and memory cost
![Image](https://private-user-images.githubusercontent.com/80030403/392497758-2b0ab767-41e9-478d-9a36-7a02064d9a5e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzQ0NjQzMDQsIm5iZiI6MTczNDQ2NDAwNCwicGF0aCI6Ii84MDAzMDQwMy8zOTI0OTc3NTgtMmIwYWI3NjctNDFlOS00NzhkLTlhMzYtN2EwMjA2NGQ5YTVlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDEyMTclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQxMjE3VDE5MzMyNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWM3YTI0ZmZmNGExYjhhZDExNGE2MDJhZTk5MDlmMTY2ZTNiOGU0NWU2OTQyZmQ0OWMxMDk2YmRhYzVmOGM4YTEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.VJ03I8VJTPpv2uj5MFzTYjMgUc0hMUqLRVROZNs4mIM)
To Reproduce
Running identical inference workload (6 questions) on both implementations
Using tensor parallelism and ZeRO optimization in DeepSpeed version ( deepspeed --num_gpus=6 main.py)
Using default HuggingFace implementation as baseline
Expected behavior
DeepSpeed with tensor parallelism and ZeRO optimization should show faster inference times.
deepspeed code
baseline code template
System info (please complete the following information):
The text was updated successfully, but these errors were encountered: