[Bug]: The final reason why you will get a model that cannot stop generation when you fine-tune the Qwen2.5-7b-base use Lora and a non-<|endoftext|> token as eos_token. #1064

hxs91 · 2024-11-08T03:45:32Z

Model Series

Qwen2.5

What are the models used?

Qwen2.5-7b-base

What is the scenario where the problem happened?

sft with huggingface trainer

Is this a known issue?

I have followed the GitHub README.
I have checked the Qwen documentation and cannot find an answer there.
I have checked the documentation of the related framework and cannot find useful information.
I have searched the issues and there is not a similar one.

Information about environment

Doesn't matter.

Log output

Doesn't matter.

Description

Steps to reproduce

Use Qwen2.5-7b-base
Modify its eos_token from <|endoftext|> to <|im_end|>(or any other specical token) in the tokenizer_config.json
Use LoRA to fine-tune the model on your downstream task, makesure your LoRA will not fine-tune the lm_head and embedding.
The fine-tuning data follow the input, instruction and output format and I place them with : input+instruction+output+eos_token, so do the labels.

Expected results

Expected: The fine-tuned model can produce text as you presented in the training data.
Happend: The model can indeed generate appropriate text but cannot stop generation.

Attempts to fix

check the generate function reveive the right eos_token_id
check the training procedure get the right inputs_id and labels, make sure eos_token_id has been trained.

Final reason

If I change the eos_token back to <|endoftext|>, the model will have the right behavior.

print("lm_head")
print("151643:"+str(model.lm_head.weight[151643]))
print("151644:"+str(model.lm_head.weight[151644]))
print("151645:"+str(model.lm_head.weight[151645]))
print("151646:"+str(model.lm_head.weight[151646]))
print("151647:"+str(model.lm_head.weight[151647]))
print("151648:"+str(model.lm_head.weight[151648]))
print("151649:"+str(model.lm_head.weight[151649]))
print("151650:"+str(model.lm_head.weight[151650]))
print("142333:"+str(model.lm_head.weight[142333]))
print("embedding")
print("151643:"+str(model.get_input_embeddings().weight[151643]))
print("151644:"+str(model.get_input_embeddings().weight[151644]))
print("151645:"+str(model.get_input_embeddings().weight[151645]))
print("151646:"+str(model.get_input_embeddings().weight[151646]))
print("151647:"+str(model.get_input_embeddings().weight[151647]))
print("151648:"+str(model.get_input_embeddings().weight[151648]))
print("151649:"+str(model.get_input_embeddings().weight[151649]))
print("151650:"+str(model.get_input_embeddings().weight[151650]))
print("142333:"+str(model.get_input_embeddings().weight[142333]))

the output

qwen2_base_7b
lm_head
151643:tensor([-0.0025, -0.0061, -0.0063,  ..., -0.0042, -0.0118,  0.0019],
       dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
151644:tensor([ 0.0005,  0.0091,  0.0034,  ...,  0.0020,  0.0002, -0.0011],
       dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
151645:tensor([ 0.0005,  0.0091,  0.0034,  ...,  0.0020,  0.0002, -0.0011],
       dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
151646:tensor([ 0.0005,  0.0091,  0.0034,  ...,  0.0020,  0.0002, -0.0011],
       dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
151647:tensor([ 0.0005,  0.0091,  0.0034,  ...,  0.0020,  0.0002, -0.0011],
       dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
151648:tensor([ 0.0005,  0.0091,  0.0034,  ...,  0.0020,  0.0002, -0.0011],
       dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
151649:tensor([ 0.0005,  0.0091,  0.0034,  ...,  0.0020,  0.0002, -0.0011],
       dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
151650:tensor([ 0.0005,  0.0091,  0.0034,  ...,  0.0020,  0.0002, -0.0011],
       dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
142333:tensor([ 0.0005,  0.0091,  0.0034,  ...,  0.0020,  0.0002, -0.0011],
       dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
embedding
151643:tensor([-0.0186,  0.0347,  0.0092,  ...,  0.0040, -0.0077,  0.0006],
       dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
151644:tensor([ 1.1755e-37, -1.1755e-37,  1.1755e-37,  ...,  1.1755e-37,
        -1.1755e-37, -1.1755e-37], dtype=torch.bfloat16,
       grad_fn=<SelectBackward0>)
151645:tensor([-1.1755e-37, -1.1755e-37,  1.1755e-37,  ...,  1.1755e-37,
        -1.1755e-37,  1.1755e-37], dtype=torch.bfloat16,
       grad_fn=<SelectBackward0>)
151646:tensor([-1.1755e-37, -1.1755e-37,  1.1755e-37,  ...,  1.1755e-37,
         1.1755e-37,  1.1755e-37], dtype=torch.bfloat16,
       grad_fn=<SelectBackward0>)
151647:tensor([ 1.1755e-37,  1.1755e-37, -1.1755e-37,  ...,  1.1755e-37,
         1.1755e-37, -1.1755e-37], dtype=torch.bfloat16,
       grad_fn=<SelectBackward0>)
151648:tensor([ 1.1755e-37, -1.1755e-37,  1.1755e-37,  ...,  1.1755e-37,
        -1.1755e-37,  1.1755e-37], dtype=torch.bfloat16,
       grad_fn=<SelectBackward0>)
151649:tensor([ 1.1755e-37, -1.1755e-37,  1.1755e-37,  ..., -1.1755e-37,
         1.1755e-37,  1.1755e-37], dtype=torch.bfloat16,
       grad_fn=<SelectBackward0>)
151650:tensor([-1.1755e-37, -1.1755e-37,  1.1755e-37,  ..., -1.1755e-37,
         1.1755e-37, -1.1755e-37], dtype=torch.bfloat16,
       grad_fn=<SelectBackward0>)
142333:tensor([ 1.1755e-37, -1.1755e-37, -1.1755e-37,  ...,  1.1755e-37,
         1.1755e-37,  1.1755e-37], dtype=torch.bfloat16,
       grad_fn=<SelectBackward0>)

151643 is the id of <|endoftext|>, 142333 is a random id I pick (maybe there are more ids like this), other ids are defined in the tokenizer_config.json. This explains why the fine-tuned model cannot stop generation: although the <|im_end|>(151645) is trained, but its weight in lm_head is same with many other ids, when it should be generated at inference, its logits will also equal to those ids, thus any id can be picked during decoding. In fact I observe logits for 151645 indeed incerease when it should be generated, but the same for those ids that have the same lm_head weights. This will not happen for 151643 , because it have a different lm_head weight, maybe it is trained during the pre-training stage.

I am quite confused why this will happen since all lm_head and embedding weights are initialized by normal distribution according to the released code.

The text was updated successfully, but these errors were encountered:

jklj077 · 2024-11-14T06:43:13Z

"<|im_start|>" and "<|im_end|>" are indeed not trained for the base models in the Qwen2.5 series.

This explains why the fine-tuned model cannot stop generation: although the <|im_end|>(151645) is trained, ...

I understand that you "makesure your LoRA will not fine-tune the lm_head and embedding". So,

it is trained only in the sense that the parameters of the LoRA adapters do receive gradients and are updated in such a way that "<|im_end|>" is more likely to be generated; but
it is not trained in the sense that the parameters of the embedding or the LM head of "<|im_end|>" are updated, so "<|im_end|>" is still not differentiable from other untrained tokens.

This will not happen for 151643 , because it have a different lm_head weight, maybe it is trained during the pre-training stage.

This is true. "<|endoftext|>" is trained. See also https://qwen.readthedocs.io/en/latest/getting_started/concepts.html#control-tokens-chat-template.

If you must keep the lm_head and embedding from being finetuned, you can

regard all untrained tokens as stopping tokens, e.g., setting eos_token_ids in generation_config.json for transformers; or
randomly reinitialize the embedding and lm_head for "<|im_start|>" and "<|im_end|>" after the model is loaded from the checkpoint; or
use a different chat template that does not rely on special tokens and set stopping criteria accordingly.

However, the best way is still to train the embedding and the lm_head of the new tokens.

hxs91 · 2024-11-14T07:08:00Z

@jklj077 Thanks for the patient explanation. It takes me quite a lot of time to figure out this problem. Maybe it shoud not be called a "BUG", I wrote it down incase other guys encounter the same problem.

BTW, if initialize all embedding and lm_head according to the norm distribution, the all tokens will have a different embedding and lm_head weight, which I thought it shoud be, the phenomenon I described will not happen, even these tokens are not trained in the pretraining stage.

echoht · 2024-11-15T06:17:39Z

多轮问答的数据集里，lora的训练方式， eos_token 设置成 <|endoftext|> 会有问题吗？

hxs91 · 2024-11-15T06:26:30Z

多轮问答的数据集里，lora的训练方式， eos_token 设置成 <|endoftext|> 会有问题吗？

设置成<|endoftext|>就应该不会有问题。

echoht · 2024-11-15T06:29:29Z

我看这个doc https://qwen.readthedocs.io/zh-cn/latest/getting_started/concepts.html#control-tokens-chat-template；<|endoftext|>是放到多轮的最后一轮的结尾，感觉和这个有冲突，但是没有做实验验证。

echoht · 2024-11-15T07:35:26Z

If I change the eos_token back to <|endoftext|>, the model will have the right behavior.

你这个是在哪个地方调整？是在lora微调前调整还是lora微调后在generation的时候调整？

hxs91 · 2024-11-15T07:42:00Z

我看这个doc https://qwen.readthedocs.io/zh-cn/latest/getting_started/concepts.html#control-tokens-chat-template；<|endoftext|>是放到多轮的最后一轮的结尾，感觉和这个有冲突，但是没有做实验验证。

我做的是base模型微调，和模板没关系~

If I change the eos_token back to <|endoftext|>, the model will have the right behavior.

你这个是在哪个地方调整？是在lora微调前调整还是lora微调后在generation的时候调整？

在微调前的tokenizer_config.json

echoht · 2024-11-15T07:48:39Z

我看这个doc https://qwen.readthedocs.io/zh-cn/latest/getting_started/concepts.html#control-tokens-chat-template；<|endoftext|>是放到多轮的最后一轮的结尾，感觉和这个有冲突，但是没有做实验验证。

我做的是base模型微调，和模板没关系~

If I change the eos_token back to <|endoftext|>, the model will have the right behavior.

你这个是在哪个地方调整？是在lora微调前调整还是lora微调后在generation的时候调整？

在微调前的tokenizer_config.json

不是用llama-factory训练框架微调吗？

hxs91 · 2024-11-15T12:01:18Z

我看这个doc https://qwen.readthedocs.io/zh-cn/latest/getting_started/concepts.html#control-tokens-chat-template；<|endoftext|>是放到多轮的最后一轮的结尾，感觉和这个有冲突，但是没有做实验验证。

我做的是base模型微调，和模板没关系~

If I change the eos_token back to <|endoftext|>, the model will have the right behavior.

你这个是在哪个地方调整？是在lora微调前调整还是lora微调后在generation的时候调整？

在微调前的tokenizer_config.json

不是用llama-factory训练框架微调吗？

不是，用的huggingface trainer

echoht mentioned this issue Nov 15, 2024

qwen2.5 使用llama factory微调，template选qwen，微调后的模型输出后不能正常停止。该怎么处理？ hiyouga/LLaMA-Factory#6043

Closed

1 task

jklj077 mentioned this issue Nov 19, 2024

[Bug]: The eos_token of the Qwen 2.5 base model is inconsistent between config.json and tokenizer_config.json. #927

Closed

4 tasks

cyente mentioned this issue Dec 2, 2024

[Bug]: Qwen-Coder结束符是应该使用<endoftext>还是<im_end>? QwenLM/Qwen2.5-Coder#182

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: The final reason why you will get a model that cannot stop generation when you fine-tune the Qwen2.5-7b-base use Lora and a non-<|endoftext|> token as eos_token. #1064

[Bug]: The final reason why you will get a model that cannot stop generation when you fine-tune the Qwen2.5-7b-base use Lora and a non-<|endoftext|> token as eos_token. #1064

hxs91 commented Nov 8, 2024 •

edited

Loading

jklj077 commented Nov 14, 2024

hxs91 commented Nov 14, 2024

echoht commented Nov 15, 2024

hxs91 commented Nov 15, 2024

echoht commented Nov 15, 2024

echoht commented Nov 15, 2024 •

edited

Loading

hxs91 commented Nov 15, 2024

If I change the eos_token back to <|endoftext|>, the model will have the right behavior.

echoht commented Nov 15, 2024

If I change the eos_token back to <|endoftext|>, the model will have the right behavior.

hxs91 commented Nov 15, 2024

If I change the eos_token back to <|endoftext|>, the model will have the right behavior.

[Bug]: The final reason why you will get a model that cannot stop generation when you fine-tune the Qwen2.5-7b-base use Lora and a non-<|endoftext|> token as eos_token. #1064

[Bug]: The final reason why you will get a model that cannot stop generation when you fine-tune the Qwen2.5-7b-base use Lora and a non-<|endoftext|> token as eos_token. #1064

Comments

hxs91 commented Nov 8, 2024 • edited Loading

Model Series

What are the models used?

What is the scenario where the problem happened?

Is this a known issue?

Information about environment

Log output

Description

Steps to reproduce

Expected results

Attempts to fix

Final reason

jklj077 commented Nov 14, 2024

hxs91 commented Nov 14, 2024

echoht commented Nov 15, 2024

hxs91 commented Nov 15, 2024

echoht commented Nov 15, 2024

echoht commented Nov 15, 2024 • edited Loading

If I change the eos_token back to <|endoftext|>, the model will have the right behavior.

hxs91 commented Nov 15, 2024

If I change the eos_token back to <|endoftext|>, the model will have the right behavior.

echoht commented Nov 15, 2024

If I change the eos_token back to <|endoftext|>, the model will have the right behavior.

hxs91 commented Nov 15, 2024

If I change the eos_token back to <|endoftext|>, the model will have the right behavior.

hxs91 commented Nov 8, 2024 •

edited

Loading

echoht commented Nov 15, 2024 •

edited

Loading