-
Notifications
You must be signed in to change notification settings - Fork 673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: The final reason why you will get a model that cannot stop generation when you fine-tune the Qwen2.5-7b-base use Lora and a non-<|endoftext|> token as eos_token. #1064
Comments
"<|im_start|>" and "<|im_end|>" are indeed not trained for the base models in the Qwen2.5 series.
I understand that you "makesure your LoRA will not fine-tune the lm_head and embedding". So,
This is true. "<|endoftext|>" is trained. See also https://qwen.readthedocs.io/en/latest/getting_started/concepts.html#control-tokens-chat-template. If you must keep the lm_head and embedding from being finetuned, you can
However, the best way is still to train the embedding and the lm_head of the new tokens. |
@jklj077 Thanks for the patient explanation. It takes me quite a lot of time to figure out this problem. Maybe it shoud not be called a "BUG", I wrote it down incase other guys encounter the same problem. BTW, if initialize all embedding and lm_head according to the norm distribution, the all tokens will have a different embedding and lm_head weight, which I thought it shoud be, the phenomenon I described will not happen, even these tokens are not trained in the pretraining stage. |
多轮问答的数据集里,lora的训练方式, eos_token 设置成 <|endoftext|> 会有问题吗? |
设置成<|endoftext|>就应该不会有问题。 |
我看这个doc https://qwen.readthedocs.io/zh-cn/latest/getting_started/concepts.html#control-tokens-chat-template;<|endoftext|>是放到多轮的最后一轮的结尾,感觉和这个有冲突,但是没有做实验验证。 |
If I change the eos_token back to <|endoftext|>, the model will have the right behavior.你这个是在哪个地方调整?是在lora微调前调整还是lora微调后在generation的时候调整? |
我做的是base模型微调,和模板没关系~
在微调前的tokenizer_config.json |
不是用llama-factory训练框架微调吗? |
不是,用的huggingface trainer |
Model Series
Qwen2.5
What are the models used?
Qwen2.5-7b-base
What is the scenario where the problem happened?
sft with huggingface trainer
Is this a known issue?
Information about environment
Doesn't matter.
Log output
Doesn't matter.
Description
Steps to reproduce
<|endoftext|>
to<|im_end|>
(or any other specical token) in thetokenizer_config.json
Expected results
Expected: The fine-tuned model can produce text as you presented in the training data.
Happend: The model can indeed generate appropriate text but cannot stop generation.
Attempts to fix
Final reason
If I change the eos_token back to
<|endoftext|>
, the model will have the right behavior.After careful inspection, I found the reason: Qwen2.5-7b-base has the same weight in lm_head and embedding for additional tokens like
<|im_end|>
,<|object_ref_start|>
and so on except for<|endoftext|>
.the output
151643 is the id of
<|endoftext|>
, 142333 is a random id I pick (maybe there are more ids like this), other ids are defined in thetokenizer_config.json
. This explains why the fine-tuned model cannot stop generation: although the<|im_end|>
(151645) is trained, but its weight in lm_head is same with many other ids, when it should be generated at inference, its logits will also equal to those ids, thus any id can be picked during decoding. In fact I observe logits for 151645 indeed incerease when it should be generated, but the same for those ids that have the same lm_head weights. This will not happen for 151643 , because it have a different lm_head weight, maybe it is trained during the pre-training stage.I am quite confused why this will happen since all lm_head and embedding weights are initialized by normal distribution according to the released code.
The text was updated successfully, but these errors were encountered: