-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a use_parallel_residual argument to control the residual computing way #18695
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Thanks @NinedayWang ! If someone could review this, that would be great. This PR will also allow loading our PolyCoder model in |
Thanks a lot for the PR @NinedayWang, I'm however not 100% sure we want that as we don't try to make Transformer models be very configurable generally. |
Hi @patrickvonplaten! I appreciate your perspective, but I think in this case supporting the variation is warranted. The default of nearly all training configurations in the NeoX toolkit is to have this flag set to |
Sorry I don't fully follow here:
|
Ah, yes let me clarify. GPT-NeoX is a toolkit that can be (and is actively) used to train GPT-style models. It supports a broad range of model sizes, and has a few other hyper-parameters to vary the architecture in other ways, like that Now Neox-20B is a specific, 20B parameter model trained with this toolkit. It largely uses the same configuration that other models trained with GPT-NeoX would, with the notable exception of the aforementioned residual flag: that flag is set to So for HuggingFace/transformers to support most other models trained with the NeoX toolkit, including PolyCoder, we could either add multiple other Hope this clarifies things! |
Hey @VHellendoorn, Thanks for clarifying! Putting @LysandreJik and @sgugger in cc here. Given the "single-file" policy of Transformers (see post here), I think we would indeed prefer to add a new file such as We're definitely more than happy though to help get Polycoder added to Transformers (cc @lvwerra as well) |
Thanks, yes that would work for us too. The reason we can't load PolyCoder with that architecture file is precisely because -Vincent |
As @patrickvonplaten mentioned, Transformers is not a modular toolkit. It's therefore not surprising that one toolkit class such as GPT-Neo-X in EleutherAI is split in several different classes in Transformers (exactly like BART from fairseq is split in multiple classes here). |
Thanks for your reply @patrickvonplaten @sgugger. Let me do some explanation. GPT-NeoX supports two different residual computing ways using
As @VHellendoorn said, gpt-neox-20b is a special case of specifying I've read about the "single-file" policy, but I think GPT-NeoX is a bit special. If we load |
Hey @NinedayWang, Thanks for the explanation. Sorry some more questions to clarify: Why is it called If we have half the gpt-neox checkpoints using one residual architecture and gpt-neox-20b another architecture I'm actually not too against trying to fit it in one file. |
Thanks a lot! The name |
Is this essentially the "parallel" residual computation that allows the model to be tensor-parallelized better (especially for TPUs) - e.g. the same architecture that was used in PALM: https://arxiv.org/abs/2204.02311 ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make an exception for the same family of checkpoints indeed. There is something similar in BLOOM.
However the parameter should be better named (gpt_j_residual
will not evoke anything to a user) and needs to be documented.
use_cache=use_cache, | ||
output_attentions=output_attentions, | ||
) | ||
attn_output = attention_layer_outputs[0] # output_attn: a, present, (attentions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe leave everything until this line outside of the if block? Duplicating this code doesn't serve any purpose here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review! I fixed it in 70aaec5
@@ -99,6 +99,7 @@ def __init__( | |||
bos_token_id=0, | |||
eos_token_id=2, | |||
tie_word_embeddings=False, | |||
gpt_j_residual=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is not informative at all. Reading the code it's more a add_residual_at_end
or something along those lines. The new parameter will also require documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed it to "use_parallel_residual" and set the default value to True (4c12b69) so that the released "gpt-neox-20b" doesn't need to change the config file.
Yes, it's the same "parallel" architecture as PALM, which provides faster training speed when training large-scale models. |
@@ -66,6 +66,9 @@ class GPTNeoXConfig(PretrainedConfig): | |||
use_cache (`bool`, *optional*, defaults to `True`): | |||
Whether or not the model should return the last key/values attentions (not used by all models). Only | |||
relevant if `config.is_decoder=True`. | |||
use_parallel_residual (`bool`, *optional*, defaults to `True`): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like the name!
hidden_states = mlp_output + attn_output + residual | ||
if self.use_parallel_residual: | ||
# pseudocode: | ||
# x = x + attn(ln1(x)) + mlp(ln2(x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice comments!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
What does this PR do?
Add a gpt_j_residual argument to control the residual computing way, the default value is False, that is
consistent with https://github.com/EleutherAI/gpt-neox/blob/main/megatron/model/transformer.py#L592. And we can convert the model trained by gpt-neox into huggingface more easily.
Who can review?
Anyone in the community is free to review the PR once the tests have passed.
@LysandreJik @patrickvonplaten