Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Shrinking Tokenizer Vocabulary for Reduced Memory Consumption with Pre-Trained Model (LLaMA) Fine-Tuning #1686

Open
Amerehei opened this issue Nov 23, 2024 · 0 comments

Comments

@Amerehei
Copy link

Hi,

I am working on fine-tuning a LLaMA model and want to reduce the tokenizer vocabulary size to optimize memory consumption. Specifically, I would like to:

Retain special tokens, English characters, symbols, and numbers.

Remove tokens related to other languages (as I don’t need them).

My questions are:

  1. Is it feasible to shrink the tokenizer vocabulary in this way and still use a pre-trained model for fine-tuning without affecting its performance significantly?

  2. What are the recommended approaches or tools for modifying the tokenizer vocabulary in such cases?

  3. Are there any caveats I should be aware of when performing this adjustment (e.g., issues with token embeddings or alignment with the pre-trained model)?

  4. Is it a good idea at all to reduce the vocabulary size? Can it meaningfully reduce memory consumption and make generation faster?

Any guidance or references to similar implementations would be greatly appreciated.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant