Question: Shrinking Tokenizer Vocabulary for Reduced Memory Consumption with Pre-Trained Model (LLaMA) Fine-Tuning #1686

Amerehei · 2024-11-23T02:37:42Z

Hi,

I am working on fine-tuning a LLaMA model and want to reduce the tokenizer vocabulary size to optimize memory consumption. Specifically, I would like to:

Retain special tokens, English characters, symbols, and numbers.

Remove tokens related to other languages (as I don’t need them).

My questions are:

Is it feasible to shrink the tokenizer vocabulary in this way and still use a pre-trained model for fine-tuning without affecting its performance significantly?
What are the recommended approaches or tools for modifying the tokenizer vocabulary in such cases?
Are there any caveats I should be aware of when performing this adjustment (e.g., issues with token embeddings or alignment with the pre-trained model)?
Is it a good idea at all to reduce the vocabulary size? Can it meaningfully reduce memory consumption and make generation faster?

Any guidance or references to similar implementations would be greatly appreciated.

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Shrinking Tokenizer Vocabulary for Reduced Memory Consumption with Pre-Trained Model (LLaMA) Fine-Tuning #1686

Question: Shrinking Tokenizer Vocabulary for Reduced Memory Consumption with Pre-Trained Model (LLaMA) Fine-Tuning #1686

Amerehei commented Nov 23, 2024

Question: Shrinking Tokenizer Vocabulary for Reduced Memory Consumption with Pre-Trained Model (LLaMA) Fine-Tuning #1686

Question: Shrinking Tokenizer Vocabulary for Reduced Memory Consumption with Pre-Trained Model (LLaMA) Fine-Tuning #1686

Comments

Amerehei commented Nov 23, 2024