-
Notifications
You must be signed in to change notification settings - Fork 457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chat data + prompt template tutorial #823
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/823
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 16be2a1 with merge base bec7bab (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
docs/source/tutorials/chat.rst
Outdated
In this tutorial, we'll demystify what prompt templates are and when you'll need them | ||
and talk about the differences between prompt templating for LLaMA2 and LLaMA3. Then, | ||
we'll wrap up with a LLaMA3 finetuning example on a custom chat dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me this doesn't actually fully line up with the title of the tutorial. Also kind of a nit but I wouldn't use "demystify" in the opening sentence, I think it can potentially turn people off by making a topic seem too complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe frame it more as, "llama3 handles chat formats differently, here's what you need to know"? Also what would you suggest for the title?
docs/source/tutorials/chat.rst
Outdated
Prompt templates and tokenization schemes are often conflated, and it's sometimes | ||
unclear what you need to do to format your data to optimize the training performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imo this is not a good starting point. You need to set the stage of the tutorial with "here's a specific problem we're going to solve that you care about"
docs/source/tutorials/chat.rst
Outdated
of your model. Let's walk through the LLaMA2/LLaMA3 templates to better understand | ||
the distinction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a better motivation for a section: learning how Llama3 format differs from Llama2 format
docs/source/tutorials/chat.rst
Outdated
|
||
Let's test our understanding by trying to fine-tune the LLaMA3-8B model with a custom | ||
chat dataset. We'll walk through how to set up our data so that it can be tokenized | ||
correctly and fed into our model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🍿
docs/source/tutorials/chat.rst
Outdated
<s>[INST] <<SYS>> | ||
{{ system_prompt }} | ||
<</SYS>> | ||
|
||
{{ user_message_1 }} [/INST] {{ model_answer_1 }} </s> | ||
<s>[INST] {{ user_message_2 }} [/INST] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice but you need to give actual examples. Take a specific input text and walk through all the steps: prompt format, tokenization, special tokens, etc. Give a consistent simple example that readers can always anchor back to for their understanding.
docs/source/tutorials/chat.rst
Outdated
with the :class:`~torchtune.data.Llama2ChatFormat`, as long as this is what the model | ||
sees during inference. The model should be robust enough to adapt to a new template. | ||
|
||
Special tokens in LLaMA3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These sections don't flow together super clearly. Prompt template + special tokens section, followed by when to use prompt template, followed by specific handling of special tokens for Llama3. Why not just tackle prompt template and special tokens as their own standalone entities and walk through them fully one at a time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let me see if I can separate these two a bit better.
docs/source/tutorials/chat.rst
Outdated
|
||
.. code-block:: text | ||
|
||
<|begin_of_text|><|start_header_id|>system<|end_header_id|> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, give a specific example here
docs/source/tutorials/chat.rst
Outdated
print(tokenizer._encode_special_token("<|begin_of_text|>")) | ||
# 128000 | ||
print(tokenizer._encode_special_token("<|eot_id|>")) | ||
# 128009 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be worth a quick digression into the configurations for TikToken encoding for allowed and disallowed special tokens and how we will treat any special tokens appearing in the text just as regular text (idk if this is too in the weeds though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's way too deep imo, and no one should be playing with those APIs
docs/source/tutorials/chat.rst
Outdated
|
||
.. code-block:: python | ||
|
||
class ChatDataset(Dataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I like this section. Still a couple of comments: (1) I think it can be a bit overwhelming to just provide the general ChatDataset
API from the get-go without a ton of context. Why not start with a single sample, show the format we want to get it into for tokenization, then walk through each of the necessary steps to get there? Then use that to motivate the exact definition of lima_dataset given below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a sound approach
docs/source/tutorials/chat.rst
Outdated
1. You are running inference on the base model and it was pre-trained with a prompt | ||
template | ||
2. You want to prime a fine-tuned model to expect a certain prompt structure on inference | ||
for a specific task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't formatting correctly in the live docs
docs/source/tutorials/chat.rst
Outdated
It is not strictly necessary to fine-tune with a prompt template, but generally you | ||
want the model to perform some sort of task, which will require some formatting of | ||
the prompt. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might scrap this sentence altogether, esp "you want the model to perform some sort of task" is kinda self-evident
docs/source/tutorials/chat.rst
Outdated
Fine-tuning on a custom chat dataset | ||
------------------------------------ | ||
|
||
Let's test our understanding by trying to fine-tune the LLaMA3-8B model with a custom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're doing a LoRA finetune might wanna use the instruct version of the model here
|
||
.. code-block:: bash | ||
|
||
$ tune run lora_finetune_single_device --config custom_8B_lora_single_device.yaml epochs=15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So many epochs lol
# ] | ||
|
||
The LLaMA3 tokenizer class, :class:`~torchtune.modules.tokenizers.TikTokenTokenizer`, | ||
expects the input to be in the :class:`~torchtune.data.Message` format. Let's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You use the Message
class a few times prior to this. Might be worth a code snippet with a sentence or two explaining everything somewhere in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah, I think the docstrings serve that purpose. Going through eot/ipython/etc would be distracting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Left a handful of comments but no major concerns from me
Context
Tutorial on how to set up chat data with llama3 and discussing how prompt templates work, especially with the new special tokens.
Changelog
Test plan