Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chat data + prompt template tutorial #823

Merged
merged 7 commits into from
Apr 24, 2024
Merged

Conversation

RdoubleA
Copy link
Contributor

Context

Tutorial on how to set up chat data with llama3 and discussing how prompt templates work, especially with the new special tokens.

Changelog

  • ...

Test plan

  • ....

Copy link

pytorch-bot bot commented Apr 20, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/823

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 16be2a1 with merge base bec7bab (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 20, 2024
Comment on lines 5 to 7
In this tutorial, we'll demystify what prompt templates are and when you'll need them
and talk about the differences between prompt templating for LLaMA2 and LLaMA3. Then,
we'll wrap up with a LLaMA3 finetuning example on a custom chat dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this doesn't actually fully line up with the title of the tutorial. Also kind of a nit but I wouldn't use "demystify" in the opening sentence, I think it can potentially turn people off by making a topic seem too complex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe frame it more as, "llama3 handles chat formats differently, here's what you need to know"? Also what would you suggest for the title?

Comment on lines 26 to 27
Prompt templates and tokenization schemes are often conflated, and it's sometimes
unclear what you need to do to format your data to optimize the training performance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo this is not a good starting point. You need to set the stage of the tutorial with "here's a specific problem we're going to solve that you care about"

Comment on lines 28 to 29
of your model. Let's walk through the LLaMA2/LLaMA3 templates to better understand
the distinction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a better motivation for a section: learning how Llama3 format differs from Llama2 format


Let's test our understanding by trying to fine-tune the LLaMA3-8B model with a custom
chat dataset. We'll walk through how to set up our data so that it can be tokenized
correctly and fed into our model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🍿

Comment on lines 39 to 44
<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>

{{ user_message_1 }} [/INST] {{ model_answer_1 }} </s>
<s>[INST] {{ user_message_2 }} [/INST]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice but you need to give actual examples. Take a specific input text and walk through all the steps: prompt format, tokenization, special tokens, etc. Give a consistent simple example that readers can always anchor back to for their understanding.

with the :class:`~torchtune.data.Llama2ChatFormat`, as long as this is what the model
sees during inference. The model should be robust enough to adapt to a new template.

Special tokens in LLaMA3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These sections don't flow together super clearly. Prompt template + special tokens section, followed by when to use prompt template, followed by specific handling of special tokens for Llama3. Why not just tackle prompt template and special tokens as their own standalone entities and walk through them fully one at a time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let me see if I can separate these two a bit better.


.. code-block:: text

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, give a specific example here

print(tokenizer._encode_special_token("<|begin_of_text|>"))
# 128000
print(tokenizer._encode_special_token("<|eot_id|>"))
# 128009
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be worth a quick digression into the configurations for TikToken encoding for allowed and disallowed special tokens and how we will treat any special tokens appearing in the text just as regular text (idk if this is too in the weeds though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's way too deep imo, and no one should be playing with those APIs

@RdoubleA RdoubleA changed the title [WIP] Chat data + prompt template tutorial Chat data + prompt template tutorial Apr 21, 2024

.. code-block:: python

class ChatDataset(Dataset):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I like this section. Still a couple of comments: (1) I think it can be a bit overwhelming to just provide the general ChatDataset API from the get-go without a ton of context. Why not start with a single sample, show the format we want to get it into for tokenization, then walk through each of the necessary steps to get there? Then use that to motivate the exact definition of lima_dataset given below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a sound approach

Comment on lines 208 to 211
1. You are running inference on the base model and it was pre-trained with a prompt
template
2. You want to prime a fine-tuned model to expect a certain prompt structure on inference
for a specific task
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't formatting correctly in the live docs

Comment on lines 213 to 215
It is not strictly necessary to fine-tune with a prompt template, but generally you
want the model to perform some sort of task, which will require some formatting of
the prompt.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might scrap this sentence altogether, esp "you want the model to perform some sort of task" is kinda self-evident

Fine-tuning on a custom chat dataset
------------------------------------

Let's test our understanding by trying to fine-tune the LLaMA3-8B model with a custom
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're doing a LoRA finetune might wanna use the instruct version of the model here


.. code-block:: bash

$ tune run lora_finetune_single_device --config custom_8B_lora_single_device.yaml epochs=15
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So many epochs lol

# ]

The LLaMA3 tokenizer class, :class:`~torchtune.modules.tokenizers.TikTokenTokenizer`,
expects the input to be in the :class:`~torchtune.data.Message` format. Let's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You use the Message class a few times prior to this. Might be worth a code snippet with a sentence or two explaining everything somewhere in here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah, I think the docstrings serve that purpose. Going through eot/ipython/etc would be distracting.

Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Left a handful of comments but no major concerns from me

@RdoubleA RdoubleA merged commit dd99f37 into pytorch:main Apr 24, 2024
27 checks passed
@RdoubleA RdoubleA deleted the chat_tutorial branch April 24, 2024 00:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants