New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Chat data + prompt template tutorial #823

Merged

RdoubleA merged 7 commits into pytorch:main from RdoubleA:chat_tutorial

Apr 24, 2024

Contributor

RdoubleA commented Apr 20, 2024

Context

Tutorial on how to set up chat data with llama3 and discussing how prompt templates work, especially with the new special tokens.

Changelog

...

Test plan

....


          initial tutorial

f123045

pytorch-bot bot commented Apr 20, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/823

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 16be2a1 with merge base bec7bab ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot added the CLA Signed label

RdoubleA requested review from rohan-varma, joecummings, ebsmothers and kartikayk

April 20, 2024 00:23

ebsmothers reviewed

View reviewed changes

docs/source/tutorials/chat.rst Outdated

Comment on lines 5 to 7

+              In this tutorial, we'll demystify what prompt templates are and when you'll need them
+              and talk about the differences between prompt templating for LLaMA2 and LLaMA3. Then,
+              we'll wrap up with a LLaMA3 finetuning example on a custom chat dataset.

Contributor

ebsmothers Apr 20, 2024

To me this doesn't actually fully line up with the title of the tutorial. Also kind of a nit but I wouldn't use "demystify" in the opening sentence, I think it can potentially turn people off by making a topic seem too complex.

Contributor Author

RdoubleA Apr 21, 2024

Maybe frame it more as, "llama3 handles chat formats differently, here's what you need to know"? Also what would you suggest for the title?


          Merge branch 'main' into chat_tutorial

e71f615

ebsmothers reviewed

View reviewed changes

docs/source/tutorials/chat.rst Show resolved Hide resolved

ebsmothers reviewed

View reviewed changes

docs/source/tutorials/chat.rst Show resolved Hide resolved

ebsmothers reviewed

View reviewed changes

docs/source/tutorials/chat.rst Outdated

Comment on lines 26 to 27

		Prompt templates and tokenization schemes are often conflated, and it's sometimes
		unclear what you need to do to format your data to optimize the training performance

Contributor

ebsmothers Apr 21, 2024

Imo this is not a good starting point. You need to set the stage of the tutorial with "here's a specific problem we're going to solve that you care about"

ebsmothers reviewed

View reviewed changes

docs/source/tutorials/chat.rst Outdated

Comment on lines 28 to 29

		of your model. Let's walk through the LLaMA2/LLaMA3 templates to better understand
		the distinction.

Contributor

ebsmothers Apr 21, 2024

This is a better motivation for a section: learning how Llama3 format differs from Llama2 format

ebsmothers reviewed

View reviewed changes

docs/source/tutorials/chat.rst Outdated

+              Let's test our understanding by trying to fine-tune the LLaMA3-8B model with a custom
+              chat dataset. We'll walk through how to set up our data so that it can be tokenized
+              correctly and fed into our model.

Contributor

ebsmothers Apr 21, 2024

🍿

ebsmothers reviewed

View reviewed changes

docs/source/tutorials/chat.rst Show resolved Hide resolved

ebsmothers reviewed

View reviewed changes

docs/source/tutorials/chat.rst Show resolved Hide resolved

ebsmothers reviewed

View reviewed changes

docs/source/tutorials/chat.rst Outdated

Comment on lines 39 to 44

+                  <s>[INST] <<SYS>>
+                  {{ system_prompt }}
+                  <</SYS>>
+                  {{ user_message_1 }} [/INST] {{ model_answer_1 }} </s>
+                  <s>[INST] {{ user_message_2 }} [/INST]

Contributor

ebsmothers Apr 21, 2024

This is nice but you need to give actual examples. Take a specific input text and walk through all the steps: prompt format, tokenization, special tokens, etc. Give a consistent simple example that readers can always anchor back to for their understanding.

ebsmothers reviewed

View reviewed changes

docs/source/tutorials/chat.rst Outdated

+              with the :class:`~torchtune.data.Llama2ChatFormat`, as long as this is what the model
+              sees during inference. The model should be robust enough to adapt to a new template.
+              Special tokens in LLaMA3

Contributor

ebsmothers Apr 21, 2024

These sections don't flow together super clearly. Prompt template + special tokens section, followed by when to use prompt template, followed by specific handling of special tokens for Llama3. Why not just tackle prompt template and special tokens as their own standalone entities and walk through them fully one at a time?

Contributor Author

RdoubleA Apr 21, 2024

Yeah, let me see if I can separate these two a bit better.

ebsmothers reviewed

View reviewed changes

docs/source/tutorials/chat.rst Outdated


		.. code-block:: text

		<\|begin_of_text\|><\|start_header_id\|>system<\|end_header_id\|>

Contributor

ebsmothers Apr 21, 2024

Again, give a specific example here

ebsmothers reviewed

View reviewed changes

docs/source/tutorials/chat.rst Outdated

+                  print(tokenizer._encode_special_token("<|begin_of_text|>"))
+                  # 128000
+                  print(tokenizer._encode_special_token("<|eot_id|>"))
+                  # 128009

Contributor

ebsmothers Apr 21, 2024

Could be worth a quick digression into the configurations for TikToken encoding for allowed and disallowed special tokens and how we will treat any special tokens appearing in the text just as regular text (idk if this is too in the weeds though)

Contributor Author

RdoubleA Apr 21, 2024

I think that's way too deep imo, and no one should be playing with those APIs


          finish tutorial

47ea08f

RdoubleA changed the title ~~[WIP] Chat data + prompt template tutorial~~ Chat data + prompt template tutorial


          fix docs

a428a17

ebsmothers reviewed

View reviewed changes

docs/source/tutorials/chat.rst Outdated


		.. code-block:: python

		class ChatDataset(Dataset):

Contributor

ebsmothers Apr 21, 2024

Overall I like this section. Still a couple of comments: (1) I think it can be a bit overwhelming to just provide the general ChatDataset API from the get-go without a ton of context. Why not start with a single sample, show the format we want to get it into for tokenization, then walk through each of the necessary steps to get there? Then use that to motivate the exact definition of lima_dataset given below.

Contributor Author

RdoubleA Apr 21, 2024

That's a sound approach

RdoubleA added 2 commits

April 23, 2024 09:58


          Merge branch 'main' into chat_tutorial

ff53455


          improve tutorial

30de07d

ebsmothers reviewed

View reviewed changes

docs/source/tutorials/chat.rst Outdated Show resolved Hide resolved

ebsmothers reviewed

View reviewed changes

docs/source/tutorials/chat.rst Outdated Show resolved Hide resolved

ebsmothers reviewed

View reviewed changes

docs/source/tutorials/chat.rst Outdated

Comment on lines 208 to 211

+. You are running inference on the base model and it was pre-trained with a prompt
+              template
+. You want to prime a fine-tuned model to expect a certain prompt structure on inference
+              for a specific task

Contributor

ebsmothers Apr 23, 2024

This isn't formatting correctly in the live docs

ebsmothers reviewed

View reviewed changes

docs/source/tutorials/chat.rst Outdated Show resolved Hide resolved

ebsmothers reviewed

View reviewed changes

docs/source/tutorials/chat.rst Outdated

Comment on lines 213 to 215

+              It is not strictly necessary to fine-tune with a prompt template, but generally you
+              want the model to perform some sort of task, which will require some formatting of
+              the prompt.

Contributor

ebsmothers Apr 23, 2024

I might scrap this sentence altogether, esp "you want the model to perform some sort of task" is kinda self-evident

ebsmothers reviewed

View reviewed changes

docs/source/tutorials/chat.rst Outdated

+              Fine-tuning on a custom chat dataset
+              ------------------------------------
+              Let's test our understanding by trying to fine-tune the LLaMA3-8B model with a custom

Contributor

ebsmothers Apr 23, 2024

If we're doing a LoRA finetune might wanna use the instruct version of the model here

ebsmothers reviewed

View reviewed changes

docs/source/tutorials/chat.rst


		.. code-block:: bash

		$ tune run lora_finetune_single_device --config custom_8B_lora_single_device.yaml epochs=15

Contributor

ebsmothers Apr 23, 2024

So many epochs lol

ebsmothers reviewed

View reviewed changes

docs/source/tutorials/chat.rst

+                  # ]
+              The LLaMA3 tokenizer class, :class:`~torchtune.modules.tokenizers.TikTokenTokenizer`,
+              expects the input to be in the :class:`~torchtune.data.Message` format. Let's

Contributor

ebsmothers Apr 23, 2024

You use the Message class a few times prior to this. Might be worth a code snippet with a sentence or two explaining everything somewhere in here.

Contributor Author

RdoubleA Apr 23, 2024

nah, I think the docstrings serve that purpose. Going through eot/ipython/etc would be distracting.

ebsmothers approved these changes

View reviewed changes

Contributor

ebsmothers left a comment

This looks great! Left a handful of comments but no major concerns from me


          address comments

16be2a1

RdoubleA merged commit dd99f37 into pytorch:main

27 checks passed

RdoubleA deleted the chat_tutorial branch

April 24, 2024 00:06

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels