Warn about structured generation without a prompt for Llama tokenizers #1321

brandonwillard · 2024-12-04T23:09:36Z

Llama(/SentencePiece?) tokenizers do something special with spaces and initial tokens. For example,

from transformers import AutoTokenizer

model = "NousResearch/Nous-Hermes-llama-2-7b"
tokenizer = AutoTokenizer.from_pretrained(
    model, clean_up_tokenization_spaces=True
)

https_tokens = tokenizer.encode("https://", add_special_tokens=False)
print(https_tokens)
# [2045, 597]

prompt_tokens = tokenizer.encode("prompt", add_special_tokens=False)
print(prompt_tokens)
# [9508]

print(tokenizer.batch_decode([https_tokens], skip_special_tokens=True))
# ['https://']

print(tokenizer.batch_decode([prompt_tokens + https_tokens], skip_special_tokens=True))
# ['prompt https://']

The decoding used by our structured generation doesn't have this no-space-for-the-first-token consideration, so we need to warn people that structured generation won't allow tokens like 2045 (because it strictly interprets them as " https") when/if generation starts without a prompt. I'm not sure why one would start generation without a prompt, but it's worth mentioning.

The text was updated successfully, but these errors were encountered:

brandonwillard added documentation Linked to documentation and examples structured generation Linked to structured generation tokenization correctness Everything related to the generation correctness labels Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warn about structured generation without a prompt for Llama tokenizers #1321

Warn about structured generation without a prompt for Llama tokenizers #1321

brandonwillard commented Dec 4, 2024

Warn about structured generation without a prompt for Llama tokenizers #1321

Warn about structured generation without a prompt for Llama tokenizers #1321

Comments

brandonwillard commented Dec 4, 2024