Token healing

Guidance uses what we call “token healing” to fix tokenization artifacts that normally arise at the boundary between the end of a prompt and the beginning of a set of generated tokens. Note that token healing requires direct endpoint integration to run effciently, so it currently supported only for the guidance.models.LlamaCpp and guidance.models.Transformers LLM backends.

Why token healing is needed

Language models process tokens, which are chunks of text that often are similar to a word. This impacts how language models see text, and also how we can prompt them, since every prompt has to be a set of tokens. Encodings like BPE that are used by GPT-style models map all input bytes to token ids in an optimized manner. This works well during training, but can lead to some subtle issues during prompting and inference because the token boundaries often don’t line up with the end of the prompt if we consider the generated tokens that will also come next. Of course the end of a prompt will always align with a token boundary because the prompt is tokenized before being extended by the model, but if the first characters of the completion are part of a longer token that would span the prompt boundary, that longer token cannot be used (even though that is what the model would expect based on training data).

To see why token healing is important consider the prompt “This is a “, which is then completed with “fine day.” by the model, so resulting in the final string “This is a fine day.”. If we tokenize the prompt “This is a “ with GPT2 BPE we get [1212, 318, 257, 220], and the tokenization of the extention “fine day.” is [38125, 1110, 13]. This results in a final prompt + generation token sequence of [1212, 318, 257, 220, 38125, 1110, 13]. If however we were to tokenize the whole string “This is a fine day.” jointly we instead get [1212, 318, 257, 3734, 1110, 13]. Which tokenization is correct? Well, the correct tokenization is the one that best communicates intent to the model. Since the model learned intent based on a optimized tokenization of the training text, that means the joint tokenization that also uses optimized matching will better align with how the model processed the training data, and so it is also likely to better communicate intent to the model. This is the reason why ending your prompt with a space is almost always a bad idea in GPT models since most word-based tokens have the space before the word, not after it.

Note that another way to see that the “standard” prompt-boundary-based encoding is worse than the joint one we get with token healing is to observe that 38125 (the token id for “fine”) is a large number, this means it is uncommon to see that token in the training data (since BPE encodings are built up greedily based on frequency). In contrast 3734 (the token id for “ fine”) is a much more common token and so more likely to clearly communicate intent to the model (since the model has seen it many times and hence had more opportunity to learn its meaning in many contexts).

How token healing works

Guidance avoids the above tokenization artifacts automatically using a method we call “token healing” that backs up the generation process by one or more tokens before the end of the prompt, then constrains the first tokens generated to have a prefix that matches the last token in the prompt. This allows the generated text string to have the token encoding that the model would expect based on its training data, not an unusual alternative encoding forcing by the prompt boundary. Token healing allows you to express your prompts however you wish, without worrying about boundaries (which effect many tokens, not just space characters).

[1]:

import transformers

# compute the tokenizations of the example above
tokenizer = transformers.AutoTokenizer.from_pretrained('gpt2')
print("Tokenization of `This is a `:", tokenizer.encode("This is a "))
print("Tokenization of `fine day.`:", tokenizer.encode("fine day."))
print("Tokenization of `This is a fine day.`:", tokenizer.encode("This is a fine day."))

Tokenization of `This is a `: [1212, 318, 257, 220]
Tokenization of `fine day.`: [38125, 1110, 13]
Tokenization of `This is a fine day.`: [1212, 318, 257, 3734, 1110, 13]

Token healing in action

Below is a prompt that we run both with and without token healing to see how it can impact generation quality.

With token healing

[2]:

import guidance
from guidance import gen, models

gpt2 = models.Transformers("gpt2", temperature=0.8)

gpt2 += "The url of Google is http:" + gen(max_tokens=5)

The url of Google is http://www.google.

Without token healing

[3]:

generator = transformers.pipeline('text-generation', model='gpt2')

generator("The url of Google is http:", max_length=20, temperature=0.0001)[0]["generated_text"]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

[3]:

'The url of Google is http: //www.google.com/search?q=google+'

While it may seems strange to us why GPT2 does not put a “//” after the colon character (instead writing a space), it makes sense if we think about the tokens involved. If a “//” was likely to come after the space, then it would have been included using the token “://”. By sending the token id 25 (a colon) by itself to the model we are communicating that what comes next is not something that is in a token that starts with a colon (since otherwise the greedy/optimized tokenization would have used it). So GPT2 picks something that cannot be consumed into a larger token, a space character (since “: “ is not a GPT2 token).

[4]:

tokenizer.encode(":"), tokenizer.encode(": "), tokenizer.encode("://")

[4]:

([25], [25, 220], [1378])

Have an idea for more helpful examples? Pull requests that add to this documentation notebook are encouraged!