The Art of Prompt Design

[1]:
# Increase this if you encounter rate limiting
call_delay_secs = 0

Use clear syntax

This is the first installment of a series on how to use guidance to control large language models (LLMs). We’ll start from the basics and work our way up to more advanced topics.

In this document, we’ll show that having clear syntax enables you to communicate your intent to the LLM, and also ensure that outputs are easy to parse (like JSON that is guaranteed to be valid). For the sake of clarity and reproducibility we’ll start with an open source StableLM model without fine tuning. Then, we will show how the same ideas apply to instruction-tuned models like GPT-3.5 and chat-tuned models like ChatGPT / GPT-4.

Clear syntax helps with parsing the output

The first, and most obvious benefit of using clear syntax is that it makes it easier to parse the output of the LLM. Even if the LLM is able to generate a correct output, it may be difficult to programatically extract the desired information from the output. For example, consider the following Guidance prompt (where gen() is a guidance function to generate text from the LLM):

[2]:
import math

from huggingface_hub import hf_hub_download

import guidance
from guidance import models, gen, select

repo_id = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
filename = "mistral-7b-instruct-v0.2.Q8_0.gguf"
model_kwargs = {"verbose": True, "n_gpu_layers": -1, "n_ctx": 1024}

downloaded_file = hf_hub_download(repo_id=repo_id, filename=filename)
lm = guidance.models.LlamaCpp(downloaded_file, **model_kwargs)
Cannot use verbose=True in this context (probably CoLab). See https://github.com/abetlen/llama-cpp-python/issues/729
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from C:\Users\riedgar\.cache\huggingface\hub\models--TheBloke--Mistral-7B-Instruct-v0.2-GGUF\snapshots\3a6fbf4a41a1d52e415a4958cde6856d34b2db93\mistral-7b-instruct-v0.2.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 7
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 7.17 GiB (8.50 BPW)
llm_load_print_meta: general.name     = mistralai_mistral-7b-instruct-v0.2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors:        CPU buffer size =  7338.64 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    10.01 MiB
llama_new_context_with_model:        CPU compute buffer size =    72.00 MiB
llama_new_context_with_model: graph splits (measure): 1
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |
Model metadata: {'general.name': 'mistralai_mistral-7b-instruct-v0.2', 'general.architecture': 'llama', 'llama.context_length': '32768', 'llama.rope.dimension_count': '128', 'llama.embedding_length': '4096', 'llama.block_count': '32', 'llama.feed_forward_length': '14336', 'llama.attention.head_count': '32', 'tokenizer.ggml.eos_token_id': '2', 'general.file_type': '7', 'llama.attention.head_count_kv': '8', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.freq_base': '1000000.000000', 'tokenizer.ggml.model': 'llama', 'general.quantization_version': '2', 'tokenizer.ggml.bos_token_id': '1', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.add_bos_token': 'true', 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.chat_template': "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}"}
Guessed chat format: mistral-instruct

We can now ask a question:

[3]:
# run a guidance program (by appending to the model state)
lm + "Name common Linux operating system commands." + gen(max_tokens=50)
[3]:
Name common Linux operating system commands.

Here are some common Linux operating system commands:

1. `cd`: Change directory. This command is used to navigate through the file system. For example, `cd /home/user/Documents` changes the current directory

While the answer is readable, the output format is arbitrary (i.e. we don’t know it in advance), and thus hard to parse programatically. For example here is another run of the same prompt where the output format is very different:

[4]:
lm + "Name common Mac operating system commands." + gen(max_tokens=50)
[4]:
Name common Mac operating system commands.

1. Open Finder: `Command + Space` and type "Finder" or press `Command + Shift + G` and type "/" to open the Go to Folder dialog box and type "/Applications/Finder.app

Enforcing clear syntax in your prompts can help reduce the problem of arbitrary output formats. There are a couple ways you can do this: 1. Giving structure hints to the LLM inside a standard prompt (perhaps even using few shot examples). 2. Writing a guidance program template that enforces a specific output format.

These are not mutually exclusive. Let’s see an example of each approach

Traditional prompt with structure hints

Here is an example of a traditional prompt that uses structure hints to encourage the use of a specific output format. The prompt is designed to generate a list of 5 items that is easy to parse. Note that in comparison to the previous prompt, we have written this prompt in such a way that it has committed the LLM to a specific clear syntax (numbers followed by a quoted string). This makes it much easier to parse the output after generation.

[5]:
lm +'''\
What are the most common commands used in the Linux operating system?

Here are the 5 most common commands:
1. "''' + gen(max_tokens=70)
[5]:
What are the most common commands used in the Linux operating system?

Here are the 5 most common commands:
1. "ls" (list files and directories): This command is used to list the files and directories in the current directory. You can also use options with this command to modify the output, such as "-l" to display detailed information about each file.
2. "cd" (change directory): This command is used to change the current directory

Note that the LLM follows the syntax correctly, but does not stop after generating 5 items. We can fix this by creating a clear stopping criteria, e.g. asking for 6 items and stopping when we see the start of the sixth item (so we end up with five):

[6]:
lm + '''\
What are the most common commands used in the Linux operating system?

Here are the 6 most common commands:
1. "''' + gen(max_tokens=100, stop="\n6.")
[6]:
What are the most common commands used in the Linux operating system?

Here are the 6 most common commands:
1. "ls" (list files and directories): This command is used to list the files and directories in the current directory. You can also use options with this command to modify the output, such as "-l" to display detailed information about each file.
2. "cd" (change directory): This command is used to change the current directory. You can specify the directory you want to change to by providing its path.
3. "cp" (copy files and directories): This

Enforcing syntax with a guidance program

Rather than using hints, a Guidance program enforces a specific output format, inserting the tokens that are part of the structure rather than getting the LLM to generate them. For example, this is what we would do if we wanted to enforce a numbered list as a format:

[7]:
lm2 = lm + """What are the most common commands used in the Linux operating system?

Here are the 5 most common commands:
"""
for i in range(5):
    lm2 += f'''{i+1}. "{gen('commands', list_append=True, stop='"', max_tokens=50)}"\n'''
What are the most common commands used in the Linux operating system?

Here are the 5 most common commands:
1. "ls"
2. "cd"
3. "mkdir"
4. "rm"
5. "cp"

Here is what is happening in the above prompt: - The lm2 = lm + """What are... command saves the new model state that results from adding the blank starting model to a string into the variable lm2. The for loop then iteratively updates lm2 by adding a mixure of strings and generated sequences. - Note that the structure (the numbers, and quotes) are not generated by the LLM.

Output parsing is done automatically by the Guidance program, so we don’t need to worry about it. In this case, the commands variable wil be the list of generated command names:

[8]:
lm2["commands"]
[8]:
['ls', 'cd', 'mkdir', 'rm', 'cp']

Forcing valid JSON systax: Forcing valid JSON syntax: Using guidance we can create any syntax we want with absolute confidence that what we generate will exactly follow the format we specify. This is particularly useful for things like JSON:

[9]:
import guidance

# define a re-usable "guidance function" that we can use below
@guidance
def quoted_list(lm, name, n):
    for i in range(n):
        if i > 0:
            lm += ", "
        lm += '"' + gen(name, list_append=True, stop='"') + '"'
    return lm

lm + f"""What are the most common commands used in the Linux operating system?

Here are the 5 most common commands in JSON format:
{{
    "commands": [{quoted_list('commands', 5)}],
    "my_favorite_command": "{gen('favorite_command', stop='"')}"
}}"""
[9]:
What are the most common commands used in the Linux operating system?

Here are the 5 most common commands in JSON format:
{
    "commands": ["ls", "cd", "mkdir", "rm", "cp"],
    "my_favorite_command": "ls -la"
}

Guidance acceleration: Another benefit of guidance programs is speed – incremental generation is actually faster than a single generation of the entire list, because the LLM does not have to generate the syntax tokens for the list itself, only the actual command names (this makes more of a difference when the output structure is richer). If you are using a model endpoint that does not support such acceleration (e.g. OpenAI models), then guidance instead lets the model generate all the tokens in a single API call, since many incremental API calls will slow you down. Note this throws an exception if the output does not match the Guidance patten (we may enable retrying in the future with several calls).

What this means in practice is that you can use guidance on simple remote endpoint models, but you just get parsing, not controlled decoding since we can’t effciently force the model at each token.

[10]:
import os
import time

# Uncomment if using DefaultAzureCredential below
# from azure.identity import DefaultAzureCredential, get_bearer_token_provider

# This is the name of the model deployed, such as 'gpt-4' or 'gpt-3.5-turbo
model = os.getenv("AZUREAI_CHAT_MODEL", "Please set the model")

# This is the deployment URL, as provided in the Azure AI playground ('view code')
# It will end with 'openai.azure.com'
azure_endpoint = os.getenv("AZUREAI_CHAT_BASE_ENDPOINT", "Please set the endpoint")

# This is the name of the deployment specified in the Azure portal
azure_deployment = os.getenv("AZUREAI_CHAT_DEPLOYMENT", "Please set the deployment name")

# This is the deployed API version, such as 2024-02-15-preview
azure_api_version = os.getenv("AZUREAI_CHAT_API_VERSION", "Please set the API version")

# The environment variable should be set to the API key from the Azure AI playground:
api_key=os.getenv("AZUREAI_CHAT_KEY", "Please set API key")

# Alternatively, we can use Entra authentication
# token_provider = get_bearer_token_provider(
#     DefaultAzureCredential(),
#     "https://cognitiveservices.azure.com/.default"
#)

Now things have been configured, we can create our model

[11]:
from guidance import models, gen
from guidance import user, assistant

azureai_model = models.AzureOpenAI(
    model=model,
    azure_endpoint=azure_endpoint,
    azure_deployment=azure_deployment,
    version=azure_api_version,
    # For authentication, use either
    api_key=api_key
    # or
    # azure_ad_token_provider=token_provider
)

And now use it:

[12]:
with user():
    lm2 = azureai_model + 'What are the 5 most common commands used in the Linux operating system? Please list them as 1. "command" ...one per line with double quotes and no description.'

with assistant():
    for i in range(5):
        lm2 += f'''{i+1}. "{gen('commands', list_append=True, stop='"', temperature=1)}"\n'''
        time.sleep(call_delay_secs)
user
What are the 5 most common commands used in the Linux operating system? Please list them as 1. "command" ...one per line with double quotes and no description.
assistant
1. "cd" 2. "ls" 3. "pwd" 4. "rm" 5. "cp"
[13]:
lm2["commands"]
[13]:
['cd', 'ls', 'pwd', 'rm', 'cp']

Clear syntax gives the user more power

Getting stuck in a low-diversity rut is a common failure mode of LLMs, which can happen even if we use a relatively high temperature:

[14]:
lm2 = lm + """What are the most common commands used in the Linux operating system?
"""
for i in range(10):
    lm2 += f'''- "{gen('commands', list_append=True, stop='"', temperature=0.8, max_tokens=40)}"\n'''
    time.sleep(call_delay_secs)
What are the most common commands used in the Linux operating system?
- "ls"
- "cd"
- "mkdir"
- "rm"
- "rmdir"
- "cp"
- "mv"
- "chmod"
- "chown"
- "sudo"

One common fix to this problem is asking for parallel completions (so that prior generated commands do not influence the next command generation):

[15]:
lm2 = lm + '''What are the most common commands used in the Linux operating system?
- "'''
commands = []
for i in range(10):
    lm_tmp = lm2 + gen('command', stop='"', temperature=0.8)
    commands.append(lm_tmp["command"])
    time.sleep(call_delay_secs)
What are the most common commands used in the Linux operating system?
- "ls
[16]:
commands
[16]:
['`cd`',
 ', ',
 ':/ - Change the current directory to the root directory (/)\n- ',
 'ls',
 'cd',
 '><?php\n// Here are some common Linux commands:\n\n// List files and directories in the current directory\necho ',
 '`cd`',
 './myfile.sh',
 '`ls`',
 'ls']
We get more variability than before. Since clear structure gives us outputs that are easy to parse and manipulate, we can easily take the output, remove duplicates, and use them in the next step of our program.
Here is an example program that takes the listed commands, picks one, and does further operations on it:
[17]:
newline = "\n" # because for python < 3.12 we can't put a backslash in f-string values
lm2 = lm + 'What are the most common commands used in the Linux operating system?\n'

# generate a bunch of command names
lm_tmp = lm2 + 'Here is a common command: "'
commands = [(lm_tmp + gen('command', stop='"', max_tokens=20, temperature=1.0))["command"] for i in range(10)]
time.sleep(call_delay_secs)

# discuss them
for i,command in enumerate(set(commands)):
    lm2 += f'{i+1}. "{command}"\n'
lm2 += f'''Perhaps the most useful command from that list is: "{gen('cool_command', stop='"')}", because {gen('cool_command_desc', max_tokens=100, stop=newline)}
On a scale of 1-10, it has a coolness factor of: {gen('coolness', regex="[0-9]+")}.'''
What are the most common commands used in the Linux operating system?
1. ""
2. "./filename.sh"
3. ";
1. pwd (Print Working Directory) - Displays the current working directory"
4. ">> ls -l"
5. "ls"
Perhaps the most useful command from that list is: "ls", because it allows you to list the files and directories in the current directory. The ">>" symbol is used to append text to a file. The ";" symbol is used to execute multiple commands on a single line. The "pwd" command is used to display the current working directory. The "./filename.sh" command is used to execute a shell script in the current directory. The first command, "echo", is used to print text to the terminal.
On a scale of 1-10, it has a coolness factor of: 8.

We introduced one import control method in the above program: the regex pattern guide for generation. The command gen('coolness', regex="[0-9]+") uses a regular expression to enforce a certain syntax on the output (i.e. forcing the output to match an arbitrary regular experession). In this case we force the coolness score to be a whole number (note that generation stops once the model has completed generation of the pattern and starts to generate something else).

Combining clear syntax with model-specific structure like chat

All the examples above used a base model without any later fine-tuning. But if the model you are using has fine tuning, it is important to combine clear syntax with the structure that has been tuned into the model. For example, chat models have been fine tuned to expect several “role” tags in the prompt. We can leverage these tags to further enhance the structure of our programs/prompts.

The following example adapts the above prompt for use with a chat based model. guidance has special role context blocks (like user()), which allow you to mark out various roles and get them automatically translated into the right special tokens or API calls for the LLM you are using. This helps make prompts easier to read and makes them more general across different chat models.

[18]:
# if we have multple GPUs we can load the chat model on a different GPU with the `device` argument
del lm
time.sleep(call_delay_secs)
chat_lm = guidance.models.LlamaCpp(downloaded_file, **model_kwargs)
Cannot use verbose=True in this context (probably CoLab). See https://github.com/abetlen/llama-cpp-python/issues/729
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from C:\Users\riedgar\.cache\huggingface\hub\models--TheBloke--Mistral-7B-Instruct-v0.2-GGUF\snapshots\3a6fbf4a41a1d52e415a4958cde6856d34b2db93\mistral-7b-instruct-v0.2.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 7
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 7.17 GiB (8.50 BPW)
llm_load_print_meta: general.name     = mistralai_mistral-7b-instruct-v0.2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors:        CPU buffer size =  7338.64 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    10.01 MiB
llama_new_context_with_model:        CPU compute buffer size =    72.00 MiB
llama_new_context_with_model: graph splits (measure): 1
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |
Model metadata: {'general.name': 'mistralai_mistral-7b-instruct-v0.2', 'general.architecture': 'llama', 'llama.context_length': '32768', 'llama.rope.dimension_count': '128', 'llama.embedding_length': '4096', 'llama.block_count': '32', 'llama.feed_forward_length': '14336', 'llama.attention.head_count': '32', 'tokenizer.ggml.eos_token_id': '2', 'general.file_type': '7', 'llama.attention.head_count_kv': '8', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.freq_base': '1000000.000000', 'tokenizer.ggml.model': 'llama', 'general.quantization_version': '2', 'tokenizer.ggml.bos_token_id': '1', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.add_bos_token': 'true', 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.chat_template': "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}"}
Guessed chat format: mistral-instruct
[19]:
from guidance import user, assistant, system
newline = "\n"

with user():
    lm2 = chat_lm + "What are the most common commands used in the Linux operating system?"

with assistant():

    # generate a bunch of command names
    lm_tmp = lm2 + 'Here are ten common command names:\n'
    for i in range(10):
        lm_tmp += f'{i+1}. "' + gen('commands', list_append=True, stop='"', max_tokens=20, temperature=0.7) + '"\n'

    # discuss them
    for i,command in enumerate(set(lm_tmp["commands"])):
        lm2 += f'{i+1}. "{command}"\n'
    lm2 += f'''Perhaps the most useful command from that list is: "{gen('cool_command', stop='"')}", because {gen('cool_command_desc', max_tokens=100, stop=newline)}
On a scale of 1-10, it has a coolness factor of: {gen('coolness', regex="[0-9]+")}.'''
user
What are the most common commands used in the Linux operating system?
assistant
1. "touch" 2. "mkdir" 3. "pwd" 4. "mv" 5. "ls" 6. "rm" 7. "grep" 8. "cd" 9. "rmdir" 10. "cp" Perhaps the most useful command from that list is: "man", because it allows you to view the manual pages for other commands. On a scale of 1-10, it has a coolness factor of: 11.

Using API-restricted models

When we have control over generation, we can guide the output at any step of the process. But some model endpoints (e.g. OpenAI’s ChatGPT) currently have a much more limited API, e.g. we can’t control what happens inside each role block.
While this limits the user’s power, we can still use a subset of syntax hints, and enforce the structure outside of the role blocks:
[20]:
azureai_model = models.AzureOpenAI(
    model=model,
    azure_endpoint=azure_endpoint,
    azure_deployment=azure_deployment,
    version=azure_api_version,
    # For authentication, use either
    api_key=api_key
    # or
    # azure_ad_token_provider=token_provider
)
[21]:
lm = azureai_model

with system():
    lm += "You are an expert unix systems admin that is willing follow any instructions."

with user():
    lm += f"""\
What are the top ten most common commands used in the Linux operating system?

List the commands one per line.  Please list them as 1. "command" ...one per line with double quotes and no description."""

# generate a list of commands
with assistant():
    lm_inner = lm
    for i in range(10):
        lm_inner += f'''{i+1}. "{gen('commands', list_append=True, stop='"', temperature=1)}"\n'''
        time.sleep(call_delay_secs)

# filter to make sure they are all unique then add them to the context (just as an example)
with assistant():
    for i,command in enumerate(set(lm_inner["commands"])):
        lm += f'{i+1}. "{command}"\n'

with user():
    lm += "If you were to guess, which of the above commands would a sys admin think was the coolest? Just name the command, don't print anything else."

with assistant():
    lm += gen('cool_command')
    time.sleep(call_delay_secs)

with user():
    lm += "What is that command's coolness factor on a scale from 0-10? Just write the digit and nothing else."

with assistant():
    lm += gen('coolness', regex="[0-9]+")
    time.sleep(call_delay_secs)

with user():
    lm += "Why is that command so cool?"

with assistant():
    lm += gen('cool_command_desc', max_tokens=100)
    time.sleep(call_delay_secs)
system
You are an expert unix systems admin that is willing follow any instructions.
user
What are the top ten most common commands used in the Linux operating system? List the commands one per line. Please list them as 1. "command" ...one per line with double quotes and no description.
assistant
1. "sudo" 2. "pwd" 3. "cat" 4. "man" 5. "mv" 6. "ls" 7. "rm" 8. "echo" 9. "cd" 10. "cp"
user
If you were to guess, which of the above commands would a sys admin think was the coolest? Just name the command, don't print anything else.
assistant
"sudo"
user
What is that command's coolness factor on a scale from 0-10? Just write the digit and nothing else.
assistant
8
user
Why is that command so cool?
assistant
The "sudo" command is considered cool because it allows a system administrator to execute commands with the security privileges of another user (by default, the superuser). This is powerful because it provides control over the system, allowing the execution of administrative tasks while also providing a layer of security. It's like having the master key to a building, you can access any room you want.

Summary

Whenever you are building a prompt to control a model it is important to consider not only the content of the prompt, but also the syntax. Clear syntax makes it easier to parse the output, helps the LLM produce output that matches your intent, and lets you write complex multi-step programs. While even a trivial example (listing common OS commands) benefits from clear syntax, most tasks are much more complex, and benefit even more. We hope this post gives you some ideas on how to use clear syntax to improve your prompts.

Also, make sure to check out guidance. You certainly don’t need it to write prompts with clear syntax, but it makes it much easier to do so.


Have an idea for more helpful examples? Pull requests that add to this documentation notebook are encouraged!