Example: generating and running code

Warning: This notebook runs LLM-generated code without any checks. Run at your own risk.

Loading a code model:

[1]:

from guidance import models, gen
from guidance.library._gen import will_gen
from guidance import capture, one_or_more, any_char, zero_or_more, commit_point, select
import guidance
import re
base_path = '/home/marcotcr_google_com/work/models/'
model_path = base_path + 'mistral-7b-codealpaca-lora.Q8_0.gguf'
mistral = models.LlamaCpp(model_path, n_gpu_layers=-1, n_ctx=4096)

2023-12-05 21:22:33.078594: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-05 21:22:33.148003: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

Loading the HumanEval dataset:

[2]:

from datasets import load_dataset
dataset = load_dataset("openai_humaneval")

Let’s write a very simple baseline

[3]:

import re
import guidance
@guidance
def baseline(lm, prompt):
    r = re.findall('def (.*?)\(', prompt)
    name = r[-1]
    lm += f'Here is an implementation of {name}:\n'
    lm += '```python\n' + prompt + gen(max_tokens=800, stop=['```', 'if __name__', 'def test'], name='program')
    lm = lm.set('program', prompt + lm['program'])
    return lm

[4]:

idx = 121
prompt = dataset['test']['prompt'][idx]
lm = mistral + baseline(prompt)

Here is an implementation of solution:
```python

def solution(lst):
    """Given a non-empty list of integers, return the sum of all of the odd elements that are in even positions.


    Examples
    solution([5, 8, 7, 1]) ==> 12
    solution([3, 3, 3, 3, 3]) ==> 9
    solution([30, 13, 24, 321]) ==>0
    """
    return sum(lst[i] for i in range(0, len(lst), 2) if lst[i] % 2 != 0)

# test the function
print(solution([5, 8, 7, 1]))  # should print 12
print(solution([3, 3, 3, 3, 3]))  # should print 9
print(solution([30, 13, 24, 321]))  # should print 0

Here is simple function to evaluate a generated program with the HumanEval evaluation tests:

[6]:

# Returns True if it passes the evaluation tests, False otherwise
def eval_program(program, i):
    # Loads the `check` function
    exec(dataset['test']['test'][i])
    try:
        # Executes the function definition
        exec(program, globals())
    except Exception as e:
        # Program not valid
        return False
    name = dataset['test']['entry_point'][i]
    try:
        # Run the unit tests
        eval('check(%s)' % name)
        # If we get here, we passed the tests
        return True
    except:
        # The program ran, but the failed the unit test, or ran into some other exception
        return False

[7]:

eval_program(lm['program'], idx)

12
9
0

[7]:

True

If you run this prompt on all of HumanEval, you get 54.9% accuracy.

The model generates valid code (i.e. code that doesn’t trip up the python interpreter) on 96% of examples, but the code only executes without exceptions in 93% of examples.

Let’s try another one:

[8]:

idx = 71
prompt = dataset['test']['prompt'][idx]
lm = mistral + baseline(prompt)

Here is an implementation of triangle_area:
```python

def triangle_area(a, b, c):
    '''
    Given the lengths of the three sides of a triangle. Return the area of
    the triangle rounded to 2 decimal points if the three sides form a valid triangle.
    Otherwise return -1
    Three sides make a valid triangle when the sum of any two sides is greater
    than the third side.
    Example:
    triangle_area(3, 4, 5) == 6.00
    triangle_area(1, 2, 10) == -1
    '''
    if a + b > c and a + c > b and b + c > a:
        s = (a + b + c) / 2
        return round(s * (s - a) * (s - b) * (s - c), 2)
    else:
        return -1

[9]:

eval_program(lm['program'], idx)

[9]:

False

Notice that this time the generated program doesn’t pass the evaluation tests.

But it’s worse than that: the program doesn’t even pass the first example in the docstring:

[10]:

exec(lm['program'])
triangle_area(3, 4, 5) # should be 6.00

[10]:

36.0

This suggests an improvement: let’s extract the tests on the docstrings, and only return a program if it passes at least those tests.

First, let’s write a simple prompt to extract the examples from the docstring into tests

[12]:

from guidance import any_char_but, regex
@guidance(stateless=True)
def test(lm, fn_name):
    """Only allows assert fn_name(args) == expected"""
    return lm + '   assert ' + fn_name + '(' + capture(zero_or_more(any_char_but(['\n'])), name='args') + commit_point(select([') == ', ') is ', ')' + regex('\s\s?\s?\s?') + '== '])) + capture(one_or_more(any_char()), name='result') + commit_point('\n')

@guidance
def write_tests(lm, prompt):
    r = re.findall('def (.*?)\(', prompt)
    name = r[-1]
    lm += '```python\n' + prompt + '    pass\n'
    lm += f'\ndef test_{name}():\n'
    lm += '    """Turns the example(s) in the docstring above into asserts"""\n'
    args = []
    expected = []
    # Write at most 10 tests, but stop when the model wants to stop
    for i in range(10):
        lm += test(name)
        args.append(lm['args'])
        expected.append(lm['result'])
        if not lm.will_gen('assert', ignore_spaces=True):
            break
    lm = lm.set('args', args)
    lm = lm.set('expected', expected)
    return lm

[13]:

lm = mistral + write_tests(prompt)
args = lm['args']
expected = lm['expected']

```python

def triangle_area(a, b, c):
    '''
    Given the lengths of the three sides of a triangle. Return the area of
    the triangle rounded to 2 decimal points if the three sides form a valid triangle.
    Otherwise return -1
    Three sides make a valid triangle when the sum of any two sides is greater
    than the third side.
    Example:
    triangle_area(3, 4, 5) == 6.00
    triangle_area(1, 2, 10) == -1
    '''
    pass

def test_triangle_area():
    """Turns the example(s) in the docstring above into asserts"""
   assert triangle_area(3, 4, 5) == 6.00
   assert triangle_area(1, 2, 10) == -1
   assert triangle_area(3, 4, 7) == -1
   assert triangle_area(3, 3, 3) == 0.00
   assert triangle_area(1, 1, 2) == -1

The LM went beyond extracting tests, it also generated a few of its own. While some of these may be incorrect, at least we have the original ones as well.

What’s more, we already stored the inputs and expected results in the lm object:

[14]:

# (input, expected output)
list(zip(lm['args'], lm['expected']))

[14]:

[('3, 4, 5', '6.00'),
 ('1, 2, 10', '-1'),
 ('3, 4, 7', '-1'),
 ('3, 3, 3', '0.00'),
 ('1, 1, 2', '-1')]

Let’s combine the baseline and the test generation prompts into a single guidance function:

[15]:

@guidance
def reconstruct_tests(lm, name, args, expected):
    """Helper to format tests nicely"""
    lm += f'def test_{name}():\n'
    for arg, e in zip(args, expected):
        lm += f'   assert {name}({arg}) == {e}\n'
    return lm

@guidance
def add_program_and_tests(lm, name, program, args, expected):
    """Helper to format program and tests nicely"""
    lm += f'Here is an implementation of {name}:\n'
    lm += '```python\n'
    lm += program + '\n'
    lm += reconstruct_tests(name, args, expected) + '```\n'
    return lm

@guidance
def baseline_and_tests(lm, prompt):
    lm2 = lm + baseline(prompt)
    r = re.findall('def (.*?)\(', prompt)
    name = r[-1]
    program = lm2['program']
    lm2 = lm + write_tests(prompt)
    args, expected = lm2['args'], lm2['expected']
    lm = lm.set('program', program)
    lm = lm.set('args', args)
    lm = lm.set('expected', expected)
    lm = lm.set('name', name)
    lm += add_program_and_tests(name, program, args, expected)
    return lm

[16]:

lm = mistral + baseline_and_tests(prompt)

Here is an implementation of triangle_area:
```python

def triangle_area(a, b, c):
    '''
    Given the lengths of the three sides of a triangle. Return the area of
    the triangle rounded to 2 decimal points if the three sides form a valid triangle.
    Otherwise return -1
    Three sides make a valid triangle when the sum of any two sides is greater
    than the third side.
    Example:
    triangle_area(3, 4, 5) == 6.00
    triangle_area(1, 2, 10) == -1
    '''
    if a + b > c and a + c > b and b + c > a:
        s = (a + b + c) / 2
        return round(s * (s - a) * (s - b) * (s - c), 2)
    else:
        return -1

def test_triangle_area():
   assert triangle_area(3, 4, 5) == 6.00
   assert triangle_area(1, 2, 10) == -1
   assert triangle_area(3, 4, 7) == -1
   assert triangle_area(3, 3, 3) == 0.00
   assert triangle_area(1, 1, 2) == -1
```

[17]:

lm['args'], lm['expected']

[17]:

(['3, 4, 5', '1, 2, 10', '3, 4, 7', '3, 3, 3', '1, 1, 2'],
 ['6.00', '-1', '-1', '0.00', '-1'])

Now, if we have a generated program and a set of tests, we can write a guidance function that runs the tests and outputs the results:

[18]:

# Helper function to load the program
def load_program(name, program):
    error = None
    try:
        exec(program, globals())
        fn = eval(name)
    except Exception as e:
        fn = None
        error = e
    return fn, error

# Tolerance when x and y are floats
def equals(x, y):
    if isinstance(x, float) and isinstance(y, float):
        return abs(x - y) < 0.00001
    else:
        return x == y

@guidance
def run_tests(lm, name, program, args, expected):
    fn, error = load_program(name, program)
    all_pass = True
    lm += 'Running the test(s) above gives:\n'
    for arg, e in zip(args, expected):
        # Reconstruct the test
        lm += f'assert {name}({arg}) == {e}\n'
        try:
            arg = eval(arg)
            expected_result = eval(e)
        except:
            continue
        try:
            if isinstance(arg, tuple):
                r = fn(*arg)
            else:
                r = fn(arg)
        except Exception as ex:
            r = ex
        if equals(r, expected_result):
            lm += 'Assertion passed.\n'
        else:
            all_pass = False
            lm += f'Assertion failed.\n'
            lm += f'Expected: {e}\n'
            lm += f'Actual: {r}\n'
        lm += '---\n'
    lm = lm.set('all_pass', all_pass)
    return lm

[19]:

mistral + run_tests(lm['name'], lm['program'], lm['args'], lm['expected'])

[19]:

Running the test(s) above gives:
assert triangle_area(3, 4, 5) == 6.00
Assertion failed.
Expected: 6.00
Actual: 36.0
---
assert triangle_area(1, 2, 10) == -1
Assertion passed.
---
assert triangle_area(3, 4, 7) == -1
Assertion passed.
---
assert triangle_area(3, 3, 3) == 0.00
Assertion failed.
Expected: 0.00
Actual: 15.19
---
assert triangle_area(1, 1, 2) == -1
Assertion passed.
---

Now, we can put this all together into a function that gets the LM to rewrite the program when the tests don’t work:

[20]:

@guidance
def run_tests_and_fix(lm, prompt):
    lm2 = lm + baseline_and_tests(prompt)
    name, program, args, expected = lm2['name'], lm2['program'], lm2['args'], lm2['expected']
    i = 0
    # Try this at most 3 times
    while i != 3:
        i += 1
        lm2 += run_tests(name, program, args, expected)
        # Passing the tests, I can stop.
        if lm2['all_pass']:
            break
        lm2 += f'\n'
        # Get the model to think about what's wrong
        lm2 += f'My implementation of {name} is wrong, because''' + gen(stop='\n') + '\n'
        lm2 += f'In order to fix it, I need to''' + gen(stop='\n') + '\n'
        lm2 += f'Here is a fixed implementation:\n'
        # Write a new program
        lm2 += '```python\n' + prompt + gen(max_tokens=800, stop_regex='\n[^\s]', name='program')
        lm2 += '```\n'
        # Reset the slate, start over with new program
        program = prompt + lm2['program']
        lm2 = lm + add_program_and_tests(name, program, args, expected)
        lm2 = lm2.set('program', program)
        lm + 'ae' + gen(max_tokens=10)
    return lm2

[21]:

mistral + '...' + gen(max_tokens=3)

[21]:

...------------------------------------------------

[22]:

lm = mistral + run_tests_and_fix(prompt)

Here is an implementation of triangle_area:
```python

def triangle_area(a, b, c):
    '''
    Given the lengths of the three sides of a triangle. Return the area of
    the triangle rounded to 2 decimal points if the three sides form a valid triangle.
    Otherwise return -1
    Three sides make a valid triangle when the sum of any two sides is greater
    than the third side.
    Example:
    triangle_area(3, 4, 5) == 6.00
    triangle_area(1, 2, 10) == -1
    '''
    # Check if the sides form a valid triangle
    if a + b > c and a + c > b and b + c > a:
        # Calculate the semi-perimeter
        s = (a + b + c) / 2
        # Check if the triangle is right-angled
        if a**2 + b**2 == c**2:
            # Use Gauss's formula for the area of a right-angled triangle
            area = ((s * (s - a) * (s - b) * (s - c)) ** 0.5)
        else:
            # Use Heron's formula for the area of a general triangle
            area = ((s * (s - a) * (s - b) * (s - c)) ** 0.5)
        return round(area, 2)
    else:
        return -1

def test_triangle_area():
   assert triangle_area(3, 4, 5) == 6.00
   assert triangle_area(1, 2, 10) == -1
   assert triangle_area(3, 4, 7) == -1
   assert triangle_area(1, 1, 2) == -1
   assert triangle_area(12, 5, 13) == 17.71
```

[23]:

program = lm['program']
exec(program)
print(triangle_area(3, 4, 5))
print(triangle_area(1, 2, 10))

6.0
-1

In this particular case, having more rounds allows the model to fix its program on the unit tests. Does it also result in a program that passes the evaluation tests?

[24]:

eval_program(program, idx)

[24]:

True

Yes. Indeed, this simple prompt modification raises accuracy from 54.9% to 57.3% for this model (we’ve seen bigger gains with larger models)

Anyway, the point of this notebook is just to illustrate how easy it is to guide generation depending on what previous generations are (e.g. the test results depend on the current version of the code.)