Formal Grammars for Large Language Models

5 October 2023 8min read ai

Let’s jump in and explore a world where Large Language Models speak to us not in prose, but in JSON, XML, and where we can deterministically restrict their outputs so that they act according to the Unix philosophy:

Write programs that do one thing and do it well.
Write programs to work together.
Write programs to handle text streams, because that is a universal interface.

Why the Unix philosophy?

We have LLMs and LORAs that do one thing very well
We want to plug the output of one model into another (or loop it back into itself)
We can pipe the output of LLMs into non-LLM tools like sed, jq, for complex pipelines

Throughout this article, I will be using the new Mistral-7B-Instruct model, as it is small, fast and VERY capable.

At the moment, asking it to write an article yields exactly what you’d expect:

MISTRAL_PROMPT_TEMPLATE = "[INST] {text} [/INST]"
MISTRAL_PROMPT = PromptTemplate(template=MISTRAL_PROMPT_TEMPLATE, input_variables=["text"])

print(llm(MISTRAL_PROMPT.format(text="Write a quick article on why nuclear powers should vehemently resist dismantling their nuclear arsenal")))

"Nuclear powers have always been at the forefront of global politics and international relations. They possess a tremendous amount of power and influence, which is why they must be very careful about how they handle their nuclear arsenal. In recent years, there has been debate about whether or not these countries should dismantle their weapons programs. However, there are several reasons why nuclear powers should vehemently resist dismantling their nuclear arsenal. ..."

This is good, and as you (and many others online) noticed, Mistral is quite uncensored in it’s output (a good thing, if you ask me).

Now, let’s say we’re trying to build an automated news website (shameless plug here - Trending on Weibo). We might want to have the model give us 1) A title, 2) an article and potentially 3) a good SEO-able URL slug.

Enter Formal Grammars

Formal Grammars are a concept from applied mathematics, and among other things, are used to define programming language syntax.

Imagine you go to a chippy and the bossman asks you if you’d want your burger to have here or to takeaway. You can define bossman’s options using Backus-Naur Form (a metasyntax notation for formal grammars), as such:

<choice> ::= "Here" | "Takeaway"

Now this is the simplest possible example, but BNF can be used to describe many things, for example, a US postal address:

<postal-address> ::= <name-part> <street-address> <zip-part>
<name-part> ::= <personal-part> <last-name> <opt-suffix-part> <EOL> | <personal-part> <name-part>
<personal-part> ::= <initial> "." | <first-name>
<street-address> ::= <house-num> <street-name> <opt-apt-num> <EOL>
<zip-part> ::= <town-name> "," <state-code> <ZIP-code> <EOL>
<opt-suffix-part> ::= "Sr." | "Jr." | <roman-numeral> | ""
<opt-apt-num> ::= <apt-num> | ""

Ok, now that we have a primer on what formal grammars are, how can we use them with LLMs?

As you saw above, in order to abide by the grammar, in the chippy we only have a "Here" or "Takeaway" option. Well, for LLMs this works in the same way.

You can nudge the LLM to answer the way you’d expect it by restricting it’s choice of tokens.

We want the LLM to return us an article in JSON form, so let’s whip up a quick BNF grammar using ggerganov’s GBNF syntax (which adds modern features like regex to standard BNF):

root   ::= object
value  ::= object | array | string | number | ("true" | "false" | "null") ws

object ::=
  "{" ws (
            string ":" ws value
    ("," ws string ":" ws value)*
  )? "}" ws

array  ::=
  "[" ws (
            value
    ("," ws value)*
  )? "]" ws

string ::=
  "\"" (
    [^"\\] |
    "\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F]) # escapes
  )* "\"" ws

number ::= ("-"? ([0-9] | [1-9] [0-9]*)) ("." [0-9]+)? ([eE] [-+]? [0-9]+)? ws

# Optional space: by convention, applied in this grammar after literal chars when allowed
ws ::= ([ \t\n] ws)?

And now, let’s restrict our LLMs output using this grammar:

(I’m using Langchain here for brevity but there are many tools that support grammars, though mostly in the llama.cpp universe.)

llm = LlamaCpp(
    model_path="mistral-7b-instruct-v0.1.Q8_0.gguf",
    ...
    grammar_path="grammars/json.gbnf"
)

print(llm(MISTRAL_PROMPT.format(text="Write a quick article on why nuclear powers should vehemently resist dismantling their nuclear arsenal")))

{
  "title": "Why Nuclear Powers Should Resist Dismantling Their Arsenals",
  "author": "John Smith",
  "date": "2019-03-28T19:45:00Z",
  "image": "https://images.unsplash.com/photo-1567789052-f723b2e55d?ixid=MnwxMjA3fDB8MHxzZWFyY2h8MXx8c2FjdGlvbnxlbnwwfHwwfHw%3D&ixlib=rb-1.2.1&auto=format&fit=crop&w=500&q=60",
  "url": "https://www.reuters.com/business/economics-business/why-nuclear-powers-should-resist-dismantling-their-arsenals-2019-03-28/",
  "content": "Nuclear weapons are considered one of the most powerful tools in a nation's arsenal. The ability to use them as a last resort has kept countries from engaging in full-scale war for decades. However, some have suggested that nuclear powers should dismantle their arsenals as part of a broader disarmament effort. While this may seem like a positive move, there are several reasons why nuclear powers should resist dismantling their arsenals."
}

Now that’s starting to look like nice output we could feed into Mongo, but… let’s take it a step further.

Agents using Open Source models

My biggest pain point setting up autonomous agents using Open Source models has always been their unstructured response.

While GPT-4 could be coaxed into outputting JSON through prompting, smaller open source models could output anything from plain text to XML when asking for JSON output.

Let’s deviate from the grammar above, and restrict the model’s choices even further, by only allowing it to reply with these functions:

{"function": "google_search", "arguments": {"query": ""}}
{"function": "image_search", "arguments": {"query": ""}}
{"function": "create_event", "arguments": {"title": "", "date": "" , "time": ""}}

I’ll define this grammar as a JSON object, and use script from llama.cpp to convert it into GBNF format:

functions_schema = {
    "oneOf": [
        {
            "type": "object",
            "properties": {
                "function": {"const": "create_event"},
                "arguments": {
                    "type": "object",
                    "properties": {
                        "title": {"type": "string"},
                        "date": {"type": "string"},
                        "time": {"type": "string"}
                    }
                }
            }
        },
        {
            "type": "object",
            "properties": {
                "function": {"const": "image_search"},
                "arguments": {
                    "type": "object",
                    "properties": {
                        "query": {"type": "string"}
                    }
                }
            }
        },
        {
            "type": "object",
            "properties": {
                "function": {"const": "google_search"},
                "arguments": {
                    "type": "object",
                    "properties": {
                        "query": {"type": "string"}
                    }
                }
            }
        }
    ]
}

So now, our model can only reply to us using the functions it would like to execute in order to accomplish the task of the prompt.

So let’s prompt it:


prompts = [
    "Search recent news in Taiwan",
    "Find an image of a dog."
]

Now, since we expect the model to ask to call a function with some parameters, let’s write some code for that functionality:

from duckduckgo_images_api import search
from duckduckgo_search import DDGS

for prompt in prompts:
    result = json.loads(llm(MISTRAL_PROMPT.format(text=prompt)))
    
    print(f"Prompt: {prompt}")
    print(f"Result: {result}\n")

    if result["function"] == "image_search":
        img = search(result["arguments"]["query"])
        print("=== IMAGE RESULTS ===")
        print(*[r["url"] for r in img["results"]], sep="\n\t")
    elif result["function"] == "google_search":
        with DDGS() as ddgs:
            results = [r for r in ddgs.text(result["arguments"]["query"], max_results=5)]
        print("=== SEARCH RESULTS ===")
        print(*map(lambda r: (r["title"], r["body"]), results), sep="\n")

And the result issss:

Prompt: Search recent news in Taiwan
Result: {'function': 'google_search', 'arguments': {'query': ' Recent news in Taiwan'}}

=== SEARCH RESULTS ===
('Taiwan news - breaking stories, video, analysis and opinion | CNN', "View the latest Taiwan news and videos, including politics and business headlines. Latest news Cartoon elves and scrolls visualize Chinese military's goal of Taiwan 'reunification'...")
('Taiwan News － Breaking News, Politics, Environment, Immigrants, Travel ...', 'Taiwan News is the most widely visited English-language portal for news about Taiwan, offering the outside world a revealing look at all things Taiwan Taiwan News － Breaking News, Politics, Environment, Immigrants, Travel, and Health')
('Taiwan - BBC News', '4 Sep Watch: Typhoon Saola and Storm Haikui seen from satellite Asia 1 Sep 0:17 The iPhone billionaire who wants to be Taiwan president Asia 28 Aug Taiwan detects 42 Chinese warplanes Asia 19...')
('China-Taiwan conflict: What you need to know | CNN', "Hong Kong CNN — US President Joe Biden's warning the US would defend Taiwan against Chinese aggression has made headlines around the world - and put growing tensions between the small...")
("Taiwan | Today's latest from Al Jazeera", "SHORT ANSWER Tai\xadwan launch\xades the Haikun, its first do\xadmes\xadti\xadcal\xadly-made sub\xadma\xadrine Tai\xadwan has plans to build eight sub\xadmarines, which will be a key part of the is\xadland's strat\xade\xadgy of...")

Prompt: Find an image of a dog.
Result: {'function': 'image_search', 'arguments': {'query': 'dog'}}

__________
Width 3296, Height 2497
Thumbnail https://tse4.mm.bing.net/th?id=OIP.vpENuVG6_Ke79c0shGAHMQHaFn&pid=Api
Url http://www.businessinsider.com/9-reasons-to-own-a-dog-2014-12
Title b'9 reasons to own a dog - Business Insider'
Image http://static3.businessinsider.com/image/5484d9d1eab8ea3017b17e29/9-science-backed-reasons-to-own-a-dog.jpg
__________
Width 2400, Height 1589
Thumbnail https://tse2.mm.bing.net/th?id=OIP.z86nurg5VEy9ULrYNyQu0wHaE5&pid=Api
Url https://burudidavvyd.blogspot.com/2018/09/25-beautiful-dog-species-name.html
Title b'25 Beautiful Dog Species Name'
Image https://www.rd.com/wp-content/uploads/2016/01/04-dog-breeds-dalmation.jpg

We could build on this further, adding a loop where the model can summarise the news it found, and prompt itself to investigate deeper, but that’s for another article.

And that’s the top level 10,000 ft view of formal grammars for LLMs. For me, this was a game changer, and allowed me to continue and expand the use of open source models in my projects.