Breaking down PPM: Parser Combinators

The primitives are critical to defining parsers, but aren’t very interesting on their own; we need ways to sequence them, represent notions of choice, and deal with repetition. First, we look at the various ways to sequence a collection of parsers.

Standard Sequencing

First and foremost, we need a way to run one parser after another. In DaeDaLus, we write sequenced parsers by writing the keyword block and, on the following lines, writing each parser we want to run, one parser per line. We’ve already seen this in the Token parser:

def Token P =
  block
    $$ = P
    Many (1..) WS

This says “run the parser P, then run the parser Many (1..) WS.” If either of these were to fail, the entire sequence would fail.

By default, when we use parser sequencing, the result of the last parser in the sequence is what will be returned. We can subvert this default, as in the Token example, using the special variable $$: assigning to this variable in a sequence means “return this as the result of the whole sequence.” As we’ll see later, this is simply syntactic sugar for a more verbose construction with exactly the same behavior.

Note

There is another way of writing sequenced parsers in DaeDaLus that you may see sometimes that does not rely on whitespace sensitivity or code layout.

Consider the two following declarations:

def UseBraces = { Match "A"; Match "B" }
def UseLayout =
  block
    Match "A"
    Match "B"

We can use braces/semicolons instead of layout which uses the block keyword.

Note that when using layout, all of the parsers we are sequencing must be aligned on the same column, and any text that is indented beyond this column belongs to the corresponding (preceding) parser. To demonstrate:

def UseLayout2 =
  block
    Match
      "A"
    Match "B"

The example above behaves the same as both of the previous parsers. We recommend using layout for more complex parsers, and braces/semicolons for short parsers that fit on a single line. These aren’t rules, though; use what’s comfortable for you!

Array Sequencing

We may also use square braces ([ .. ]) for sequencing parsers. When we use this notation, rather than returning a single result from one of the sequenced parsers, we return an array containing all of the results. Crucially, in this case, all of the parsers being sequenced must return the same type of semantic value since array elements must all have the same type.

Warning

Remember: Arrays in DaeDaLus must contain only elements of the same type! If you need to package up data of varying types, keep reading on about structure sequencing.

Structure Sequencing

What if we need to keep the results of multiple parsers, but they return different types of semantic values? Neither standard nor array sequencing are sufficient, so we need something a bit fancier.

In many programming languages, we can define record types that store a collection of named fields, each of which has its own type. For example, a Person record might contain a field name of type string, and an age field of type uint 8. We build a new Person by providing values for each field, and from a Person we may extract the values of each field, typically using some kind of “field access” notation.

DaeDaLus also supports record types, though in a non-traditional way: a record is defined by a corresponding parser. This idea is best shown by example.

In the PPM specification, we have the following declaration for a parser (pop quiz: how do we know it’s a parser?) called RGB:

def RGB =
  block
    red   = Token Natural
    green = Token Natural
    blue  = Token Natural

Note here that, rather than simply sequencing three parsers, we are storing the result of each in a variable. Doing so in this way means that the semantic value produced by the RGB parser will be a record (hereafter referred to as a structure) with three fields, red, green, and blue, and with a type named after the parser itself, i.e. RGB. As you might hope, the names we introduce are available to be referred to later in the sequence of parsers, so if we needed to, we could use the value stored in red while parsing green or blue.

To better demonstrate this last point, consider this more contrived example:

def S =
  block
    x = UInt8
    y = ^ x + 17

This defines a parser named S which will return a semantic value that is a structure (whose type is also named S) with two fields, x and y, where x is a byte we parse and y is that byte plus 17.

Note

It is also possible to define local variables within a declaration without causing a structure to be created; this can be useful when we want to save parsing results for later, or have some complex semantic value that we don’t want to write down more than once.

To introduce a local variable that won’t be turned into a structure field, prefix the assignment with the keyword let. We’ve already seen an example of this in the Digit parser:

def Digit =
  block
    let d = $['0' .. '9']
    ^ d - '0'

Here, the result of the parser $['0' .. '9'] is stored in a local variable d which we later use in a lifted semantic value to return the value of the digit itself.

Remember: If we prefix the assignment with let, we’re just creating a local variable, not a field of a structure!

De-Sugaring Nonstandard Structure Sequences

Let’s pull back the curtain a bit: as it turns out, most of the constructs for sequencing we’ve looked at so far can be expressed using only local variables and standard sequencing!

First, recall that the special variable $$ allows us to control which parser’s result is returned in a standard sequence. If we have { $$ = P; Q }, that means “run parser P”, then run parser Q, and return the result of parser P.” Can we write this without using the special variable?

Yes! All we need to do is store the result of P to refer to later, like so: { let x = P; Q; ^ x }. Here, we store the result of P in the local variable x, which we later lift using the primitive pure parser ^.

Similarly, array sequencing of parsers, such as [ P; Q ], can be written: { let x0 = P; let x1 = Q; ^ [x0, x1] }. Note that, in both this and the previous case, the expanded forms require us to come up with more names for things. Arguably, naming is one of the hardest problems we face in computer science, so it’s nice to be able to avoid coming up with new names using the shorthand originally presented.

Finally, even structure sequencing can be written this way, since we can construct structure semantic values using the primitive pure parser. If we have { x = P; y = Q }, this can also be written { let x = P; let y = Q; ^ { x = x, y = y } }.

While we recommend using the shorthand, developing an understanding of what it actually means can make it more obvious when each construct is appropriate for your use-cases.

Parsing Alternates

While it is great to be able to parse many things in sequence, most interesting formats require that we be able to parse one of a set of alternatives. As an example, in a programming language, there are typically many different forms of expression, and anywhere an expression is allowed, we must be able to successfully parse any of those different forms.

DaeDaLus is unique in that it provides two ways of handling alternatives: biased choice and unbiased choice. Many parsing libraries do not provide this flexibility. We’ll now look at these alternatives (no pun intended), and some examples that demonstrate their differing behaviors.

Note that our working PPM example does not use any alternative parsing. The extended exercise following this section, to implement the PNG image format, will show off these features more concretely.

Biased Choice Parsing

If we have two parsers, P and Q, we can construct the parser P <| Q. This new parser succeeds if either P or Q succeeds, and crucially, even if both would succeed on the input, it behaves like P since P takes precedence over Q. Thought about another way: P <| Q tries to parse using P, and if this fails, it backtracks and tries parsing with Q.

Consider this contrived example:

def P = $['A'] <| (^ 'B')

P consumes a single byte, 'A', and returns it, or it consumes nothing and returns the byte 'B' (in the case that parsing a single 'A' fails.) It’s important to note that P’s behavior is unambiguous on inputs starting with 'A'. It will always consume the 'A' rather than consuming nothing.

Unbiased Choice Parsing

We can also construct the parser P | Q from two parsers P and Q. Like biased choice, this parser succeeds if either P or Q succeed - However, if both would succeed on the input, it is ambiguous, and can parse inputs in more than one way. Typically, these ambiguities are handled by sequencing with other parsers.

If we take our biased choice example and replace <| with |:

def P = $['A'] | (^ 'B')

P is now ambiguous on inputs that start with 'A', since it can consume either one or zero bytes. Remember, DaeDaLus parsers in general only need to match a prefix of the input to succeed.

There are many grammars that have intentional ambiguities, and this unbiased choice facility in DaeDaLus allows us to express those formats with ease.

Note

Much like with parser sequencing, we can use a layout-based syntax to write down alternatives parsers. We use the keyword First for biased choice, and Choose for unbiased choice, like so:

def BP =
  First
    block
      Match "This is"
      Match "the first alternative"
    Match
      "The second one is here"

def UP =
  Choose
    block
      Match "This is"
      Match "the first alternative"
    Match
      "The second one is here"

Again, the language does not prefer this style over the use of <| and |; use whatever syntax is more comfortable for you. There is one major exception to this, which we’ll address in the next section.

Tagged Sum Types in DaeDaLus: Unions

Something not mentioned above is that, like array-sequenced parsers, alternative parsers must parse to the same type of semantic value on all branches, but this is limiting! What if, for example, we’re parsing a format that allows strings or numbers to appear in the same place? As described so far, we can’t handle this using biased or unbiased choice.

Enter sum types, called “unions” in DaeDaLus.

In many programming languages, sum types are how we can describe a set of alternatives. Typically, the variants of a sum type are labeled with a tag which may or may not carry some additional data of some other type. In DaeDaLus, such a sum type is called a union and the tags given to its alternatives are called constructors.

As a simple example, the type bool is a union with two constructors: true and false.

DaeDaLus allows us to implicitly declare and return a union semantic value using variations of the layout-based syntax described in the note above, similar to how we can build structures using parser sequencing. To do this, we use First and Choose. For example:

def GoodOrBad =
  First
    good = $['G']
    bad  = $['B']

This parser implicitly declares a new union called GoodOrBad with two constructors, good and bad. The parser returns a semantic value of type GoodOrBad.

Note

You might wonder if, like sequencing earlier, there is some syntactic sugar at play. Indeed, we can construct semantic values of union types explicitly, using a special bracket syntax, {| ... |}:

def GoodOrBad2 =
  First
    block
      let x = $['G']
      ^ {| good = x |}
    block
      let x = $['B']
      ^ {| bad = x |}

Note that, because of the way DaeDaLus attempts to infer the types of these sum-typed values, this declaration will in fact create a new sum type named GoodOrBad2. It is not interchangeable with the previous definition of GoodOrBad, even though both types have essentially the same values.

If we wanted this new parser to return the same type of semantic value as the original GoodOrBad, we would need to provide type annotations to guide the type inferencer:

def GoodOrBad3 =
  First
    block
      let x = $['G']
      ^ {| good = x |} : GoodOrBad
    block
      let x = $['B']
      ^ {| bad = x |}

Note that, since all branches of First and Choose parsers must have the same type, we need only annotate the first branch’s result. The type inferencer will take care of the rest.

Unions can be declared explicitly and then used in parsers as an alternative to declaring and using unions implicitly with First or Choose. For example:

def GoodOrBadData =
  union
    good: {}
    bad: {}

def GoodOrBad: GoodOrBadData =
  First
    block
      $['G']
      {| good |}
    block
      $['B']
      {| bad |}

In the example above, we’ve explicitly declared a union called GoodOrBadData and we’re returning semantic values of that type in the two parsers in GoodOrBad. Taking this approach rather than implicitly declaring a union can be helpful when we need to declare a union in one place and construct its values in multiple places or when we need to return semantic values of a union somewhere other than a First or Choose construct.

With this set of rich type-constructing mechanisms, you can go forth and create many interesting format specifications with DaeDaLus. But, what happens when you need to parse many copies of the same thing in sequence, perhaps an unknown number of times? For that, we use parser repetition, which we describe in the next section.

Repeating Parsers

If we need to parse the same thing multiple times, we can use the Many parser combinator. In its most basic form, it can parse an arbitrarily long sequence, stopping only when the given parser first fails. As a simple example:

def P = { $$ = Many $['7']; $['0'] }

This parser will match any number of 7s followed by a 0, e.g. "0", "70", "770", etc. The semantic value returned by the above parser is an array of all the 7s that were parsed.

Be cautious when using this unbounded form of Many! It parses inputs maximally, so it’s possible to accidentally create a parser that never succeeds, e.g.:

def P = { Many $['7']; $['7'] }

When we know that we are only parsing a particular number of things (or even that there is a lower or upper bound on the number of things), we can provide an optional additional argument to Many:

The parser Many n P succeeds if P succeeds exactly n times
The parser Many (i .. j) P succeeds if P succeeds at least i and at most j times

This latter form can be modified to leave off either the lower or upper bound, e.g. Many (i ..) P, which will succeed if P succeeds at least i times.

All we’re missing now for a complete understanding of the PPM example is some control-flow mechanisms and expressions involving semantic values. If you’re already familiar with other programming languages, you can probably figure out what’s going on with the for-loops and integer expressions, but the following section will explain these features in more detail.