Parsers

Primitive Parsers

Specific Byte. The parser $[ set ] matches a single byte that belongs to the set of bytes described by set. For example, $[ '0' .. '9' ] matches any bytes in the range 48 ('0') through 57 ('9') inclusive. See Character Classes for deatails on how to write sets of bytes.

Any Byte. The parser UInt8 extracts a single byte from the input. It fails if there are no bytes left in the input. If successful, it constructs a value of type uint 8. This is equivalent to writing $[ $any ].

Specific Byte Sequence. The parser Match bytes matches the byte sequence bytes in the current input. The resulting semantic value is an array of bytes, [uint 8], corresponding to the matched bytes. For example Match "keyword" will match "keyword", while Match [0x00,0x01] will match two bytes: 0 followed by 1.

End of Input. The parser END succeeds only if there is no more input to be parsed. If successful, the result is the trivial semantic value {}. Normally DaeDaLus parsers succeed as long as they match a prefix of the entire input. By sequencing (see Sequencing Parsers) a parser with END we specify that the entire input must be matched.

Pure Parsers. Any semantic value may be turned into a parser that does not consume any input and always succeeds with the given result. To do so prefix the semantic value with the operator ^. Thus, ^ 'A' is a parser that always succeeds and produces byte 'A' as a result. The parser Accept may be used to match the empty string without constructing an interesting semantic value. It is equivalent to ^ {}. In many cases, one may omit the ^ operator because of Implicit Lifting.

Explicit Failure The Fail construct will always fail. This parser is parameterized by an optional location, along with an error message.

Examples:

{- Declaration                      Matches       Result          -}
def GetByte     = UInt8             -- Any byte X X
def TheLetterA  = $[ 'A' ]          -- Byte 65    65
def TheNumber3  = $[0 .. 5]         -- A byte between 0 to 5
def TheNumber16 = $[ 0x10 ]         -- Byte 16    16
def Magic       = Match "HELLO"     -- "HELLO"    [72,69,76,76,79]
def AlwaysA     = ^ 'A'             -- ""         65
def GiveUp      = Fail "I give up"  -- (none)    Failure with message "I give up"

Sequencing Parsers

Basic Sequencing

Multiple parsers may be executed one after the other, by listing them either between { and } or between [ and ], and separating them with ;. Thus, { P; Q; R } and [ P; Q; R ] are both composite parsers that will execute P , then Q, and finally R. If any of the sequenced parsers fails, then the whole sequence fails.

Parsers sequenced with [] produce an array, with each element of the array containing the result of the corresponding parser. Since all elements in an array have the same type, all parsers sequenced with [] should construct the same type of semantic value.

By default, parsers sequenced with {} return the result of the last parser in the sequence.

Examples:

{- Declaration                                  Matches       Result -}
def ABC1 = { $['A']; $['B']; $['C'] }        -- "ABC"         67
def ABC2 = [ $['A']; $['B']; $['C'] ]        -- "ABC"         [65,66,67]
def ABC3 = { Match "Hello"; Match "ABC" }    -- "HelloABC"    [65,66,67]
def ABC4 = { Match "Hello"; $['C'] }         -- "HelloC"      67

An alternative notation for { .. } parsers is to use the block keyword and layout:

def UseBraces = { Match "A"; Match "B" }
def UseLayout =
  block
    Match "A"
    Match "B"

The parsers UseBraces and UseLayout are the same, just using a different notation. When using layout, the entries in the sequence must start on the same column, and any text that is indented more than that column belongs to the corresponding parser. So, the following parser is also equivalent to the previous two:

def AlsoTheSame =
  block
    Match
      "A"
    Match "B"

Explicit Result

A block or {}-sequenced group of parsers may return the result from any member of the group instead of the last one. To do so, assign the result of the parser to the special variable $$. For example:

def ReturnMiddle =
  block
    P
    $$ = Q
    R

In the example above, the semantic value produce by ReturnMiddle is that produced by Q.

Local Variables

It is also possible to combine the results of some of the block/{}-sequenced parsers by using local variables and the pure parser. Assignments prefixed by the keyword let introduce a local variable, which is in scope in the following parsers. Here is an example:

def Add =
  block
    let x = UInt8
    $[ '+' ]
    let y = UInt8
    ^ x + y

The parser Add is a sequence of 4 parsers. The local variables x and y store the results of the first and the third parser. The result of the sequence is the result of the last parser, which does not consume any input, but only constructs a semantic value by adding x and y together.

Structure Sequence

It is also possible to return results from more than one of the parsers in a block/{}-sequenced group. To do so give names to the desired results (without let). The semantic value of the resulting parser is a structure with fields containing the value of the correspondingly named parsers. Consider, for example, the following declaration:

def S =
  block
    x = UInt8
    y = Match "HELLO"

This declaration defines a parser named S, which will extract a byte followed by the sequence "HELLO". The result of this parser is a structure type, also named S, which has two fields, x and y: x is a byte, while y is an array of bytes.

Note that structure fields also introduce a local variable with the same name, so later parsers in the sequence may depend on the semantic values in earlier parsers in the sequence. For example:

def S1 =
  block
    x = UInt8
    y = block
          let z = UInt8
          ^ x + z

The parser S1 is a sequence of two parsers, whose semantic value is a structure with two fields, x and y. Both fields have type uint 8. The first parser just extracts a byte from input. The second parser is itself a sequence: first it extracts a byte from the input, but its semantic value is the sum of the two extracted bytes. As another example, here is an equivalent way to define the same parser:

def S2 =
  block
    x = UInt8
    let z = UInt8
    y = ^ x + z

Syntactic Sugar

A number of the constructs described in this section are may be thought of as simply syntactic sugar for using local variables. Here are some examples:

Expression:

Equivalent to:

{ $$ = P; Q }

{ let x  = P; Q;          ^ x }

[ P; Q ]

{ let x0 = P; let x1 = Q; ^ [x0,x1] }

{ x = P; y = Q }

{ let x  = P; let y  = Q; ^ { x = x; y = y } }

Parsing Alternatives

Biased Choice

Given two parsers P and Q we may construct the composite parser P <| Q. This parser succeeds if either P or Q succeeds. In the case that both succeed, the parser behaves like P. Note that P and Q have to construct semantic values of the same type.

More operationally, P would be used to parse the input first, and only if it fails would we execute Q on the same input. While this may be a useful intuition about the behavior of this parser, the actual parsing algorithm might implement this behavior in a different way.

Here are some examples:

{- Declaration           Matches        Result   -}
def B1 = $[ 'A' ]     -- "A"            'A', or
      <| $[ 'B' ]     -- "B"            'B'

def B2 = $[ 'A' ]
      <| ^ 'B'        -- "A"            'A', or
                      -- ""             'B'
These two are quite different:
  • B1 matches a single byte, either A or B and returns the matched byte as the result of the parser.

  • B2 matches either 1 byte, which must be A and will be returned as the result of the parser, or 0 bytes, in which case it will return byte B.

Unbiased Choice

Given two parsers P and Q we may construct the composite parser``P | Q``. This parser succeeds if either P or Q succeeds on the given input. Unlike biased choice, if both succeed, then the resulting parser is ambiguous for the given input, which means that input may be parsed in more than one way. It is possible, however, to resolve ambiguities by composing (e.g., in sequence) with other parsers.

Here are some examples:

def U1 = $[ 'A' ] | ^ 0
def U2 = { U1; 'B' }

Parser U1 on its own is ambiguous on inputs starting with "A" because it could produce either 'A (by consuming it from the input), or 0 (by consuming nothing). This happens because parsers only need to match a prefix of the input to succeed.

Parser U2 accepts inputs starting with either "AB" (by using the left alternative of U1) or starting with "B" (by using the right alternative of U1). No inputs are ambiguous in this case.

Alternative Syntax

Given multiple parsers A, B, … we can use the Choose keyword for unbiased choice and First for biased choice. These constructs use layout, in a similar style to block: when using this notation eahc alternative must start at the same indention in the file, and the entire definition of an alternative must be indented furter. Here are some examples:

def BiasedExample =
  First
    block
      Match "This is"
      Match "the firts alternaitve"
    Match
      "The second one is here"

def BiasedExample =
  Choose
    block
      Match "This is"
      Match "the firts alternaitve"
    Match
      "The second one is here"

Tagged Unions

DaeDaLus supports a variation on Choose and First that can be used to construct tagged unions, which is useful if you’d like the semantic value to reflect which of the parsers succeeded, or if the branches need to return construct results of different types.

For example, the following parser constructs a union with possible tags good and bad, depending on whether the input character is 'G' or 'B'.

def BorG =
  First
    good = $[ 'G' ]
    bad  = $[ 'B' ]

This parser works in a similar way to ordinary First except that if an alternative succeeds, the resulting semantic value is tagged with the given tag (e.g., good or bad and the previous example). The type of the semantic value is of a new user-defined type, derived from the name of the declaration—in the previous example, the result of the parser would of a newly defined union type called BorG.

It is also possible to construct a value if a tagged-union type using the notation {| good = 'G' |}. For example, an alternative way to write the previous example is like this:

def AnotherBorG =
  First
    block
      let x = $[ 'G' ]
      ^ {| good = x |}
    block
      let x = $[ 'B' ]
      ^ {| bad = x |}

Note that when using the {| tag = value |} notation, DaeDaLus will try to infer the type of the tagged union. If it cannot infer it, it will generate a new user defined type: this is the case in the previous example, and so parser AnotherBorG will return values of a newly generated type also called AnotherBorG.

It is important to note that even though BorG and AnotherBorG have essentially the same values, these values have distinct types and cannot be freely interchanged.

If we want to make a tagged union value of an existing type, we’d have to provide a type annotation, unless the type can already be inferred from the context. For example:

def YetAnotherBorG =
  Choose
    block
      let x = $[ 'G' ]
      ^ {| good = x |} : BorG
    block
      let x = $[ 'B' ]
      ^ {| bad = x |}

The : BorG in the first alternative specifies that we are making a value of type BorG. Note that we do not need to provide the annotation on the second alternative because all alternatives in (untagged) Choose have the same type, so DaeDaLus can infer that we are also making a value of type BorG.

Unions can also be declared explicitly for use in the above scenarios rather than declaring them implicitly by using Choose or First. For example:

def MyUnion =
  union
    Good: uint 8
    Bad: [uint 8]

In this example, we have explicitly declared a union MyUnion with two constructors, Good and Bad. The Good constructor carries a single uint 8 and the Bad constructor carries a list of uint 8. Such a union could then be used in a parser as follows:

def MyUnionParser: MyUnion =
  Choose
    block
      let x = UInt8
      {| Good = x |}
    block
      {| Bad = "Some text" |}

To use an explicitly-declared union, we give MyUnionParser a return type annotation to indicate that it returns values of type MyUnion. We then construct union semantic values using the Good and Bad constructors and the {| ... |} notation.

Repetition

The Many construct allows the same parser to be run multiple times in sequence on an incoming data stream, and it returns an array containing the resulting semantic values.

block
  $$ = Many $[ '7' ]
  $[ '0' ]
  END

This code will successfully parse any stream consisting of multiple 7 characters, terminated by the 0 character at the end of the stream. For example, the stream "7770" will return the array ['7', '7', '7'].

The Many construct optionally takes either a uint 64 value or an interval bounded by two uint 64 values:

  • Many n P succeeds if it executes parser P exactly n times.

  • Many (i..j) P succeeds if it executes parser P at least i and at most j times.

  • Many also supports lower-bounded intervals Many (i..) P, and likewise upper-bounded intervals Many (..j) P.

To avoid spurious backtracking, Many will parse any input maximally. This can have counter-intuitive consequences! For example, the following code will never succeed:

block
  Many $[ '7' ]
  $[ '7' ]

The call to Many will consume all the input characters matching 7, meaning that the following $[ '7' ] will always fail. This may be difficult to spot in situations where two more complex parsers are run in sequence, the first of which contains an unbounded call to Many.