Breaking down PPM: Primitive Parsing

Before trying to move on to the other more complex parsers defined in this specification, it is going to be helpful to outline some of the primitive parsers made available in the DaeDaLus standard library, pointing to their uses in the PPM example where appropriate:

UInt8

The parser UInt8 parses a single byte of input, failing if there are no bytes left to consume. When it succeeds, it constructs a semantic value of type uint 8. The PPM example doesn’t use this parser, but later we will see examples of parsers that do.

Note

Use UInt8 when you need to parse a single specific byte from input.

$[...]

$[...] parses a single byte of input that matches any of the bytes specified by the content in the square brackets. We can specify inclusive ranges of bytes, such as '0' .. '9', or we can specify the set elements explicitly, as in this example from the PPM specification:

def WS = $[0 | 9 | 12 | 32 | '\n' | '\r']

Note

Use $[...] when you need to parse one of a finite set of bytes.

Match ...

Match parses a particular sequence of bytes in the input, returning a semantic value that is an array of bytes (whose type is written [uint 8].) As an example, we can write Match "keyword" to match exactly those bytes corresponding to the string "keyword". We can also utilize this parser when working with binary formats, where we may find it useful to precisely specify the bytes we’re expecting, e.g. Match [0x00, 0x01] which will match the two bytes 0 and 1.

In the declaration for the parser PPM, we use Match "P" to consume the first part of the aforementioned PPM “magic number”; we could have just as well used $['P'] here, but the meaning in this case is the same.

Note

Use Match when you need to parse a specific sequence of bytes.

END

The parser END succeeds only if there is no additional input to be consumed. It results in the ‘trivial’ semantic value {}. Typically, a DaeDaLus parser succeeds consuming only a prefix of the input, but adding END to the end of our parsing sequence means we must consume the entire input in order to succeed.

The PPM specification does not use the END parser, so in fact, the generated parser will consume any input prefixed by a well-formed ASCII PPM; depending on the use-case, this may or may not be desirable.

Note

Use END when you want to guarantee full inputs are consumed.

^ ...

It is sometimes convenient to ‘lift’ a semantic value into a parser that consumes no input and always succeeds, returning that same semantic value; this is an important part of how data-dependent parsing works in DaeDaLus. We can accomplish this by placing a ^ before any semantic value. For example, ^ 'A' is a parser that consumes no input and always succeeds, producing the byte 'A' as a result. If we want to consume no input and return nothing interesting, we may write ^ {} (the same ‘trivial’ semantic value returned by the END parser); DaeDaLus also provides the synonym Accept for this trivially-succeeding parser for more readability.

The idea here is best shown by example. Consider the declaration of the Digit parser:

def Digit =
  block
    let d = $['0' .. '9']
    ^ d - '0'

Parsers can only be combined with other parsers, so to transform the ASCII byte we read with $[...] into the actual digit it represents, we must write a parser that returns that transformed value. This is exactly the use for ^ since we don’t wish to read any additional input.

Parsers defined with ^ are called ‘pure’, because they do not consume any input (that is, they don’t alter the internal parsing state in any way.) We’ll see many more examples of this in the other formats we study.

Note

Use ^ ... to turn semantic values into parsers that don’t consume input.

Fail ...

We can trigger a failure with the Fail parser, which always fails. Optionally, we can provide a message to this parser which will be printed as part of the triggered failure; this is how you may indicate to users of your specifications what exactly went wrong while trying to parse.

The PPM example does not make use of the Fail parser; it is mostly useful when performing validation of parsed data, which is often better left to later stages of the processing done on layouts. We’ll have some more to say about this later.

Note

Use Fail ... to immediately stop parsing with an error message.