\[ \newcommand{\tr}{\Rightarrow} \newcommand{\trs}{\tr^{\!\ast}} \newcommand{\rlnm}[1]{\mathsf{(#1)}} \newcommand{\rred}[1]{\xrightarrow{#1}} \newcommand{\rreds}[1]{\mathrel{\xrightarrow{#1}\!\!^*}} \newcommand{\cl}{\mathsf{Cl}} \newcommand{\pow}{\mathcal{P}} \newcommand{\matches}{\mathrel{\mathsf{matches}}} \newcommand{\kw}[1]{\mathsf{#1}} \newcommand{\andop}{\mathrel{\&\!\&}} \newcommand{\orop}{\mathrel{\|}} \newcommand{\ff}{\mathsf{false}} \newcommand{\tt}{\mathsf{true}} \newcommand{\abra}[1]{\langle #1 \rangle} \newcommand{\bnfnt}[1]{\abra{\small \textsf{#1}}} \newcommand{\llbracket}{[\![} \newcommand{\rrbracket}{]\!]} \newcommand{\first}{\mathsf{First}} \newcommand{\nullable}{\mathsf{Nullable}} \newcommand{\follow}{\mathsf{Follow}} \newcommand{\tm}[1]{\mathsf{#1}} \]
\[\newcommand{\andop}{\mathrel{\&\!\&}} \newcommand{\orop}{\mathrel{\|}} \newcommand{\ff}{\mathsf{false}} \newcommand{\tt}{\mathsf{true}} \newcommand{\tm}[1]{\mathsf{#1}}\]

Predictive Recognition

Once you have an LL(1) parsing table, it is straightforward to implement a kind of parser called a predictive parser. It is predictive in the sense that it can predict what rule to use at every step - due to the grammar being LL(1). We start with some basic plumbing. Let’s suppose we have some mutable list of tokens, in OCaml this can be a reference to a token list:

1
tokens : (token list) ref

Much like in our lexer, we need a way to peek at the next element of the input (in this case, the next token) and a way to consume it (actually remove it from the head of the list). Also like in our lexer, we will only consume the head of the list if it matches a given token. So we will assume we have two functions:

1
2
peek : void -> token
eat : token -> void

Lets assume that, if the head of the list !tokens is tk then eat tk evaluates to void () and modifies tokens to remove tk from the head as a side effect, otherwise it throws a parser error exception.

Then, to implement the parser, we simply make one new function for each non-terminal in the grammar. Recall the LL(1) grammar we studied last time:

\[\begin{array}{lcl} S &\longrightarrow& D\ \$\\ D &\longrightarrow& C\ D' \\ D' &\longrightarrow& \orop C\ D' \mid \epsilon \\ C &\longrightarrow& A\ C' \\ C' &\longrightarrow& \andop A\ C' \mid \epsilon \\ A &\longrightarrow& \tt \mid \ff \mid \mathsf{id} \mid (D) \end{array}\]

So we will have 6 functions, s, d, d', c, c' and a. Each of these functions will be responsible for parsing those strings that are derivable from the corresponding non-terminal in the grammar. So, for example, d will parse exactly those strings derivable from $D$. Here’s the complete set of parsing functions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
  let rec d () =
    match peek () with
    | TkLParen | TkTrue |  TkFalse | TkId _ -> c (); d' ()
    | _ -> raise ParseFailure

  and d' () =
    match peek () with
    | TkRParen | TkEnd -> ()
    | TkOr -> eat TkOr; c(); d'()
    | _ -> raise ParseFailure

  and c () =
    match peek () with
    | TkLParen | TkTrue | TkFalse | TkId _ -> a(); c'()
    | _ -> raise ParseFailure

  and c' () =
    match peek () with
    | TkRParen | TkOr | TkEnd -> ()
    | TkAnd -> eat TkAnd; a(); c'()
    | _ -> raise ParseFailure

  and a () =
    match peek () with
    | TkLParen -> eat TkLParen; d(); eat TkRParen
    | TkTrue -> eat TkTrue
    | TkFalse -> eat TkFalse
    | TkId x -> eat (TkId x)
    | _ -> raise ParseFailure

  let s () =
    match peek () with
    | TkLParen | TkTrue |  TkFalse | TkId _ -> d (); eat TkEnd
    | _ -> raise ParseFailure

The implementation strategy is simple, for each non-terminal $X$ we perform a case analysis on the next token $a$ of the input, and then we “execute” the unique parsing rule that is listed in the parsing table with row $X$ and column $a$. Otherwise, if that cell of the table is empty, we raise an exception.

For example, for non-terminal $D$, the only terminals that have non-empty cells are true, false, left-paren and identifier and, in each of these, the parsing rule is $D \longrightarrow C\ D’$. We think of this parsing rule as saying: “to parse a string using $D$, first parse the first part of the string using $C$ and then parse the second part of the string using $D’$”. So, in the code for d() we have a match on the next token, with these four cases all giving rise to the same expression c(); d'() which consists of parsing the first part of the string using c() and then parsing the next part of the string using d'().

When a parsing rule would require us to match a given terminal symbol $a$, as in $A \longrightarrow (D)$, we use eat a to do so. So, this part of the case analysis for $a()$ reads: eat TkLParen; d(); eat TkRParen. When a parsing rule requires that we do nothing, e.g. $C’ \longrightarrow \epsilon$, then we just return void, (), in the implementation.

Then, to use the parsing functions, we simply need to take the input string, lex it to obtain a list of tokens and then “call” the start symbol s():

1
2
  let parse inp = 
    tokens := lex inp; s ()

If parse inp returns void, we can conclude that inp is a string in the language of the grammar and, otherwise, when parse inp throws an exception, we can conclude that inp is not a string in the language of the grammar.