Predictive Recognition
Once you have an LL(1) parsing table, it is straightforward to implement a kind of parser called a predictive parser. It is predictive in the sense that it can predict what rule to use at every step - due to the grammar being LL(1). We start with some basic plumbing. Let’s suppose we have some mutable list of tokens, in OCaml this can be a reference to a token list:
1
tokens : (token list) ref
Much like in our lexer, we need a way to peek at the next element of the input (in this case, the next token) and a way to consume it (actually remove it from the head of the list). Also like in our lexer, we will only consume the head of the list if it matches a given token. So we will assume we have two functions:
1
2
peek : void -> token
eat : token -> void
Lets assume that, if the head of the list !tokens
is tk
then eat tk
evaluates to void ()
and modifies tokens
to remove tk
from the head as a side effect, otherwise it throws a parser error exception.
Then, to implement the parser, we simply make one new function for each non-terminal in the grammar. Recall the LL(1) grammar we studied last time:
\[\begin{array}{lcl} S &\longrightarrow& D\ \$\\ D &\longrightarrow& C\ D' \\ D' &\longrightarrow& \orop C\ D' \mid \epsilon \\ C &\longrightarrow& A\ C' \\ C' &\longrightarrow& \andop A\ C' \mid \epsilon \\ A &\longrightarrow& \tt \mid \ff \mid \mathsf{id} \mid (D) \end{array}\]So we will have 6 functions, s
, d
, d'
, c
, c'
and a
. Each of these functions will be responsible for parsing those strings that are derivable from the corresponding non-terminal in the grammar. So, for example, d
will parse exactly those strings derivable from $D$. Here’s the complete set of parsing functions:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
let rec d () =
match peek () with
| TkLParen | TkTrue | TkFalse | TkId _ -> c (); d' ()
| _ -> raise ParseFailure
and d' () =
match peek () with
| TkRParen | TkEnd -> ()
| TkOr -> eat TkOr; c(); d'()
| _ -> raise ParseFailure
and c () =
match peek () with
| TkLParen | TkTrue | TkFalse | TkId _ -> a(); c'()
| _ -> raise ParseFailure
and c' () =
match peek () with
| TkRParen | TkOr | TkEnd -> ()
| TkAnd -> eat TkAnd; a(); c'()
| _ -> raise ParseFailure
and a () =
match peek () with
| TkLParen -> eat TkLParen; d(); eat TkRParen
| TkTrue -> eat TkTrue
| TkFalse -> eat TkFalse
| TkId x -> eat (TkId x)
| _ -> raise ParseFailure
let s () =
match peek () with
| TkLParen | TkTrue | TkFalse | TkId _ -> d (); eat TkEnd
| _ -> raise ParseFailure
The implementation strategy is simple, for each non-terminal $X$ we perform a case analysis on the next token $a$ of the input, and then we “execute” the unique parsing rule that is listed in the parsing table with row $X$ and column $a$. Otherwise, if that cell of the table is empty, we raise an exception.
For example, for non-terminal $D$, the only terminals that have non-empty cells are true, false, left-paren and identifier and, in each of these, the parsing rule is $D \longrightarrow C\ D’$. We think of this parsing rule as saying: “to parse a string using $D$, first parse the first part of the string using $C$ and then parse the second part of the string using $D’$”. So, in the code for d()
we have a match on the next token, with these four cases all giving rise to the same expression c(); d'()
which consists of parsing the first part of the string using c()
and then parsing the next part of the string using d'()
.
When a parsing rule would require us to match a given terminal symbol $a$, as in $A \longrightarrow (D)$, we use eat a
to do so. So, this part of the case analysis for $a()$ reads: eat TkLParen; d(); eat TkRParen
. When a parsing rule requires that we do nothing, e.g. $C’ \longrightarrow \epsilon$, then we just return void, ()
, in the implementation.
Then, to use the parsing functions, we simply need to take the input string, lex it to obtain a list of tokens and then “call” the start symbol s()
:
1
2
let parse inp =
tokens := lex inp; s ()
If parse inp
returns void, we can conclude that inp
is a string in the language of the grammar and, otherwise, when parse inp
throws an exception, we can conclude that inp
is not a string in the language of the grammar.