Lexing
So far, we have used grammars as a mechanism to define a language - a potentially infinite set of strings, each of which can be generated by the process of derivation. In practice, in a compiler or interpreter for a programming language – let us just say a translator to avoid fixing one or the other, we don’t want to generate strings, rather we are given a particular string (the source code of someone’s program) as input. The goal of the translator is to check whether or not this string is valid with respect to the grammar for the programming language and, if so, construct some internal representation of it for further processing by the translator. This is the problem called parsing.
Since the syntax of programming languages are often quite complex, there is usually a preprocessing step before parsing that will make both the grammar of the language and, consequently, the task of parsing strings in that language, conceptually simpler and more efficient.
Lexical Analysis
This preprocessing step is called lexical analysis, or lexing for short (also called scanning), and it is the job of the lexer (also called scanner) to take the input string and try to identify the smallest meaningful pieces, that is substrings that correspond to some part of the syntax, but which would become meaningless (within that particular program) if they were broken any further. These smallest meaningful pieces are called lexemes.
For example, given the assignment statement:
1
delta = dx + 3
we would expect the lexer to identify “delta” as a lexeme - it has meaning as an identifier - a variable name - but, if it were broken up at all, say into two pieces “del” and “ta”, neither of these on its own corresponds to a meaningful part of the syntax of this particular assignment statement.
In fact, there are no hard and fast rules about what should count as a meaningful part of the syntax, it is up to the designer of the language to decide what makes sense for them, but most of the time a good rule of thumb is that lexing is about breaking up the string according to whitespace and punctuation.
Tokens
The lexer packages up the lexeme with a terminal symbol that describes what kind of an entity it is with respect to the syntax of the programming language, this combination is called a token. For example, we might have a terminal symbol for all identifiers, $\mathsf{id}$, and then a token corresponding to the situation described above would be:
(id, “delta”)
i.e. the terminal $\mathsf{id}$ says it is an identifier, and the lexeme records which one - specifically “delta”.
We can use a grammar to also specify the possible shape of lexemes but, since lexemes are usually simpler to recognize, it would also suffice to use a regular expression (we no longer study regular expressions in this unit, but if you wish to look deeper into the theory of lexical analysis, then you will see that it is underpinned on regular languages, which are the languages describable by regular expressions).
By consulting our grammar for the While language, we can identify the following lexemes:
- An identifier, terminal $\mathsf{id}$, consists of a lowercase letter, which is followed by any number of upper or lowercase letters or the prime character.
- A number, terminal $\mathsf{num}$, consists of a non-empty sequence of digits.
- The while keyword, terminal $\mathsf{while}$, consists of the string “while”.
- A left parenthesis, terminal $\mathsf{lparen}$, consists of the single character ‘)’.
- The assignment symbol, terminal $\mathsf{assn}$, consists of the string “<-“.
- The plus symbol, terminal $\mathsf{plus}$, consists of the character ‘+’.
- and so on
Lexing
Lexical analysis consists of a transformation from a given input string, i.e. a sequence of characters, into a sequence of tokens.
“delta <- dx + 3” –lexing–> (id, “delta”)(assn, “<-“)(id, “dx”)(plus, “+”)(num, “3”)
Typically, a lexer will just read the input string, character by character, trying to match substrings as lexemes and classify them with a terminal symbol, whilst throwing away irrelevant whitespace and comments.
Then the parser operates on tokens, looking at the first part of each token - the terminal symbol - in order to check if the input is valid, and looking at the second part when it constructs the internal representation, which we will see in a later lecture. For example, the token sequence above will be recognised as a valid statement by a parser for a grammar with rules:
\[\begin{array}{rcl} \mathit{Stmt} &\longrightarrow& \mathsf{id}\ \mathsf{assn}\ \mathit{Exp} \mid \cdots{}\\ \mathit{Exp} &\longrightarrow& \mathsf{id} \mid \mathsf{num} \mid \mathit{Exp}\ \mathsf{plus}\ \mathit{Exp} \mid \cdots{} \end{array}\]So, when we incorporate lexing, we will usually use more abstract grammars that do not specify what individual lexemes may actually look like. For example the grammar above does not specify what an identifier is allowed to look like, instead all possible identifiers are represented in the grammar by a single terminal symbol, say $\mathsf{id}$. The specification of what an identifier looks like will be given elsewhere, not cluttering up this grammar. For example, in Microsoft’s specification of the C programming language, there is a separate “lexical grammar” which specifies e.g. what an identifier can look like.
A Lexer for Boolean Expressions
Deciding which substrings constitute meaningful parts of the syntax of the program is another string classification task, and there is some theory that concerns the kinds of languages that underpin lexical analysis. However, our time is short and lexers are relatively simple programs anyway, so we will not go into it. Instead, I would like to show you an example of a lexer, because most lexers you ever encounter will likely look very similar.
My implementation will be in the functional programming language OCaml. In the labs you may use whatever programming language you prefer. Functional programming languages are ideal for implementing anything to do with tools for programming languages (such as compilers, interpreters, static analysis tools), because these typically involve many (mutually) recursive functions defined over different kinds of tree data structures, and functional programming languages can describe these kinds of computations particularly elegantly. As a case in point, there was a long time when Perl had gotten so complex that the only fully-working Perl 5 interpreter was written in Haskell.
I’m going to assume you are familiar with general functional programming concepts like recursive functions and algebraic data types from your experience with Haskell. Of course these things look a little different in OCaml, but I’ll point them out to you.
You may wonder why I don’t just present my examples in Haskell. One reason is that I simply want you to see another perspective on functional programming, and give you a second chance if you never got it the first time. Another reason is that OCaml is less of an extreme language than Haskell and it makes some more practical choices (in my opinion). First and foremost of these, OCaml is call-by-value, which makes execution much easier to predict and reason about. Second, it has much more straightforward support for mutable state and IO, which makes writing complete and useful programs more accessible to beginners than in Haskell.
The type of tokens
The language for which I want to implement a lexer consists of simple Boolean expressions. The grammar is:
\[B \longrightarrow B \andop B \mid B \orop B \mid \tt \mid \ff \mid \tm{id} \mid (B)\]The kinds of token/terminal for our Boolean expressions are:
- The keywords “true” and “false”
- The operators “&&” and “||”
- Left “(“ and right “)” parentheses
- Identifiers, which are non-empty sequences of lowercase and uppercase letters, the prime (apostrophe), that must start with a lowercase letter.
- An end-of-input token
Of those, we are only really interested in keeping the lexemes associated with identifiers. For example, there is no need to actually keep the lexeme “&&” associated with the conjunction operator, since it is always “&&”, we won’t lose any information by simply discarding it (if we know the token is the and-operator, then we already know the lexeme is “&&”).
We use an algebraic data type to represent tokens in OCaml with one constructor for each kind of token. Note, in contrast to Haskell, in OCaml types are written starting with a lowercase letter. Also in contrast to Haskell, when we write that a certain expression e
has a certain type t
, this is written e:t
rather than e::t
.
1
2
3
4
5
6
7
8
9
type token =
| TkTrue
| TkFalse
| TkAnd
| TkOr
| TkLParen
| TkRParen
| TkId of string
| TkEnd
This introduces a new type called token
which has 8 constructors, TkTrue
, TkFalse
etc. The first 6 constructors and the last one don’t take any arguments, they simply are 7 different kinds of token - that is each has type token, e.g. TkTrue : token
. The other one takes a single argument of type string
. Thus TkId
is not a token, but for any string s
, TkId s
is a token, e.g. TkId "foo" : token
.
So, the goal of this lexer is to take some input of type string
and return an output of type token list
. In contrast to Haskell, when we have a type constructor like list
, we write it’s argument before it, instead of after it. Moreover, there is no special syntactic sugar for the list type in OCaml. Compare the following, in which option
is the equivalent OCaml type constructor to Haskell’s Maybe
:
Haskell | OCaml |
---|---|
Maybe Int |
int option |
[Int] |
int list |
(Int, String) |
int * string |
[a] |
'a list |
The overall approach to converting the input string into a token list is straightforward, we simply look at each character of the input in turn, starting from the left-hand side of the string and decide which kind of token it probably belongs to depending on what character it is.
The following code assumes that we have two mutable variables in scope, input
of type string
, input_len
of type int
and idx
of type int
. In OCaml, mutable variables are called references and a reference to a piece of memory that holds something of type t
has type t ref
.
input : string ref
- a reference to the whole input stringinput_len : int ref
- a reference to the length of the input stringidx : int ref
- a reference to the index of the next unmatched character in the input string
Reading the value stored at the piece of memory referenced can be done with the dereferencing operator !
. Thus !input : string
is the string currently occupying the memory referenced by input : string ref
. To overwrite the contents of a reference, we use the assignment operator :=
. Thus, idx := !idx + 1
overwrites the value stored in reference idx
by one more than its old value. Notice that on the LHS you have a reference, the name of a piece of memory, and on the RHS you have an ordinary value, namely the value you want to store at that piece of memory. You can’t add a piece of memory to the number 1, idx + 1
, but you can add the number stored at a piece of memory to the number 1, !idx + 1
.
1
2
3
let idx = ref 0
let input = ref ""
let input_len = ref 0
Rather than interacting with these references directly, we will introduce a little API:
init s
- will load the strings
intoinput
, setinput_len
accordingly and setidx
to0
.is_more ()
- returns true if the input is not yet fully consumed and false o/wpeek ()
- will return the next unmatched character without consuming it (i.e. without advancing the index)eat c
- attempts to match the next unmatched character to its argument, if it matches then it advances the index and otherwise raises an exception.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
let init (s:string) =
input := s;
input_len := String.length s;
idx := 0
let is_more () : bool =
!idx < !input_len
let peek () : char =
!input.[!idx]
let eat (c:char) =
if peek () = c then
idx := !idx + 1
else
raise (LexFailure (Printf.sprintf "Expected '%c' but found '%c'!" c (peek ())))
The code also assumes that we have helper functions is_lower : char -> bool
, is_prime : char -> bool
, is_upper : char -> bool
and is_wspace : char -> bool
that test whether a given character is lowercase or a prime or uppercase or whitespace (spaces and newlines) respectively.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
let lex (s:string) : token list =
init s;
let tokens = ref [] in
while is_more () do
match peek() with
| '(' ->
eat '(';
tokens := TkLParen :: !tokens
| ')' ->
eat ')';
tokens := TkRParen :: !tokens
| '&' ->
eat '&';
eat '&';
tokens := TkAnd :: !tokens
| '|' ->
eat '|';
eat '|';
tokens := TkOr :: !tokens
| c when is_lower c ->
tokens := lex_kw_or_id () :: !tokens
| c when is_wspace c -> eat c
| c ->
let error_msg =
Printf.sprintf "Did not expect to find '%c'!" c
in raise (LexFailure error_msg)
done;
(* Must reverse the list of tokens because
we were always adding to the front *)
List.rev (TkEnd :: !tokens)
This function first loads the input string into the reference and sets the length and index accordingly. Then it creates a new local reference tokens
, initialised to the empty list, which is used to accumulate the list of tokens that will ultimately be returned.
After this, the code enters a while loop, processing each character of the string in turn. What it does in the loop is determined by what character it sees. For example, if the next character is &
then this can only mean that we are looking at the first character of the logical and syntax &&
, so it proceeds to consume this first &
and then the next one (if, in reality, the second character here is not an &
, this indicates an invalid program and eat '&'
will raise an exception). Once the code has consumed the whole lexeme it creates a new token, TkAnd
, and adds it onto the list tokens
to be output.
In the penultimate case, the next character of the input is whitespace (a space or newline) and it is simply consumed without creating a token.
When the next character of the input is a lowercase letter or an underscore, then it could be that we are looking at either one of the two keywords true
or false
, or we are looking at an identifier. In general, we can’t tell which until we have seen at least the next 5 characters, because it is possible to have an identifier with a name like falsey
.
Since there is no upper bound to the length of a valid identifier, it’s neater to defer the processing of strings of characters like this to a separate function implementing another loop. This function simply accumulates the lexeme (consuming characters so long as they are allowable characters for an identifier) and then afterwards checks whether the lexeme is one of the reserved words “true” or “false”.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
let lex_kw_or_id () : token =
let lexeme = ref "" in
let
is_id_char c =
is_lower c || is_upper c || is_prime c
in
while is_more () && is_id_char (peek ()) do
let c = peek () in
eat c;
lexeme := !lexeme ^ String.make 1 c
done;
match !lexeme with
| "true" -> TkTrue
| "false" -> TkFalse
| _ -> TkId !lexeme
Then we can use the lexer in the REPL as follows:
1
2
3
4
5
# lex "foo && true || (false && bar)";;
- : token list =
[TkId "foo"; TkAnd; TkTrue; TkOr; TkLParen;
TkFalse; TkAnd; TkId "bar"; TkRParen;
TkEnd]
1
2
# lex "foo && true || (false * bar)";;
Exception: LexFailure "Did not expect to find '*'".