Mathematical Notation
Some notation for discussing context free grammars a bit more precisely.
A Context Free Grammar (CFG) consists of four components:
- An alphabet of terminal symbols, which we shall usually write as
(capital letter sigma) - A finite, non-empty set of non-terminal symbols, disjoint from the terminals, which we shall usually write as
(caligraphic letter N) - A finite set of production rules, which we shall usually write as
(caligraphic letter R) - A designated non-terminal from
, called the start symbol, which we will usually write as .
We will adopt the following conventions. We will use lowercase letters from the start of the alphabet,
If you think back to the previous example, you will notice that a key part of the grammar is manipulating sequences of terminals and non-terminals, like
A sentential form is just a finite sequence of terminals (from
According to this view, we will reserve metavariables
We use metavariables when we want to state something that involves a string, or a sentential form that we don’t want to specify concretely. For example, we might say, “for all strings
However, to make the situation even more confusing, but much more convenient, we will by some abuse use the metavariables for terminals and nonterminals also as the names of particular terminals and non-terminals when their names don’t matter. For example, we might say “Consider a grammar for generating strings over the alphabet
The production rules
where
Here’s another example of a context free grammar which is particularly important to this unit, it is the syntax of a little idealised imperative programming language called While.
The syntax of the While Programming Language is given by the CFG with:
- Terminal symbols:
, , , , , , , , , , , , . - Nonterminal symbols:
, , , , , , , , , -
Production rules:
- Start symbol:
Informally the “meaning” of the different nonterminals (sometimes called the syntactic classes of the language) are as follows:
: statements, i.e. code that, when executed, does not return a value but instead has some effect (such as updating the value of a variable) : Boolean expressions, i.e. code that, when executed, will return a Boolean (true/false) value : arithmetic expressions, i.e. code that, when executed, will return a numerical value : single digits : possibly empty sequences of digits : single lowercase letters, or the prime (apostrophe) symbol ‘ : single uppercase letters : possibly empty sequences of letters and underscores : variable names (identifiers) : numerals
There are quite a number of nonterminals, but there is a sense (which we will make more precise later) that the first three are really doing all the heavy lifting in the description of this language. The latter seven non-terminals are just describing two things: the shape of numbers and the shape of variable names (also known as identifiers).
Note: there is nothing special about the particular letters used for the nonterminals, we could have used alternative names and we would still consider it essentially the same description of the While programming language. For example, if we replaced everywhere in the rules the letter N by the letter P.
Some remarks on the notation:
is the “do nothing” statement, it has no effect when executed is an assignment statement: assign the value described by the arithmetic expression derived from to the variable whose name derives from .- {
} are just brackets for statements (as parentheses are to expressions) - We use
to combine two statements by executing them in sequence, rather than as a terminator for every (atomic) statement, so e.g. is a perfectly good statement derivable from .
Derivation
The process of deriving a string from the grammar, step by step, is made precise using the derivation relation
The one-step derivation relation is a binary relation on sentential forms with two sentential forms
We write
Finally, we say that a word
In other words, we write
Some common ways in which this notation will be used:
means that the grammar, starting from the start symbol (which is a sentential form of length 1), derives a string (of terminals) . Here the sentential form at the end of some number of derivation steps is asserted to be a string (i.e. a sentential form containing no non-terminals). means that sentential form is derivable in the grammar. may or may not be a string, so without further information it could be that represents a kind of intermediate point in the derivation which is not yet at an end.
For example, in the grammar above we have the following:
-
This is justified by the definition if we take
to be , to be , to be and to be . -
This is justified by the definition if we take
to be , to be , to be the sentential form and as . -
This is justified by the definition if we take
to be , to be , to be and to be .
If we put these together in order:
then this justifies the statement:
and this justifies us to claim that
Design of Context-Free Grammars
We discuss the following examples over alphabet
A grammar for this language is:
Each derivation step consists of either extending the sentential form by any choice of digit 0 or 1, or ending the string with epsilon. Thus, any possible word can be derived.
A grammar for this language is:
A word derivable from S must include 3 1s, because there is only one rule possible which introduces them. Around the 1s you may put any words you like because we only require “at least” 3, so the rules for
A grammar for this language is:
If you look at the shape of words derivable from
A grammar for this language is:
When we use the first production rule for
A grammar for this string is:
Here again, the middle of each sentential form (of length > 1) will be
Here
A word in this language can be cut down the middle, and the substring on the left is different from the reverse of the substring on the right. In other words, there is some letter, say
The grammar encodes this in the following way. Starting from