\[ \newcommand{\tr}{\Rightarrow} \newcommand{\trs}{\tr^{\!\ast}} \newcommand{\rlnm}[1]{\mathsf{(#1)}} \newcommand{\rred}[1]{\xrightarrow{#1}} \newcommand{\rreds}[1]{\mathrel{\xrightarrow{#1}\!\!^*}} \newcommand{\cl}{\mathsf{Cl}} \newcommand{\pow}{\mathcal{P}} \newcommand{\matches}{\mathrel{\mathsf{matches}}} \newcommand{\kw}[1]{\mathsf{#1}} \newcommand{\andop}{\mathrel{\&\!\&}} \newcommand{\orop}{\mathrel{\|}} \newcommand{\ff}{\mathsf{false}} \newcommand{\tt}{\mathsf{true}} \newcommand{\abra}[1]{\langle #1 \rangle} \newcommand{\bnfnt}[1]{\abra{\small \textsf{#1}}} \newcommand{\llbracket}{[\![} \newcommand{\rrbracket}{]\!]} \newcommand{\first}{\mathsf{First}} \newcommand{\nullable}{\mathsf{Nullable}} \newcommand{\follow}{\mathsf{Follow}} \newcommand{\tm}[1]{\mathsf{#1}} \]

Strings and Languages

An alphabet is any finite set, whose members are called letters (equivalently: symbols or characters). We typically use $\Sigma$ to denote a generic alphabet and $a,b,c,d$ as variables that stand for the letters.

A string (equivalently: word) over an alphabet $\Sigma$ is a finite sequence of characters from $\Sigma$. The sequence may be empty, and we write the empty string as $\epsilon$. We typically use $u,v,w,x,y,z$ as variables that stand for a string.

For example, $\mathsf{hello}$ is a string over the alphabet \(\{\mathsf{a},\mathsf{b},\mathsf{c},\ldots,\mathsf{y},\mathsf{z}\}\), but $\mathsf{Hello}$ is not. Another string over the same alphabet is $\mathsf{abbbbb}$ and another is $\epsilon$. Strings over the alphabet \(\{▽,△\}\) include $\mathord{\bigtriangleup}\mathord{\bigtriangleup}\mathord{▽}$ and $\bigtriangledown$. Strings over the alphabet \(\{\mathsf{oh},\mathsf{really}\}\) in which each letter is one of these two strings, include $\mathsf{oh}\mathsf{really}$ and $\mathsf{really}\mathsf{oh}\mathsf{really}$, but not $\mathsf{ohr}$.

The set of all strings over the alphabet $\Sigma$ is written $\Sigma^*$, this always includes $\epsilon$. We will refer to a set of strings as a language.

Now we consider some operations on strings.

A string $w$ is said to be a substring of a string $v$ just if $w$ appears consecutively in $v$.

For example, $aab$ is a substring of $aaabbb$ but it’s not a substring of $abab$. On the other hand $a$ is a substring of both (typically, we are not too concerned about distinguishing between $a$ the letter and $a$ the string consisting of a single letter). We also have that $\epsilon$ is a substring of any string (over any alphabet).

The length of a string $w$, written $|w|$, is just the number of characters in the string. That is, if $x = a_1\cdots{}a_k$ then $|x| = k$.

Given strings $x$ and $y$, we write $xy$ for the string obtained by concatenating $y$ to the end of $x$. That is, if $x = a_1\cdots{}a_k$ and $y = b_1 \cdots{} b_m$ then $xy = a_1\cdots{}a_k b_1 \cdots{} b_m$. We write $w^k$ for the $k$-fold concatenation of $w$ with itself, i.e. the word $\underbrace{ww\cdots{}w}_{\text{$k$-times}}$.

We will use concatenation and variables to describe the shape of strings abstractly, just as we use arithmetic operators and variables to describe numbers abstractly. For example, to say that a string $u$ over the alphabet \(\{a,b\}\) contains at least two occurrences of $a$, we can say that:

$u$ is of shape $w_1aw_2aw_3$

More precisely, with this form of words – “of shape” – we are saying that there exist strings \(w_1,w_2,w_3 \in \{a,b\}^*\) such that $u = w_1 a w_2 a w_3$ (i.e. $u$ is exactly the concatenation of $w_1$ followed by an $a$ followed by $w_2$ followed by an $a$ followed by $w_3$). Every string over this alphabet that contains at least two $a$ can be decomposed in this way (there may be several ways, generally), for example:

String Decomposition as $w_1aw_2aw_3$
$bbabab$ $w_1 = bb$, $w_2 = b$, $w_3 = b$
$abbaa$ $w_1 = \epsilon$, $w_2 = bba$, $w_3 = \epsilon$
$aaabbb$ $w_1 = \epsilon$, $w_2 = \epsilon$, $w_3 = abbb$

Compare this with how you would say that an integer $n$ is even: $n$ is of shape $2*m$ (for some $m$).

We will often use this in combination with set notation, to describe languages of strings that all have a certain shape. For example:

  • \(\{w_1 a w_2 a w_3 \mid w_1, w_2, w_3 \in \{a,b\}^* \}\) - the set of all strings over \(\{a,b\}\) containing at least two $a$
  • \(\{w_1 a w_2 a w_3 \mid w_1, w_2, w_3 \in \{b\}^* \}\) - the set of all strings over \(\{a,b\}\) containing exactly two $a$
  • \(\{xy \mid x,y \in \{0,1\}^* \ \text{and}\ \lvert x \rvert = \lvert y \rvert \}\) - the set of all strings over \(\{0,1\}\) that are of even length.
  • \(\{ 0u01u10u0 \mid u \in \{0,1\}^* \}\) - the set of all strings over ${0,1}$ that are formed by repeating a substring three times consecutively, with the first wrapped in 0s, the second in 1s and the third in 0s
  • \[\{ uv \mid u,v \in \{0,1\}^* \ \text{and}\ #_0(u) = #_1(v) \}\]
  • \[\{ 0^ku0^k \mid u \in \{0,1\}^* \ \text{and}\ k > 0 \}\]

Note! Do not confuse, e.g. \(\{w_1 a w_2 a w_3 \mid w_1, w_2, w_3 \in \{a,b\}^* \}\) with \(\{w_1 a w_2 a w_3\}\): the latter is a singleton set, it contains the string \(w_1 a w_2 a w_2\) as its only element (and for that to make sense we must have introduced $w_1$, $w_2$ and $w_3$ already). E.g.

Suppose, $w_1 = a$, $w_2 =aa$ and $w_3 = aaa$. Then my FAVOURITE set is \(\{w_1aw_2aw_3\}\), which I just ADORE because its ONE AND ONLY element is $aaaaaaaa$.