Strings and Languages

An alphabet is any finite set, whose members are called letters (equivalently: symbols or characters). We typically use Σ to denote a generic alphabet and a,b,c,d as variables that stand for the letters.

A string (equivalently: word) over an alphabet Σ is a finite sequence of characters from Σ. The sequence may be empty, and we write the empty string as ϵ. We typically use u,v,w,x,y,z as variables that stand for a string.

For example, hello is a string over the alphabet {a,b,c,,y,z}, but Hello is not. Another string over the same alphabet is abbbbb and another is ϵ. Strings over the alphabet {,} include and . Strings over the alphabet {oh,really} in which each letter is one of these two strings, include ohreally and reallyohreally, but not ohr.

The set of all strings over the alphabet Σ is written Σ, this always includes ϵ. We will refer to a set of strings as a language.

Now we consider some operations on strings.

A string w is said to be a substring of a string v just if w appears consecutively in v.

For example, aab is a substring of aaabbb but it’s not a substring of abab. On the other hand a is a substring of both (typically, we are not too concerned about distinguishing between a the letter and a the string consisting of a single letter). We also have that ϵ is a substring of any string (over any alphabet).

The length of a string w, written |w|, is just the number of characters in the string. That is, if x=a1ak then |x|=k.

Given strings x and y, we write xy for the string obtained by concatenating y to the end of x. That is, if x=a1ak and y=b1bm then xy=a1akb1bm. We write wk for the k-fold concatenation of w with itself, i.e. the word wwwk-times.

We will use concatenation and variables to describe the shape of strings abstractly, just as we use arithmetic operators and variables to describe numbers abstractly. For example, to say that a string u over the alphabet {a,b} contains at least two occurrences of a, we can say that:

u is of shape w1aw2aw3

More precisely, with this form of words – “of shape” – we are saying that there exist strings w1,w2,w3{a,b} such that u=w1aw2aw3 (i.e. u is exactly the concatenation of w1 followed by an a followed by w2 followed by an a followed by w3). Every string over this alphabet that contains at least two a can be decomposed in this way (there may be several ways, generally), for example:

String Decomposition as w1aw2aw3
bbabab w1=bb, w2=b, w3=b
abbaa w1=ϵ, w2=bba, w3=ϵ
aaabbb w1=ϵ, w2=ϵ, w3=abbb

Compare this with how you would say that an integer n is even: n is of shape 2m (for some m).

We will often use this in combination with set notation, to describe languages of strings that all have a certain shape. For example:

  • {w1aw2aw3w1,w2,w3{a,b}} - the set of all strings over {a,b} containing at least two a
  • {w1aw2aw3w1,w2,w3{b}} - the set of all strings over {a,b} containing exactly two a
  • {xyx,y{0,1} and |x|=|y|} - the set of all strings over {0,1} that are of even length.
  • {0u01u10u0u{0,1}} - the set of all strings over 0,1 that are formed by repeating a substring three times consecutively, with the first wrapped in 0s, the second in 1s and the third in 0s
  • You can't use 'macro parameter character #' in math mode
  • {0ku0ku{0,1} and k>0}

Note! Do not confuse, e.g. {w1aw2aw3w1,w2,w3{a,b}} with {w1aw2aw3}: the latter is a singleton set, it contains the string w1aw2aw2 as its only element (and for that to make sense we must have introduced w1, w2 and w3 already). E.g.

Suppose, w1=a, w2=aa and w3=aaa. Then my FAVOURITE set is {w1aw2aw3}, which I just ADORE because its ONE AND ONLY element is aaaaaaaa.