Proof of Kleene’s Theorem

In my last post, “Kleene’s Theorem,” I provided some useful background information about strings, regular languages, regular expressions, and finite automata before introducing the eponymously named theorem that has become one of the cornerstones of artificial intelligence and more specifically, natural language processing (NLP). Kleene’s Theorem tells us that regular expressions and finite state automata are one and the same when it comes to describing regular languages. In the post I will provide a proof of this groundbreaking principle.

Kleene’s Theorem

A language over an alphabet is regular if and only if it can be accepted by a finite automaton.

We can break this theorem down into two separate lemmas that need to proven in order to satisfy the “if and only if” constraint. Remembering that regular expressions are just representations of regular languages, we then need to prove:

Lemma 1
Any regular language can be accepted by a finite state automaton. (There exists a finite state automaton for every regular expression.)

Lemma 2
Any language accepted by a finite state automaton is regular. (There exists a regular expression for any language accepted by a finite state automaton.)

Note that some texts will extend Kleene’s Theorem to include transition graphs stating that in addition to regular expressions and finite state automata being equivalent, that transition graphs are as well. Since transition graphs are just visual representations of FSAs, this seems like a trivial and unnecessary addition IMHO.

As a reminder, the definition of a regular language over an alphabet Σ is:

The empty language Ø, ie., a set with no strings or the empty string – an empty set, is a regular language.
The language containing just the empty string, {ε}, is a regular language.
The singleton sets containing the individual symbols from Σ, {a ∈ Σ}, are regular languages.
If A and B are regular languages then:
- The union of A and B, A ∪ B, is a regular language.
- A concatenated with B, A • B, is a regular language.
- The Kleene star of A, A*, is a regular language. (The same is true for B.)
There are no other regular languages over Σ than those described above.

Lemma 1 Proof

Any regular language can be accepted by a finite state automaton. (There exists a finite state automaton for every regular expression.)

We will use the recursive definition of regular languages above to inductively prove this lemma.

Basis Step

We can construct FSAs for the languages ∅, {ε}, and {a} for any symbol a in alphabet Σ:

Inductive Step

Assume that languages $L_1$ and $L_2$ are regular languages.

To complete the inductive proof, we need to show that we can build FSAs for $L_1 \cup L_2$ , $L_1 \bullet L_2$ , and $L^*_1$ .

Union

$L_1 \cup L_2$

Concatenation

$L_1 \bullet L_2$

Kleene Star

$L^*_1$

And that completes the proof. We can create FSAs for the empty language, ∅, the language containing the empty string {ε}, the language consisting of singleton strings, {a}, and for regular languages $L_1$ and $L_2$ , we can build FSAs for $L_1 \cup L_2$ , $L_1 \bullet L_2$ , and $L^*_1$ . As the definition of regular languages states, there are no other regular languages over Σ than these.

Lemma 2 Proof

Any language accepted by a finite state automaton is regular. (There exists a regular expression for any language accepted by a finite state automaton.)

We can restate this lemma more specifically:

If M = (Σ, S, s₀, δ, F) is a finite state automata that recognizes a language R, ie., R = L(M), then there is a regular expression over alphabet Σ that corresponds to R.

Background

By ordering the states using the natural numbers, we can prove this lemma using mathematical induction. Without any loss of generality we can assume that each state in M is an integer from 1 to n where n = |S|.

We define any language that moves M from state p to state q as:

$R^{()}_{p,q} = \{ x \in \Sigma^* | \hat{\delta}(p, x) = q\}$

Where:

$R^{()}_{p,q} \equiv$ the set of strings (language) that moves M from p to q

$\Sigma^* \equiv$ the Kleene closure over alphabet $\Sigma$

$\hat{\delta} \equiv$ the extended transition function, $\hat{\delta} : S \times \Sigma^* \rightarrow 2^S$

Any non-deterministic FSA will have just a single path for each string from one state to another. We want to then define what it means to say that a path goes through a specific state.

Consider a string x, $x \in \Sigma^*$ . x can also be thought of as a path through FSA M. If string x moves M from state p to state q, $\hat{\delta}(p,x) = q$ , we can say that x goes through state s if:

x = yz (x is the concatenation of strings y and z.)
|y| > 0 and |z| > 0 (Neither y nor z is the empty string.)
$\hat{\delta}(p, y) = s$ (String y moves M from p to s)
$\hat{\delta}(s, z) = q$ (String z moves M from s to q)

Keeping in mind that we have numbered the states in M from 1 to n where n = |S|, we can a language based on the states it passes through:

$R^{(J)}_{pq} = \{ x \in \Sigma^* | x \text{ is a path from } p \text{ to } q \text{ that goes through no state higher than } J\}$

“Passing through” does not include starting and stopping states. p may be equal to 8 and q may be equal to 10, but if the intermediate states are 1,3, and 5, then we can say the path goes through no state higher than J = 5.

The set containing just the empty string, {ε}, and all singlet sets, words consisting of single symbols, don’t go through any states.

Finally, since there is no state higher than n, $R^{()}_{p,q} = R^{(n)}_{p,q}$ .

For our proof we will use induction to show that for every p where p ≥ 1 and every q where q ≤ n, each set $R^{(J)}_{pq}$ with 0 ≤ J ≤ n is a regular language.