Regular Expressions

Lexington implements its own regular expression engine in pure Python. This provides a few important advantages over using the re module:

  • Input data can be matched against regexes incrementally.
  • Complex regular expressions can be built out of smaller ones.
  • The matching algorithm is simpler to understand.

Regular Expression API

class lexington.regex.Regex[source]

This is the abstract base class for elements of regular expressions. (It’s also used as a factory for converting mundane Python data types like strings into regular expressions.)

Regex objects are immutable, hashable, and comparable.

In practice, Regex‘s subclasses should be regarded as implementation details. You shouldn’t attempt to create instances of them, create new subclasses of Regex, or test that a regex is an instance of a particular Regex subclass.

Parameters:e – An expression to convert into a regular expression. (This is equivalent to regexify.)
match(subject)[source]

Determines whether the subject matches this regex. This performs a total match – if you want the behavior of re.match, which only matches a prefix, use match_prefix.

This returns True if the match succeeds, and False if not.

Parameters:subject – The string to match against this regex.
alphabet[source]

Indicates the alphabet of the strings this regular expression can match. None indicates that this regular expression is independent of alphabet.

The alphabet will usually be Text or Bytestring.

literal[source]

If this regular expression matches a literal string exactly, this property contains that string. Otherwise, it will be None. (You can generally only assume that regexes constructed from literal strings will have this property.)

Mathematical Properites

derive(sym)[source]

Returns the derivative of this regular expression with respect to a symbol. You can view it as:

{s[1:] for s in languages_generated_by(self) if s and s[0] == sym}
Parameters:sym – The symbol to derive this regular expression with regards to.
accepts_empty_string[source]

Indicates whether this regular expression will accept the empty string.

Building Regexes

regex + other

Creates a regular expression that matches this regex followed by other. (Equivalent to concat.)

regex | other

Creates a regular expression that either matches this regex or other. (Equivalent to union.)

regex ** count

Creates a regular expression that matches this regex count times. (Equivalent to repeat. Keep in mind that this is the exponentiation operator, not multiplication as is more common with Python’s strings.)

star()[source]

Creates a regular expression that accepts this one repeated any number of times. (Equivalent to the star function.)

plus()[source]

Creates a regular expression that accepts this one repeated any number of times.

maybe()[source]

Creates a regular expression that accepts this one, or the empty string.

Special Regexes

lexington.regex.Null = Null

A regular expression that refuses to accept anything – even the empty string.

lexington.regex.Epsilon = Epsilon

A regular expression that matches the empty string.

lexington.regex.Any = Any

A regular expression that matches any single symbol.

Constructor Functions

lexington.regex.regexify(e)[source]

Converts a Python object to a Regex. If it’s already a Regex, it just returns it. It will also accept any string or character type, and create a regex that matches that exactly.

Parameters:e – The Python object to create a regex of.
lexington.regex.union(*options)[source]

Creates a regular expression that accepts any of the following regexes.

Parameters:options – The regular expressions to accept.
lexington.regex.concat(prefix, suffix)[source]

Concatenates two regular expressions, such that prefix will be matched, then suffix will be matched after it is complete.

Parameters:
  • prefix – The first regular expression.
  • suffix – The second regular expression.
lexington.regex.join(regexes)[source]

Concatenates a sequence of regular expressions, such that each one will be matched successively. (Note that a Null value anywhere in the sequence will result in a Null overall.) A zero-length sequence will return Epsilon.

Parameters:regexes – The regular expressions to concatenate.
lexington.regex.star(regex)[source]

Creates a regular expression that accepts regex, repeated any number of times – even 0.

Parameters:regex – The regular expression describing the strings to repeat.
lexington.regex.repeat(regex, count)[source]

Creates a regular expression that accepts regex repeated a specific number of times.

Parameters:
  • regex – The regular expression describing the strings to reepeat.
  • count – The number of times to repeat it.

Mathematical Concepts

The ideas behind “regular expressions” as used in modern programming languages come from formal language theory (i.e. math). You don’t need to read this section to use Lexington, but it’s useful for understanding how Lexington works and why its regular expressions are constructed a certain way.

Strings and Alphabets

The mathematical concept of a string is very similar to how programmers view it: a sequence of zero or more (but not infinity) symbols from an alphabet. “Alphabet” doesn’t refer to actual alphabets (like the Latin alphabet or Georgian alphabet), but to any finite set containing symbols. (“Playing card suits,” for example, is a perfectly valid alphabet.)

For example, if we use “Unicode codepoints” as our alphabet, the following are all strings:

""              # empty string - commonly called ε (epsilon)
"A"
"hello world"
"Spam, spam, spam, spam..."
"Ĉu vi parolas Esperanton?"

The Python str type (unicode in Python 2) is the type of all strings over the alphabet of Unicode characters. bytes (str in Python 2) is the type of all strings over the alphabet of bytes (the numbers 0 through 255).

Languages

A language is simply a set of strings. For example:

  • The set of all valid Python integer literals
  • The set of all valid Python programs
  • The set of all Monty Python skits
  • The set of names of Monty Python members
  • The set of all palindromes
  • The set of every string ever
  • The empty set (called “null,” which is sometimes written as ∅)

One way to describe a set is by explicitly listing every member. For example, we could describe the set of names of Monty Python members:

\[\{ \mathtt{John\,Cleese}, \mathtt{Eric\,Idle}, \mathtt{Terry\,Gilliam}, \mathtt{Terry\,Jones}, \mathtt{Graham\,Chapman}, \mathtt{Michael\,Palin} \}\]

In Python, this looks like:

{"John Cleese", "Eric Idle", "Terry Gilliam", "Terry Jones",
 "Graham Chapman", "Michael Palin"}

However, this only works for finite languages. Some languages (like “all valid Python programs,” or “all palindromes”) contain an infinite number of strings, and therefore cannot be enumerated.

So, another way to describe a language is by giving criteria for being part of the set. In math, this is done using “set builder” notation. An expression for the set of all palindromes would look like:

\[\{ w \mid w = w^\mathcal{R} \}\]

Which is read as, “All strings \(w\) where \(w\) is equal to \(w\) reversed.” It’s similar in concept to Python’s set comprehensions:

{w for w in every_string_ever if w == w[::-1]}

Except that set comprehensions only work when you actually have a finite number of strings. (Otherwise, Python runs out of memory and crashes.)

Regular Expressions

Regular expressions are another way to describe languages, which are similar to mathematical expressions. For example, here’s a regular expression for “the set of all Python integer literals in base 10:”

\[(\mathtt{-} \cup \epsilon) \, ((\mathtt{1} \cup \cdots \cup \mathtt{9})\, (\mathtt{0} \cup \mathtt{1} \cup \cdots \cup \mathtt{9})^* \cup \mathtt{0} \mathtt{0}^*)\]

The basic elements of a regular expression are:

  • \(a\), where \(a\) is any symbol in the alphabet
  • \(\epsilon\)
  • \(\emptyset\)

The notation \(L(R)\) means, “the set of strings described by \(R\).” The languages of these basic regular expressions are.

  • \(L(a)\) is \(\{a\}\): just the one-symbol string "a".
  • \(L(\epsilon)\) is \(\{\epsilon\}\): just the empty string.
  • \(L(\emptyset)\) is \(\emptyset\): no strings at all are in this set.

Then, there are three operations that allow us to build more complicated regular expressions:

  • \(R_1 \circ R_2\) (or just \(R_1 R_2\)) is concatenation: first, a string in \(R_1\), then a string in \(R_2\).
  • \(R_1 \cup R_2\) is union: any string either in \(R_1\) or \(R_2\).
  • \(R_1^*\) is Kleene star: zero or more strings in \(R_1\) repeated.

Here are some examples:

  • \(\mathtt{abc}\) describes "abc"
  • \(\mathtt{a} \cup \mathtt{b}\) describes "a" and "b"
  • \(\mathtt{a} \cup \epsilon\) describes "a" and ""
  • \(\mathtt{a}^*\) describes "", "a", "aa", and so on
  • \(\mathtt{aa}^*\) describes "a", "aa", "aaa", and so on
  • \((\mathtt{a} \cup \mathtt{b})^*\) describes "", "a", "b", "aa", "ab", "ba", "aaa", etc.

Going back to our integer-literal regular expression above, we can read it as:

“First, there will either be a -, or nothing. Then, we either have a digit from 1-9 followed by any number of digits from 0-9, or one or more 0‘s.”

So, regular expressions are useful for answering both “what strings are in this language?” and “how do I tell if a string is in this language?”.

Unix’s Regular Expressions

Here’s how to translate traditional Unix regular expressions (as implemented in Python’s re module) to their mathematical counterparts:

  • R* is exactly the same as \(R^*\).
  • R+ is equivalent to \(R\,R^*\).
  • R? is equivalent to \(R \cup \epsilon\).
  • R{m} is equivalent to \(R R \cdots R\), however many times.
  • R{m,n} is really complicated in math notation, but trust me, it works.
  • R|S is equivalent to \(R \cup S\).
  • [abc] is equivalent to \(\mathtt{a} \cup \mathtt{b} \cup \mathtt{c}\). (Ranges like a-z would add a union term for each character in the range.)
  • \d and friends are shortcuts for [] set notation, so they would be translated the same way.

Some features of Python’s regular expression module aren’t present in the mathematical model.

A Side Note: Non-Regular Languages

Regular expressions are called regular expressions because they are only capable of matching regular languages. One language they are not capable of matching is the bracket language, which just looks like this:

""
"[]"
"[[]]"
"[[[]]]"

The bracket language is not a regular language – it is a context-free language. Therefore, no matter how hard you try, you can never create a regular expression that can correctly recognize whether a string is in the bracket language or not.

If you take a course or read a book about automata, you’ll learn exactly what qualifies a language as regular, context-free, or context-sensitive. Sometimes it’s obvious, but other times, you need to use a mathematical proof to determine what class a language falls into.

However, in general you can assume that any recursive language is not regular. The bracket language is recursive: each bracket pair may contain another bracket pair inside it, and there’s no limit to how deep the brackets can be nested. Python is also recursive, because statements like for loops can contain other statements, and expressions can contain other expressions. HTML is recursive, because each tag can contain other tags inside. (Which explains this famous Stack Overflow answer.)

Derivatives of Languages

There are many strategies for actually parsing things with regular expressions, including NFA conversion, DFA conversion, and backtracking. The one Lexington uses is arguably the simplest, and it’s based on the idea of the derivative of a language.

The derivative in language theory isn’t really related to the derivative in calculus. The derivative of a language with respect to a character is defined as “all the strings in a language that begin with that character, without that character at the beginning.” Mathematically, it’s:

\[D_c(L) = \{ w \mid c w \in L \}\]

And a Python implementation for finite languages would look like:

def derivative(language, char):
    return {s[1:] for s in language if s[0] == char}

To see how this is useful for pattern matching, consider this language, which consists of all Triangle Transit routes in Chapel Hill:

>>> routes = {"400", "405", "420", "800", "805", "CRX"}

Let’s say we want to test whether "405" is in this language.

>>> derivative(routes, "4")
{"00", "05", "20"}
>>> derivative(_, "0")
{"0", "5"}
>>> derivative(_, "5")
{""}

We repeatedly derive the language with each character in the input string. After deriving all the characters, if "" (the empty string) is in the result set, we accept. Here’s an example of an unsuccessful match, this time for "CAT":

>>> derivative(routes, "C")
{"RX"}
>>> derivative(routes, "A")
set()

If we ever reach the empty set while deriving, that means the string is definitely not in the language.

However, compared to "405" in routes, this is really inefficient. It’s also impossible to actually compute this when we have an infinite set (like “Python integer literals”).

Derivatives of Regular Expressions

However, the derivative is also defined on regular expressions – and if we have a regular expression that describes a language, then the derivative of that regular expression describes the derivative of that language.

Also, taking the derivative of a regular expression is really simple (don’t be fooled by the giant tower of math):

\[\begin{split}& D_c(\emptyset) &= \emptyset \\ & D_c(\epsilon) &= \emptyset \\ & D_c(c) &= \epsilon \\ & D_c(c') &= \emptyset & \quad \text{ if } c \neq c' \\ & D_c(R_1 \circ R_2) &= D_c(R_1) \circ R_2 \cup D_c(R_2) & \quad \text{ if } \epsilon \in R_2 \\ & D_c(R_1 \circ R_2) &= D_c(R_1) \circ R_2 & \quad \text{ otherwise} \\ & D_c(R_1 \cup R_2) &= D_c(R_1) \cup D_c(R_2) \\ & D_c(R^*) &= D_c(R) \circ R^*\end{split}\]

This means that we can use the derivative matching algorithm above on regular expressions directly – and besides being easy to implement, it’s actually pretty efficient. And this is how Lexington’s regexes work.

Further Reading

I learned about using derivatives for pattern matching from a pair of blog articles by Matthew Might:

For a more general introduction to languages and how to recognize them, I recommend Michael Sisper’s Introduction to the Theory of Computation (my CSC 333 textbook).