Lexington implements its own regular expression engine in pure Python. This provides a few important advantages over using the re module:
This is the abstract base class for elements of regular expressions. (It’s also used as a factory for converting mundane Python data types like strings into regular expressions.)
Regex objects are immutable, hashable, and comparable.
In practice, Regex‘s subclasses should be regarded as implementation details. You shouldn’t attempt to create instances of them, create new subclasses of Regex, or test that a regex is an instance of a particular Regex subclass.
| Parameters: | e – An expression to convert into a regular expression. (This is equivalent to regexify.) |
|---|
Determines whether the subject matches this regex. This performs a total match – if you want the behavior of re.match, which only matches a prefix, use match_prefix.
This returns True if the match succeeds, and False if not.
| Parameters: | subject – The string to match against this regex. |
|---|
Indicates the alphabet of the strings this regular expression can match. None indicates that this regular expression is independent of alphabet.
The alphabet will usually be Text or Bytestring.
If this regular expression matches a literal string exactly, this property contains that string. Otherwise, it will be None. (You can generally only assume that regexes constructed from literal strings will have this property.)
Mathematical Properites
Returns the derivative of this regular expression with respect to a symbol. You can view it as:
{s[1:] for s in languages_generated_by(self) if s and s[0] == sym}
| Parameters: | sym – The symbol to derive this regular expression with regards to. |
|---|
Indicates whether this regular expression will accept the empty string.
Building Regexes
Creates a regular expression that matches this regex followed by other. (Equivalent to concat.)
Creates a regular expression that either matches this regex or other. (Equivalent to union.)
Creates a regular expression that matches this regex count times. (Equivalent to repeat. Keep in mind that this is the exponentiation operator, not multiplication as is more common with Python’s strings.)
A regular expression that refuses to accept anything – even the empty string.
A regular expression that matches the empty string.
A regular expression that matches any single symbol.
Converts a Python object to a Regex. If it’s already a Regex, it just returns it. It will also accept any string or character type, and create a regex that matches that exactly.
| Parameters: | e – The Python object to create a regex of. |
|---|
Creates a regular expression that accepts any of the following regexes.
| Parameters: | options – The regular expressions to accept. |
|---|
Concatenates two regular expressions, such that prefix will be matched, then suffix will be matched after it is complete.
| Parameters: |
|
|---|
Concatenates a sequence of regular expressions, such that each one will be matched successively. (Note that a Null value anywhere in the sequence will result in a Null overall.) A zero-length sequence will return Epsilon.
| Parameters: | regexes – The regular expressions to concatenate. |
|---|
The ideas behind “regular expressions” as used in modern programming languages come from formal language theory (i.e. math). You don’t need to read this section to use Lexington, but it’s useful for understanding how Lexington works and why its regular expressions are constructed a certain way.
The mathematical concept of a string is very similar to how programmers view it: a sequence of zero or more (but not infinity) symbols from an alphabet. “Alphabet” doesn’t refer to actual alphabets (like the Latin alphabet or Georgian alphabet), but to any finite set containing symbols. (“Playing card suits,” for example, is a perfectly valid alphabet.)
For example, if we use “Unicode codepoints” as our alphabet, the following are all strings:
"" # empty string - commonly called ε (epsilon)
"A"
"hello world"
"Spam, spam, spam, spam..."
"Ĉu vi parolas Esperanton?"
The Python str type (unicode in Python 2) is the type of all strings over the alphabet of Unicode characters. bytes (str in Python 2) is the type of all strings over the alphabet of bytes (the numbers 0 through 255).
A language is simply a set of strings. For example:
One way to describe a set is by explicitly listing every member. For example, we could describe the set of names of Monty Python members:
In Python, this looks like:
{"John Cleese", "Eric Idle", "Terry Gilliam", "Terry Jones",
"Graham Chapman", "Michael Palin"}
However, this only works for finite languages. Some languages (like “all valid Python programs,” or “all palindromes”) contain an infinite number of strings, and therefore cannot be enumerated.
So, another way to describe a language is by giving criteria for being part of the set. In math, this is done using “set builder” notation. An expression for the set of all palindromes would look like:
Which is read as, “All strings \(w\) where \(w\) is equal to \(w\) reversed.” It’s similar in concept to Python’s set comprehensions:
{w for w in every_string_ever if w == w[::-1]}
Except that set comprehensions only work when you actually have a finite number of strings. (Otherwise, Python runs out of memory and crashes.)
Regular expressions are another way to describe languages, which are similar to mathematical expressions. For example, here’s a regular expression for “the set of all Python integer literals in base 10:”
The basic elements of a regular expression are:
The notation \(L(R)\) means, “the set of strings described by \(R\).” The languages of these basic regular expressions are.
Then, there are three operations that allow us to build more complicated regular expressions:
Here are some examples:
Going back to our integer-literal regular expression above, we can read it as:
“First, there will either be a -, or nothing. Then, we either have a digit from 1-9 followed by any number of digits from 0-9, or one or more 0‘s.”
So, regular expressions are useful for answering both “what strings are in this language?” and “how do I tell if a string is in this language?”.
Here’s how to translate traditional Unix regular expressions (as implemented in Python’s re module) to their mathematical counterparts:
Some features of Python’s regular expression module aren’t present in the mathematical model.
Regular expressions are called regular expressions because they are only capable of matching regular languages. One language they are not capable of matching is the bracket language, which just looks like this:
""
"[]"
"[[]]"
"[[[]]]"
The bracket language is not a regular language – it is a context-free language. Therefore, no matter how hard you try, you can never create a regular expression that can correctly recognize whether a string is in the bracket language or not.
If you take a course or read a book about automata, you’ll learn exactly what qualifies a language as regular, context-free, or context-sensitive. Sometimes it’s obvious, but other times, you need to use a mathematical proof to determine what class a language falls into.
However, in general you can assume that any recursive language is not regular. The bracket language is recursive: each bracket pair may contain another bracket pair inside it, and there’s no limit to how deep the brackets can be nested. Python is also recursive, because statements like for loops can contain other statements, and expressions can contain other expressions. HTML is recursive, because each tag can contain other tags inside. (Which explains this famous Stack Overflow answer.)
There are many strategies for actually parsing things with regular expressions, including NFA conversion, DFA conversion, and backtracking. The one Lexington uses is arguably the simplest, and it’s based on the idea of the derivative of a language.
The derivative in language theory isn’t really related to the derivative in calculus. The derivative of a language with respect to a character is defined as “all the strings in a language that begin with that character, without that character at the beginning.” Mathematically, it’s:
And a Python implementation for finite languages would look like:
def derivative(language, char):
return {s[1:] for s in language if s[0] == char}
To see how this is useful for pattern matching, consider this language, which consists of all Triangle Transit routes in Chapel Hill:
>>> routes = {"400", "405", "420", "800", "805", "CRX"}
Let’s say we want to test whether "405" is in this language.
>>> derivative(routes, "4")
{"00", "05", "20"}
>>> derivative(_, "0")
{"0", "5"}
>>> derivative(_, "5")
{""}
We repeatedly derive the language with each character in the input string. After deriving all the characters, if "" (the empty string) is in the result set, we accept. Here’s an example of an unsuccessful match, this time for "CAT":
>>> derivative(routes, "C")
{"RX"}
>>> derivative(routes, "A")
set()
If we ever reach the empty set while deriving, that means the string is definitely not in the language.
However, compared to "405" in routes, this is really inefficient. It’s also impossible to actually compute this when we have an infinite set (like “Python integer literals”).
However, the derivative is also defined on regular expressions – and if we have a regular expression that describes a language, then the derivative of that regular expression describes the derivative of that language.
Also, taking the derivative of a regular expression is really simple (don’t be fooled by the giant tower of math):
This means that we can use the derivative matching algorithm above on regular expressions directly – and besides being easy to implement, it’s actually pretty efficient. And this is how Lexington’s regexes work.
I learned about using derivatives for pattern matching from a pair of blog articles by Matthew Might:
For a more general introduction to languages and how to recognize them, I recommend Michael Sisper’s Introduction to the Theory of Computation (my CSC 333 textbook).