Tools for parsing and manipulating regular expressions. Note that this is a very different concept from that of simply creating and using those regular expressions, functionality which is present in basically every programming language in the world, Python included.
This project was undertaken because I wanted to be able to compute the intersection between two regular expressions. The "intersection" is the set of strings which both regular expressions will accept, represented as a third regular expression.
pip install greeneryfrom greenery import parse
print(parse("abc...") & parse("...def"))
# "abcdef"
print(parse("\d{4}-\d{2}-\d{2}") & parse("19.*"))
# "19\d{2}-\d{2}-\d{2}"
print(parse("\W*") & parse("[a-g0-8$%\^]+") & parse("[^d]{2,8}"))
# "[$%\^]{2,8}"
print(parse("[bc]*[ab]*") & parse("[ab]*[bc]*"))
# "([ab]*a|[bc]*c)?b*"
print(parse("a*") & parse("b*"))
# ""
print(parse("a") & parse("b"))
# "[]"In the penultimate example, the empty string is returned, because only the empty string is in both of the regular languages a* and b*. In the final example, an empty character class has been returned. An empty character class can never match anything, which means greenery can use this to represent a regular expression which matches no strings at all. Note that this is different from only matching the empty string.
Internally, greenery works by converting regular expressions to finite state machines, computing the intersection of the two FSMs as a third FSM, and using the Brzozowski algebraic method (q.v.) to convert the third FSM back to a regular expression.
This function takes a regular expression (i.e. a string) as input and returns a Pattern object (see below) representing that regular expression.
The following metacharacters and formations have their usual meanings: ., *, +, ?, {m}, {m,}, {m,n}, (), |, [], ^ within [] character ranges only, - within [] character ranges only, and \ to escape any of the preceding characters or itself.
These character escapes are possible: \t, \r, \n, \f, \v.
These predefined character sets also have their usual meanings: \w, \d, \s and their negations \W, \D, \S. . matches any character, including new line characters and carriage returns.
An empty charclass [] is legal and matches no characters: when used in a regular expression, the regular expression may match no strings.
- 
This method is intentionally rigorously simple, and tolerates no ambiguity. For example, a hyphen must be escaped in a character class even if it appears first or last. [-abc]is a syntax error, write[\-abc]. Escaping something which doesn't need it is a syntax error too:[\ab]resolves to neither[\\ab]nor[ab].
- 
The ^and$metacharacters are not supported. By default,greeneryassumes that all regexes are anchored at the start and end of any input string. Carets and dollar signs will be parsed as themselves. If you want to not anchor at the start or end of the string, put.*at the start or end of your regex respectively.This is because computing the intersection between .*a.*and.*b.*(1) is largely pointless and (2) usually results in gibberish coming out of the program.
- 
The non-greedy operators *?,+?,??and{m,n}?are permitted but do nothing. This is because they do not alter the regular language. For example,abc{0,5}defandabc{0,5}?defrepresent precisely the same set of strings.
- 
Parentheses are used to alternate between multiple possibilities e.g. (a|bc)only, not for capture grouping. Here's why:print(parse("(ab)c") & parse("a(bc)")) # "abc" 
- 
The (?:...)syntax for non-capturing groups is permitted, but does nothing.
- 
Other (?...)constructs are not supported (and most are not regular in the computer science sense).
- 
Back-references, such as ([aeiou])\1, are not regular.
A Pattern represents a regular expression and exposes various methods for manipulating it and combining it with other regular expressions. Patterns are immutable.
A regular language is a possibly-infinite set of strings. With this in mind, Pattern implements numerous methods like those on frozenset, as well as many regular expression-specific methods.
It's not intended that you construct new Pattern instances directly; use parse(string), above.
| Method | Behaviour | 
|---|---|
| pattern.matches("a")"a" in pattern | Returns Trueif the regular expression matches the string orFalseif not. | 
| pattern.strings()for string in pattern | Returns a generator of all the strings that this regular expression matches. | 
| pattern.empty() | Returns Trueif this regular expression matches no strings, otherwiseFalse. | 
| pattern.cardinality()len(pattern) | Returns the number of strings which the regular expression matches. Throws an OverflowErrorif this number is infinite. | 
| pattern1.equivalent(pattern2) | Returns Trueif the two regular expressions match exactly the same strings, otherwiseFalse. | 
| pattern.copy() | Returns a shallow copy of pattern. | 
| pattern.everythingbut() | Returns a regular expression which matches every string not matched by the original. pattern.everythingbut().everythingbut()matches the same strings aspattern, but is not necessarily identical in structure. | 
| pattern.reversed()reversed(pattern) | Returns a reversed regular expression. For each string that patternmatched,reversed(pattern)will match the reversed string.reversed(reversed(pattern))matches the same strings aspattern, but is not necessarily identical. | 
| pattern.times(star)pattern * star | Returns the input regular expression multiplied by any Multiplier(see below). | 
| pattern1.concatenate(pattern2, ...)pattern1 + pattern2 + ... | Returns a regular expression which matches any string of the form a·b·... where a is a string matched by pattern1, b is a string matched bypattern2and so on. | 
| pattern1.union(pattern2, ...)pattern1 | pattern2 | ... | Returns a regular expression matching any string matched by any of the input regular expressions. This is also called alternation. | 
| pattern1.intersection(pattern2, ...)pattern1 & pattern2 & ... | Returns a regular expression matching any string matched by all input regular expressions. The successful implementation of this method was the ultimate goal of this entire project. | 
| pattern1.difference(pattern2, ...)pattern1 - pattern2 - ... | Subtract the set of strings matched by pattern2onwards from those matched bypattern1and return the resulting regular expression. | 
| pattern1.symmetric_difference(pattern2, ...)pattern1 ^ pattern2 ^ ... | Returns a regular expression matching any string accepted by pattern1orpattern2but not both. | 
| pattern.derive("a") | Return the Brzozowski derivative of the input regular expression with respect to "a". | 
| pattern.reduce() | Returns a regular expression which is equivalent to pattern(i.e. matches exactly the same strings) but is simplified as far as possible. See dedicated section below. | 
Call this method to try to simplify the regular expression object. The follow simplification heuristics are supported:
- (ab|cd|ef|)gto- (ab|cd|ef)?g
- ([ab])*to- [ab]*
- ab?b?cto- ab{0,2}c
- aato- a{2}
- a(d(ab|a*c))to- ad(ab|a*c)
- 0|[2-9]to- [02-9]
- abc|adeto- a(bc|de)
- xyz|stzto- (xy|st)z
- abc()defto- abcdef
- a{1,2}|a{3,4}to- a{1,4}
The value returned is a new Pattern object.
Note that in a few cases this did not result in a shorter regular expression.
A combination of a finite lower Bound (see below) and a possibly-infinite upper Bound.
from greenery import parse, Bound, INF, Multiplier
print(parse("a") * Multiplier(Bound(3), INF)) # "a{3,}"Special Multiplier, equal to Multiplier(Bound(0), INF). When it appears in a regular expression, this is {0,} or the Kleene star *.
Special Multiplier, equal to Multiplier(Bound(0), Bound(1)). When it appears in a regular expression, this is {0,1} or ?.
Special Multiplier, equal to Multiplier(Bound(1), INF). When it appears in a regular expression, this is {1,} or +.
Represents a non-negative integer or infinity.
Special Bound representing no limit. Can be used as an upper bound only.
This class represents a character class such as a, \w, ., [A-Za-z0-9_], and so on. Charclasses must be constructed longhand either using a string containing all the desired characters, or a tuple of ranges, where each range is a pair of characters to be used as the range's inclusive endpoints. Use ~ to negate a Charclass.
- a=- Charclass("a")
- [abyz]=- Charclass("abyz")
- [a-z]=- Charclass("abcdefghijklmnopqrstuvwxyz")or- Charclass((("a", "z"),))
- \w=- Charclass((("a", "z"), ("A", "Z"), ("0", "9"), ("_", "_")))
- [^x]=- ~Charclass("x")
- \D=- ~Charclass("0123456789")
- .=- ~Charclass(())
An Fsm is a finite state machine which accepts strings (or more generally iterables of Unicode characters) as input. This is used internally by Pattern for most regular expression manipulation operations.
In theory, accepting strings as input means that every Fsm's alphabet is the same: the set of all 1,114,112 possible Unicode characters which can make up a string. But this is a very large alphabet and would result in extremely large transition maps, and have very poor performance. So, in practice, Fsm uses not single characters but Charclasses (see above) for its alphabet and its map transitions.
# FSM accepting only the string "a"
a = Fsm(
    alphabet={Charclass("a"), ~Charclass("a")},
    states={0, 1, 2},
    initial=0,
    finals={1},
    map={
        0: {Charclass("a"): 1, ~Charclass("a"): 2},
        1: {Charclass("a"): 2, ~Charclass("a"): 2},
        2: {Charclass("a"): 2, ~Charclass("a"): 2},
    },
)Notes:
- The Charclasses which make up the alphabet must partition the space of all Unicode characters - every Unicode character must be a member of exactly oneCharclassin the alphabet.
- States must be integers.
- The map must be complete. Omitting transition symbols or states is not permitted.
A regular language is a possibly-infinite set of strings. With this in mind, Fsm implements several methods like those on frozenset.
| Method | Behaviour | 
|---|---|
| fsm.accepts("a") | Returns Trueif the FSM accepts string orFalseif not. | 
| fsm.strings() | Returns a generator of all the strings which this FSM accepts. | 
| fsm.empty() | Returns Trueif this FSM accepts no strings, otherwiseFalse. | 
| fsm.cardinality() | Returns the number of strings which the FSM accepts. Throws an OverflowErrorif this number is infinite. | 
| fsm1.equivalent(fsm2) | Returns Trueif the two FSMs accept exactly the same strings, otherwiseFalse. | 
| fsm.copy() | Returns a shallow copy of fsm. | 
| fsm.everythingbut() | Returns an FSM which accepts every string not matched by the original. fsm.everythingbut().everythingbut()matches the same strings asfsm. | 
| fsm1.concatenate(fsm2, ...) | Returns an FSM which accepts any string of the form a·b·... where a is a string accepted by fsm1, b is a string accepted byfsm2and so on. | 
| fsm.times(multiplier) | Returns the input FSM concatenated with itself multipliertimes.multipliermust be a non-negative integer. | 
| fsm.star() | Returns an FSM which is the Kleene star closure of the original. | 
| fsm1.union(fsm2, ...) | Returns an FSM accepting any string matched by any of the input FSMs. This is also called alternation. | 
| fsm1.intersection(fsm2, ...) | Returns an FSM accepting any string matched by all input FSMs. | 
| fsm1.difference(fsm2, ...) | Subtract the set of strings matched by fsm2onwards from those matched byfsm1and return the resulting FSM. | 
| fsm1.symmetric_difference(fsm2, ...) | Returns an FSM matching any string accepted by fsm1orfsm2but not both. | 
| fsm.derive(string) | Return the Brzozowski derivative of the input FSM with respect to the input string. | 
| fsm.reduce() | Returns an FSM which is equivalent to fsm(i.e. accepts exactly the same strings) but has a minimal number of states. | 
Note that methods combining FSMs usually output new FSMs with modified alphabets. For example, concatenating an FSM with alphabet {Charclass("a"), ~Charclass("a")} and another FSM with alphabet {Charclass("abc"), ~Charclass("abc")} usually results in a third FSM with a repartitioned alphabet of {Charclass("a"), Charclass("bc"), ~Charclass("abc")}. Notice how all three alphabets partition the space of all Unicode characters.
Several other methods on Fsm instances are available - these should not be used, they're subject to change.
Special Fsm which accepts only the empty string.
Special Fsm which accepts no strings.
pip install -r requirements.dev.txt
isort .
black .
mypy greenery
flake8 --count --statistics --show-source --select=E9,F63,F7,F82 .
flake8 --count --statistics --exit-zero --max-complexity=10 .
pylint --recursive=true .
pytest- Update the version in ./setup.py
- Trash ./dist
- python -m build- creates a- ./distdirectory with some stuff in it
- python -m twine upload dist/*