RegExDot

Background

Regular-expressions are powerful but imprecise.

Regexen are a bit like hammers, they're as easy to use as they are blunt instruments, & having gained some familiarity with them, suddenly you'll be surrounded by things that beg to be whacked; before you know it, you'll have created many bent nails & a throbbing thumb.

They were lean & mean when I first encountered them in ed, acquired some fat around the gut with the inclusion of capture-groups, back-references, & POSIX character-classes, & now look like Mr Creosote after being fed on a feature-rich diet in Perl-5, which included look-around assertions, shortcuts & ironically, non-greedy quantification.

They've outgrown their origins to such an extent, that they can no longer justify their original name "regular expression" (given its anachronistic association with the subject of "regular algebra", from which they emerged), but are now collectively known as "regexen".

The Grudge

I see several problems with regexen:

My Big Idea Enlightenment

regexdot aims to rectify these specific issues:

Implementation

You may wonder why a direct implementation of a polymorphic regex-engine is required. If, for example, one has an enumerated type, with unique instances corresponding to say, the colours of the rainbow, & one wishes to match a lists of these colours, against a regex describing a colour-pattern (exactly why one might want to do this, isn't really the point), couldn't one just define a mapping, from each of the possible enumerated colours, to an arbitrary seven-member subset of characters, then perform the match using a traditional regex-engine ? This scheme, for which Heath Robinson doubtless holds a patent, might work to a limited extent, but closer inspection reveals its inadequacy:

This website would be much smaller if I'd decided that the best strategy was not to bother, & though I don't currently know of many instances in which such a beast might be useful, my gut-instinct is that it's worth producing; regrettably, my gut can't even forecast meal-times accurately.

The implementation of a regex-engine falls into three basic categories; recursive, NFA, & DFA. The Thompson NFA algorithm is theoretically most efficient, but; it trades the undesirable exponential time-complexity inherent to other implementations, for (arguably less) undesirable space-complexity; can't easily be queried regarding the nature of any match; & makes the implementation of some newer regex-features difficult. Because I want to; know the precise nature of the match; implement features like lazy quantification; & retain the option to implement back-references; regexdot is implemented using the simple recursive algorithm.

This design-choice results in relatively poor performance on some pathological regexen, but on typical regexen, this performance-penalty is masked anyway, by the poor rate of data-consumption resulting from its inherent polymorphism, which blocks adoption of Char-specific optimizations. On the plus side, the algorithm is capable of evaluating alternative sub-expressions in parallel, & will do so when linked with the threaded ghc-runtime & where multiple CPU-cores are available.

Features

regexdot's Feature-set
Supported Partially Supported Unsupported
Capture-groups:

Supported to an indefinite level of nesting.

Alternation:

An indefinite number of alternative sub-expressions are permitted. An exhaustive search amongst them, for the optimal match, is performed as required by POSIX, though there's an execution-option to short-circuit the search at the left-most matching sub-expression, like Perl.

Non-greedy Quantification:

Supported.

Anchors:

Supported; including internal anchors around capture-groups.

Bracket-expression Negation:

Supported.

Perl-style Shortcuts:

Conceptually supported for input-data of arbitrary type, by permitting the definition of custom domain-specific shortcuts.

Comments:

Indirectly supported; one can use Haskell-comments, when the regex is composed via its API.

Free-spacing:

Indirectly supported; one can use white-space, when the regex is composed via its API.

Bracket-expression Ranges:

This feature would require the formal type-parameter to implement either of the type-classes "Enum" or "Ord", which to avoid unnecessarily constraining its applicability, regexdot doesn't. Despite this deliberately lax policy, one can still use the type-classes supported by the actual type-parameter in the implementation of type-specific Perl-style shortcuts.

Conceptually meaningful only in a character-context

† Fully supported by the package "regexchar".

The Package

In the pages to follow, some of the more significant modules in regexdot, will be referenced. To give you some feel for the environment, this is the complete set of files from which it is composed, with those files corresponding to significant modules, emphasised. A summary of the function of each file is available, & can be accessed (given focus), by hovering your cursor over its name.

$ tar -zxf 'regexdot-0.11.1.2.tar.gz' && ls -pR 'regexdot-0.11.1.2/'	#Unpack & list the package-contents.
regexdot-0.11.1.2/:
changelog.markdown copyright LICENSE README.markdown regexdot.cabal Setup.hs src-lib/

regexdot-0.11.1.2/src-lib:
RegExDot/

regexdot-0.11.1.2/src-lib/RegExDot:
Anchor.hs                  CompilationOptions.hs ConsumptionProfile.hs DSL.hs              Meta.hs      Repeatable.hs Span.hs
BracketExpression.hs       Consumer.hs           DataSpan.hs           ExecutionOptions.hs RegEx.hs     Result.hs     Tree.hs
BracketExpressionMember.hs ConsumptionBounds.hs  DataSpanTree.hs       InstanceInt.hs      RegExOpts.hs ShowablePredicate.hs

Test Data-type

In order to demonstrate the use of the package "regexdot", I need a test data-type, & lacking a better example, I'll use the one for which it was developed, i.e. the type "Bid". This data-type can be inspected using ghci (using package "bridge-0.1.0.12" which, in contrast to earlier versions, exposes a library).

ghci
GHCi, version 7.6.3: https://www.haskell.org/ghc/  :? for help
Loading package ghc-prim … linking … done.
Loading package integer-gmp … linking … done.
Loading package base … linking … done.

Prelude> :module +Bridge.Tier2.Bid
Prelude Bridge.Tier2.Bid> :info Bid
data Bid = Redouble | Double | Pass | MkBid BidLevel.BidLevel Trump.Trump	-- Defined at Bridge/Tier2/Bid.hs:103:6-8
instance Enum Bid							-- Defined at Bridge/Tier2/Bid.hs:124:10-17
instance Eq Bid								-- Defined at Bridge/Tier2/Bid.hs:109:17-18
instance Ord Bid								-- Defined at Bridge/Tier2/Bid.hs:115:10-16
instance Read Bid							-- Defined at Bridge/Tier2/Bid.hs:141:10-17
instance Show Bid							-- Defined at Bridge/Tier2/Bid.hs:135:10-17

Prelude Bridge.Tier2.Bid> take 38 [Redouble ..]
Loading package array-0.4.0.1 … linking … done.
Loading package deepseq-1.3.0.1 … linking … done.
Loading package bytestring-0.10.0.2 … linking … done.
Loading package containers-0.5.0.0 … linking … done.
Loading package binary-0.7.6.1 … linking … done.
Loading package filepath-1.3.0.1 … linking … done.
Loading package old-locale-1.0.0.5 … linking … done.
Loading package time-1.4.0.1 … linking … done.
Loading package unix-2.6.0.1 … linking … done.
Loading package directory-1.2.0.1 … linking … done.
Loading package pretty-1.1.1.0 … linking … done.
Loading package process-1.1.0.2 … linking … done.
Loading package Cabal-1.22.4.0 … linking … done.
Loading package parallel-3.2.0.3 … linking … done.
Loading package primes-0.2.1.0 … linking … done.
Loading package random-1.0.1.1 … linking … done.
Loading package QuickCheck-2.6 … linking … done.
Loading package toolshed-0.16.0.0 … linking … done.
Loading package factory-0.2.1.2 … linking … done.
Loading package transformers-0.3.0.0 … linking … done.
Loading package mtl-2.1.2 … linking … done.
Loading package text-0.11.3.1 … linking … done.
Loading package parsec-3.1.3 … linking … done.
Loading package regexdot-0.11.1.2 … linking … done.
Loading package bridge-0.1.0.12 … linking … done.
[xx,x,Pass,1C,1D,1H,1S,1NT,2C,2D,2H,2S,2NT,3C,3D,3H,3S,3NT,4C,4D,4H,4S,4NT,5C,5D,5H,5S,5NT,6C,6D,6H,6S,6NT,7C,7D,7H,7S,7NT]

As required, the data-type "Bridge.Tier2.Bid.Bid" implements the type-classes "Eq" & "Show". ghci used the latter, to print the 38 unique instances of the data-type, which can be built using the four data-constructors.

The first three Bid-instances each have their own nullary data-constructor, but the remainder must be constructed using the only binary data-constructor, so for convenience, the module "Bridge.Tier2.Bid" also exports constants, corresponding to each these.

Bid-instances exported by the module "Bridge.Tier2.Bid"
Required
tricks
Trump-suit No-trumps
7 oneClub oneDiamond oneHeart oneSpade oneNT
8 twoClubs twoDiamonds twoHearts twoSpades twoNTs
9 threeClubs threeDiamonds threeHearts threeSpades threeNTs
10 fourClubs fourDiamonds fourHearts fourSpades fourNTs
11 fiveClubs fiveDiamonds fiveHearts fiveSpades fiveNTs
12 smallSlamClubs smallSlamDiamonds smallSlamHearts smallSlamSpades smallSlamNTs
13 grandSlamClubs grandSlamDiamonds grandSlamHearts grandSlamSpades grandSlamNTs

API

Constructing a Regex