Composing the Regex

Why ?

By using regexdot's API to compose a polymorphic regex, one can:

One could argue that Perl has had the freedom to explore new regex-concepts precisely because regexen, though integrated with the language, weren't constructed via an API, but were still defined by an encoded string (perhaps with fancy delimiters), thus leaving the core language unchanged in the face of new regex-syntax; not that the language didn't change beyond recognition anyway, for other reasons. My view is that if a new regex-feature is introduced, which might alter the meaning of existing string-based specifications, then one should want existing code to fail compilation. Of course, one would still hope that changes to this API would be backwards-compatible, leaving the user with option, but not the obligation to use new features.

The Goal

What I want to be able to do, is construct a polymorphic regex, where each meta-datum (can't really call them meta-characters any more), is qualified by a postfix operator which defines the permissible repetition-range, & then is concatenated with previously quantified meta-data. If the identifiers for these postfix operators are chosen to match the traditional quantifiers, then the result should look rather like the string-representation of a regex, just without the quotes, presenting a lower entry-barrier to those familiar with this traditional syntax.

There are several problems with this goal.

The Solution

Quantification

To create a set of uniformly binary operators, which can be interleaved between meta-data without the requirement for parenthesis to specify precedence explicitly, the requirement to quantify, is combined with the requirement to concatenate. The arguments to each binary operator are then:

  1. The thing to be repeated; which may be either some meta-datum or a capture-group, subsequently collectively referred to as a "RegExDot.RegEx.Pattern".

    Any additional arguments, required to define a range of permissible repetitions, are aggregated with the RegExDot.RegEx.Pattern to form a single tuple.

  2. Any previously concatenated meta-data; & if there aren't any, just a null list.

RegExDot.RegEx.Pattern operator (RegExDot.RegEx.Pattern, bounds) operator [quantified patterns]

The module "RegExDot.DSL" exports a set of binary operators designed to implement this scheme.

$ ghci
GHCi, version 7.6.3: https://www.haskell.org/ghc/  :? for help
Loading package ghc-prim … linking … done.
Loading package integer-gmp … linking … done.
Loading package base … linking … done.

Prelude> :browse RegExDot.DSL
(#->#:)	:: (RegExDot.RegEx.Pattern a, RegExDot.Repeatable.RepetitionBounds) -> RegExDot.RegEx.Concatenation a -> RegExDot.RegEx.Concatenation a
(#->#?:)	:: (RegExDot.RegEx.Pattern a, RegExDot.Repeatable.RepetitionBounds) -> RegExDot.RegEx.Concatenation a -> RegExDot.RegEx.Concatenation a
(#->:)	:: (RegExDot.RegEx.Pattern a, RegExDot.Repeatable.Repetitions) -> RegExDot.RegEx.Concatenation a -> RegExDot.RegEx.Concatenation a
(#->?:)	:: (RegExDot.RegEx.Pattern a, RegExDot.Repeatable.Repetitions) -> RegExDot.RegEx.Concatenation a -> RegExDot.RegEx.Concatenation a
(#:)	:: (RegExDot.RegEx.Pattern a, RegExDot.Repeatable.Repetitions) -> RegExDot.RegEx.Concatenation a -> RegExDot.RegEx.Concatenation a
(*:)	:: RegExDot.RegEx.Pattern a -> RegExDot.RegEx.Concatenation a -> RegExDot.RegEx.Concatenation a
(*?:)	:: RegExDot.RegEx.Pattern a -> RegExDot.RegEx.Concatenation a -> RegExDot.RegEx.Concatenation a
(+:)	:: RegExDot.RegEx.Pattern a -> RegExDot.RegEx.Concatenation a -> RegExDot.RegEx.Concatenation a
(+?:)	:: RegExDot.RegEx.Pattern a -> RegExDot.RegEx.Concatenation a -> RegExDot.RegEx.Concatenation a
(?:)	:: RegExDot.RegEx.Pattern a -> RegExDot.RegEx.Concatenation a -> RegExDot.RegEx.Concatenation a
(??:)	:: RegExDot.RegEx.Pattern a -> RegExDot.RegEx.Concatenation a -> RegExDot.RegEx.Concatenation a
(-:)	:: RegExDot.RegEx.Pattern a -> RegExDot.RegEx.Concatenation a -> RegExDot.RegEx.Concatenation a
(<~>)	:: (Maybe RegExDot.Anchor.Anchor, Maybe RegExDot.Anchor.Anchor) -> RegExDot.RegEx.Concatenation a -> RegExDot.RegEx.ExtendedRegEx a

The identifiers are intended to suggest the number of permissible repetitions with which the specified RegExDot.RegEx.Pattern is qualified before concatenation, in an attempt to mimic traditional regex-notation, but without presenting a conflict with the arithmetic operators. Though you may see this as either a failed attempt to clarify, or a successful attempt to encrypt, some quick cursor-hovering over each identifier should give you the hang of it.

One can interleave these binary operators, using infix notation (where according Haskell-convention, they'd lose the above delimiting parenthesis) between RegExDot.RegEx.Patterns, such that the first parameter required by each operator is the previous RegExDot.RegEx.Pattern in the list, potentially combined as a tuple, with any repetition-bounds, & the second parameter required by each operator, is the RegExDot.RegEx.Concatenation (just a type-synonym for a list of quantified RegExDot.RegEx.Patterns), constructed in the same manner, from the remainder of the list. Sorry, that's was quite complicated; hopefully it'll become clear in the subsequent examples.

The final operator defined above, is rather different. It sandwiches the completed RegExDot.RegEx.Concatenation between optional RegExDot.Anchor.Anchors, to form a RegExDot.RegEx.ExtendedRegEx.

Defining Capture-groups & Alternative Sub-expressions

Though the above operators allow one to quantify & concatenate RegExDot.RegEx.Patterns, & then to define optional RegExDot.Anchor.Anchors around the resulting RegExDot.RegEx.Concatenation, one must first be able to construct the RegExDot.RegEx.Patterns on which to operate. A RegExDot.RegEx.Pattern can be used to represent either a simple meta-datum or a capture-group. A capture-group may be used either to delimit a single sub-expression (presumably for subsequent quantification), or to encapsulate a set of alternative sub-expressions.

Though I've used the traditional name for this concept, suggesting that it delimits a section of the regex, from which any captured input-data may subsequently be requested, it's a misnomer in the context of the package "regexdot", which actually captures everything without choice.

The module "RegExDot.RegEx" exports the relevant data-types & their data-constructors.

Prelude> :module + RegExDot.RegEx

Prelude RegExDot.RegEx> :info Pattern
data Pattern a	= Require (RegExDot.Meta.Meta a) | CaptureGroup (Alternatives a)	-- Defined in RegExDot.RegEx
instance (Eq a)		=> Eq (Pattern a)						-- Defined in RegExDot.RegEx
instance (Show a)	=> Show (Pattern a)					-- Defined in RegExDot.RegEx

Prelude RegExDot.RegEx> :info Alternatives
newtype Alternatives a	= MkAlternatives [ExtendedRegEx a]			-- Defined in RegExDot.RegEx
instance (Eq a)		=> Eq (Alternatives a)					-- Defined in RegExDot.RegEx
instance (Show a)	=> Show (Alternatives a)					-- Defined in RegExDot.RegEx
RegExDot.RegEx.Pattern

This data-type has two data-constructors "RegExDot.RegEx.Require" & "RegExDot.RegEx.CaptureGroup", which respectively allow one to specify the requirement for either just a single meta-datum or a whole capture-group. Once constructed, they can then be treated uniformly during subsequent quantification & concatenation.

RegExDot.RegEx.Alternatives

The only data-constructor "RegExDot.RegEx.MkAlternatives", is merely used to encapsulate a list of alternative RegExDot.RegEx.ExtendedRegExs, as required for the previous data-constructor "RegExDot.RegEx.CaptureGroup". If there isn't an alternative sub-expression to express, which is a legitimate requirement when the capture-group is being used for subsequent quantification, then one merely defines a singleton list of sub-expressions.

The common requirement to compose the two data-constructors "RegExDot.RegEx.CaptureGroup" & "RegExDot.RegEx.MkAlternatives", is captured in the convenience-function "RegExDot.RegEx.captureGroup", which merely takes a list of RegExDot.RegEx.ExtendedRegEx as its single parameter, & returns an instance of the type "RegExDot.RegEx.Pattern".

The order in which alternative sub-expressions are listed within a capture-group, is normally irrelevant, since the package "regexdot" defaults to the rigorous POSIX-requirement to find the longest match, rather than Perl's left-most match.

Constructing Meta-data

Though one now has the ability to quantify RegExDot.RegEx.Patterns, (which may be capture-groups, potentially containing alternative sub-expressions), concatenate them, & to define optional anchors, one must first construct the basic meta-data on which these higher-level operations depend. The module "RegExDot.Meta" exports the data-type "Meta" & its data-constructors.

Prelude> :module + RegExDot.Meta

Prelude RegExDot.Meta> :info Meta
data Meta a
	= Any
	| Literal a
	| AnyOf (RegExDot.BracketExpression.BracketExpression a)
	| NoneOf (RegExDot.BracketExpression.BracketExpression a)
	| Predicate (RegExDot.ShowablePredicate.ShowablePredicate a)	-- Defined in RegExDot.Meta
instance (Eq a)				=> Eq (Meta a)		-- Defined in RegExDot.Meta
instance (ShortcutExpander a, Read a)	=> Read (Meta a)	-- Defined in RegExDot.Meta
instance (Show a)			=> Show (Meta a)	-- Defined in RegExDot.Meta

The purpose of the first four of these five data-constructors is reasonably self-explanatory:

Any

Specifies that any input-datum will match. Note that this since there no need even to know the datum on which it operates, this is a nullary data-constructor.

The equivalent of /./, in the standard notation for a regex.

Literal

Specifies a requirement for a specific input-datum of the parameterised data-type.

AnyOf

Defines a bracket-expression, based on the supplied list of acceptable data.

The equivalent of /[]/, in the standard notation for a regex.

NoneOf

Defines a bracket-expression, based on the supplied list of unacceptable data.

The equivalent of /[^]/, in the standard notation for a regex.

Predicate

This concept isn't directly available in a traditional regex, but permits one to define a predicate-function, the return-value of which determines whether the current input-datum matches one's requirements.

This enables one to implement efficiently some Perl-style shortcuts.

Examples

Type-parameter = Bridge.Tier2.Bid.Bid

In contrast to earlier versions, package "bridge-0.1.0.12" exposes a library which must be installed for the following example.

$ ghci
GHCi, version 7.6.3: https://www.haskell.org/ghc/  :? for help
Loading package ghc-prim … linking … done.
Loading package integer-gmp … linking … done.
Loading package base … linking … done.
*Bridge.Tier3.MetaBid> :module + RegExDot.RegEx RegExDot.DSL
*Bridge.Tier3.MetaBid RegExDot.RegEx RegExDot.DSL> :set prompt "$ "
$ :{
$ (
$ 	Just RegExDot.Anchor.Bow, Nothing {-Define the required anchors-}
$ ) <~> (
$ 	Require $ RegExDot.Meta.Literal Bridge.Tier2.Bid.Pass, (0, Just 3) {-Combine the RegExDot.Meta Bid with the permissible repetition-range-}
$ ) #->#: Require (
$ 	RegExDot.Meta.Literal Bridge.Tier2.Bid.oneNT
$ ) -: Require (
$ 	RegExDot.Meta.Literal Bridge.Tier2.Bid.Pass
$ ) -: Require (
$ 	RegExDot.Meta.AnyOf $ map RegExDot.BracketExpressionMember.Literal [Bridge.Tier2.Bid.twoDiamonds, Bridge.Tier2.Bid.twoHearts]
$ ) ?: [] {-Start with a null RegExDot.RegEx.Concatenation-}
$ :}
Loading package transformers-0.3.0.0 … linking … done.
Loading package array-0.4.0.1 … linking … done.
Loading package mtl-2.1.2 … linking … done.
Loading package deepseq-1.3.0.1 … linking … done.
Loading package containers-0.5.0.0 … linking … done.
Loading package parallel-3.2.0.3 … linking … done.
Loading package bytestring-0.10.0.2 … linking … done.
Loading package text-0.11.3.1 … linking … done.
Loading package parsec-3.1.3 … linking … done.
Loading package binary-0.7.6.1 … linking … done.
Loading package filepath-1.3.0.1 … linking … done.
Loading package old-locale-1.0.0.5 … linking … done.
Loading package time-1.4.0.1 … linking … done.
Loading package unix-2.6.0.1 … linking … done.
Loading package directory-1.2.0.1 … linking … done.
Loading package pretty-1.1.1.0 … linking … done.
Loading package process-1.1.0.2 … linking … done.
Loading package Cabal-1.22.4.0 … linking … done.
Loading package primes-0.2.1.0 … linking … done.
Loading package random-1.0.1.1 … linking … done.
Loading package QuickCheck-2.6 … linking … done.
Loading package toolshed-0.16.0.0 … linking … done.
Loading package factory-0.2.1.2 … linking … done.
Loading package regexdot-0.11.1.2 … linking … done.
Loading package bridge-0.1.0.12 … linking … done.
^[Pass{0,3},1NT,Pass,[2D,2H]?]

Whilst you might say, that composing the above expression using the API, would take much longer than two-finger typing the traditional terse syntax into a string, which one then parses into the required regex, this isn't a traditional regex; it's polymorphic.

Besides, because the regex-specification is now merely a Haskell-expression, rather than some special syntax squashed into a string, it can be separated with white-space as appropriate, & the sub-components can be commented, just as one can with any Haskell source-code.

Further, the whole regex-specification is now statically type-checked by the compiler, rather than waiting until runtime to find syntax-errors. It doesn't stop you writing the wrong thing (that requires a deeper level of help), but it does reduce the scope of potential errors to the semantics of the expression.

Now let's try to use the API to reproduce the example of the Stayman-convention, specified previously by a Perl-style shortcut.

$  :{
$  (Just RegExDot.Anchor.Bow, Nothing) <~> captureGroup (
$      map (
$          (Nothing, Nothing) <~>
$      ) [
$          captureGroup (
$              map (
$                  (Nothing, Nothing) <~>
$              ) [
$                  (
$                      Require $ RegExDot.Meta.Literal Bridge.Tier2.Bid.Pass, (0, Just 3)
$                  ) #->#: [] {-Start with a null RegExDot.RegEx.Concatenation-},
$                  (
$                      Require $ RegExDot.Meta.Literal Bridge.Tier2.Bid.Pass, (0, Just 2)
$                  ) #->#: (
$                      Require . RegExDot.Meta.AnyOf $ map RegExDot.BracketExpressionMember.Literal [Bridge.Tier2.Bid.oneClub .. Bridge.Tier2.Bid.oneSpade]
$                  ) -: (
$                      captureGroup . return {-to List-monad-} $ (
$                          Nothing, Nothing
$                      ) <~> (Require (RegExDot.Meta.Literal Bridge.Tier2.Bid.Pass), 2) #: [] {-Start with a null RegExDot.RegEx.Concatenation-}
$                  ) ?: [] {-Start with a null RegExDot.RegEx.Concatenation-}
$              ]
$          ) -: Require (RegExDot.Meta.Literal Bridge.Tier2.Bid.oneNT) -: Require (RegExDot.Meta.Literal Bridge.Tier2.Bid.Pass) -: Require (RegExDot.Meta.Literal Bridge.Tier2.Bid.twoClubs) -: [] {-Start with a null RegExDot.RegEx.Concatenation-},
$          (
$              Require $ RegExDot.Meta.Literal Bridge.Tier2.Bid.Pass, (0, Just 3)
$          ) #->#: Require (RegExDot.Meta.Literal Bridge.Tier2.Bid.twoNTs) -: Require (RegExDot.Meta.Literal Bridge.Tier2.Bid.Pass) -: Require (RegExDot.Meta.Literal Bridge.Tier2.Bid.threeClubs) -: [] {-Start with a null RegExDot.RegEx.Concatenation-}
$      ]
$   ) -: []
$  :}
^[([([Pass{0,3}]|[Pass{0,2},[1C,1D,1H,1S],([Pass{2}])?]),1NT,Pass,2C]|[Pass{0,3},2NT,Pass,3C])]

If you feel your eyebrows migrating relentlessly northbound in alarm, read southbound.


Nobody said it was going to be easy.

Fortunately, this is about as complicated as it gets, & by way of consolation, even writing the equivalent terse string ready to parse into the corresponding regex, wouldn't have been easy. Rather than repeatedly fully qualifying function-identifiers with their module-name, a more concise but acceptable policy, would be to import specific functions from the module "RegExDot.Meta", & leave subsequent references to them unqualified. Equally, after spending some more time in solitary confinement, I may be able to provide a wider set of operators, to replace some of the verbose function-names.

As before, the three match-operators are available to test whether sample input-data matches, & if so, to examine its mapping into the regex. So, defining a regex to represent a Quantitative Raise, i.e.

"^[Pass?,[1NT,2NT,3NT],.,4NT,.,[6NT,Pass]]"
.
$  :{
$  [
$  	Bridge.Tier2.Bid.Pass,Bridge.Tier2.Bid.oneNT,Bridge.Tier2.Bid.Pass,Bridge.Tier2.Bid.fourNTs,Bridge.Tier2.Bid.Pass,Bridge.Tier2.Bid.smallSlamNTs,Bridge.Tier2.Bid.Double,Bridge.Tier2.Bid.Pass,Bridge.Tier2.Bid.Pass,Bridge.Tier2.Bid.Pass
$  ] +~ RegExDot.RegExOpts.mkRegEx (
$  	(
$  		Just RegExDot.Anchor.Bow, Nothing {-Specify a Bow-anchor only-}
$  	) <~> Require (
$  		RegExDot.Meta.Literal Bridge.Tier2.Bid.Pass
$  	) ?: Require (
$  		RegExDot.Meta.AnyOf $ map RegExDot.BracketExpressionMember.Literal [Bridge.Tier2.Bid.oneNT, Bridge.Tier2.Bid.twoNTs, Bridge.Tier2.Bid.threeNTs] {-Construct a bracket-expression-}
$  	) -: Require RegExDot.Meta.Any -: Require (
$  		RegExDot.Meta.Literal Bridge.Tier2.Bid.fourNTs
$  	) -: Require RegExDot.Meta.Any -: Require (
$  		RegExDot.Meta.AnyOf $ map RegExDot.BracketExpressionMember.Literal [Bridge.Tier2.Bid.smallSlamNTs, Bridge.Tier2.Bid.Pass] {-Construct a bracket-expression-}
$  	) -: []
$  )
$  :}
(Nothing,Just [(Pass?,0,[Pass]),([1NT,2NT,3NT],1,[1NT]),(.,2,[Pass]),(4NT,3,[4NT]),(.,4,[Pass]),([6NT,Pass],5,[6NT])],Just (.*,6,[x,Pass,Pass,Pass]))

The value of type "RegExDot.RegEx.Result" returned from the match-operation, tells one that the input-data matched, & the list of triples from which the central part of this structure is composed, each of which corresponds to a quantified "RegExDot.Meta.Meta  Bridge.Tier2.Bid.Bid" in the regex, confirms the expected input-data consumption-profile. The only surprise is that, since the regex wasn't terminated in an RegExDot.Anchor.Stern, four Bridge.Tier2.Bid.Bids were consumed after the match, as defined by the last part of the outermost triple.