Composing the Regex
Why ?
By using regexdot's API to compose a polymorphic regex, one can:
-
Reduce the scope for syntactic errors, by exposing the components from which the regex is built, to the Haskell-compiler.
-
Improve its readability by embedding white-space & comments.
-
Largely release oneself from concerns regarding the string-representation of the actual type-parameter, which, should the regex alternatively be parsed from a String, might conflict with traditional regex quantifiers & delimiters.
If the chosen type-parameter can't unambiguously be parsed, because of unavoidable conflict with the traditional symbols for the meta-characters from which the regex is composed, then even after composition via the API described here, the printed form of the regex will exhibit the same ambiguity. This is a certainly a problem, but counting the number of times I've needed to print a regex, rather than just use it to match something, I repeatedly arrive at the answer, "Zero".
-
Potentially dynamically construct the regex; maybe from the results of a query. One could achieve this by concatenating strings containing snippets of regex, but this sends an icy shudder rippling down my spine, reflecting the unilateral decision by my subconscious, to relegate this dubious practice to the bush-league.
One could argue that Perl has had the freedom to explore new regex-concepts precisely because regexen, though integrated with the language, weren't constructed via an API, but were still defined by an encoded string (perhaps with fancy delimiters), thus leaving the core language unchanged in the face of new regex-syntax; not that the language didn't change beyond recognition anyway, for other reasons. My view is that if a new regex-feature is introduced, which might alter the meaning of existing string-based specifications, then one should want existing code to fail compilation. Of course, one would still hope that changes to this API would be backwards-compatible, leaving the user with option, but not the obligation to use new features.
The Goal
What I want to be able to do, is construct a polymorphic regex, where each meta-datum (can't really call them meta-characters any more), is qualified by a postfix operator which defines the permissible repetition-range, & then is concatenated with previously quantified meta-data. If the identifiers for these postfix operators are chosen to match the traditional quantifiers, then the result should look rather like the string-representation of a regex, just without the quotes, presenting a lower entry-barrier to those familiar with this traditional syntax.
There are several problems with this goal.
-
I don't particularly like overloading the traditional arithmetic operators + & *. This would require one either to hide portions of the standard module "Prelude", or to qualify each of these new operators with its module-name, ruining the goal of producing something vaguely similar to a traditional regex.
-
Haskell doesn't cater for postfix operators; just infix or prefix.
-
Some of the required quantifiers only take one parameter (*, *?, +, +?, ?, ??), the meta-datum they're quantifying. Unary operators are both rare & awkward in Haskell; there's currently only one in common usage, the unary minus, & there are many who would like to subtract it from this short-list too.
-
Some quantifiers require additional arguments ({n,}, {n,}?, {n}, {n, m}, {n, m}?), to define the minimum, & potentially also the maximum, permissible repetition-bounds.
The Solution
Quantification
To create a set of uniformly binary operators, which can be interleaved between meta-data without the requirement for parenthesis to specify precedence explicitly, the requirement to quantify, is combined with the requirement to concatenate. The arguments to each binary operator are then:
-
The thing to be repeated; which may be either some meta-datum or a capture-group, subsequently collectively referred to as a "RegExDot.RegEx.Pattern".
Any additional arguments, required to define a range of permissible repetitions, are aggregated with the RegExDot.RegEx.Pattern to form a single tuple.
-
Any previously concatenated meta-data; & if there aren't any, just a null list.
RegExDot.RegEx.Pattern operator (RegExDot.RegEx.Pattern, bounds) operator [quantified patterns]
The module "RegExDot.DSL" exports a set of binary operators designed to implement this scheme.
$ghci
GHCi, version 7.6.3: https://www.haskell.org/ghc/ :? for help Loading package ghc-prim … linking … done. Loading package integer-gmp … linking … done. Loading package base … linking … done. Prelude>:browse RegExDot.DSL
(#->#:) :: (RegExDot.RegEx.Pattern a, RegExDot.Repeatable.RepetitionBounds) -> RegExDot.RegEx.Concatenation a -> RegExDot.RegEx.Concatenation a (#->#?:) :: (RegExDot.RegEx.Pattern a, RegExDot.Repeatable.RepetitionBounds) -> RegExDot.RegEx.Concatenation a -> RegExDot.RegEx.Concatenation a (#->:) :: (RegExDot.RegEx.Pattern a, RegExDot.Repeatable.Repetitions) -> RegExDot.RegEx.Concatenation a -> RegExDot.RegEx.Concatenation a (#->?:) :: (RegExDot.RegEx.Pattern a, RegExDot.Repeatable.Repetitions) -> RegExDot.RegEx.Concatenation a -> RegExDot.RegEx.Concatenation a (#:) :: (RegExDot.RegEx.Pattern a, RegExDot.Repeatable.Repetitions) -> RegExDot.RegEx.Concatenation a -> RegExDot.RegEx.Concatenation a (*:) :: RegExDot.RegEx.Pattern a -> RegExDot.RegEx.Concatenation a -> RegExDot.RegEx.Concatenation a (*?:) :: RegExDot.RegEx.Pattern a -> RegExDot.RegEx.Concatenation a -> RegExDot.RegEx.Concatenation a (+:) :: RegExDot.RegEx.Pattern a -> RegExDot.RegEx.Concatenation a -> RegExDot.RegEx.Concatenation a (+?:) :: RegExDot.RegEx.Pattern a -> RegExDot.RegEx.Concatenation a -> RegExDot.RegEx.Concatenation a (?:) :: RegExDot.RegEx.Pattern a -> RegExDot.RegEx.Concatenation a -> RegExDot.RegEx.Concatenation a (??:) :: RegExDot.RegEx.Pattern a -> RegExDot.RegEx.Concatenation a -> RegExDot.RegEx.Concatenation a (-:) :: RegExDot.RegEx.Pattern a -> RegExDot.RegEx.Concatenation a -> RegExDot.RegEx.Concatenation a (<~>) :: (Maybe RegExDot.Anchor.Anchor, Maybe RegExDot.Anchor.Anchor) -> RegExDot.RegEx.Concatenation a -> RegExDot.RegEx.ExtendedRegEx a
The identifiers are intended to suggest the number of permissible repetitions with which the specified RegExDot.RegEx.Pattern is qualified before concatenation, in an attempt to mimic traditional regex-notation, but without presenting a conflict with the arithmetic operators. Though you may see this as either a failed attempt to clarify, or a successful attempt to encrypt, some quick cursor-hovering over each identifier should give you the hang of it.
One can interleave these binary operators, using infix notation (where according Haskell-convention, they'd lose the above delimiting parenthesis) between RegExDot.RegEx.Patterns, such that the first parameter required by each operator is the previous RegExDot.RegEx.Pattern in the list, potentially combined as a tuple, with any repetition-bounds, & the second parameter required by each operator, is the RegExDot.RegEx.Concatenation (just a type-synonym for a list of quantified RegExDot.RegEx.Patterns), constructed in the same manner, from the remainder of the list. Sorry, that's was quite complicated; hopefully it'll become clear in the subsequent examples.
The final operator defined above, is rather different. It sandwiches the completed RegExDot.RegEx.Concatenation between optional RegExDot.Anchor.Anchors, to form a RegExDot.RegEx.ExtendedRegEx.
Defining Capture-groups & Alternative Sub-expressions
Though the above operators allow one to quantify & concatenate RegExDot.RegEx.Patterns, & then to define optional RegExDot.Anchor.Anchors around the resulting RegExDot.RegEx.Concatenation, one must first be able to construct the RegExDot.RegEx.Patterns on which to operate. A RegExDot.RegEx.Pattern can be used to represent either a simple meta-datum or a capture-group. A capture-group may be used either to delimit a single sub-expression (presumably for subsequent quantification), or to encapsulate a set of alternative sub-expressions.
Though I've used the traditional name for this concept, suggesting that it delimits a section of the regex, from which any captured input-data may subsequently be requested, it's a misnomer in the context of the package "regexdot", which actually captures everything without choice.
The module "RegExDot.RegEx" exports the relevant data-types & their data-constructors.
Prelude>:module + RegExDot.RegEx
Prelude RegExDot.RegEx>:info Pattern
data Pattern a = Require (RegExDot.Meta.Meta a) | CaptureGroup (Alternatives a) -- Defined in RegExDot.RegEx instance (Eq a) => Eq (Pattern a) -- Defined in RegExDot.RegEx instance (Show a) => Show (Pattern a) -- Defined in RegExDot.RegEx Prelude RegExDot.RegEx>:info Alternatives
newtype Alternatives a = MkAlternatives [ExtendedRegEx a] -- Defined in RegExDot.RegEx instance (Eq a) => Eq (Alternatives a) -- Defined in RegExDot.RegEx instance (Show a) => Show (Alternatives a) -- Defined in RegExDot.RegEx
- RegExDot.RegEx.Pattern
-
This data-type has two data-constructors "RegExDot.RegEx.Require" & "RegExDot.RegEx.CaptureGroup", which respectively allow one to specify the requirement for either just a single meta-datum or a whole capture-group. Once constructed, they can then be treated uniformly during subsequent quantification & concatenation.
- RegExDot.RegEx.Alternatives
-
The only data-constructor "RegExDot.RegEx.MkAlternatives", is merely used to encapsulate a list of alternative RegExDot.RegEx.ExtendedRegExs, as required for the previous data-constructor "RegExDot.RegEx.CaptureGroup". If there isn't an alternative sub-expression to express, which is a legitimate requirement when the capture-group is being used for subsequent quantification, then one merely defines a singleton list of sub-expressions.
The common requirement to compose the two data-constructors "RegExDot.RegEx.CaptureGroup" & "RegExDot.RegEx.MkAlternatives", is captured in the convenience-function "RegExDot.RegEx.captureGroup", which merely takes a list of RegExDot.RegEx.ExtendedRegEx as its single parameter, & returns an instance of the type "RegExDot.RegEx.Pattern".
The order in which alternative sub-expressions are listed within a capture-group, is normally irrelevant, since the package "regexdot" defaults to the rigorous POSIX-requirement to find the longest match, rather than Perl's left-most match.
Constructing Meta-data
Though one now has the ability to quantify RegExDot.RegEx.Patterns, (which may be capture-groups, potentially containing alternative sub-expressions), concatenate them, & to define optional anchors, one must first construct the basic meta-data on which these higher-level operations depend. The module "RegExDot.Meta" exports the data-type "Meta" & its data-constructors.
Prelude>:module + RegExDot.Meta
Prelude RegExDot.Meta>:info Meta
data Meta a = Any | Literal a | AnyOf (RegExDot.BracketExpression.BracketExpression a) | NoneOf (RegExDot.BracketExpression.BracketExpression a) | Predicate (RegExDot.ShowablePredicate.ShowablePredicate a) -- Defined in RegExDot.Meta instance (Eq a) => Eq (Meta a) -- Defined in RegExDot.Meta instance (ShortcutExpander a, Read a) => Read (Meta a) -- Defined in RegExDot.Meta instance (Show a) => Show (Meta a) -- Defined in RegExDot.Meta
The purpose of the first four of these five data-constructors is reasonably self-explanatory:
- Any
-
Specifies that any input-datum will match. Note that this since there no need even to know the datum on which it operates, this is a nullary data-constructor.
The equivalent of /./, in the standard notation for a regex.
- Literal
-
Specifies a requirement for a specific input-datum of the parameterised data-type.
- AnyOf
-
Defines a bracket-expression, based on the supplied list of acceptable data.
The equivalent of /[…]/, in the standard notation for a regex.
- NoneOf
-
Defines a bracket-expression, based on the supplied list of unacceptable data.
The equivalent of /[^…]/, in the standard notation for a regex.
- Predicate
-
This concept isn't directly available in a traditional regex, but permits one to define a predicate-function, the return-value of which determines whether the current input-datum matches one's requirements.
This enables one to implement efficiently some Perl-style shortcuts.
Examples
Type-parameter = Bridge.Tier2.Bid.Bid
In contrast to earlier versions, package "bridge-0.1.0.12" exposes a library which must be installed for the following example.
$ghci
GHCi, version 7.6.3: https://www.haskell.org/ghc/ :? for help Loading package ghc-prim … linking … done. Loading package integer-gmp … linking … done. Loading package base … linking … done. *Bridge.Tier3.MetaBid>:module + RegExDot.RegEx RegExDot.DSL
*Bridge.Tier3.MetaBid RegExDot.RegEx RegExDot.DSL>:set prompt "$ "
$:{
$(
$Just RegExDot.Anchor.Bow, Nothing {-Define the required anchors-}
$) <~> (
$Require $ RegExDot.Meta.Literal Bridge.Tier2.Bid.Pass, (0, Just 3) {-Combine the RegExDot.Meta Bid with the permissible repetition-range-}
$) #->#: Require (
$RegExDot.Meta.Literal Bridge.Tier2.Bid.oneNT
$) -: Require (
$RegExDot.Meta.Literal Bridge.Tier2.Bid.Pass
$) -: Require (
$RegExDot.Meta.AnyOf $ map RegExDot.BracketExpressionMember.Literal [Bridge.Tier2.Bid.twoDiamonds, Bridge.Tier2.Bid.twoHearts]
$) ?: [] {-Start with a null RegExDot.RegEx.Concatenation-}
$:}
Loading package transformers-0.3.0.0 … linking … done. Loading package array-0.4.0.1 … linking … done. Loading package mtl-2.1.2 … linking … done. Loading package deepseq-1.3.0.1 … linking … done. Loading package containers-0.5.0.0 … linking … done. Loading package parallel-3.2.0.3 … linking … done. Loading package bytestring-0.10.0.2 … linking … done. Loading package text-0.11.3.1 … linking … done. Loading package parsec-3.1.3 … linking … done. Loading package binary-0.7.6.1 … linking … done. Loading package filepath-1.3.0.1 … linking … done. Loading package old-locale-1.0.0.5 … linking … done. Loading package time-1.4.0.1 … linking … done. Loading package unix-2.6.0.1 … linking … done. Loading package directory-1.2.0.1 … linking … done. Loading package pretty-1.1.1.0 … linking … done. Loading package process-1.1.0.2 … linking … done. Loading package Cabal-1.22.4.0 … linking … done. Loading package primes-0.2.1.0 … linking … done. Loading package random-1.0.1.1 … linking … done. Loading package QuickCheck-2.6 … linking … done. Loading package toolshed-0.16.0.0 … linking … done. Loading package factory-0.2.1.2 … linking … done. Loading package regexdot-0.11.1.2 … linking … done. Loading package bridge-0.1.0.12 … linking … done. ^[Pass{0,3},1NT,Pass,[2D,2H]?]
-
To reduce excess verbosity, the data-constructor "RegExDot.RegEx.Require", & the binary operators from the module "RegExDot.DSL", were pulled into the context for expression-evaluation, from the package "regexdot-0.10.2.0".
-
A multi-line ghci-command was started.
-
Four "RegExDot.Meta.Meta Bridge.Tier2.Bid.Bids" were constructed; three of them literal instances of the type-parameter, but the final one a bracket-expression.
-
A RegExDot.RegEx.Pattern was constructed from each "RegExDot.Meta.Meta Bridge.Tier2.Bid.Bid"; in each case using the data-constructor "RegExDot.RegEx.Require", since in this example, none of them is a capture-group.
-
Each RegExDot.RegEx.Pattern was then quantified & concatenated, by interleaving infix-instances of the binary operators exported from the module "RegExDot.DSL".
-
Since, in this example, no stern-anchor is required, the resulting RegExDot.RegEx.Concatenation was merely bow-anchored, to create a "RegExDot.RegEx.ExtendedRegEx Bid".
-
Lacking any request actually to use the regex in a match-operation, ghci merely prints it, using RegExDot.RegEx.ExtendedRegEx's implementation of the type-class "Show".
Whilst you might say, that composing the above expression using the API, would take much longer than two-finger typing the traditional terse syntax into a string, which one then parses into the required regex, this isn't a traditional regex; it's polymorphic.
Besides, because the regex-specification is now merely a Haskell-expression, rather than some special syntax squashed into a string, it can be separated with white-space as appropriate, & the sub-components can be commented, just as one can with any Haskell source-code.
Further, the whole regex-specification is now statically type-checked by the compiler, rather than waiting until runtime to find syntax-errors. It doesn't stop you writing the wrong thing (that requires a deeper level of help), but it does reduce the scope of potential errors to the semantics of the expression.
Now let's try to use the API to reproduce the example of the Stayman-convention, specified previously by a Perl-style shortcut.
$:{
$(Just RegExDot.Anchor.Bow, Nothing) <~> captureGroup (
$map (
$(Nothing, Nothing) <~>
$) [
$captureGroup (
$map (
$(Nothing, Nothing) <~>
$) [
$(
$Require $ RegExDot.Meta.Literal Bridge.Tier2.Bid.Pass, (0, Just 3)
$) #->#: [] {-Start with a null RegExDot.RegEx.Concatenation-},
$(
$Require $ RegExDot.Meta.Literal Bridge.Tier2.Bid.Pass, (0, Just 2)
$) #->#: (
$Require . RegExDot.Meta.AnyOf $ map RegExDot.BracketExpressionMember.Literal [Bridge.Tier2.Bid.oneClub .. Bridge.Tier2.Bid.oneSpade]
$) -: (
$captureGroup . return {-to List-monad-} $ (
$Nothing, Nothing
$) <~> (Require (RegExDot.Meta.Literal Bridge.Tier2.Bid.Pass), 2) #: [] {-Start with a null RegExDot.RegEx.Concatenation-}
$) ?: [] {-Start with a null RegExDot.RegEx.Concatenation-}
$]
$) -: Require (RegExDot.Meta.Literal Bridge.Tier2.Bid.oneNT) -: Require (RegExDot.Meta.Literal Bridge.Tier2.Bid.Pass) -: Require (RegExDot.Meta.Literal Bridge.Tier2.Bid.twoClubs) -: [] {-Start with a null RegExDot.RegEx.Concatenation-},
$(
$Require $ RegExDot.Meta.Literal Bridge.Tier2.Bid.Pass, (0, Just 3)
$) #->#: Require (RegExDot.Meta.Literal Bridge.Tier2.Bid.twoNTs) -: Require (RegExDot.Meta.Literal Bridge.Tier2.Bid.Pass) -: Require (RegExDot.Meta.Literal Bridge.Tier2.Bid.threeClubs) -: [] {-Start with a null RegExDot.RegEx.Concatenation-}
$]
$) -: []
$:}
^[([([Pass{0,3}]|[Pass{0,2},[1C,1D,1H,1S],([Pass{2}])?]),1NT,Pass,2C]|[Pass{0,3},2NT,Pass,3C])]
If you feel your eyebrows migrating relentlessly northbound in alarm, read southbound.
Nobody said it was going to be easy.
Fortunately, this is about as complicated as it gets, & by way of consolation, even writing the equivalent terse string ready to parse into the corresponding regex, wouldn't have been easy. Rather than repeatedly fully qualifying function-identifiers with their module-name, a more concise but acceptable policy, would be to import specific functions from the module "RegExDot.Meta", & leave subsequent references to them unqualified. Equally, after spending some more time in solitary confinement, I may be able to provide a wider set of operators, to replace some of the verbose function-names.
-
The main difference from the previous example, is the brooding presence of three capture-groups, the first two of which both offer a choice of two alternative sub-expressions.
To reduce verbosity, it used a trivial convenience-function "RegExDot.RegEx.captureGroup", which merely composes two data-constructors, to build a RegExDot.RegEx.Pattern from a list of alternative sub-expressions.
Each of the sub-expressions (& also the whole regex) begins as a null list, to which successive RegExDot.RegEx.Patterns, after quantification, are prepended, using the binary operators exported from the module "RegExDot.DSL".
-
Most of the meta-data from which these RegExDot.RegEx.Patterns are constructed, are literal instances of the type-parameter "Bridge.Tier2.Bid.Bid", constructed using the data-constructor "RegExDot.Meta.Literal", but there is also one bracket-expression, constructed using the data-constructor "RegExDot.Meta.AnyOf", from a list generated using a range-specifier & the actual type-parameter's implementation of the type-class "Enum". The polymorphic regex-engine doesn't mandate that the formal type-parameter implement this type-class, but that doesn't prohibit its use in this non-polymorphic context.
-
Once printed, the resulting regex can be seen to be identical to the version specified previously by a Perl-style shortcut.
As before, the three match-operators are available to test whether sample input-data matches, & if so, to examine its mapping into the regex. So, defining a regex to represent a Quantitative Raise, i.e.
"^[Pass?,[1NT,2NT,3NT],.,4NT,.,[6NT,Pass]]"
.
$:{
$[
$Bridge.Tier2.Bid.Pass,Bridge.Tier2.Bid.oneNT,Bridge.Tier2.Bid.Pass,Bridge.Tier2.Bid.fourNTs,Bridge.Tier2.Bid.Pass,Bridge.Tier2.Bid.smallSlamNTs,Bridge.Tier2.Bid.Double,Bridge.Tier2.Bid.Pass,Bridge.Tier2.Bid.Pass,Bridge.Tier2.Bid.Pass
$] +~ RegExDot.RegExOpts.mkRegEx (
$(
$Just RegExDot.Anchor.Bow, Nothing {-Specify a Bow-anchor only-}
$) <~> Require (
$RegExDot.Meta.Literal Bridge.Tier2.Bid.Pass
$) ?: Require (
$RegExDot.Meta.AnyOf $ map RegExDot.BracketExpressionMember.Literal [Bridge.Tier2.Bid.oneNT, Bridge.Tier2.Bid.twoNTs, Bridge.Tier2.Bid.threeNTs] {-Construct a bracket-expression-}
$) -: Require RegExDot.Meta.Any -: Require (
$RegExDot.Meta.Literal Bridge.Tier2.Bid.fourNTs
$) -: Require RegExDot.Meta.Any -: Require (
$RegExDot.Meta.AnyOf $ map RegExDot.BracketExpressionMember.Literal [Bridge.Tier2.Bid.smallSlamNTs, Bridge.Tier2.Bid.Pass] {-Construct a bracket-expression-}
$) -: []
$)
$:}
(Nothing,Just [(Pass?,0,[Pass]),([1NT,2NT,3NT],1,[1NT]),(.,2,[Pass]),(4NT,3,[4NT]),(.,4,[Pass]),([6NT,Pass],5,[6NT])],Just (.*,6,[x,Pass,Pass,Pass]))
The value of type "RegExDot.RegEx.Result" returned from the match-operation, tells one that the input-data matched, & the list of triples from which the central part of this structure is composed, each of which corresponds to a quantified "RegExDot.Meta.Meta Bridge.Tier2.Bid.Bid" in the regex, confirms the expected input-data consumption-profile. The only surprise is that, since the regex wasn't terminated in an RegExDot.Anchor.Stern, four Bridge.Tier2.Bid.Bids were consumed after the match, as defined by the last part of the outermost triple.