RegExChar's Native API


By delving beneath the standard Haskell-API, defined in the module "Text.Regex.Base.RegexLike", & implemented in the module "RegExChar.RegExOptsChar", to the Haskell-98 compliant API beneath, the additional features available within the package "regexdot" (& consequently the Char-specific specialisation, defined by the package "regexchar"), can be accessed. Specifically, the ability to retrieve a tree-structure containing the complete match-result, via just one non-polymorphic operator. One can then carve up the result-data as required.

The module "RegExChar.ExtendedRegExChar" exports three match-operators:

$ ghci
GHCi, version 7.6.3:  :? for help
Loading package ghc-prim … linking … done.
Loading package integer-gmp … linking … done.
Loading package base … linking … done.
Prelude> :browse RegExChar.ExtendedRegExChar
(+~)	:: InputData -> RegExDot.RegExOpts.RegExOpts ExtendedRegExChar -> RegExDot.RegEx.Result Char
(/~)	:: InputData -> RegExDot.RegExOpts.RegExOpts ExtendedRegExChar -> Bool
(=~)	:: InputData -> RegExDot.RegExOpts.RegExOpts ExtendedRegExChar -> Bool
data ExtendedRegExChar	= MkExtendedRegExChar {
	hasNonCapturingTopLevelAlternatives	:: Bool,
	extendedRegEx				:: RegExDot.RegEx.ExtendedRegEx Char
type InputData	= RegExDot.RegEx.InputData Char

ghci, locates the module "RegExChar.ExtendedRegExChar" in the package "regexchar", using the information recorded during its installation.

The return-type of these operators is well-defined, rather than the polymorphic value returned by the standard interface.

Returns Bool, & merely indicates the success or failure of the match.
Returns the opposite of (=~).
Returns a value of type "RegExDot.RegEx.Result  Char", & contains the mapping of the input-data into the tree-structure which represents the optimal solution.

Unless Haskell-98 compliance is required, the functionality of (=~) is covered by the standard interface, & (/~) is just a convenience function, so only (+~) is of further interest.

The type "RegExChar.ExtendedRegExChar.InputData" is merely a type-synonym for a list of Char.


To demonstrate this facility to best effect, a regex with a flourishing tree-like structure is preferable, i.e. one with repeated capture-groups. Many pathological regexen fall into this category, which have little inherent value, but are frequently discussed in the context of performance; since the performance of regex-engines (other than those implemented using a DFA), can degrade exponentially with increasing data-length, when attempting to match against one. Whilst their relevance to measuring performance is debatable, since they're not typically encountered in real tasks, they seem appropriate for the current requirement.

More detailed information is made available when you hover your cursor over specific parts of these examples.

Prelude> :module + RegExChar.ExtendedRegExChar
Prelude RegExChar.ExtendedRegExChar> replicate (2 ^ 4 - 1) 'a' +~ RegExDot.RegExOpts.mkRegEx (read "^(a?|a{2}|a{4}|a{8}|a{16})*$")
Loading package transformers- … linking … done.
Loading package array- … linking … done.
Loading package mtl-2.1.2 … linking … done.
Loading package deepseq- … linking … done.
Loading package containers- … linking … done.
Loading package parallel- … linking … done.
Loading package bytestring- … linking … done.
Loading package text- … linking … done.
Loading package parsec-3.1.3 … linking … done.
Loading package regex-base-0.93.2 … linking … done.
Loading package old-locale- … linking … done.
Loading package time- … linking … done.
Loading package random- … linking … done.
Loading package QuickCheck-2.6 … linking … done.
Loading package filepath- … linking … done.
Loading package unix- … linking … done.
Loading package directory- … linking … done.
Loading package toolshed- … linking … done.
Loading package regexdot- … linking … done.
Loading package regexchar- … linking … done.
(Nothing,Just [[[('a'{8},0,"aaaaaaaa")],[('a'{4},8,"aaaa")],[('a'{2},12,"aa")],[('a'?,14,"a")]]],Nothing)
What just happened ?!
  • First a String containing fifteen 'a' Chars, was supplied as input-data, as required by the first parameter of the subsequent infix match-operator.

  • The second parameter required by the match-operator is a "RegExDot.RegExOpts.RegExOpts  RegExChar.ExtendedRegExChar.ExtendedRegExChar", which aggregates the data-type "RegExDot.Options.Options" (which allows one to control the operation of the regex-engine), with the regex. This aggregation, allows the match-operator to receive the three parameters it requires, while retaining the clarity of a simple binary operator. To build this data-structure, we referenced the module "RegExDot.RegExOpts", which was loaded from the package "regexdot" (which should previously have been installed), & which exports a function "mkRegEx". This function uses the default value for RegExDot.Options.Options, & then merely requires a RegExChar.ExtendedRegExChar.ExtendedRegExChar before it can build the required data-structure. One can alternatively customise the construction of RegExDot.Options.Options, by replacing this convenient builder-function with a more verbose data-constructor exported from the same module;

    RegExDot.RegExOpts.MkRegExOpts RegExDot.Options.defaultValue {RegExDot.CompilationOptions.field=valueRegExDot.Options.defaultValue {RegExDot.ExecutionOptions.field=value}.
  • The RegExChar.ExtendedRegExChar.ExtendedRegExChar, was read from a String, & as promised contains a repeatable capture-group, which results in a tree-like result-structure. This rather pointless regex merely defines the subset of strings composed entirely from 'a's; which could have been defined more clearly and efficiently as /^a*$/, & perhaps would have been, had I not wanted to illustrate the nature of the result; but the end justifies the means.

  • Finally, since the type "RegExDot.RegEx.Result" implements the type-class Show, ghci prints it. The result is a triple, composed from:

    • Any input-data consumed before the start of the matching portion; since the regex was bow-anchored, this can't contain anything.

    • The match-structure has the shape of a Rose tree, each leaf of which is also a triple, composed from;

      • The term in the regex which consumed the input-data.
      • The index into the input-data, from which data-consumption began.
      • The consumed input-data.

      The capture-group matched four times. Each of the alternative sub-expressions uses the default greedy quantification, so, on successive iterations, it finds the maximal available match;

      • eight Chars from the original fifteen,
      • four from the remaining seven,
      • two from the remaining three,
      • & finally the last Char.
    • Any unconsumed input-data; since the regex was stern-anchored, this can't contain anything.

    This is the correct solution according to POSIX standard for regex-matching, which requires an exhaustive search of all alternative sub-expressions on each repetition of the capture-group, occasionally resulting in either an unacceptable time-complexity or an unacceptable space-complexity. This contrasts with Perl's regex-engine, which uses a cheaper algorithm which merely finds the left-most match amongst alternative sub-expressions, & short-circuits potentially superior matches further right.

    Each leaf in the tree-shaped match, is also a triple, composed from:

    • The term in the regex which consumed the input-data.
    • The index into the input-data, from which data-consumption began.
    • The consumed input-data.

    In contrast to the standard interface, all the data from the match-operation is available from one function-call, & can be extracted as required.

  • RegExDot.RegEx.Result-tree

    Despite the billing, this tree-structure isn't exactly flourishing, because the result has; a single trunk, sprouting four branches, on each of which is just one leaf; but as may be inferred from the seemingly redundant singleton lists, the result-structure caters for greater complexity.

    A regex is composed from a concatenation of quantified terms, the result for which can be contained within a list. If the terms in question are merely quantified meta-characters, then a list of triples will suffice, but each may also be a quantified capture-group, the repeated results for which require a sub-list rather than just a triple. Each alternative sub-expression in a capture-group, is recursively defined as a regex, composed from a concatenation of terms … the result for which has the same storage-requirements as the outer regex; hence the recursive nature of the structure. To summarise, whenever a capture-group is included within the regex, rather than merely a meta-character, a list of lists is introduced into the result-structure, rather than a just triple. If the capture-group is only repeated once, either because its quantification permitted no more, or because the optimal solution was one, then the corresponding outer list in the result-structure, is merely a singleton. If there's only one term in the matching sub-expression from that capture-group, then the corresponding inner list in the result-structure, is also a singleton.

Let's try another example.

Prelude RegExChar.ExtendedRegExChar> replicate 4 'a' +~ RegExDot.RegExOpts.mkRegEx (read "^(a?){4}(a){4}$")
(Nothing,Just [[[('a'?,0,"")],[('a'?,0,"")],[('a'?,0,"")],[('a'?,0,"")]],[[('a',0,"a")],[('a',1,"a")],[('a',2,"a")],[('a',3,"a")]]],Nothing)

Since both repeatable capture-groups, can independently consume all the input-data, but the former merely desires it, whereas the latter requires it, the former is starved by the latter. Though the only viable match requires that all the input-data must ultimately be consumed by the second capture-group, this is only discovered after exhausting all the alternatives, hence the pathological categorisation of this expression.

As before, the regex matched, which this time required exactly four repetitions of both capture-groups, & being anchored at both ends, consumed all the input-data. This time the result is a tree with; two trunks, each one of which sprouts four branches, on each of which is just one leaf, which consumes one Char.

Prelude RegExChar.ExtendedRegExChar> replicate (2 ^ 4) 'a' +~ RegExDot.RegExOpts.mkRegEx (read "^(((a{0,2}){0,2}){0,2}){0,2}$")
(Nothing,Just [[[[[[[('a'{0,2},0,"aa")],[('a'{0,2},2,"aa")]]],[[[('a'{0,2},4,"aa")],[('a'{0,2},6,"aa")]]]]],[[[[[('a'{0,2},8,"aa")],[('a'{0,2},10,"aa")]]],[[[('a'{0,2},12,"aa")],[('a'{0,2},14,"aa")]]]]]]],Nothing)

This time, any consumption of input-data by the regex, must be by just one literal meta-character; that's all there is. Since this meta-character can match any datum from the available input-data, the only decision left to the regex-engine is to determine the optimal combination of repetitions. The only valid is solution is to choose the maximum repetitions of each capture-group, permissible according to the quantified bounds.

As before, the match consumed all the input-data, but this time the result is a tree with; one trunk, which has two branches, each of which sprouts two twigs, on each of which are two leaves, each of which consumes two Chars.

Prelude RegExChar.ExtendedRegExChar> ['0' .. '5'] +~ RegExDot.RegExOpts.mkRegEx (read "^a?((01?)*|(12?)*|(23?)*|(34?)*|(45?)*)*$")
(Nothing,Just [('a'?,0,""),[[[[('0',0,"0"),('1'?,1,"1")]]],[[[('2',2,"2"),('3'?,3,"3")]]],[[[('4',4,"4"),('5'?,5,"5")]]]]],Nothing)

This regex looks simple to solve; the optional literal meta-character with which it starts, always fails to match, then every second alternative sub-expression is used on successive repetitions of the outermost capture-group to consume all the input-data; but, since POSIX standard for regex-matching requires that we find the optimal solution amongst alternative sub-expressions, an exhaustive search of the other possible solutions must be conducted, before reluctantly concluding that the first solution found was actually the best.

As before, the match consumed all the input-data, but this time the result is a tree with; two trunks, the first of which has just one leaf, which consumes nothing, but the second of which sprouts three branches, on each of which are two leaves, each of which consumes just one Char.

Prelude RegExChar.ExtendedRegExChar> "0x0B0110C5" +~ RegExDot.RegExOpts.mkRegEx (read "0[Xx]([[:xdigit:]]{2})+")
(Just (.*?,0,""),Just [('0',0,"0"),(['X','x'],1,"x"),[[([[:xdigit:]]{2},2,"0B")],[([[:xdigit:]]{2},4,"01")],[([[:xdigit:]]{2},6,"10")],[([[:xdigit:]]{2},8,"C5")]]],Just (.*,10,""))
Returning to the non-pathological example from the previous page, describing the standard API, one can now access all matches of the capture-group.