RegExChar's Standard API

The package "regexchar" implements the Haskell-API defined in the module "Text.Regex.Base.RegexLike"; though this is the de facto standard, it uses features which exceed the Haskell-98 standard. This uniform interface makes it easy to swap between one of a plethora of implementations, but is sufficiently esoteric to warrant a nose-bleed further explanation. The obscurity of the interface arises from the decision to make the return-type of the match-operator polymorphic. Many programming-languages, though they may have support for polymorphism, can't handle this slant on the concept at all; though the Perl-function "wantarray", allows a subroutine to inspect the environment from which it was called, in order to gain a limited understanding of the required type, & consequently value, to return. In contrast, it's not uncommon to find functions of this nature in Haskell.

$ ghci
GHCi, version 7.6.3: https://www.haskell.org/ghc/  :? for help
Loading package ghc-prim … linking … done.
Loading package integer-gmp … linking … done.
Loading package base … linking … done.

Prelude> :type read
read :: Read a => String -> a

Prelude> read "123"
<interactive>:3:1:
	No instance for (Read a0) arising from a use of `read'
	The type variable `a0' is ambiguous
	Possible fix: add a type signature that fixes these type variable(s)
	Note: there are several potential instances:
		instance Read () -- Defined in `GHC.Read'
		instance (Read a, Read b) => Read (a, b) -- Defined in `GHC.Read'
		instance (Read a, Read b, Read c) => Read (a, b, c)
			-- Defined in `GHC.Read'
			...plus 25 others
	In the expression: read "123"
	In an equation for `it': it = read "123"

Prelude> read "123" :: Int
123

Prelude> read "123" :: Double
123.0

The cryptic type of 'Text.Regex.Base.RegexContext'. This seems reasonable, & one can understand without difficulty what it means. What makes the regex-interface, in which the return-type is also polymorphic, more difficult, is the esoteric relationship between the chosen return-type & the data which one might consequently expect to receive. The documentation may precisely define the data expected for each return-type, but the relationship isn't self-explanatory, & occasionally, where a single return-type could sensibly contain more than one set of data, the decision to favour one, can appear rather arbitrary. So rather than creating a set of different match-functions, named according to the data they return, each of which has a concrete type, there is just one polymorphic match-operator "(=~)", & the caller defines the expected return-type, according to the data they require from the match. Whilst this undoubtedly cerebral interface may seem obfuscated, Perl's similar regex match-operator employs the same concept, though to a more limited extent. This approach has the advantage that the interface isn't bloated by a large number of verbose function-identifiers; but it is hard to read.

Examples

More detailed information is made available when you hover your cursor over specific parts of these examples.

Prelude> :module +RegExChar.RegExOptsChar
Prelude RegExChar.RegExOptsChar> :type (=~)
(=~) :: (Text.Regex.Base.RegexLike.RegexContext  RegExOptsChar  RegExChar.ExtendedRegExChar.InputData  target) => RegExChar.ExtendedRegExChar.InputData -> String -> target
Choice of targets.

The match-operator for the package "regexchar", is defined in the module "RegExChar.RegExOptsChar". This module is brought into the context for expression-evaluation, so that the exported symbols don't subsequently need to be fully qualified with their module-name.

Whilst the match-operator has an alarming type-signature, the important thing to take away from it is that since "target" has a lower-case initial, by convention, it must be a type-parameter, rather than a concrete type, &, according to the context, must implement a multi-parameter type-class, of which the first two types have already been defined. One can use ghci to identify pre-defined instances, & consequently the possible values for the type-parameter "target".

$ ghci
GHCi, version 7.6.3: https://www.haskell.org/ghc/  :? for help
Loading package ghc-prim … linking … done.
Loading package integer-gmp … linking … done.
Loading package base … linking … done.

Prelude> :module +Text.Regex.Base
Prelude Text.Regex.Base> :info RegexContext
class RegexLike regex source => RegexContext regex source target	where
	match	:: regex -> source -> target
	matchM	:: Monad m => regex -> source -> m target	-- Defined in Text.Regex.Base.RegexLike
instance RegexLike a b => RegexContext a b [[b]]	-- Defined in Text.Regex.Base.Context
instance RegexLike a b => RegexContext a b (MatchResult b)	-- Defined in Text.Regex.Base.Context
instance RegexLike a b => RegexContext a b Int	-- Defined in Text.Regex.Base.Context
instance RegexLike a b => RegexContext a b Bool	-- Defined in Text.Regex.Base.Context
instance RegexLike a b => RegexContext a b (AllTextSubmatches [] (b, (MatchOffset, MatchLength)))	-- Defined in Text.Regex.Base.Context
instance RegexLike a b => RegexContext a b (AllTextSubmatches [] b)	-- Defined in Text.Regex.Base.Context
instance RegexLike a b => RegexContext a b (AllTextMatches [] b)	-- Defined in Text.Regex.Base.Context
instance RegexLike a b => RegexContext a b (AllSubmatches [] (MatchOffset, MatchLength))	-- Defined in Text.Regex.Base.Context
instance RegexLike a b => RegexContext a b (AllMatches [] (MatchOffset, MatchLength))	-- Defined in Text.Regex.Base.Context
instance RegexLike a b => RegexContext a b (b, b, b, [b])	-- Defined in Text.Regex.Base.Context
instance RegexLike a b => RegexContext a b (b, b, b)	-- Defined in Text.Regex.Base.Context
instance RegexLike a b => RegexContext a b (MatchOffset, MatchLength)	-- Defined in Text.Regex.Base.Context
instance RegexLike a b => RegexContext a b ()	-- Defined in Text.Regex.Base.Context

Regrettably, this says nothing about the data a caller might expect to receive, for each specific type; to discover that, one can refer to the source-code for the type-class "Text.Regex.Base.RegexContext". I'll cover most of these types & the corresponding data, to demonstrate the flexibility & length of the rope with which you can hang yourself.

Single Match

Prelude> :module +RegExChar.RegExOptsChar
Prelude RegExChar.RegExOptsChar> "Function Alley" =~ "^Function(\\s+|-)[Aa]l" :: Bool -- Note the escaped Perl-style shortcut & the explicit type-signature.
Loading package transformers-0.3.0.0 … linking … done.
Loading package array-0.4.0.1 … linking … done.
Loading package mtl-2.1.2 … linking … done.
Loading package deepseq-1.3.0.1 … linking … done.
Loading package containers-0.5.0.0 … linking … done.
Loading package bytestring-0.10.0.2 … linking … done.
Loading package parallel-3.2.0.3 … linking … done.
Loading package regex-base-0.93.2 … linking … done.
Loading package text-0.11.3.1 … linking … done.
Loading package parsec-3.1.3 … linking … done.
Loading package old-locale-1.0.0.5 … linking … done.
Loading package time-1.4.0.1 … linking … done.
Loading package random-1.0.1.1 … linking … done.
Loading package QuickCheck-2.6 … linking … done.
Loading package filepath-1.3.0.1 … linking … done.
Loading package unix-2.6.0.1 … linking … done.
Loading package directory-1.2.0.1 … linking … done.
Loading package toolshed-0.16.0.0 … linking … done.
Loading package regexdot-0.11.1.2 … linking … done.
Loading package regexchar-0.9.0.13 … linking … done.
True
Only two outcomes.

In this instance the regex is quite simple, being composed mostly from literals (i.e. meta-characters which must match the corresponding Char exactly, & in this case just once), but permits a little flexibility in the Chars used to separate the two words from which the input-data is required to be composed, & in the capitalisation of the second. Note that special regex-delimiters are no longer required, since the bounds are just those of the String in which it is contained. Note also the additional '\', required to escape /\s/ from string-interpolation.

By defining the return-type as simply a Bool, one can determine whether the match-operation was successful, but not how that match was achieved. This type of match is rather special, since the regex-engine can return after finding any match, rather than proceeding to determine the optimal match, therefore it can, depending on the implementation, return early.

Besides regexchar, the requested match-operation requires a large number of ancillary packages, including toolshed & regexdot. ghci locates these using the file "$HOME/.ghc/ … /package.conf", & loads them.

In this instance, the match was successful. You may be wondering what the fuss regarding polymorphic return-types is all about, since this operation is quite clear; read on, & I will attempt to obfuscate.

Prelude RegExChar.RegExOptsChar> "Function Alley" =~ "^Function(\\s+|-)[Aa]l" :: (Text.Regex.Base.RegexLike.MatchOffset, Text.Regex.Base.RegexLike.MatchLength)
(0,11)
Pear

By defining the return-type as a pair, composed from type-synonyms exported from Text.Regex.Base.RegexLike, as a more readable alternative to the Ints with which they're synonymous, one can extract the start-index & length of the matched input-data.

The value "(-1,0)" would be returned, after a failure to match anything.

The lack of any '$', permits a match against the first eleven Chars of input-data, leaving the remaining text unconsumed.

Prelude RegExChar.RegExOptsChar> "Function-alley" =~ "^Function(\\s+|-)[Aa]l" :: Text.Regex.Base.RegexLike.MatchArray
array (0,1) [(0,(0,11)),(1,(8,1))]
Match-array

By defining the return-type as another exported type-synonym, but this time an instance of the polymorphic type "Data.Array.Array", one can obtain not only the previous result, but also the start-index & length of any capture-groups.

The value "array (1,0) [ ]" would be returned, after a failure to match anything.

As before, the regex matched the first eleven Chars of input-data, but this time the regex-engine also identifies that the only capture-group (used to define the permissible Chars between the words in the input-data), captured the single Char '-'.

Prelude RegExChar.RegExOptsChar> "Function-alley" =~ "^Function(\\s+|-)[Aa]l" :: (String, Text.Regex.Base.RegexLike.MatchText String, String)
("",array (0,1) [(0,("Function-al",(0,11))),(1,("-",(8,1)))],"ley")

Prelude RegExChar.RegExOptsChar> :module +Text.Regex.Posix
Prelude RegExChar.RegExOptsChar Text.Regex.Posix> "Function-alley" Text.Regex.Posix.=~ "^Function(\\s+|-)[Aa]l" :: (String, Text.Regex.Base.RegexLike.MatchText String, String)
("",array (0,1) [(0,("Function-al",(0,11))),(1,("-",(8,1)))],"ley")

Prelude RegExChar.RegExOptsChar Text.Regex.Posix> :module -Text.Regex.Posix
Triple Cone.

By defining the return-type as a triple, composed from another exported type-synonym based on Data.Array.Array, one can extract the data from the previous two examples combined, sandwiched between any input-data preceding & following the match.

The value '("Function-alley",array (1,0) [ ],"")' would be returned, after a failure to match anything.

In this instance, no input-data preceded the match, but three Chars of input-data followed it.

The test was then repeated to demonstrate compatibility, with an alternative implementation of the same type-class, from the module "Text.Regex.Posix". Note that the infix match-operator was, in this instance, rather awkwardly fully qualified, to avoid ambiguity with its namesake from the module "RegExChar.RegExOptsChar".

Prelude RegExChar.RegExOptsChar> "Function-alley" =~ "^Function(\\s+|-)[Aa]l" :: (String, String, String)
("","Function-al","ley")

By defining the return-type as a simple triple of Strings, one can extract just the text; preceding the match, matching the regex, & following the match.

The value '("Function-alley","","")' would be returned, after a failure to match anything.

In this instance, no data preceded the match (as represented by the null String), but "ley" followed it.

Prelude RegExChar.RegExOptsChar> "Function-alley" =~ "^Function(\\s+|-)[Aa]l" :: (String, String, String, [String])
("","Function-al","ley", ["-"])

By defining the return-type as a quadruple of three Strings followed by a String-list, one can extract the same result as before, plus any input-data consumed by capture-groups.

The value '("Function-alley","","",[ ])' would be returned, after a failure to match anything.

In this instance, no data preceded the match (as represented by the null String), but "ley" followed it, & the single Char '-' was consumed by the only capture-group.

Prelude RegExChar.RegExOptsChar> Text.Regex.Base.RegexLike.mrBefore (("Function-alley" =~ "^Function(\\s+|-)[Aa]l") :: Text.Regex.Base.RegexLike.MatchResult String)
""

Prelude RegExChar.RegExOptsChar> Text.Regex.Base.RegexLike.mrMatch (("Function-alley" =~ "^Function(\\s+|-)[Aa]l") :: Text.Regex.Base.RegexLike.MatchResult String)
"Function-al"

Prelude RegExChar.RegExOptsChar> Text.Regex.Base.RegexLike.mrAfter (("Function-alley" =~ "^Function(\\s+|-)[Aa]l") :: Text.Regex.Base.RegexLike.MatchResult String)
"ley"

Prelude RegExChar.RegExOptsChar> Text.Regex.Base.RegexLike.mrSubList (("Function-alley" =~ "^Function(\\s+|-)[Aa]l") :: Text.Regex.Base.RegexLike.MatchResult String)
["-"]

Prelude RegExChar.RegExOptsChar> Text.Regex.Base.RegexLike.mrSubs (("Function-alley" =~ "^Function(\\s+|-)[Aa]l") :: Text.Regex.Base.RegexLike.MatchResult String)
array (0,1) [(0,"Function-al"),(1,"-")]
Match-result

Accessor-functions can be used to extract explicitly, the required input-data from an instance of the record, Text.Regex.Base.RegexLike.MatchResult. Whilst this provides an oasis of clarity amongst the previous queries, the record contains neither Text.Regex.Base.RegexLike.MatchOffset nor Text.Regex.Base.RegexLike.MatchLength.

Either the value '[ ]', or the value "array (1,0) [ ]", would be returned, after a failure to match anything.

The data returned is identical to the previous call, in which the return-type was defined as a quadruple, though there's now an additional option to access the input-data consumed by capture-groups as an "Data.Array.Array Int String".

Repeated Non-overlapping Matches

Specification of any of the following return-types, results in an attempt repeatedly to re-match against unconsumed input-data. To allow this, the input-data has been expanded, & the '^' has been dropped from the regex.

Prelude RegExChar.RegExOptsChar> "Function-alley Function Alley" =~ "Function(\\s+|-)[Aa]l" :: Int
2

By defining the return-type as an Int, one can determine the number of non-overlapping matches.

The value "0" would be returned, after a failure to match anything.

In this instance, the regex matched twice.

Prelude RegExChar.RegExOptsChar> "Function-alley Function Alley" =~ "Function(\\s+|-)[Aa]l" :: [Text.Regex.Base.RegexLike.MatchArray]
[array (0,1) [(0,(0,11)),(1,(8,1))],array (0,1) [(0,(15,11)),(1,(23,1))]]
List of Match-arrays.

By defining the return-type as a list of Text.Regex.Base.RegexLike.MatchArrays, one can extract the start-index & length of the repeated match, including any capture-groups.

The value "[ ]" would be returned, after a failure to match anything.

In this instance, the regex matched twice, resulting in a list of length two, & each match included one capture-group, resulting in two-element Text.Regex.Base.RegexLike.MatchArrays.

Prelude RegExChar.RegExOptsChar> "Function-alley Function Alley" =~ "Function(\\s+|-)[Aa]l" :: [Text.Regex.Base.RegexLike.MatchText String]
[array (0,1) [(0,("Function-al",(0,11))),(1,("-",(8,1)))],array (0,1) [(0,("Function Al",(15,11))),(1,(" ",(23,1)))]]

Prelude RegExChar.RegExOptsChar> :module +Text.Regex.Posix
Prelude RegExChar.RegExOptsChar Text.Regex.Posix> "Function-alley Function Alley" Text.Regex.Posix.=~ "Function(\\s+|-)[Aa]l" :: [Text.Regex.Base.RegexLike.MatchText String]
[array (0,1) [(0,("Function-al",(0,11))),(1,("-",(8,1)))],array (0,1) [(0,("Function Al",(15,11))),(1,(" ",(23,1)))]]

Prelude RegExChar.RegExOptsChar Text.Regex.Posix> :module -Text.Regex.Posix

The type "Text.Regex.Base.RegexLike.MatchText  String" was used previously as the middle member of a triple, but by alternatively specifying a list of them, one can extract the data from the previous two examples combined.

The value "[ ]" would be returned, after a failure to match anything.

In this instance, as before, the regex matched twice, & for each of these, the capture-group consumed a single Char, but this time, as required, one additionally receives the start-index & length.

Since this result is about the most comprehensive that can be extracted from the match, the test was then repeated to demonstrate compatibility, with an alternative implementation of the same type-class, from the module "Text.Regex.Posix". Note that the infix match-operator was, in this instance, rather awkwardly fully qualified, to avoid ambiguity with its namesake from the module "RegExChar.RegExOptsChar".

Prelude RegExChar.RegExOptsChar> "Function-alley Function Alley" =~ "Function(\\s+|-)[Aa]l" :: [[String]]
[["Function-al","-"],["Function Al"," "]]

By defining the return-type as a list of String-lists, one can extract the text, consumed by each match & any capture-groups; the same data as when the return-type was defined as "[Data.Array.Array  Int  String]", just in a different container.

The value "[ ]" would be returned, after a failure to match anything.

In this instance, the regex matched twice, resulting in a list of length two, & for each of these, the capture-groups consumed a single Char, resulting in sub-lists also of length two.

Ox

I think these examples adequately demonstrate the power of the interface, but also its hostility. Despite the wide choice in what information is returned, some information can't be obtained by specifying any of the possible return-types.

Prelude RegExChar.RegExOptsChar> "0x0B0110C5" =~ "0[Xx]([[:xdigit:]]{2})+" :: (String, Text.Regex.Base.RegexLike.MatchText String, String)
("",array (0,1) [(0,("0x0B0110C5",(0,10))),(1,("C5",(8,2)))],"")

Whilst I'm not claiming there's any point to this particular regex, its point, err, isn't the point. The capture-group matched four times, but only the last instance is available in the results. This conforms to the POSIX standard for regex-matching, but that's little consolation, if a list containing each of the individual byte-sized chunks of input-data, matched by the repeatable capture-group, was your requirement. Knowing in advance the number of times the capture-group will be repeated, in the optimum solution, would allow one to re-write the regex with four separate capture-groups; but, aside from becoming unnecessarily verbose, one doesn't necessarily know this.