SRELL

SRELL (std::regex-like library) is a Unicode-aware regular expression template library for C++.

Contents

Features

Header-only and the same class design as std::regex

SRELL is a header-only template libarary and does not need any installation. SRELL has an ECMAScript (JavaScript) compatible regular expression engine wrapped into the same class design as std::regex. As APIs are compatible, SRELL can be used in the same way as std::regex (or boost::regex on which std::regex is based).

Unicode-specific implementation

SRELL has native support for Unicode:

  • UTF-8, UTF-16, and UTF-32 strings can be handled without any additional configurations.
  • '.' does not match a half of a surrogate pair in UTF-16 strings or does not match a code unit in UTF-8 strings.
  • Supplementary Characters can be specified in a character class such as [丈𠀋], and a range also can be specified in a character class such as [\u{1b000}-\u{1b0ff}].
  • When the case-insensitive match is performed, even characters having two lowercase letters for one uppercase letter such as Greek Σ (u+03c2 [ς] and u+03c3 [σ]) or having the third case called "titlecase" besides the uppercase and the lowercase such as DŽ (uppercase; DŽ, lowercase; dž and titlecase; Dž) are processed appropriately.
Consideration for ignore-case (icase) search

SRELL has been tuned up not to slow down remarkably when case-insensitive (icase) search is performed.

As std::regex was proposed early for C++0x (now C++11), it is little dependent on C++11's new features. So SRELL should be available with even pre-C++11 compilers as far as they interpret C++ templates accurately. (The oldest compiler on where I confirm SRELL can be used is Visual C++ 2005 in Visual Studio 2005).

Download

Notes

How to use

No preparation is required. Place srell*.h* (the three files of srell.hpp, srell_ucfdata2.h, and srell_updata3.h) somewhere in your PATH and include srell.hpp.

If you have used <regex>, you already know how to use SRELL generally.

//  Example 01:
#include <cstdio>
#include <string>
#include <iostream>
#include "srell.hpp"

int main()
{
    srell::regex e;     //  Regular expression object holder.
    srell::cmatch m;    //  Object which receives results.

    e = "\\d+[^-\\d]+"; //  Compile a regular expression string.
    if (srell::regex_search("1234-5678-90ab-cdef", m, e))
    {
        //  If use printf.
        const std::string s(m[0].first, m[0].second);
            //  The code above can be replaced with one of the following lines.
            //  const std::string s(m[0].str());
            //  const std::string s(m.str(0));
        std::printf("result: %s\n", s.c_str());

        //  If use iostream.
        std::cout << "result: " << m[0] << std::endl;
    }
    return 0;
}
	

As in this example, all classes and algorithms that belong to SRELL have been put within namespace "srell". Except for this point, the usage is basically identical to std::regex.

Please see also readme_en.txt included in the zip archive.

C++11 and later features

New features introduced in C++11 and later that SRELL may use are as follows:

SRELL determines whether these features are available or not by using the following macros:

#ifdef __cpp_unicode_characters
  #ifndef SRELL_CPP11_CHAR1632_ENABLED
  #define SRELL_CPP11_CHAR1632_ENABLED  //  Do typedef for char16_t and char32_t.
  #endif
#endif

#ifdef __cpp_initializer_lists
  #include <initializer_list>
  #ifndef SRELL_CPP11_INITIALIZER_LIST_ENABLED
  #define SRELL_CPP11_INITIALIZER_LIST_ENABLED  //  Make it possible to pass initializer_list as an argument.
  #endif
#endif

#ifdef __cpp_rvalue_references
  #ifndef SRELL_CPP11_MOVE_ENABLED
  #define SRELL_CPP11_MOVE_ENABLED  //  Enable move semantics in constructos and operator=().
  #endif
#endif

#ifdef __cpp_char8_t
  #ifdef __cpp_lib_char8_t
  #define SRELL_CPP20_CHAR8_ENABLED 2   //  Both char8_t support and std::u8string support.
  #else
  #define SRELL_CPP20_CHAR8_ENABLED 1   //  Only char8_t support.
  #endif
#endif
		

If your compiler does not set __cpp_* macros despite of the fact that the corresponding features are actually available, you can turn on the feature(s) you need by setting the SRELL_CPP_* macro(s) above before including SRELL.

Syntax

The expressions defined in the RegExp (Regular Expression) Objects section in ECMAScript 2023 Specification are available. By default, the u flag is assumed to be always set.

Starting with version 4.000, SRELL supports v flag mode, which is turned on by passing the unicodesets flag to the pattern compiler (srell::basic_regex). For the details of the v mode, see the proposal page.

The detailed list of supported expressions is as follows:

List of Regular Expressions being available in SRELL
Characters
.

Matches any character but LineTerminator (i.e., any code point but U+000A, U+000D, U+2028, and U+2029).
If the dotall option flag is passed to the pattern compiler, '.' matches every code point (i.e., equivalent to [\u{0}-\u{10ffff}]). Note that when dotall is specified, .* matches all the remaining string.

Note: The dotall option flag is available since SRELL 2.000, following the enhancement of RegExp in ES2018/ES9.0. This corresponds to //s in Perl 5.

\0

Matches NULL (\u0000).

\t

Matches Horizontal Tab (\u0009).

\n

Matches Line Feed (\u000a).

\v

Matches Vertical Tab (\u000b).

\f

Matches Form Feed (\u000c).

\r

Matches Carriage Return (\u000d).

\cX

Matches a control character corresponding to ((the code point value of X) & 0x1f) where X is one of [A-Za-z].
If \c is not followed by one of A-Z or a-z, then error_escape is thrown.

\\

Matches a backslash (\u005c) itself.

\xHH

Matches a character whose code unit value in UTF-16 is identical to the value represented by two hexadecimal digits HH.
If \x is not followed by two hexadecimal digits error_escape is thrown.

Because code unit values 0x00-0xFF in UTF-16 represent U+0000-U+00FF respectively, HH in this expression virtually represents a code point.

\uHHHH

Matches a character whose Unicode code point is identical to the value represented by four hexadecimal digits HHHH.
If \u is not followed by four hexadecimal digits error_escape is thrown.

SRELL 2.500-: When sequential \uHHHH escapes represent a valid surrogate pair in UTF-16, they are interpreted as a Unicode code point value. For example, /\uD842\uDF9F/ is interpreted as being equivalent to /\u{20B9F}/.

\u{H...}

Matches a character whose Unicode code point is identical to the value represented by one or more hexadecimal digits H....
If the inside of {} in \u{...} is not one or more hexadecimal digits, a value represented by the hexadecimal digits exceeds the max value of Unicode code points (0x10FFFF), or the closing curly bracket '}' does not exist, then error_escape is thrown.

Note: This expression has been available since ECMAScript 6.0. In SRELL up to version 2.001, what could be specified as H... in \u{H...} was "one to six hexadecimal digits". This is because this feature was implemented based on the proposal document, and the change that was made to the text when the proposal was approved formally was overlooked.

\

When a \ is followed by one of ^ $ . * + ? ( ) [ ] { } | /, the sequence represents the following character itself. This is used for removing the speciality of a character that has usually a special meaning in the regular expression and making the pattern compiler interpret the character literally. (The reason why '/' is also included in the list is probably because a sequence of regular expressions is enclosed by // in ECMAScript.)
In the character class mentioned below, '-' also becomes a member of this group in addition to the fourteen characters above and can be used as "\-".

Note: In the u flag mode of ECMAScript, all the combinations of \ and some-letter are reserved. You cannot expect that if \ SOME-LETTER does not have any special meaning, the sequence is treated as SOME-LETTER itself. An arbitrary combination of \ and something causes a syntax error.

Any character but
^$.*+?()[]{}|\/

Represents that character itself.

Alternatives
A|B

Matches a sequence of regular expressions A or B. An arbitrary number of '|' can be used to separete expressions, such as /abc|def|ghi?|jkl?/.
Each sequence of regular expressions separeted by '|' is tried from left to right, and only the sequence that first succeeds in matching is adopted.
For example, when matching /abc|abcdef/ against "abcdef", the result is "abc".

Character Class
[]

A character class. A set of characters:

  • [ABC] matches 'A', 'B', and 'C'.
  • [^DEF] matches any character but 'D', 'E', 'F'. When the first charcter in [] is '^', any character being not included in [] is matched. I.e., '^' as the first character means negation.
  • [G^H] matches 'G', '^', and 'H'. '^' that is not the first character in [] is treated as an ordinary character.
  • [I-K] matches 'I', 'J', and 'K'. The sequence CH1-CH2 represents "any character in the range from the Unicode code point of CH1 to the code point of CH2 inclusive".
  • [-LM] matches '-', 'L', and 'M'. '-' that does not fall under the condition above is treated as an ordinary character.
  • [N-P-R] matches 'N', 'O', 'P', '-', and 'R'; does not match 'Q'. '-' following a range sequence represents '-' itself.
  • [S\-U] matches 'S', '-', and 'U'. '-' escaped by \ is treated as '-' itself ("\-" is available only in the character class).
  • [.|({] matches '.', '|', '(', and '{'. These characters lose their special meanings in [].
  • [] is the empty class. It does not match any code point. This expression always makes matching fail whenever it occurs.
  • [^] is the complementary set of the empty class. Thus it matches any code point. The same as [\0-\u{10FFFF}].

Examples when case insensitive match is performed (when the icase flag is set):

  • [E-F] matches 'E', 'F', 'e', and 'f'; all the characters in the range from 'E' (u+0045) to 'F' (u+0046) inclusive, and the ones regarded as the same character as any in this range when Unicode case folding is applied to.
  • [E-f] matches 'A' to 'Z', 'a' to 'z', '[', '\', ']', '^', '_', '`', 'ſ', and 'K'; all the characters in the range from 'E' (u+0045) to 'f' (u+0066) inclusive, and the ones regarded as the same character as any in this range when Unicode case folding is applied to.

Although ']' immediately after '[' is counted as a ']' itself in Perl's regular expression, there is not such a special treatment in ECMAScript's RegExp. To include ']' in a character class, it is always needed to escape like "\]" by prefixing a '\' to ']'.

If regular expressions contain a mismatched '[' or ']', error_brack is thrown. If regular expressions contain an invalid character range such as [b-a], error_range is thrown.

In the v mode (when the unicodesets flag is specified), in addition to the features explained above (called union), the following features are available in the character class:

  • && operator: The sequence CC1&&CC2 represents "any character that is included in both the character class CC1 and the character class CC2" (i.e., intersection). For example, [\p{sc=Latin}&&\p{Ll}] matches any character that belongs to the Latin script (\p{sc=Latin}) and is a lower letter (\p{Ll}).
  • -- operator: The sequence CC1--CC2 represents "any character that is included in the character class CC1, but is NOT included in the character class CC2 (i.e., difference/subtraction). For example, [\p{sc=Latin}--\p{Ll}] matches any character that belongs to the Latin script (\p{sc=Latin}) and is NOT a lower letter (\p{Ll}).
  • Containing of strings: By using \q{...}, strings can be contained in a character class. For example, [a-z\q{ch|th|ph}] matches any single character in the range [a-z], or the sequences ch, th, or ph. When strings are included in a character class, it is ensured that longest strings matched first. Consequently, the previous example is virtually equivalent to (?:ch|th|ph|[a-z]).
    \q{...} can be used as an operand of the operations (union, intersection, and difference).
  • [] can be nested and used as an operand of the operations. For example, [\p{sc=Latin}--[a-z]] matches any character that belongs to the Latin script (\p{sc=Latin}) and is NOT in the range [a-z].

Per level of [...], only one type of operator can be used. (Note that in the following examples, A, B, C, D represent a character class each):

  • [AB--CD]: Error. SRELL throws error_operator, because after the union operation was used at AB, a different type of operator, -- appeared.
  • [[AB]--[CD]]: OK.
  • [A[B--C]D]: OK.
  • [\p{sc=Latin}--\p{Lu}--[a-z]]: OK. Using one type of operator multile times does not cause an error.

In the v mode, the eight characters ( ) [ { } / - | cannot be written directly in a character class. They need to be escaped by placing \ in front of themselves; otherwise, SRELL throws error_noescape (Regardless of the the v mode, ] needs to be escaped always in the character class).

Moreover, the following 18 double punctuators are reserved in the vmode for future use. They cannot be written in []. If written, SRELL throws error_operator.

  • !!, ##, $$, %%, **, ++, ,,, ..,
  • ::, ;;, <<, ==, >>, ??, @@, ^^,
  • ``, ~~
Predefined Character Classes
\d

Equivalent to [0-9]. This expression can be used also in a character class, such as [\d!"#$%&'()].

\D

Equivalent to [^0-9]. This can be used in a character class, as well as \d.

\s

Equivalent to [ \t\n\v\f\r\u00a0\u1680\u2000-\u200a\u2028-\u2029\u202f\u205f\u3000\ufeff]. This can be used in a character class, too.

Note: Strictly speaking, this consists of the union of WhiteSpace and LineTerminator. Whenever some code point(s) were to be added to category Zs in Unicode, the number of code points that \s matches is increased.

\S

Equivalent to [^ \t\n\v\f\r\u00a0\u1680\u2000-\u200a\u2028-\u2029\u202f\u205f\u3000\ufeff]. This can be used in a character class, too.

\w

Equivalent to [0-9A-Za-z_]. This can be used in a character class, too.

\W

Equivalent to [^0-9A-Za-z_]. This can be used in a character class, too.

\p{...}

Matches any character that has the Unicode property specified in "...". For example, \p{scx=Latin} matches every character defined as a Latin letter in Unicode. This expression can be used also in a character class.

For the details about what can be specified in "...", see the tables in the latest draft of the ECMAScript specification..

In the v mode, properties of strings (Unicode properties that match sequences of characters) are also supported. They can be used also in the character class, except negated character classes ([^...]). If used in a negated character class, SRELL throws error_complement.

Note: This expression is available since SRELL 2.000, following the enhancement of RegExp in ES2018/ES9.0.
Properties of strings are available only in the v mode in SRELL 4.000 and later.

\P{...}

Matches any character that does not have the Unicode property specified in "...". This can be used in a character class, too.

Unlike \p above, even in the v mode \P{...} supports only properties that match single characters, does not support properties of strings. If any property name that represents a property of strings is specified in \P{...}, SRELL throws error_complement.

Note: Available since SRELL 2.000, following the enhancement of RegExp in ES2018/ES9.0.
When icase (case-insensitive) matching is performed, \P{...} may represent different character sets in the u mode and in the v mode. See here for details.

Quantifiers
*
*?

Repeats matching the preceding expression 0 or more times. * tries to match as many as possible, whereas *? tries to match as few as possible.

If this appears without a preceding expression, error_badrepeat is thrown. This applies to the following five also.

+
+?

Repeats matching the preceding expression 1 or more time(s). + tries to match as many as possible, whereas +? tries to match as few as possible.

?
??

Repeats matching the preceding expression 0 or 1 time(s). ? tries to match as many as possible, whereas ?? tries to match as few as possible.

{n}

Repeats matching the preceding expression exactly n times.

If regular expressions contain a mismatched '{' or '}', error_brace is thrown. This applies to the following two also.

{n,}
{n,}?

Repeats matching the preceding expression at least n times. {n,} tries to match as many as possible, whereas {n,}? tries to match as few as possible.

{n,m}
{n,m}?

Repeats matching the preceding expression n time at least and m times at most. {n,m} tries to match as many as possible, whereas {n,m}? tries to match as few as possible.

If an invalid range in {} is specified like {3,2}, error_badbrace is thrown.

Brackets and backreference
(...)

Grouping of regular expressions and capturing a matched string. Every pair of capturing brackets is assigned with a number starting from 1 in the order that its left roundbracket '(' appears leftwards in the entire sequence of regular expressions, and the substring matched with the regular expressions enclosed by the pair can be referenced by the number from other position in the expressions.

If regular expressions contain a mismatched '(' or ')', error_paren is thrown.

When a pair of capturing roundbrackets itself is bound with a quantifier or it is inside another pair of brackets having a quantifier, the captured string by the pair is cleared whenever a repetition happens. Thus, any captured string cannot be carried over to the next loop. For example, when /(?:(a)|(b))+/ matches something, either of \1 or \2 is empty.

\N
(N is a
positive
integer)

Backreference. When '\' is followed by a number that begins with 1-9, it is regarded as a backreference to a string captured by (...) assigned with the corresponding number and matching is performed with that string. If a pair of brackets assigned with Number N do not exist in the entire sequence of regular expressions, error_backref is thrown.

For example, /(TO|to)..\1/ matches "TOMATO" or "tomato", but does not match "Tomato".

In RegExp of ECMAScript, capturing brackets are not required to appear prior to its corresponding backreference(s). So expressions such as /\1(abc)/ and /(abc\1)/ are valid and not treated as an error.

When a pair of brackets does not capture anything, it is treated as having captured the special undefined value. A backreference to undefined is equivalent to an empty string, matching with it always succeeds.

(?<NAME>...)

Identical to (...) except that a substring matched with the regular expressions inside a pair of brackets can be referenced by the name NAME as well as the number assigned to the pair of the brackets.

For example, in the case of /(?<year>\d+)\/(?<month>\d+)\/(?<day>\d+)/, the string captured by the first pair of parentheses can be referenced by either \1 or \k<year>.

Note: Available since SRELL 2.000, following the enhancement of RegExp in ES2018/ES9.0.

\k<NAME>

References to a substring captured by the pair of brackets named NAME. If the pair of corresponding brackets does not exist in the entire sequence of regular expressions, error_backref is thrown.

Note: Available since SRELL 2.000, following the enhancement of RegExp in ES2018/ES9.0.

(?:...)

Grouping. Unlike (...), this does not capture anything but only do grouping. So assignment of a number for backreference is not performed.
For example, /tak(?:e|ing)/ matches "take" or "taking", but does not capture anything for backreference. Usually, this is somewhat faster than (...).

Assertions
^

Matches at the beginning of the string. When the multiline option is specified, ^ also matches every position immediately after one of LineTerminator.

$

Matches at the end of the string. When the multiline options is specified, $ also matches every position immediately before one of LineTerminator.

\b

Out of a character class: matches a boundary between \w and \W.

Inside a character class: matches BEL (\u0008).

\B

Out of a character class: matches any boundary where \b does not match.

Inside a character class: error_escape is thrown.

(?=...)

A zero-width positive lookahead assertion. For example, /a(?=bc|def)/ matches "a" followed by "bc" or "def", but only "a" is counted as the matched string.

(?!...)

A zero-width negative lookahead assertion. For example, /a(?!bc|def)/ matches "a" not followed by "bc" nor "def".

Incidentally, expression /&(?!amp;|lt|gt|#)/ would be useful to find and escape bare '&'s when source code in where many '&'s are used is copied to a HTML file.

(?<=...)

A zero-width positive lookbehind assertion. For example, /(?<=bc|de)a/ matches "a" following "bc" or "de", but only "a" is counted as the matched string and "bc" or "de" is not.

Note: In SRELL 1, the number of characters matched with regular expressions inside a lookbehind assertion must be a fixed-length, such as /(?<=abc|def)/, /(?<=\d{2})/; otherwise error_lookbehind is thrown. This restriction does not exist in SRELL 2.000 or later.

(?<!...)

A zero-width negative lookbehind assertion. For example, /(?<!bc|de)a/ matches "a" not following "bc" nor "de".

Note: In SRELL 1 the number of characters matched with regular expressions inside a lookbehind assertion must be a fixed-length; otherwise error_lookbehind is thrown. This restriction does not exist in SRELL 2.000 or later.

Embedded flag modifiers
(?ism-ism)

Option flag modifiers. (?i) behaves as if the icase flag is specified, similarly, (?m) and (?s) correspond to the multiline flag and the dotall flag, respectively.
These can be combined and used like (?ism).
When preceded by - like (?-m), the corresponding feature(s) is/are disabled.

This expression can be used only at the beginning of a regular expression (the same as Python 3.11). If used elsewhere, error_modifier is thrown.

Note: Available since SRELL 4.007. This feature is an extension and not part of the ECMAScript specification. This feature can be disabled by defining #define SRELL_NO_UBMOD.

Footnotes

Extensions to std::regex

Unicode support

For Unicode support, SRELL has the following typedefs and extensions that do not exist in <regex>:

Typedef list of three basic classes (basic_regex, match_results, sub_match)
Prefix
and
inter­pretation of string
Specia­lised with ... basic_regex
(-regex)
match_results
(-cmatch)
(-smatch)
sub_match
(-csub_match)
(-ssub_match)
Note
u8-
(UTF-8)
char8_t
or
char
u8regex u8cmatch
u8smatch
u8csub_match
u8ssub_match
char8_t is used only when char8_t is supported (detected by checking if __cpp_char8_t or SRELL_CPP20_CHAR8_ENABLED is defined). Otherwise, these types are just aliases of u8c- types shown below.
u16-
(UTF-16)
char16_t u16regex u16cmatch
u16smatch
u16csub_match
u16ssub_match
Defined only when char16_t and char32_t are supported (detected by checking if __cpp_unicode_characters or SRELL_CPP11_CHAR1632_ENABLED is defined).
u32-
(UTF-32)
char32_t u32regex u32cmatch
u32smatch
u32csub_match
u32ssub_match
u8c-
(UTF-8)
char u8cregex u8ccmatch
u8csmatch
u8ccsub_match
u8cssub_match
u16w-
(UTF-16)
wchar_t u16wregex u16wcmatch
u16wsmatch
u16wcsub_match
u16wssub_match
Defined only when 0xFFFF <= WCHAR_MAX < 0x10FFFF.
u32w-
(UTF-32)
u32wregex u32wcmatch
u32wsmatch
u32wcsub_match
u32wssub_match
Defined only when WCHAR_MAX >= 0x10FFFF.
u1632w- u1632wregex u1632wcmatch
u1632wsmatch
u1632wcsub_match
u1632wssub_match
Aliases of u16w- or u32w- types above, depending on the value of WCHAR_MAX.

The meaning of each prefix is as follows:

  • u8: meaning changes depending on whether your compiler supports the char8_t type. It is detected by checking whether the __cpp_char8_t or SRELL_CPP20_CHAR8_ENABLED macro is defined or not:
    • If char8_t supported: handles an array of char8_t and an instance of std::u8string as a UTF-8 string.
    • If char8_t not supported: identical to the "u8c-" prefix. Defined as mere aliases of "u8c-" types shown below.
    By varying as above, "u8-" prefix types are always suitable for UTF-8 string literals (u8"...") in code for both before and after C++20, in which the type of u8"..." was changed from char to char8_t.
  • u16: handles an array of char16_t and an instance of std::u16string as a UTF-16 string. Suitable for UTF-16 string literals (u"...").
  • u32: handles an array of char32_t and an instance of std::u32string as a UTF-32 string. Suitable for UTF-32 string literals (U"...").
  • u8c: handles an array of char and an instance of std::string as a UTF-8 string. (Introduced in SRELL version 2.100. Until version 2.002, the "u8-" prefix was used for this kind of type.)
  • u16w: handles an array of wchar_t and an instance of std::wstring as a UTF-16 string. (Defined only when WCHAR_MAX is equal to or more than 0xFFFF and less than 0x10FFFF.)
  • u32w: handles an array of wchar_t and an instance of std::wstring as a UTF-32 string. (Defined only when WCHAR_MAX is equal to or more than 0x10FFFF.)
  • u1632w: When 0xFFFF <= WCHAR_MAX < 0x10FFFF, identical to u16w- above. When WCHAR_MAX >= 0x10FFFF, identical to u32w- above. Unlike u16w- and u32w-, these u1632w- types are always defined on condition that WCHAR_MAX >= 0xFFFF. Types of this prefix are available in SRELL version 2.930 and later.

* For u16w- types and u32w- types, only either of them are provided depending on the value of WCHAR_MAX. As I realised later that this affects the portability of code, u1632w- types were introduced in SRELL 2.930.

Although omitted from the table above, regex_iterator, regex_iterator2, and regex_token_iterator also have typedefs that have u(8c?|16w?|32w?|u1632w) prefixes similarly, based on these rules above.

Basic use of Unicode support versions is as follows:

srell::u8regex u8re(u8"UTF-8 Regular Expression");
srell::u8cmatch u8cm;   //  -smatch instead of -cmatch if target string is of basic_string type. And so on.
std::printf("%s\n", srell::regex_search(u8"UTF-8 target string", u8cm, u8re) ? "found!" : "not found...");

srell::u16regex u16re(u"UTF-16 Regular Expression");
srell::u16cmatch u16cm;
std::printf("%s\n", srell::regex_search(u"UTF-16 target string", u16cm, u16re) ? "found!" : "not found...");

srell::u32regex u32re(U"UTF-32 Regular Expression");
srell::u32cmatch u32cm;
std::printf("%s\n", srell::regex_search(U"UTF-32 target string", u32cm, u32re) ? "found!" : "not found...");

srell::u1632wregex u1632wre(L"UTF-16 or UTF-32 Regular Expression");
srell::u1632wcmatch u1632wcm;
std::printf("%s\n", srell::regex_search(L"UTF-16 or UTF-32 target string", u1632wcm, u1632wre) ? "found!" : "not found...");

srell::u16wregex u16wre(L"UTF-16 Regular Expression");
srell::u16wcmatch u16wcm;
std::printf("%s\n", srell::regex_search(L"UTF-16 target string", u16wcm, u16wre) ? "found!" : "not found...");
    //  The three lines above and the ones below are mutually exclusive.
    //  If wchar_t is less than 21-bit, the ones above are available;
    //  if equal to or more than, the ones below are available.
srell::u32wregex u32wre(L"UTF-32 Regular Expression");
srell::u32wcmatch u32wcm;
std::printf("%s\n", srell::regex_search(L"UTF-32 target string", u32wcm, u32wre) ? "found!" : "not found...");
			
syntax_option_type

The following flag option has been added:

namespace regex_constants
{
    static const syntax_option_type dotall; //  (Since SRELL 2.000)
        //  Single-line mode. If specified, the behaviour of '.' is changed.
        //  This flag option corresponds to //s of ECMAScript and Perl 5.

    static const syntax_option_type unicodesets; //  (Since SRELL 4.000)
        //  For using v mode.
}
			

Like the other values of the syntax_option_type type, this value is also defined also in basic_regex.

error_type

The following error type values have been added:

namespace regex_constants
{
    static const error_type error_utf8; //  (Since SRELL 2.630)
        //  Invalid UTF-8 sequence was found in a regular expression passed to basic_regex.

    static const error_type error_property; //  (Since SRELL 3.010)
        //  Unknown or unsupported name or value was specified in \p{...} or \P{...}.

    static const error_type error_noescape; //  (Since SRELL 4.000; v mode only)
        //  ( ) [ ] { } / - \ | needs to be escaped by using \ in the character class.

    static const error_type error_operator; //  (Since SRELL 4.000; v mode only)
        //  Operation error in the character class. Reserved double punctuators are
        //  found, or different operations are used at the same level of [].

    static const error_type error_complement; //  (Since SRELL 4.000; v mode only)
        //  Complement of strings cannot be used. \P{POSName}, [^\p{POSName}],
        //  or [^\q{strings}] where POSName is a name of property-of-strings was found.

    static const error_type error_modifier; //  (Since SRELL 4.007)
        //  The expression contained the unbounded form of flag modifiers ((?ism-ism))
        //  at a position other than the beginning, or a specific flag modifier appeared
        //  more then once in one pair of brackets.
}
			
No throw/exception mode

Starting with version 4.034, when you #define SRELL_NO_THROW SRELL does not throw an exception of the regex_error type. In this mode, an error that should have been thrown during the previous pattern compiling can be known by calling basic_regex::ecode(), and an error that should have been thrown during the previous search or match can be known by calling match_results::ecode().

In this mode, algorithm functions (regex_search, regex_match) return false when an error occurs.

Defining SRELL_NO_THROW cannot prevent std::vector used in SRELL from throwing std::bad_alloc etc.

Since SRELL 2.600, overload functions that take three BidirectionalIterator as parameters have been added:

template <class BidirectionalIterator, class Allocator, class charT, class traits>
bool regex_search(
    BidirectionalIterator first,
    BidirectionalIterator last,
    BidirectionalIterator lookbehind_limit,
    match_results<BidirectionalIterator, Allocator> &m,
    const basic_regex<charT, traits> &e,
    const regex_constants::match_flag_type flags = regex_constants::match_default);

template <class BidirectionalIterator, class charT, class traits>
bool regex_search(
    BidirectionalIterator first,
    BidirectionalIterator last,
    BidirectionalIterator lookbehind_limit,
    const basic_regex<charT, traits> &e,
    const regex_constants::match_flag_type flags = regex_constants::match_default);
			

The third iterator, lookbehind_limit is used for specifying the limit until where regex_search() can read a sequence backwards when a lookbehind assertion is performed.

In other words, this three-iterators version starts searching at the postion first in the range [lookbehind_limit, last).

const char text[] = "0123456789abcdefghijklmnopqrstuvwxyz";
const char* const begin = text;
const char* const end = text + std::strlen(text);
const char* const first = text + 10;    //  Sets the position of 'a'.
const srell::regex re("(?<=^\\d+).");
srell::cmatch match;

std::printf("matched %d\n", srell::regex_search(first, end, match, re));
    //  Does not match as lookbehind is performed only in the range [first, end).

std::printf("matched %d\n", srell::regex_search(first, end, begin, match, re));
    //  Matches because regex_search is allowed to lookbehind until begin.
    //  I.e., in a three-iterators version, searching againist the sequence
    //  [begin, end), begins at first in the sequence.
			

As in the example shown above, in a version of three-iterators, ^ matches begin (the third iterator) instead of first (the first iterator).

And when a three-iterators version is called, the position() member of match_results returns a distance from the position passed to as the third iterator, while prefix().first of match_results is set to the position passed to as the first iterator.

* By introducing these three-iterators overloads, the way used in SRELL 2.300~2.500 has been removed.

match_results

Overload functions for the named capture feature

In SRELL 2.000 and later, the following member functions have been added to the match_results class for the named capture feature:

difference_type length(const string_type &sub) const;
difference_type position(const string_type &sub) const;
string_type str(const string_type &sub) const;
const_reference operator[](const string_type &sub) const;

//  The following ones are available since SRELL 2.650 and later.
difference_type length(const char_type *sub) const;
difference_type position(const char_type *sub) const;
string_type str(const char_type *sub) const;
const_reference operator[](const char_type *sub) const;
				

Basically, these can be used in the same way as the member functions having the same names in regex. The only difference is that these take the group name string as a parameter, instead of the group number corresnponding to a pair of parentheses.

//  Example.
srell::regex e("-(?<digits>\\d+)-");
srell::cmatch m;

if (srell::regex_search("1234-5678-90ab-cdef", m, e))
{
    const std::string by_number(m.str(1));      //  access by paren's number. a feature of std::regex.
    const std::string by_name(m.str("digits")); //  access by paren's name. an extension of SRELL.

    std::printf("results: bynumber=%s byname=%s\n", by_number.c_str(), by_name.c_str());
}
//  results: bynumber=5678 byname=5678
				

Until version 4.033: When a group name that does not exist in the regular expression is passed, error_backref is thrown.
Version 4.034 and later: When a group name that does not exist in the regular expression is passed, a reference to an instance of sub_match whose matched member variable is false is returned.

ecode() const

Returns an error code that should have been thrown during the previous search or match. This member function is intended to be used in the no throw/exception mode supported since 4.034.
The returned value is an integer number of the error_type type, which is the same as the return type of regex_error::code().

If no error has occurred in the previous call to an algorithm function, returns 0.

//  std::regex compatible error handling.
try {
    srell::regex re("a*");
    srell::smatch m;

    regex_search(text, m, re);
} catch (const srell::regex_error &e) {
    //  Error handling.
}
				
//  Error handling in no throw/exception mode.
srell::regex re("a*");
srell::smatch m;

if (!regex_search(text, m, re))
    if (m.ecode()) //  If not 0, error occurred.
        //  Error handling.
				
basic_regex

Beginning with SRELL 4.009, the following member functions have been added to the basic_regex class of SRELL as extensions.

  • match(): Executes matching like srell::regex_match().
  • search(): Executes searching like srell::regex_search().
  • replace(): Unlike srell::regex_replace() that does not modify the original string, actually replaces a subsequence that matches a regular expression in the passed string with a new string.
  • split(): Splits a string into multiple substrings.

match() const

Does matching like srell::regex_match(). Supposing an instance of the basic_regex type is re, re.match(...) is a shorthand for srell::regex_match(..., re, ...).

The number of overload functions and their orders of parameters are the same as the ones of regex_match(), except that the parameter re is omitted.

template <typename BidirectionalIterator, typename Allocator>
bool match(
	const BidirectionalIterator begin,
	const BidirectionalIterator end,
	match_results<BidirectionalIterator, Allocator> &m,
	const regex_constants::match_flag_type flags = regex_constants::match_default) const;
//  The same as srell::regex_match(begin, end, m, re, flags)

template <typename BidirectionalIterator>
bool match(
	const BidirectionalIterator begin,
	const BidirectionalIterator end,
	const regex_constants::match_flag_type flags = regex_constants::match_default) const;
//  The same as srell::regex_match(begin, end, re, flags)

template <typename Allocator>
bool match(
	const charT *const str,
	match_results<const charT *, Allocator> &m,
	const regex_constants::match_flag_type flags = regex_constants::match_default) const;
//  The same as srell::regex_match(str, re, flags)

bool match(
	const charT *const str,
	const regex_constants::match_flag_type flags = regex_constants::match_default) const;
//  The same as srell::regex_match(str, re, flags)

template <typename ST, typename SA, typename MA>
bool match(
	const std::basic_string<charT, ST, SA> &s,
	match_results<typename std::basic_string<charT, ST, SA>::const_iterator, MA> &m,
	const regex_constants::match_flag_type flags = regex_constants::match_default) const;
//  The same as srell::regex_match(s, m, re, flags)

template <typename ST, typename SA>
bool match(
	const std::basic_string<charT, ST, SA> &s,
	const regex_constants::match_flag_type flags = regex_constants::match_default) const;
//  The same as srell::regex_match(s, re, flags)
				

Does searching like srell::regex_search(). Supposing an instance of the basic_regex type is re, re.search(...) is a shorthand for srell::regex_search(..., re, ...).

The number of overload functions and their orders of parameters are the same as the ones of regex_search(), except that the parameter re is omitted.

template <typename BidirectionalIterator, typename Allocator>
bool search(
	const BidirectionalIterator begin,
	const BidirectionalIterator end,
	match_results<BidirectionalIterator, Allocator> &m,
	const regex_constants::match_flag_type flags = regex_constants::match_default
) const;
//  The same as srell::regex_search(begin, end, m, re, flags)

template <typename BidirectionalIterator>
bool search(
	const BidirectionalIterator begin,
	const BidirectionalIterator end,
	const regex_constants::match_flag_type flags = regex_constants::match_default
) const;
//  The same as srell::regex_search(begin, end, re, flags)

template <typename Allocator>
bool search(
	const charT *const str,
	match_results<const charT *, Allocator> &m,
	const regex_constants::match_flag_type flags = regex_constants::match_default
) const;
//  The same as srell::regex_search(str, m, re, flags)

bool search(
	const charT *const str,
	const regex_constants::match_flag_type flags = regex_constants::match_default
) const;
//  The same as srell::regex_search(str, re, flags)

template <typename ST, typename SA, typename Allocator>
bool search(
	const std::basic_string<charT, ST, SA> &s,
	match_results<typename std::basic_string<charT, ST, SA>::const_iterator, Allocator> &m,
	const regex_constants::match_flag_type flags = regex_constants::match_default
) const;
//  The same as srell::regex_search(s, m, re, flags)

template <typename ST, typename SA>
bool search(
	const std::basic_string<charT, ST, SA> &s,
	const regex_constants::match_flag_type flags = regex_constants::match_default
) const;
//  The same as srell::regex_search(s, re, flags)

//  The following two are not part of std::regex.

template <typename BidirectionalIterator, typename Allocator>
bool search(
	const BidirectionalIterator begin,
	const BidirectionalIterator end,
	const BidirectionalIterator lookbehind_limit,
	match_results<BidirectionalIterator, Allocator> &m,
	const regex_constants::match_flag_type flags = regex_constants::match_default) const;
//  The same as srell::regex_search(begin, end, lookbehind_limit, m, re, flags)

template <typename BidirectionalIterator>
bool search(
	const BidirectionalIterator begin,
	const BidirectionalIterator end,
	const BidirectionalIterator lookbehind_limit,
	const regex_constants::match_flag_type flags = regex_constants::match_default
) const;
//  The same as srell::regex_search(begin, end, lookbehind_limit, re, flags)
				

replace() const

Unlike the two above, this is not a shorthand for srell::regex_replace(). While regex_replace() does not modify a passed original string but creates a copy of it and does replacement on the copy and finally returns it, this replace() actually rewrites the passed string.
Except this point, this behaves like String.prototype.replace(regexp-object, newSubStr|callback-function) of ECMAScript.

replace() accepts an object of the std::basic_string type or any container type that has the same APIs as std::basic_string as a target string (Up to 4.010, only the former type was accepted).

There are two ways to pass a replacement string: 1) pass a format string, or 2) pass through a callback function.

Replacement by format string

//  charT is the type passed to basic_regex as the first template argument.
template <typename StringLike>
void replace(
    StringLike &s,
    const charT *const fmt_begin,
    const charT *const fmt_end,
    const bool global = false) const;

template <typename StringLike>
void replace(
    StringLike &s,
    const charT *const fmt,
    const bool global = false) const;

template <typename StringLike, typename FST, typename FSA>
void replace(
    StringLike &s,
    const std::basic_string<charT, FST, FSA> &fmt,
    const bool global = false) const;
					

When the global parameter is false, only the first match is replaced. When true, all matched substrings in a string are replaced.

The format of fmt are the same as the one for srell::regex_replace(), it is in accordance with the ECMAScript specification, Runtime Semantics: GetSubstitution.
The sequences found in the following table have a special meaning, other symbols are used "as is" as a substitution character.

Special symbols for replacement
Symbol String to be used replacement
$$ $ itself.
$& The entire matched substring.
$` The substring that precedes the matched substring.
$' The substring that follows the matched substring.
$n where n is one of 1 2 3 4 5 6 7 8 9 not followed by a digit. The substring captured by the pair of nth round bracket (1-based index) in the regular expression. Replaced with an empty string if nothing is captured. Not replaced if the number n is greater than the number of capturing brackets in the regular expression.
$nn where nn is any value in the range 01 to 99 inclusive The substring captured by the pair of nnth round bracket (1-based index) in the regular expression. Replaced with an empty string if nothing is captured. Not replaced if the number nn is greater than the number of capturing brackets in the regular expression.
$<NAME>
  1. If any named group does not exist in the regular expression, replacement does not happen.
  2. Otherwise, replaced with the substring captured by the pair of round brackets whose group name is NAME. If any capturing group whose name is NAME does not exist or nothing is capture by that group, replaced with an empty string.
//  Replacement by format template.
#include <cstdio>
#include <string>
#include "srell.hpp"

int main()
{
    const srell::regex re("(\\d)(\\d)");    //  Searches for two consecutive digits.
    std::string text("ab0123456789cd");

    re.replace(text, "($2$1)", true);
        //  Exchanges the order of $1 and $2 and encloses them with a pair of brackets.

    std::printf("Result: %s\n", text.c_str());
    return 0;
}
---- output ----
Result: ab(10)(32)(54)(76)(98)cd
					

Replacement by callback function

When replace() receives a callback function as a parameter instead of a format string, it calls that function every time a substring that matches a regular expression is found.

//  charT is the type passed to basic_regex as the first template argument.
template <typename StringLike, typename RandomAccessIterator, typename MA>
void replace(
    StringLike &s,
    bool (*repfunc)(
        std::basic_string<charT, typename StringLike::traits_type,
            typename StringLike::allocator_type> &replacement_text,
        const match_results<RandomAccessIterator, MA> &m,
        void *),
    void *ptr = NULL) const;

template <typename MatchResults, typename StringLike>
void replace(
    StringLike &s,
    bool (*repfunc)(
        std::basic_string<charT, typename StringLike::traits_type,
            typename StringLike::allocator_type> &replacement_text,
        const MatchResults &m,
        void *),
    void *ptr = NULL) const;
					

The signature of a callback function (repfunc) is as follows:

//  charT is the type passed to basic_regex as the first template argument.
bool replacement_function(
    std::basic_string<charT> &replacemen_text,
    const match_results<const charT *> &m,  //  For *cmatch
    void *);

bool replacement_function(
    std::basic_string<charT> &replacemen_text,
    const match_results<typename std::basic_string<charT>::const_iterator> &m,  //  For *smatch
    void *);
					

The ranges of matched substrings are written into an instance of match_results, and passed to the callback function as the second parameter m.

For typedefs of match_results, there are two groups, the *cmatch family based on const charT * and the *smatch family based on std::basic_string<charT>::const_iterator. Because of this, the callback function has two types of signature.

The third argument of replace() becomes the third parameter value of the callback function. It is useful to pass/return something to/from a callback function via a pointer.

In the callback function, a new string for replacement should be set to the first parameter replacemen_text passed to as a reference to an object of basic_string, then return. When returns true, the callback function is called again if any new match is found. When returns false, the callback function is not called any more.

//  Replacement by callback function.
//  Decode of percent encoding.
#include <cstdio>
#include <string>
#include "srell.hpp"

bool repfunc(std::string &out, const srell::u8cmatch &m, void *) {
    out.push_back(std::strtoul(m[1].str().c_str(), NULL, 16));
    return true;
}

int main() {
    const srell::regex re("%([0-9A-Fa-f]{2})");
    std::string c14("%E3%81%82%E3%81%84%E3%81%86%E3%81%88%E3%81%8A");
    std::string c9803(c14), c11(c14);

    re.replace(c9803, repfunc);  //  C++98/03

    re.template replace<srell::smatch>(c11, [](std::string &out, const srell::smatch &m, void *) -> bool {  //  C++11
         out.push_back(std::strtoul(m[1].str().c_str(), NULL, 16));
         return true;
    });

    re.template replace<srell::smatch>(c14, [](auto &out, const auto &m, auto) -> bool {  //  C++14 and later.
         out.push_back(std::strtoul(m[1].str().c_str(), NULL, 16));
         return true;
    });

    std::printf("Result(C++98/03): %s\n", c9803.c_str());
    std::printf("Result(C++11): %s\n", c11.c_str());
    std::printf("Result(C++14-): %s\n", c14.c_str());
    return 0;
}
---- output ----
Result(C++98/03): あいうえお
Result(C++11): あいうえお
Result(C++14-): あいうえお
					

When a lambda expression is used instead of the pointer to a callback function, the match_results<RandomAccessIterator, Alloc> type that is wanted to be passed to as the second parameter of the lambda needs to be explicitly specified as a template argument, otherwise type deduction for template parameters of match_results fails.

Note
  • Until version 4.012, creplace() that always passes *cmatch family, and sreplace() that always passes *smatch family were provided. They were removed in 4.013 for simplification of complicated overload functions of replace().
  • Until version 4.010, two types, 1) the custom std::basic_string<charT, ST, SA> type itself, and 2) the match_results<RandomAccessIterator, Alloc> type that is wanted to be passed to the callback function, need to be explicitly specified in this order as template arguments.

str_clip()

This is a template class inside namespace srell and not a member of basic_regex, but explained here as it is intended to be used with replace().

Beginning with SRELL 4.011, a template class str_clip() has been added. This is a utility that limits a range of a string in where replace() executes searching and replacing.

//  Example of str_clip().
#include <cstdio>
#include <string>
#include "srell.hpp"

int main() {
    const srell::regex re(".");
    std::string text("0123456789ABCDEF");

    srell::str_clip<std::string> ctext(text);
    //  As a template argument, specifies the type (std::string)
    //  of an object (text) to be assigned with str_clip.

    //  Clipping by pos and count pair: From offset 4, 6 elements.
    re.replace(ctext.clip(4, 6), "x", true);
    std::printf("By pos&count: %s\n", text.c_str());  //  "0123xxxxxxABCDEF"

    //  Clipping by iterator pair.
    re.replace(ctext.clip(text.begin() + 6, text.end() - 6), "y", true);
    std::printf("By iterators: %s\n", text.c_str());  //  "0123xxyyyyABCDEF"

    re.template replace<srell::cmatch>(ctext.clip(6, 2), [](std::string &out, const srell::cmatch &, void *) {
        out = "Zz";
        return true;
    });
    std::printf("By lambda: %s\n", text.c_str());  //  "0123xxZzZzyyABCDEF"

    return 0;
}
					

split() const

split() splits a string into substrings by a subsequence that matches a regular expression in the string, sets each position range into an instance of sub_match, and pushes it to a reference to a list container (vector, list, etc.) passed to split() as the first parameter.

Except the following modification, behaves like String.prototype.split(regexp-object, limit) of ECMAScript:

  • When limit, the maximum number of pushing a substring to the passed list container, is explicitly specified, split() behaves in accordance with the ECMAScript specification up to limit-1 times, and for the last time, pushes the remainder of the string (the substring in which searching has not been yet) to the container in whole.

Although specifying the maximum number of splitting is not a rare feature, the one of JavaScript is a bit peculiar; when the number of times splitting is executed reaches limit, split() throws away the remainder of the string that has not searched yet and does not push it to the list container. As personally this behaviour is not pleasant, the modification above has been applied.

template <typename container, typename ST, typename SA>
void split(
    container &c,
    const std::basic_string<charT, ST, SA> &s,
    const std::size_t limit = static_cast<std::size_t>(-1)) const;

//  The following two are available since version 4.011.
template <typename container, typename BidirectionalIterator>
void split(
    container &c,
    const BidirectionalIterator begin,  //  The same as or convertible to container::value_type::iterator.
    const BidirectionalIterator end,
    const std::size_t limit = static_cast<std::size_t>(-1)) const;

template <typename container>
void split(
    container &c,
    const charT *const str,
    const std::size_t limit = static_cast<std::size_t>(-1)) const;
				

For c, any container type can be used to receive results if it has push_back() as a member function.

If a regular expression contains a capturing round bracket, the substring captured by it is also pushed into the list container. Even when a pair of brackets does not capture anything, pushing is not skipped but an empty string is pushed instead.

#include <cstdio>
#include <string>
#include <vector>
#include "srell.hpp"

template <typename Container>
void print(const Container &c) {
    for (typename Container::size_type i = 0; i < c.size(); ++i)
        std::printf("%s\"%s\"", i == 0 ? "{ " : ", ", c[i].str().c_str());
    std::puts(" }");
}

int main() {
    std::string text("01:23:45");
    srell::regex re(":");
    std::vector<srell::csub_match> res;  //  Or srell::ssub_match.

    re.split(res, text);    //  Unlimited splitting.
    print(res);     //  { "01", "23", "45" }

    res.clear();    //  Note: split() does not call clear()
    re.split(res, text, 2); //  Splits into two.
    print(res);     //  { "01", "23:45" }
                    //  split() of JavaScript returns { "01", "23" }

    re.assign("(?<=(\\d?)):(?=(\\d?))");  //  Captures a string before and after ':'
    res.clear();
    re.split(res, text);
    print(res);     //  { "01", "1", "2", "23", "3", "4", "45" }

    text.assign("caf\xC3\xA9");     //  "café"
    re.assign("");

    res.clear();
    re.split(res, text);    //  Splits by element of char.
    print(res);     //  { "c", "a", "f", "\xC3", "\xA9" }

    srell::u8cregex u8re("");
    res.clear();
    u8re.split(res, text);  //  Splits by character of UTF-8.
    print(res);     //  { "c", "a", "f", "é" }

    return 0;
}
				

ecode() const

Returns an error code that should have been thrown during the previous pattern compiling. This member function is intended to be used in the no throw/exception mode supported since 4.034.
The returned value is an integer number of the error_type type, which is the same as the return type of regex_error::code().

If no error has occurred in the previous pattern compiling, returns 0.

//  std::regex compatible error handling.
try {
    srell::regex re("a{2,1}");
} catch (const srell::regex_error &e) {
    //  e.code() == srell::regex_constants::error_badbrace
}
				
//  Error handling in no throw/exception mode.

srell::regex re("a{2,1}");
//  re.ecode() == srell::regex_constants::error_badbrace
				
regex_iterator2

Since 4.013, SRELL has regex_iterator2. It is a modificatoin of regex_iterator, to which the following changes have been applied:

  • Removal of the special handling when the iterator holds a zero-length match. By this change, a result of replacement using this iterator becomes consistent with basic_regex::replace() above, i.e. JavaScript compatible (example shown later).
  • Addition of assign() for re-use of the object.
  • Addition of helper functions for replacement and splitting.
template <typename BidirectionalIterator,
    typename BasicRegex = basic_regex<typename std::iterator_traits<BidirectionalIterator>::value_type,
        regex_traits<typename std::iterator_traits<BidirectionalIterator>::value_type> >,
    typename MatchResults = match_results<BidirectionalIterator> >
class regex_iterator2;
			

The second template parameter is a type of basic_regex, and the third one is a type of match_results. They have been simplified more than the ones of regex_iterator.
After regex_iterator, the following typedefs are provided:

typedef regex_iterator2<const char *> cregex_iterator2;
typedef regex_iterator2<const wchar_t *> wcregex_iterator2;
typedef regex_iterator2<std::string::const_iterator> sregex_iterator2;
typedef regex_iterator2<std::wstring::const_iterator> wsregex_iterator2;

//  For UTF-8 with char.
typedef regex_iterator2<const char *, u8cregex> u8ccregex_iterator2;
typedef regex_iterator2<std::string::const_iterator, u8cregex> u8csregex_iterator2;

//  Defined only when char16_t, char32_t are available.
typedef regex_iterator2<const char16_t *> u16cregex_iterator2;
typedef regex_iterator2<const char32_t *> u32cregex_iterator2;
typedef regex_iterator2<std::u16string::const_iterator> u16sregex_iterator2;
typedef regex_iterator2<std::u32string::const_iterator> u32sregex_iterator2;

//  Defined only when char8_t is available.
typedef regex_iterator2<const char8_t *> u8cregex_iterator2;
//  Defined only when std::u8string is available.
typedef regex_iterator2<std::u8string::const_iterator> u8sregex_iterator2;

//  Defined only when char8_t is NOT available.
typedef u8ccregex_iterator2 u8cregex_iterator2;
//  Defined only when std::u8string is NOT available.
typedef u8csregex_iterator2 u8sregex_iterator2;

//  Defined only when WCHAR_MAX >= 0x10FFFF.
typedef wcregex_iterator2 u32wcregex_iterator2;
typedef wsregex_iterator2 u32wsregex_iterator2;
typedef u32wcregex_iterator2 u1632wcregex_iterator2;
typedef u32wsregex_iterator2 u1632wsregex_iterator2;

//  Defined only when 0x10FFFF > WCHAR_MAX >= 0xFFFF.
typedef regex_iterator2<const wchar_t *, u16wregex> u16wcregex_iterator2;
typedef regex_iterator2<std::wstring::const_iterator, u16wregex> u16wsregex_iterator2;
typedef u16wcregex_iterator2 u1632wcregex_iterator2;
typedef u16wsregex_iterator2 u1632wsregex_iterator2;
			

done() const

Like regex_iterator, iterating can be performed by for-loop with comparing the iterator and the end-of-sequence iterator created with no arguments. But as a simpler way to judge, regex_iterator2 has the member function done() that checks if the iterator has already reached the end.

srell::sregex_iterator2 eit;
srell::sregex_iterator2 it(text.begin(), text.end, re);

//  for (; it != eit; ++it) {   //  The same as below.
for (; !it.done(); ++it) {
    //  Does something.
}
				

replace()

If a range that has been passed to the constructor is a part of an object of std::basic_string, and the object has not been resized after that (the given area of memory has not been changed elsewhere), then the current matched range of the iterator ((*it)[0]) can be replaced with a new string by calling the replace() member function of the iterator.

regex_iterator2::replace() takes the entire string object of std::basic_string as a first parameter, and a replacement string as a second parameter:

//  Replaces [(*it)[0].first, (*it)[0].second) in
//  [entire.begin(), entire.end()) with replacement or [begin, end).

template <typename ST, typename SA>
void replace(std::basic_string<char_type, ST, SA> &entire,
    const std::basic_string<char_type, ST, SA> &replacement);

template <typename ST, typename SA>
void replace(std::basic_string<char_type, ST, SA> &entire,
    BidirectionalIterator begin, BidirectionalIterator end);

template <typename ST, typename SA>
void replace(std::basic_string<char_type, ST, SA> &entire,
    const char_type *const replacement);
				

If the size of entire is lengthen or shorten by replacement, position information inside the iterator is adjusted accordingly, and if the given area of memory is changed, all stashed internal iterators are recreated automatically.

Example of regex_iterator2::replace() and showing differences from regex_iterator and consistency:

#include <cstdio>
#include <string>
#include <regex>
#include "srell.hpp"

template <typename Iterator, typename Regex>
void replace(const Regex &re, const std::string &text, const char *const title) {
    std::string::const_iterator prevend = text.begin();
    Iterator it(text.begin(), text.end(), re), eit;
    std::string out;

    for (; it != eit; ++it) {
        out += it->prefix();
        out += ".";
        prevend = (*it)[0].second;
    }

    const std::string::const_iterator end = text.end();
    out.append(prevend, end);
    std::printf("[%s] by %s\n", out.c_str(), title);
}

int main() {
    std::string text("a1b");
    std::regex re1("\\d*?");
    srell::regex re2("\\d*?");

    replace<std::sregex_iterator>(re1, text, "std::sregex_iterator");
    replace<srell::sregex_iterator>(re2, text, "srell::sregex_iterator");
    replace<srell::sregex_iterator2>(re2, text, "srell::sregex_iterator2");

    srell::sregex_iterator2 it(text, re2);
    for (; !it.done(); ++it)
        it.replace(text, ".");  //  Use of replace().
    std::printf("[%s] by srell::sregex_iterator2::replace()\n", text.c_str());

    text = "a1b";  //  Restores because replaced avove.
    re2.replace(text, ".", true);
    std::printf("[%s] by srell::basic_regex::replace()\n", text.c_str());

    return 0;
}
---- output ----
[.a...b.] by std::sregex_iterator
[.a...b.] by srell::sregex_iterator
[.a.1.b.] by srell::sregex_iterator2
[.a.1.b.] by srell::sregex_iterator2::replace()
[.a.1.b.] by srell::basic_regex::replace()
				

Through the special handling mentioned above, "1" was replaced in the first two examples using regex_iterator, whereas it remained unchanged in the last three examples of replacement being compatible with JavaScript.

Helpers for splitting

Gatherinig the prefixes of matches that the iterator points to, and the suffix of the final match is equivalent to what split() does (Cf. the table below. it means an iterator):

Positions pointerd to by (*it)[0] and it->prefix()
Subject Unmatch First match Unmatch Second match Unmatch
Iterator it it->prefix()
of 1st match
(*it)[0] it->prefix()
of 2nd match
(*it)[0] it->suffix()
of 2nd match

So, the following helper functions are provided for gathering blue portions easily:

  • bool split_ready(): Returns whether the current it->prefix() points to a range that can be treated as a split subsequence. The criterion is accordance with the method defined for split() of ECMAScript (it->prefix().first != (*it)[0].second).
  • const typename value_type::value_type &remainder(bool only_after_match = false): Returns a subsequence equivalent to "it->suffix() of 2nd match" in the table above. When an iterator it has never once matched anything, it->suffix() returns an undefined value, whereas it.remainder() always returns a valid range.
    When the argument is true and the previous match has succeeded, returns [(*it)[0].second, endOfSequence); otherwise returns [it->prefix().first, endOfSequence).

Example of a simple split operation:

for (; !it.done(); ++it) {
    if (it.split_ready())
        list.push_back(it->prefix());
}
list.push_back(it.remainder());
				

Another example of split, which supports features like pushing also submatches when the regular expression contains capturing round brackets, and specifying the max number (LIMIT) of split chunks, as seen in split() in other languages:

for (std::size_t count = 0; !it.done(); ++it) {
    if (it.split_ready()) {
        if (++count == LIMIT)
            break;
        list.push_back(it->prefix());   //  *1
        for (std::size_t i = 1; i < it->size(); ++i) {
            if (++count == LIMIT) {
                list.push_back(it.remainder(true));
                //  true to exclude the range of prefix()
                //  that has already been pushed above (*1).
                return;
            }
            list.push_back((*it)[i]);
        }
    }
}
list.push_back(it.remainder());
				

Even using helper functions, now code is lengthy. Thus, more helper functions are provided. The code above can be written as follows:

std::size_t count = 0;
for (it.split_begin(); !it.done(); it.split_next()) {
    if (++count == LIMIT)
        break;
    list.push_back(it.split_range());
}
list.push_back(it.split_remainder());   //  Note: not remainder(), but split_remainder().
				
  • void split_begin(): Moves to a first subsequence for which split_ready() returns true. This should be called only once before beginning iterating (or after calling rewind()).
  • bool split_next(): Moves to a next subsequence for which split_ready() returns true. If such a subsequence is found, returns true, otherwise false. This member function is intended to be used instead of the ordinary increment operator (++).
  • const typename value_type::value_type &split_range() const: Returns a current subsequence to which the iterator points.
  • const typename value_type::value_type &split_remainder(): Returns the final subsequence immediately following the last match range. This should be called after iterating is complete or aborted. Unlike remainder() above, a boolean value corresponding to only_after_match is automatically calculated.

Incidentally, split support is just an extra. If it turned out that there is such a case as supporting it is an obstacle to or incompatible with the iterator's duty, walking and finding a substring matched with the regular expression, these helpers might become undocumented features or even be dropped. Because updataout3.cpp included in the zip archive of SRELL has begun to use this feature, removing of these helpers will probably not happen.

Measures against long time thinking

The regular expression engine of ECMAScript (and also Perl on which it is based) usually uses the backtracking algorithm in matching. The backtracking algorithm can require exponential time to search with a regular expression that includes 1) nested quantifiers or 2) consecutive expressions that have a quantifier each, and what each of which matches and what adjacent expressions match are not mutually exclusive but overlapping. The following patterns are well-known examples:

  • "aaaaaaaaaaaaaaaaaaaaaaaaaaaaa" =~ /(a*)*b/
  • "aaaaaaaaaaaaaaaaaaaaaaaaaaaaa" =~ /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaa/

Unfortunately, against this problem, no fundamental measures that can be applied in any situation have found yet. So, to avoid holding control for long time, SRELL throws regex_error(regex_constants::error_complexity) when matching from a particular position fails repeatedly more than certain times.

The default value of the "certain times" is 16777216 (256 to the third power). But this value can be changed by setting an arbitrary value to the limit_counter member variable of an instance of regex_basic passed to an algorithm function, such as regex_search() and regex_match().

Differences between std::regex and SRELL

Regular expression engines and flags

<regex> has six regular expression engines including ECMAScript-based one, whereas SRELL has one engine being compatible with RegExp of ECMAScript. Because of this difference, the following flag options defined in <regex> are ignored in SRELL even if specified:

syntax_option_type (and flag_type of basic_regex)

  • nosubs, optimize, collate, basic, extended, awk, grep, egrep (all but icase and multiline)

match_flag_type

  • match_any, format_sed

The comparison between the ECMAScript mode of <regex> and SRELL is as follows:

  • <regex>'s ECMAScript mode consists of the expressions defined in the ECMAScript specificatoin third edition
    - (MINUS) Unicode dependent matters (such as what \s matches)
    + (PLUS) locale dependent matters
    + (PLUS) [:class name:], [.class name.], [=class name=] expressions.

  • SRELL 2.000 and later: consists of the expressions defined in the ECMAScript 2018 or later specification.
  • SRELL 1.nnn: consists of the expressions defined in the ECMAScript 2017 (ES8) specification
    + (PLUS) fixed-length lookbehind assertions.

Although both are based on the same ECMAScript's regular expression specification, neither <regex> nor SRELL is a superset of each other.

Simplification

The implementations of the following functions in SRELL have been simplified to avoid redundant overheads:

  • basic_regex::assign(): In <regex> when an exception is thrown (when compiling a regular expression string fails) *this remains unchanged (cf. 11 in [re.regex.assign]), whereas *this is cleared in SRELL. This is because when SRELL begins to compile a new pattern, it does not keep the old contents anywhere.
  • Until version 4.033: match_results::operator[](size_type n): While <regex> guarantees safety even when n >= match_results::size() (i.e., out-of-range access) (cf. 8 in [re.results.acc]), SRELL did not until version 4.033. Guaranteeing safety needs an additional dummy member of the sub_match type only for the purpose of preparing out-of-range access.

Tips

For better performance

1. When matching or searching with the same regular expression pattern is performed multiple times, it is recommended to construct a regular expression object as static const (of basic_regex) in order that the pattern compile is executed only once.

//  A function called multiple times in the program.
bool is_included(const std::string &text, const std::string &regex)
{
    static const srell::regex r1("a*b");    //  OK. Pattern compile is executed only at the first time.
//  srell::regex r2("a*b");     //  Compiled evertime.
    ...
			

2. When an algorithm function (regex_search, regex_match) is called multiple times in a loop, it is recommeneded to pass an object of match_results to the function for better performance even if you do not need the results.

std::vector lines;
srell::regex re("https?://\\S+");   //  Matches something that looks like URL.
srell::smatch match;    //  typedef of match_results<std::string::const_iterator>.

//  Reads text into lines here.

for (std::size_t i = 0; i < lines.size(); ++i)
{
    //  Very slow because a disposable match_results object
    //  is prepared in regex_search() everytime.
//  if (srell::regex_search(lines[i], re))  //  *1

    if (srell::regex_search(lines[i], match, re))   //  *2
        ++count;
    ...
			

The reason of the better performance of *2 is because match_results contains a stack used when regex matching is performed. In *1 above, each time the function is called 1) a disposable match_results object is prepared, 2) memory for the stack in it is allocated, 3) and freed, while in the version of *2, once memory is allocated it will be reused in the subsequent calls. So, *2 version can be faster over twice than *1 when the number of repeats is a lot.

For smaller binary size

Some feature(s) that you do not need can be cut off by defining one or more macros in the following table before including srell.hpp. This will make the size of an output binary file smaller and compiling time (not of a regex pattern but of C++ source code) faster.

SRELL_NO_UNICODE_ICASE

Prevents Unicode case folding data used for icase (case-insensitive) matching from being output into a resulting binary. In this case, only the ASCII characters are case-folded when icase matching is performed ([A-Z] -> [a-z] only).

SRELL_NO_UNICODE_PROPERTY

Prevents Unicode property data from being output into a resulting binary. In this case, \p{...} and \P{...} are not available.
Moreover, the name for a named capturing group is not parsed strictly, but any character except '\' and '>' is accepted as a letter that can be used for the group name.

When this macro is defined, SRELL_NO_VMODE below is also defined implicitly.

SRELL_NO_UNICODE_DATA

Defines both SRELL_NO_UNICODE_ICASE and SRELL_NO_UNICODE_PROPERTY.

SRELL_NO_NAMEDCAPTURE

Cuts off the code for named capturing groups.

SRELL_NO_VMODE

Cuts off the code for v-mode (v-flag, unicodeset flag).

Miscellaneous information

SRELL with char, wchar_t

Among typedefs of basic_regex, types that do not have any Unicode prefix (u8-, u8c-, u16-, u16w-, u1632w-, u32-, u32w-) treat an input string as a sequence of Unicode values.

For example, when CHAR_BIT is 8, srell::regex (typedef of srell::basic_regex<char>) interprets 0x00-0xFF in an input string as U+0000-U+00FF, respectively. Because U+0000-U+00FF in Unicode are compatible with ISO-8859-1, as a result, it can be assumed that srell::regex supports ISO-8859-1.

srell::regex can be used to find a specific pattern of bytes in a binary data.

This applies also to srell::wregex (typedef of srell::basic_regex<wchar_t>). It interprets an input as a sequence of Unicode values in the range 0x00-WCHAR_MAX.

The suitable type to use with the W functions of WinAPI is srell::u16wregex or srell::u1632wregex which supports UTF-16, not srell::wregex that virtually supports UCS-2.

C++98/03 and UTF-8, UTF-16, UTF-32

In compilers prior to C++11, only "u8c-" types and "u16w-" types are available if wchar_t is a type being equal to or more than 16-bit and less than 21-bit, and only "u8c-" types and "u32w-" types are available if wchar_t is a type being equal to or more than 21-bit.
However, even in such environments, "u8c-", "u16-" and "u32-" types are available if such code as below is put before including SRELL:

typedef unsigned short char16_t;    //  Do typedef for a type that can have a 16-bit value.
typedef unsigned long char32_t;    //  Do typedef for a type that can have a 32-bit value.

namespace std
{
    typedef basic_string<char16_t> u16string;
    typedef basic_string<char32_t> u32string;
}

#define SRELL_CPP11_CHAR1632_ENABLED    //  Make them available manually.
			

Incidentally, handling UTF-8 or UTF-16 is performed by u8regex_traits or u16regex_traits passed to basic_regex as a template argument. By using these classes, for example, it is possible to make a class to handle UTF-16 strings with uint32_t type array, such as basic_regex<uint32_t, u16regex_traits<uint32_t> >.

Compiler

Although I develop and check SRELL mainly on VC++ and MinGW, according to some tests on Compiler Explorer, as of May 2019, at least the following compilers also can compile a sample code that does regex search with srell::u16regex in SRELL 2.200 against UTF-16 strings and generate assembly outputs:

  • GCC 9.1 (x86-64), 8.2 (ARM64, ARM)
  • Clang 8.0.0 (x86-64)
  • ICC (Intel C++ Compiler) 19.0.1 (x86-64)

A sample code that uses char or wchar_t can be compiled by x86-64 gcc 4.1.2, which seems to be the oldest one among compilers being available in Compiler Explorer.

Breaking changes

match_lblim_avail flag and match_results.lookbehind_limit member

SRELL version 2.300~2.500 had the following extensions:

If the match_lblim_avail flag option is set, when a lookbehind assertion is performed, the lookbehind_limit member of an instance of the match_result type passed to an algorithm function is treated as "the limit of a sequence until where the algorithm function can lookbehind".

const char text[] = "0123456789abcdefghijklmnopqrstuvwxyz";
const char* const begin = text;
const char* const end = text + std::strlen(text);
const char* const first = text + 10;    //  Sets the position of 'a'.
const srell::regex re("(?<=^\\d+).");
srell::cmatch match;

match.lookbehind_limit = begin;

std::printf("matched %d\n", srell::regex_search(first, end, match, re));
    //  Does not match as lookbehind is performed only in the range [first, end).

std::printf("matched %d\n", srell::regex_search(first, end, match, re, srell::regex_constants::match_lblim_avail));
    //  Matches because regex_search is allowed to lookbehind until match.lookbehind_limit.
    //  I.e., when match_lblim_avail specified, searching againist the sequence
    //  [match.lookbehind_limit, end), begins at first in the sequence.
		

As shown in the example above, when match_lblim_avail specified, ^ matches match.lookbehind_limit instead of first.

In SRELL 2.600 and later, the limit position until where regex_search is allowed to lookbehind can be specified as an argument passed to the function. Thus, the way mentioned above was removed.
The reason why "the three iterators way" introduced in 2.600 was not chosen at first is because there are two ways to pass arguments, as follows:

//  Option 1
bool regex_search(
    BidirectionalIterator first,
    BidirectionalIterator last,
    BidirectionalIterator lookbehind_limit,
    match_results<BidirectionalIterator, Allocator>& m,
    const basic_regex<charT, traits>& e,
    regex_constants::match_flag_type flags = regex_constants::match_default);

//  Option 2
bool regex_search(
    BidirectionalIterator lookbehind_limit,
    BidirectionalIterator first,
    BidirectionalIterator last,
    match_results<BidirectionalIterator, Allocator>& m,
    const basic_regex<charT, traits>& e,
    regex_constants::match_flag_type flags = regex_constants::match_default);
		

In Option 1, the limit of lookbehind is passed as an addition to an ordinary [first, last] range. In Option 2, three iterators are sorted in ascending order.

In both options, as the orders of the parameter types are exactly the same, C++ compilers will not be able to distinguish them. If after SRELL were to adopt Option 1 regex_search of C++ Standard were to adopt Option 2 in the future, there would occur an incompatibility between std::regex and srell::regex, which does not cause a compiling error.
Once this should happen, fixing would be not easy. If the order of the parameters in SRELL were rearranged to agree with C++ Standard, it would become a breaking change that is not easily recognised by users because of no compiling error. To fix or not, SRELL would have a troublesome problem in either case.

Thus, I submitted a proposal to ask if the C++ Committee has an interest in enchancement of <regex>, and implemented the additional members to match_results mentioned above as a temporary means until the outcome of the proposal is turned out (Because I intentionally chose an unpolished way, it was very unlikely that this addition would conflict with a future enhancement of C++'s match_results).

As it turned out that the C++ committee no longer has any intention to improve or to enhance <regex>, I implemented Option 1 in SRELL 2.600 and removed the temporary means.

u8-prefix versus u8c-prefix

Regardless of the C++ version, "u8-" prefix means that "This class can handle u8"..." string literals", whereas "u8c-" prefix means that "This class handles a sequence of the char type as a UTF-8 string". Until C++17, this distinction was not necessary because the type of u8"..." was const char[N].

However, as char8_t was introduced in C++20, u8"..." was changed to const char8_t[N]. Because of this, SRELL came to need to distinguish a UTF-8 sequence based on char8_t and traditional one based on char.

Normally, a new prefix should be given to newly-introduced specialisations that handle a UTF-8 sequence base on char8_t, but it is expected that the C++ standard library will use the "u8-" prefix for specialisation for char8_t type, just as std::u8string is already so.
Thus, to avoid inconsistency with the naming convention of the standard library, SRELL introduced the "u8c-" prefix and has used it to mean that "This class handles a sequence of char as a UTF-8 string" since version 2.100.

List of classes whose prefix has been changed from u8- to u8c-

  • basic_regex: u8cregex
  • match_results: u8ccmatch, u8csmatch
  • sub_match: u8ccsub_match, u8cssub_match
  • regex_iterator: u8ccregex_iterator, u8csregex_iterator
  • regex_token_iterator: u8ccregex_token_iterator, u8csregex_token_iterator

The freed "u8-" prefix is now associated with the char8_t type. But if your compiler does not support char8_t, for backwards compatibility, class names having the "u8-" prefix are also provided as aliases (i.e., typedef) of the corresponding classes that have the "u8c-" prefix in their names, respectively.

u8- and u8c-
PrefixSRELL -2.002SRELL 2.100-
When char8_t not
supported by compiler
When char8_t supported
by compiler
u8- handles a sequence of char as UTF-8 handles a sequence of char8_t as UTF-8
u8c- (Prefix did not exist) handles a sequence of char as UTF-8

Lookbehind

SRELL version 1.nnn supported the regular expressions defined in ECMAScript 2017 (ES8) Specification 21.2 RegExp (Regular Expression) Objects plus fixed-length lookbehind assertions.

Because of the following reason, SRELL version 1.nnn behaved differently to SRELL version 2.000 and later when a lookbehind assertion was used.

TC39, which maintains the ECMAScript standard, adopted the variable-length lookbehind assertions for its RegExp, instead of the fixed-length ones that are supported by many script languages, such as Perl5, Python, etc. At a glance, the former may seem to be just a superset of the latter, but these two in fact return different results in some cases:

"abcd" =~ /(?<=(.){2})./
//  Fixed-length lookbehind: ["c", "b"].
//  As the automaton runs from left to right even in a lookbehind assertion,
//  "b" just before "c" is the last one that $1 captured.

//  Variable-length lookbehind: ["c", "a"].
//  As the automaton runs from right to left in a lookbehind assertion,
//  "a" being the second next of "c" is the last one that $1 captured.
		

While in SRELL 1 the fixed-width lookbehind assertions were supported as an extension, in SRELL 2.000 and later the variable-length lookbehind assertions are supported following the enhancement of RegExp of JavaScript. Thus, there happened a breaking change between SRELL 1.401 and SRELL 2.000.

External Links

RegExp of ECMAScript (JavaScript)

Proposals (Updated: 26 Jan 2024)

In principle, SRELL begins to support a proposed feature at the point when the proposal's champion(s) would try advancing the proposal to Stage 4.

Performance