List of breaking changes in SRELL

Contents

Breaking changes

basic_regex's extension APIs

In SRELL 4.009-4.056, the basic_regex class had the following member functions as extension APIs:

replace() const

While regex_replace() does not modify a passed original string but creates a copy of it and does replacement on the copy and finally returns it, this replace() actually rewrites the passed string.
Except this point, this behaves like String.prototype.replace(regexp-object, newSubStr|callback-function) of ECMAScript.

replace() accepts an object of the std::basic_string type or any container type that has the same APIs as std::basic_string as a target string (Up to 4.010, only the former type was accepted).

There are two ways to pass a replacement string: 1) pass a format string, or 2) pass through a callback function.

Replacement by format string
//  charT is the type passed to basic_regex as the first template argument.
template <typename StringLike>
void replace(
StringLike &s,
const charT *const fmt_begin,
const charT *const fmt_end,
const bool global = false) const;

template <typename StringLike>
void replace(
StringLike &s,
const charT *const fmt,
const bool global = false) const;

template <typename StringLike, typename FST, typename FSA>
void replace(
StringLike &s,
const std::basic_string<charT, FST, FSA> &fmt,
const bool global = false) const;
				

When the global parameter is false, only the first match is replaced. When true, all matched substrings in a string are replaced.

The format of fmt are the same as the one for srell::regex_replace(), it is in accordance with the ECMAScript specification, Runtime Semantics: GetSubstitution.
The sequences found in this table have a special meaning, other symbols are used "as is" as a substitution character.

//  Replacement by format template.
#include <cstdio>
#include <string>
#include "srell.hpp"

int main()
{
const srell::regex re("(\\d)(\\d)");    //  Searches for two consecutive digits.
std::string text("ab0123456789cd");

re.replace(text, "($2$1)", true);
    //  Exchanges the order of $1 and $2 and encloses them with a pair of brackets.

std::printf("Result: %s\n", text.c_str());
return 0;
}
---- output ----
Result: ab(10)(32)(54)(76)(98)cd
				
Replacement by callback function

When replace() receives a callback function as a parameter instead of a format string, it calls that function every time a substring that matches a regular expression is found.

//  charT is the type passed to basic_regex as the first template argument.
template <typename StringLike, typename RandomAccessIterator, typename MA>
void replace(
StringLike &s,
bool (*repfunc)(
    std::basic_string<charT, typename StringLike::traits_type,
        typename StringLike::allocator_type> &replacement_text,
    const match_results<RandomAccessIterator, MA> &m,
    void *),
void *ptr = NULL) const;

template <typename MatchResults, typename StringLike>
void replace(
StringLike &s,
bool (*repfunc)(
    std::basic_string<charT, typename StringLike::traits_type,
        typename StringLike::allocator_type> &replacement_text,
    const MatchResults &m,
    void *),
void *ptr = NULL) const;
				

The signature of a callback function (repfunc) is as follows:

//  charT is the type passed to basic_regex as the first template argument.
bool replacement_function(
std::basic_string<charT> &replacemen_text,
const match_results<const charT *> &m,  //  For *cmatch
void *);

bool replacement_function(
std::basic_string<charT> &replacemen_text,
const match_results<typename std::basic_string<charT>::const_iterator> &m,  //  For *smatch
void *);
				

The ranges of matched substrings are written into an instance of match_results, and passed to the callback function as the second parameter m.

For typedefs of match_results, there are two groups, the *cmatch family based on const charT * and the *smatch family based on std::basic_string<charT>::const_iterator. Because of this, the callback function has two types of signature.

The third argument of replace() becomes the third parameter value of the callback function. It is useful to pass/return something to/from a callback function via a pointer.

In the callback function, a new string for replacement should be set to the first parameter replacemen_text passed to as a reference to an object of basic_string, then return. When returns true, the callback function is called again if any new match is found. When returns false, the callback function is not called any more.

//  Replacement by callback function.
//  Decode of percent encoding.
#include <cstdio>
#include <string>
#include "srell.hpp"

bool repfunc(std::string &out, const srell::u8cmatch &m, void *) {
out.push_back(std::strtoul(m[1].str().c_str(), NULL, 16));
return true;
}

int main() {
const srell::regex re("%([0-9A-Fa-f]{2})");
std::string c14("%E3%81%82%E3%81%84%E3%81%86%E3%81%88%E3%81%8A");
std::string c9803(c14), c11(c14);

re.replace(c9803, repfunc);  //  C++98/03

re.template replace<srell::smatch>(c11, [](std::string &out, const srell::smatch &m, void *) -> bool {  //  C++11
     out.push_back(std::strtoul(m[1].str().c_str(), NULL, 16));
     return true;
});

re.template replace<srell::smatch>(c14, [](auto &out, const auto &m, auto) -> bool {  //  C++14 and later.
     out.push_back(std::strtoul(m[1].str().c_str(), NULL, 16));
     return true;
});

std::printf("Result(C++98/03): %s\n", c9803.c_str());
std::printf("Result(C++11): %s\n", c11.c_str());
std::printf("Result(C++14-): %s\n", c14.c_str());
return 0;
}
---- output ----
Result(C++98/03): あいうえお
Result(C++11): あいうえお
Result(C++14-): あいうえお
				

When a lambda expression is used instead of the pointer to a callback function, the match_results<RandomAccessIterator, Alloc> type that is wanted to be passed to as the second parameter of the lambda needs to be explicitly specified as a template argument, otherwise type deduction for template parameters of match_results fails.

Note
  • Until version 4.012, creplace() that always passes *cmatch family, and sreplace() that always passes *smatch family were provided. They were removed in 4.013 for simplification of complicated overload functions of replace().
  • Until version 4.010, two types, 1) the custom std::basic_string<charT, ST, SA> type itself, and 2) the match_results<RandomAccessIterator, Alloc> type that is wanted to be passed to the callback function, need to be explicitly specified in this order as template arguments.

str_clip()

This is a template class inside namespace srell and not a member of basic_regex, but explained here as it is intended to be used with replace().

Beginning with SRELL 4.011, a template class str_clip() has been added. This is a utility that limits a range of a string in where replace() executes searching and replacing.

//  Example of str_clip().
#include <cstdio>
#include <string>
#include "srell.hpp"

int main() {
const srell::regex re(".");
std::string text("0123456789ABCDEF");

srell::str_clip<std::string> ctext(text);
//  As a template argument, specifies the type (std::string)
//  of an object (text) to be assigned with str_clip.

//  Clipping by pos and count pair: From offset 4, 6 elements.
re.replace(ctext.clip(4, 6), "x", true);
std::printf("By pos&count: %s\n", text.c_str());  //  "0123xxxxxxABCDEF"

//  Clipping by iterator pair.
re.replace(ctext.clip(text.begin() + 6, text.end() - 6), "y", true);
std::printf("By iterators: %s\n", text.c_str());  //  "0123xxyyyyABCDEF"

re.template replace<srell::cmatch>(ctext.clip(6, 2), [](std::string &out, const srell::cmatch &, void *) {
    out = "Zz";
    return true;
});
std::printf("By lambda: %s\n", text.c_str());  //  "0123xxZzZzyyABCDEF"

return 0;
}
				

split() const

split() splits a string into substrings by a subsequence that matches a regular expression in the string, sets each position range into an instance of sub_match, and pushes it to a reference to a list container (vector, list, etc.) passed to split() as the first parameter.

Except the following modification, behaves like String.prototype.split(regexp-object, limit) of ECMAScript:

  • When limit, the maximum number of pushing a substring to the passed list container, is explicitly specified, split() behaves in accordance with the ECMAScript specification up to limit-1 times, and for the last time, pushes the remainder of the string (the substring in which searching has not been yet) to the container in whole.

Although specifying the maximum number of splitting is not a rare feature, the one of JavaScript is a bit peculiar; when the number of times splitting is executed reaches limit, split() throws away the remainder of the string that has not searched yet and does not push it to the list container. As personally this behaviour is not pleasant, the modification above has been applied.

template <typename container, typename ST, typename SA>
void split(
container &c,
const std::basic_string<charT, ST, SA> &s,
const std::size_t limit = static_cast<std::size_t>(-1)) const;

//  The following two are available since version 4.011.
template <typename container, typename BidirectionalIterator>
void split(
container &c,
const BidirectionalIterator begin,  //  The same as or convertible to container::value_type::iterator.
const BidirectionalIterator end,
const std::size_t limit = static_cast<std::size_t>(-1)) const;

template <typename container>
void split(
container &c,
const charT *const str,
const std::size_t limit = static_cast<std::size_t>(-1)) const;
			

For c, any container type can be used to receive results if it has push_back() as a member function.

If a regular expression contains a capturing round bracket, the substring captured by it is also pushed into the list container. Even when a pair of brackets does not capture anything, pushing is not skipped but an empty string is pushed instead.

#include <cstdio>
#include <string>
#include <vector>
#include "srell.hpp"

template <typename Container>
void print(const Container &c) {
for (typename Container::size_type i = 0; i < c.size(); ++i)
    std::printf("%s\"%s\"", i == 0 ? "{ " : ", ", c[i].str().c_str());
std::puts(" }");
}

int main() {
std::string text("01:23:45");
srell::regex re(":");
std::vector<srell::csub_match> res;  //  Or srell::ssub_match.

re.split(res, text);    //  Unlimited splitting.
print(res);     //  { "01", "23", "45" }

res.clear();    //  Note: split() does not call clear()
re.split(res, text, 2); //  Splits into two.
print(res);     //  { "01", "23:45" }
                //  split() of JavaScript returns { "01", "23" }

re.assign("(?<=(\\d?)):(?=(\\d?))");  //  Captures a string before and after ':'
res.clear();
re.split(res, text);
print(res);     //  { "01", "1", "2", "23", "3", "4", "45" }

text.assign("caf\xC3\xA9");     //  "café"
re.assign("");

res.clear();
re.split(res, text);    //  Splits by element of char.
print(res);     //  { "c", "a", "f", "\xC3", "\xA9" }

srell::u8cregex u8re("");
res.clear();
u8re.split(res, text);  //  Splits by character of UTF-8.
print(res);     //  { "c", "a", "f", "é" }

return 0;
}
			

Algorithms (regex_search)

In SRELL 2.600-4.064, there was the following overload as an extension API:

template <class BidirectionalIterator, class charT, class traits>
bool regex_search(
    BidirectionalIterator first,
    BidirectionalIterator last,
    BidirectionalIterator lookbehind_limit,
    const basic_regex<charT, traits> &e,
    const regex_constants::match_flag_type flags = regex_constants::match_default);
		

As I personally do not use at all match/search overloads that do not take match_results as a parameter, I removed it for simplicity of overloading.
There is no plan to remove the other three iterator overload that does take match_results as a parameter; nor APIs being compatible with std::regex.

match_lblim_avail flag and match_results.lookbehind_limit member

SRELL version 2.300~2.500 had the following extensions:

If the match_lblim_avail flag option is set, when a lookbehind assertion is performed, the lookbehind_limit member of an instance of the match_result type passed to any of the regular expression algorithms is treated as "the limit of a sequence until where the algorithm can lookbehind".

const char text[] = "0123456789abcdefghijklmnopqrstuvwxyz";
const char* const begin = text;
const char* const end = text + std::strlen(text);
const char* const first = text + 10;    //  Sets the position of 'a'.
const srell::regex re("(?<=^\\d+).");
srell::cmatch match;

match.lookbehind_limit = begin;

std::printf("matched %d\n", srell::regex_search(first, end, match, re));
    //  Does not match as lookbehind is performed only in the range [first, end).

std::printf("matched %d\n", srell::regex_search(first, end, match, re, srell::regex_constants::match_lblim_avail));
    //  Matches because regex_search() is allowed to lookbehind until match.lookbehind_limit.
    //  I.e., when match_lblim_avail specified, searching againist the sequence
    //  [match.lookbehind_limit, end), begins at first in the sequence.
		

As shown in the example above, when match_lblim_avail specified, ^ matches match.lookbehind_limit instead of first.

In SRELL 2.600 and later, the limit position until where regex_search() is allowed to lookbehind can be specified as an argument passed to the function. Thus, the way mentioned above was removed.
The reason why "the three iterators way" introduced in 2.600 was not chosen at first is because there are two ways to pass arguments, as follows:

//  Option 1
bool regex_search(
    BidirectionalIterator first,
    BidirectionalIterator last,
    BidirectionalIterator lookbehind_limit,
    match_results<BidirectionalIterator, Allocator>& m,
    const basic_regex<charT, traits>& e,
    regex_constants::match_flag_type flags = regex_constants::match_default);

//  Option 2
bool regex_search(
    BidirectionalIterator lookbehind_limit,
    BidirectionalIterator first,
    BidirectionalIterator last,
    match_results<BidirectionalIterator, Allocator>& m,
    const basic_regex<charT, traits>& e,
    regex_constants::match_flag_type flags = regex_constants::match_default);
		

In Option 1, the limit of lookbehind is passed as an addition to an ordinary [first, last] range. In Option 2, three iterators are sorted in ascending order.

In both options, as the orders of the parameter types are exactly the same, C++ compilers will not be able to distinguish them. If after SRELL were to adopt Option 1 regex_search() of C++ Standard were to adopt Option 2 in the future, there would occur an incompatibility between std::regex and srell::regex, which does not cause a compiling error.
Once this should happen, fixing would be not easy. If the order of the parameters in SRELL were rearranged to agree with C++ Standard, it would become a breaking change that is not easily recognised by users because of no compilation error. To fix or not, SRELL would have a troublesome problem in either case.

Thus, I submitted a proposal to ask if the C++ Committee has an interest in enchancement of <regex>. But at that time, it was not rare that a proposal document was not discussed for a long time and I did not want to leave this problem (being not able to specify the lookbehind limit position) unsettled, so I implemented match_results::lookbehind_limit and the match_lblim_avail flag mentioned above as a temporary means in SRELL until any outcome of the proposal would be turned out. Because I intentionally chose an unpolished API, it was very unlikely that this addition would conflict with a future enhancement of C++'s match_results.

As it turned out that the C++ committee no longer has any intention to improve or to enhance <regex>, I implemented Option 1 in SRELL 2.600 and removed the temporary means.

u8-prefix versus u8c-prefix

Regardless of the C++ version, "u8-" prefix means that "This class can handle u8"..." string literals", whereas "u8c-" prefix means that "This class handles a sequence of the char type as a UTF-8 string". Until C++17, this distinction was not necessary because the type of u8"..." was const char[N].

However, as char8_t was introduced in C++20, u8"..." was changed to const char8_t[N]. Because of this, SRELL came to need to distinguish a UTF-8 sequence based on char8_t and traditional one based on char.

Normally, a new prefix should be given to newly-introduced specialisations that handle a UTF-8 sequence base on char8_t, but it is expected that the C++ standard library will use the "u8-" prefix for specialisation for char8_t type, just as std::u8string is already so.
Thus, to avoid inconsistency with the naming convention of the standard library, SRELL introduced the "u8c-" prefix and has used it to mean that "This class handles a sequence of char as a UTF-8 string" since version 2.100.

List of classes whose prefix has been changed from u8- to u8c-

  • basic_regex: u8cregex
  • match_results: u8ccmatch, u8csmatch
  • sub_match: u8ccsub_match, u8cssub_match
  • regex_iterator: u8ccregex_iterator, u8csregex_iterator
  • regex_token_iterator: u8ccregex_token_iterator, u8csregex_token_iterator

The freed "u8-" prefix is now associated with the char8_t type when SRELL is compiled based on C++20 or later. But when SRELL is compiled based on versions up to C++17, classes with the "u8-" prefix are just aliases (typedefs) of the corresponding classes with the "u8c-" prefix in their names, respectively.

u8- and u8c-
PrefixSRELL -2.002SRELL 2.100-
Until C++17Since C++20
u8- handles a sequence of char as UTF-8 handles a sequence of char8_t as UTF-8
u8c- (Prefix did not exist) handles a sequence of char as UTF-8

Lookbehind

SRELL version 1.nnn supported the regular expressions defined in ECMAScript 2017 (ES8) Specification 21.2 RegExp (Regular Expression) Objects plus fixed-length lookbehind assertions.

Because of the following reason, SRELL version 1.nnn behaved differently to SRELL version 2.000 and later when a lookbehind assertion was used.

TC39, which maintains the ECMAScript standard, adopted the variable-length lookbehind assertions for its RegExp, instead of the fixed-length ones that are supported by many script languages, such as Perl5, Python, etc. At a glance, the former may seem to be just a superset of the latter, but these two in fact return different results in some cases:

"abcd" =~ /(?<=(.){2})./
//  Fixed-length lookbehind: ["c", "b"].
//  As the automaton runs from left to right even in a lookbehind assertion,
//  "b" just before "c" is the last one that $1 captured.

//  Variable-length lookbehind: ["c", "a"].
//  As the automaton runs from right to left in a lookbehind assertion,
//  "a" being the second next of "c" is the last one that $1 captured.
		

While in SRELL 1 the fixed-width lookbehind assertions were supported as an extension, in SRELL 2.000 and later the variable-length lookbehind assertions are supported following the enhancement of RegExp of JavaScript. Thus, there happened a breaking change between SRELL 1.401 and SRELL 2.000.