C++ Regular Expressions Functions

regex_match

regex_match returns true and only if the entire input sequence [start, end) is matched. For example, the output of

void test_regex_match(const std::string& subject, const std::string& re_str)
{
  regex re{re_str};

  cout << "Does subject of: '" << subject << "'. Match the regex of '" << re_str << "'" << endl;

  string msg { regex_match(subject, re) ? "yes" : "no" };

  cout << "Answer: " << msg << endl;
}

regex re{string{R"(\d\d/\d\d/\d\d\d\d)"}}; // four digit month/day/year date

test_regex_match(string{"5/31/2000"}, R"(\d\d/\d\d/\d\d\d\d)"); // four digit month/day/year date

test_regex_match(string{"05/31/2000"}, R"(\d\d/\d\d/\d\d\d\d)"); // four digit month/day/year date

is

Does subject of: '5/31/2000'. Match the regex of '\d\d/\d\d/\d\d\d\d'
Answer: no
Does subject of: '05/31/2000'. Match the regex of '\d\d/\d\d/\d\d\d\d'
Answer: yes

regex_search and smatch

While regex_match returns true only if the regular expression matches the entire input sequence, regex_search will succeed even if only a sub-sequence matches the regular expression. And regex_search will return both the matched regex and any submatches, any captures, within the regex. Given code such as

void test_regex_search()
{
  // [:w:] stands for "word character". It matches the ASCII characters [A-Za-z0-9_].
  // This regex matches a
  regex re{R"(([[:w:]\.]+)@([[:w:]]+)\.com)"};

  smatch m;

  string s{R"(<prefix>[email protected]<suffix>[email protected]|[email protected])"};

  if (regex_search(s, m, re)) {

      for (auto i = 0; i < m.size(); ++i) {

          cout << "The " << i << "th submatch using m[i].str()     is: " <<  m[i].str() << endl;
          cout << "The " << i << "th submatch using m.str(i)       is: " <<  m.str(i) << endl;
          cout << "The " << i << "th submatch using *(begin() + n) is: " <<  *(m.begin() + i) << endl;
      }

      cout << "m.prefix().str() = " << m.prefix().str() << endl;

      cout << "m.suffix().str() = " << m.suffix().str() << endl;
  }
}

the first “submatch”, m[0].str() and m.str(0), is always the match of the entire regex , while is the first submatch is returned in m[1].str(). Submatches can be returned various ways that are all equivalent to one another

m[i].str()
m.str(i)
*(m.begin() + i)

In addition to returning submatches, regex_search returns everything before the matched expression in m.prefix().str(), and everything after the matched expression in m.suffix().str(). Therefore m.prefix().str() contains <prefix> and m.suffix().cstr() contains <suffix>.

In the code above regex_search() only found the first email address.

You can use regex_search() in a loop, but you must supply m.suffix().str() as the new input each time through the loop:

void test_regex_search2()  // See https://www.youtube.com/watch?v=nkjUpUu3dFk
{
  regex re{R"(([[:w:]\.]+)@([[:w:]]+)\.com)"};

  smatch m;

  string s{R"(<prefix>[email protected]<suffix>[email protected]|[email protected])"};

  for(;(regex_search(s, m, re) ); s = m.suffix().str())  {

      for (auto i = 0; i < m.size(); ++i) {

          cout << "The " << i << "th submatch using m[i].str()     is: " <<  m[i].str() << endl;
          cout << "The " << i << "th submatch using m.str(i)       is: " <<  m.str(i) << endl;
          cout << "The " << i << "th submatch using *(begin() + n) is: " <<  *(m.begin() + i) << endl;
      }

      cout << "m.prefix().str() = " << m.prefix().str() << endl;

      cout << "m.suffix().str() = " << m.suffix().str() << endl;
    }
}

The output is

The 0th submatch using m[i].str()     is: [email protected]
The 0th submatch using m.str(i)       is: [email protected]
The 0th submatch using *(begin() + n) is: [email protected]
The 1th submatch using m[i].str()     is: kurt.krueckeberg
The 1th submatch using m.str(i)       is: kurt.krueckeberg
The 1th submatch using *(begin() + n) is: kurt.krueckeberg
The 2th submatch using m[i].str()     is: gmail
The 2th submatch using m.str(i)       is: gmail
The 2th submatch using *(begin() + n) is: gmail
m.prefix().str() = 
m.suffix().str() = [email protected]|[email protected]
The 0th submatch using m[i].str()     is: [email protected]
The 0th submatch using m.str(i)       is: [email protected]
The 0th submatch using *(begin() + n) is: [email protected]
The 1th submatch using m[i].str()     is: joe.smith
The 1th submatch using m.str(i)       is: joe.smith
The 1th submatch using *(begin() + n) is: joe.smith
The 2th submatch using m[i].str()     is: aol
The 2th submatch using m.str(i)       is: aol
The 2th submatch using *(begin() + n) is: aol
m.prefix().str() = 
m.suffix().str() = |[email protected]

regex iterators

regex_iterator

If we change s in the previous code above to be

string s{R"(<prefix1>[email protected]<suffix1>  <prefix2>[email protected]<suffix2>)"};

smatch m;

auto found = regex_search(s, m, re);

regex_search() will only find the first occurance in the string. To do repeated, iterative searching involving regex_search(), we must use the while loop idiom shown in test_regex_search2():

while(regex_search(s, m, re) ) {

  for (auto& x : m) {

     // handle submatches here...
  }
  s = m.suffix().str();
}

regex_iterator is a more natural alternative:

void test_regex_iterator()
{
  string s{R"(<prefix1>[email protected]<suffix1>  <prefix2>[email protected]<suffix2>)"};

  regex re{R"(([[:w:]\.]+)@([[:w:]]+)\.com)"};

  sregex_iterator re_iter(s.begin(), s.end(), re);

  sregex_iterator re_end;

  for (; re_iter != re_end; ++re_iter) {

      cout << "The 0th match using re_iter->str(0)\t is: " <<  re_iter->str(0) << " or the entire matched expression." << endl;
      cout << "The 1th submatch using re_iter->str(1)  is: " <<  re_iter->str(1) << endl;
      cout << "The 2th submatch using re_iter->str(2)  is: " <<  re_iter->str(2) << endl;

      cout << "The 2th submatch using re_iter->prefix()  is: " <<  re_iter->prefix() << endl;
      cout << "The 2th submatch using re_iter->suffix()  is: " <<  re_iter->suffix() << endl;
  }
}

whose output is:

The 0th match using re_iter->str(0)      is: [email protected] or the entire matched expression.
The 1th submatch using re_iter->str(1)  is: kurt.krueckeberg
The 2th submatch using re_iter->str(2)  is: gmail
The 2th submatch using re_iter->prefix()  is: 
The 2th submatch using re_iter->suffix()  is:   [email protected]
The 0th match using re_iter->str(0)      is: [email protected] or the entire matched expression.
The 1th submatch using re_iter->str(1)  is: kathafalk
The 2th submatch using re_iter->str(2)  is: yahoo
The 2th submatch using re_iter->prefix()  is:   
The 2th submatch using re_iter->suffix()  is: 

regex_token_iterator

The other type of regex iterator is regex_token_iterator. regex_iterator points to matched results. While regex_token_iterator points to submatches. Its str() method, unlike regex_iterator, cannot take a index. Thus regex_iter->str(i) works for regex_iterator objects, but for regex_token_iterator, we are limited to regex_iter->str(). For exmaple

void test_regex_token_iterator()
{
  string s{R"(<prefix1>[email protected]<suffix1>  <prefix2>[email protected]<suffix2>)"};

  regex re{R"(([[:w:]\.]+)@([[:w:]]+)\.com)"};

  sregex_token_iterator re_iter(s.begin(), s.end(), re);

  sregex_token_iterator re_end;

  for (; re_iter != re_end; ++re_iter) {

      cout << "re_iter->str() is: " <<  re_iter->str() << endl;
  }
}

whose output is:

[ADD OUTPUT HERE]

In addition regex_token_iterator can take a 4th parameter to control the way the matched result is returned. It can be either an int, indicating the submatch to return, or an initializer list of ints, indicating the set of submatches to return. For example:

void run_regex_token_iterator()
{
  string s {"this subject has a submarine as a subsequence"};

  cout << "\nSubject: " << s << "\n";

  try {

      cout << "\nregex: " << R"(\b(sub)([^ ]*))" << "\n" << endl;

      regex re {R"(\b(sub)([^ ]*))"};

      sregex_token_iterator rend;

      string subject ("This subject has a submarine as a subsequence");

      show_regex( sregex_token_iterator{subject.begin(), subject.end(), re}, rend, "entire matchesubject");

      show_regex( sregex_token_iterator{subject.begin(), subject.end(), re, 0}, rend, "0 returns entire matched result ");

      show_regex( sregex_token_iterator{subject.begin(), subject.end(), re, 2}, rend, "2 returns 2nd submatch");

      show_regex( sregex_token_iterator{subject.begin(), subject.end(), re, {1, 2}}, rend, "{1, 2} returns 1st and 2nd submatches");

      show_regex( sregex_token_iterator{subject.begin(), subject.end(), re, -1}, rend, "-1 returns nonmatched text");

      show_regex( sregex_token_iterator{subject.begin(), subject.end(), re, {-1, 0}}, rend, "{-1, 0} returns nonmatched text followed by entire match");

      show_regex( sregex_token_iterator{subject.begin(), subject.end(), re, {-1, 0, 1}}, rend, "{-1, 0, 1} matchesubject");

  } catch (exception& e) {

      cout << "Exception thrown. \n" << e.what() << endl;
  }
}

whose output is:

[ADD OUTPUT HERE]

A regex replace callback

sregex_iterator can be used together with the methods of smatch to do conditional replacement using a callback.

This code example capitalizes all titles–mr, ms, mrs, and dr–as well as ensures that all sentences begin with an uppercase word. It does his by initially putting all the text in lowercase–since it may be all capitalized. Then, it uses a lamda function that takes a const smatch& to capitalize the proper letter. It relies on the smatch position(int) method to do so:

// Initially ensure all text is all lowercase.
for_each(s.begin(), s.end(), [&](char& c) { \
                                 c = tolower(c, locale()); }\
);

// Capitalize titles and the personal pronoun i
regex re_titles{ R"(\b(?:dr)|(?:mr)|(?:ms)|(?:mrs)|(?:i)\b)" };

sregex_iterator titles_iter(s.begin(), s.end(), re_titles);

sregex_iterator titles_end;

int index = 0;

auto lambda_toupper = [&](const smatch& sm) {

     auto pos = sm.position(index);

     s[pos] = toupper(s[pos], std::locale());
};

for_each(titles_iter, titles_end, lambda_toupper);

// Capitalize first word of sentences...

// ...Do first character manually
s[0] = toupper(s[0], locale());

regex re_ucfirst{R"((?:\.|\?|!)\s+([a-z]))"};

sregex_iterator ucfirst_iter(s.begin(), s.end(), re_ucfirst);

sregex_iterator ucfirst_end;

index = 1;

// Do remaining sentences using regex and lambda_toupper function
for_each(ucfirst_iter, ucfirst_end, lambda_toupper);

return;