New flags

About flags

XRegExp provides four new flags (n, s, x, A), which can be combined with native flags and arranged in any order. Unlike native flags, non-native flags do not show up as properties on regular expression objects.

Explicit capture (n)

Specifies that the only valid captures are explicitly named groups of the form (?<name>…). This allows unnamed (…) parentheses to act as noncapturing groups without the syntactic clumsiness of the expression (?:…).

Annotations

Dot matches all (s)

Usually, a dot does not match newlines. However, a mode in which dots match any code unit (including newlines) can be as useful as one where dots don't. The s flag allows the mode to be selected on a per-regex basis. Escaped dots (\.) and dots within character classes ([.]) are always equivalent to literal dots. The newline code points are as follows:

Annotations

When using XRegExp's Unicode Properties addon, you can match any code point without using the s flag via \p{Any}.

Free-spacing and line comments (x)

This flag has two complementary effects. First, it causes most whitespace to be ignored, so you can free-format the regex pattern for readability. Second, it allows comments with a leading #. Specifically, it turns most whitespace into an "ignore me" metacharacter, and # into an "ignore me, and everything else up to the next newline" metacharacter. They aren't taken as metacharacters within character classes (which means that classes are not free-format, even with x), and as with other metacharacters, you can escape whitespace and # that you want to be taken literally. Of course, you can always use \s to match whitespace.

It might be better to think of whitespace and comments as do-nothing (rather than ignore-me) metacharacters. This distinction is important with something like \12 3, which with the x flag is taken as \12 followed by 3, and not \123. However, quantifiers following whitespace or comments apply to the preceeding token, so x + is equivalent to x+.

The ignored whitespace characters are those matched natively by \s. ES3 whitespace is based on Unicode 2.1.0 or later. ES5 whitespace is based on Unicode 3.0.0 or later, plus U+FEFF. Following are the code points that should be matched by \s according to ES5 and Unicode 4.0.1–6.1.0 (not yet updated for later versions):

Annotations

Unicode 1.1.5–4.0.0 assigned code point U+200B (ZWSP) to the Zs (Space separator) category, which means that some browsers or regex engines might include this additional code point in those matched by \s, etc. Unicode 4.0.1 moved ZWSP to the Cf (Format) category.

Unicode 1.1.5 assigned code point U+FEFF (ZWNBSP) to the Zs category. Unicode 2.0.14 moved ZWNBSP to the Cf category. ES5 explicitly includes ZWNBSP in its list of whitespace characters, even though this does not match any version of the Unicode standard since 1996.

U+180E (Mongolian vowel separator) was introduced in Unicode 3.0.0, which assigned it the Cf category. Unicode 4.0.0 moved it into the Zs category, and Unicode 6.3.0 moved it back to the Cf category.

JavaScript's \s is similar but not equivalent to \p{Z} (the Separator category) from regex libraries that support Unicode categories, including XRegExp's own Unicode Categories addon. The difference is that \s includes code points U+0009U+000D and U+FEFF, which are not assigned the Separator category in the Unicode character database.

JavaScript's \s is nearly equivalent to \p{White_Space} from the Unicode Properties addon. The differences are: 1. \p{White_Space} does not include U+FEFF (ZWNBSP). 2. \p{White_Space} includes U+0085 (NEL), which is not assigned the Separator category in the Unicode character database.

Aside: Not all JavaScript regex syntax is Unicode-aware. According to JavaScript specs, \s, \S, ., ^, and $ use Unicode-based interpretations of whitespace and newline, while \d, \D, \w, \W, \b, and \B use ASCII-only interpretations of digit, word character, and word boundary. Many browsers get some of these details wrong.

For more details, see JavaScript, Regex, and Unicode.

Astral (A)

Requires the Unicode Base addon.

By default, \p{…} and \P{…} support the Basic Multilingual Plane (i.e. code points up to U+FFFF). You can opt-in to full 21-bit Unicode support (with code points up to U+10FFFF) on a per-regex basis by using flag A. In XRegExp, this is called astral mode. You can automatically add flag A for all new regexes by running XRegExp.install('astral'). When in astral mode, \p{…} and \P{…} always match a full code point rather than a code unit, using surrogate pairs for code points above U+FFFF.

// Using flag A to match astral code points
XRegExp('^\\pS$').test('💩'); // -> false
XRegExp('^\\pS$', 'A').test('💩'); // -> true
XRegExp('(?A)^\\pS$').test('💩'); // -> true
// Using surrogate pair U+D83D U+DCA9 to represent U+1F4A9 (pile of poo)
XRegExp('(?A)^\\pS$').test('\uD83D\uDCA9'); // -> true

// Implicit flag A
XRegExp.install('astral');
XRegExp('^\\pS$').test('💩'); // -> true

Opting in to astral mode disables the use of \p{…} and \P{…} within character classes. In astral mode, use e.g. (\pL|[0-9_])+ instead of [\pL0-9_]+.

Annotations