New flags
About flags
XRegExp provides four new flags (n
, s
, x
, A
), which can be combined with native flags and arranged in any order. Unlike native flags, non-native flags do not show up as properties on regular expression objects.
- New flags
n
— Explicit captures
— Dot matches all (aka singleline mode) — Added as a native flag in ES2018x
— Free-spacing and line comments (aka extended mode)A
— Astral (requires the Unicode Base addon)
- Native flags
g
— All matches, or advancelastIndex
after matches (global
)i
— Case insensitive (ignoreCase
)m
—^
and$
match at newlines (multiline
)u
— Handle surrogate pairs as code points and enable\u{…}
(unicode
) — Requires native ES6 supporty
— Matches must start atlastIndex
(sticky
) — Requires Firefox 3+ or native ES6 support
Explicit capture (n
)
Specifies that the only valid captures are explicitly named groups of the form (?<name>…)
. This allows unnamed (…)
parentheses to act as noncapturing groups without the syntactic clumsiness of the expression (?:…)
.
Annotations
- Rationale: Backreference capturing adds performance overhead and is needed far less often than simple grouping. The
n
flag frees the(…)
syntax from its often-undesired capturing side effect, while still allowing explicitly-named capturing groups. - Compatibility: No known problems; the
n
flag is illegal in native JavaScript regular expressions. - Prior art: The
n
flag comes from .NET.
Dot matches all (s
)
Usually, a dot does not match newlines. However, a mode in which dots match any code unit (including newlines) can be as useful as one where dots don't. The s
flag allows the mode to be selected on a per-regex basis. Escaped dots (\.
) and dots within character classes ([.]
) are always equivalent to literal dots. The newline code points are as follows:
U+000a
— Line feed —\n
U+000d
— Carriage return —\r
U+2028
— Line separatorU+2029
— Paragraph separator
Annotations
- Rationale: All popular Perl-style regular expression flavors except JavaScript include a flag that allows dots to match newlines. Without this mode, matching any single code unit requires, e.g.,
[\s\S]
,[\0-\uFFFF]
,[^]
(JavaScript only; doesn't work in some browsers without XRegExp), or god forbid(.|\s)
. - Compatibility: No known problems; the
s
flag is illegal in native JavaScript regular expressions. - Prior art: The
s
flag comes from Perl.
When using XRegExp's Unicode Properties addon, you can match any code point without using the s
flag via \p{Any}
.
Free-spacing and line comments (x
)
This flag has two complementary effects. First, it causes most whitespace to be ignored, so you can free-format the regex pattern for readability. Second, it allows comments with a leading #
. Specifically, it turns most whitespace into an "ignore me" metacharacter, and #
into an "ignore me, and everything else up to the next newline" metacharacter. They aren't taken as metacharacters within character classes (which means that classes are not free-format, even with x
), and as with other metacharacters, you can escape whitespace and #
that you want to be taken literally. Of course, you can always use \s
to match whitespace.
It might be better to think of whitespace and comments as do-nothing (rather than ignore-me) metacharacters. This distinction is important with something like \12 3
, which with the x
flag is taken as \12
followed by 3
, and not \123
. However, quantifiers following whitespace or comments apply to the preceeding token, so x +
is equivalent to x+
.
The ignored whitespace characters are those matched natively by \s
. ES3 whitespace is based on Unicode 2.1.0 or later. ES5 whitespace is based on Unicode 3.0.0 or later, plus U+FEFF
. Following are the code points that should be matched by \s
according to ES5 and Unicode 4.0.1–6.1.0 (not yet updated for later versions):
U+0009
— Tab —\t
U+000A
— Line feed —\n
U+000B
— Vertical tab —\v
U+000C
— Form feed —\f
U+000D
— Carriage return —\r
U+0020
— SpaceU+00A0
— No-break spaceU+1680
— Ogham space markU+180E
— Mongolian vowel separatorU+2000
— En quadU+2001
— Em quadU+2002
— En spaceU+2003
— Em spaceU+2004
— Three-per-em spaceU+2005
— Four-per-em spaceU+2006
— Six-per-em spaceU+2007
— Figure spaceU+2008
— Punctuation spaceU+2009
— Thin spaceU+200A
— Hair spaceU+2028
— Line separatorU+2029
— Paragraph separatorU+202F
— Narrow no-break spaceU+205F
— Medium mathematical spaceU+3000
— Ideographic spaceU+FEFF
— Zero width no-break space
Annotations
- Rationale: Regular expressions are notoriously hard to read; adding whitespace and comments makes regular expressions easier to read.
- Compatibility: No known problems; the
x
flag is illegal in native JavaScript regular expressions. - Prior art: The
x
flag comes from Perl, and was originally inspired by Jeffrey Friedl's pretty-printing of complex regexes.
Unicode 1.1.5–4.0.0 assigned code point U+200B
(ZWSP) to the Zs
(Space separator) category, which means that some browsers or regex engines might include this additional code point in those matched by \s
, etc. Unicode 4.0.1 moved ZWSP to the Cf
(Format) category.
Unicode 1.1.5 assigned code point U+FEFF
(ZWNBSP) to the Zs
category. Unicode 2.0.14 moved ZWNBSP to the Cf
category. ES5 explicitly includes ZWNBSP in its list of whitespace characters, even though this does not match any version of the Unicode standard since 1996.
U+180E
(Mongolian vowel separator) was introduced in Unicode 3.0.0, which assigned it the Cf
category. Unicode 4.0.0 moved it into the Zs
category, and Unicode 6.3.0 moved it back to the Cf
category.
JavaScript's \s
is similar but not equivalent to \p{Z}
(the Separator category) from regex libraries that support Unicode categories, including XRegExp's own Unicode Categories addon. The difference is that \s
includes code points U+0009
–U+000D
and U+FEFF
, which are not assigned the Separator category in the Unicode character database.
JavaScript's \s
is nearly equivalent to \p{White_Space}
from the Unicode Properties addon. The differences are: 1. \p{White_Space}
does not include U+FEFF
(ZWNBSP). 2. \p{White_Space}
includes U+0085
(NEL), which is not assigned the Separator category in the Unicode character database.
Aside: Not all JavaScript regex syntax is Unicode-aware. According to JavaScript specs, \s
, \S
, .
, ^
, and $
use Unicode-based interpretations of whitespace and newline, while \d
, \D
, \w
, \W
, \b
, and \B
use ASCII-only interpretations of digit, word character, and word boundary. Many browsers get some of these details wrong.
For more details, see JavaScript, Regex, and Unicode.
Astral (A
)
Requires the Unicode Base addon.
By default, \p{…}
and \P{…}
support the Basic Multilingual Plane (i.e. code points up to U+FFFF
). You can opt-in to full 21-bit Unicode support (with code points up to U+10FFFF
) on a per-regex basis by using flag A
. In XRegExp, this is called astral mode. You can automatically add flag A
for all new regexes by running XRegExp.install('astral')
. When in astral mode, \p{…}
and \P{…}
always match a full code point rather than a code unit, using surrogate pairs for code points above U+FFFF
.
// Using flag A to match astral code points XRegExp('^\\pS$').test('💩'); // -> false XRegExp('^\\pS$', 'A').test('💩'); // -> true XRegExp('(?A)^\\pS$').test('💩'); // -> true // Using surrogate pair U+D83D U+DCA9 to represent U+1F4A9 (pile of poo) XRegExp('(?A)^\\pS$').test('\uD83D\uDCA9'); // -> true // Implicit flag A XRegExp.install('astral'); XRegExp('^\\pS$').test('💩'); // -> true
Opting in to astral mode disables the use of \p{…}
and \P{…}
within character classes. In astral mode, use e.g. (\pL|[0-9_])+
instead of [\pL0-9_]+
.
Annotations
- Rationale: Astral code point matching uses surrogate pairs and is somewhat slower than BMP-only matching. Enabling astral code point matching on a per-regex basis can therefore be useful.
- Compatibility: No known problems; the
A
flag is illegal in native JavaScript regular expressions. - Prior art: None.