New syntax
Named capture
XRegExp includes comprehensive support for named capture. Following are the details of XRegExp's named capture syntax:
- Capture:
(?<name>…)
- Backreference in regex:
\k<name>
- Backreference in replacement text:
$<name>
- Backreference stored at:
result.groups.name
- Backreference numbering: Sequential (i.e., left to right for both named and unnamed capturing groups)
- Multiple groups with same name:
SyntaxError
Notes
- See additional details and compare to named capture in other regex flavors here: Named capture comparison.
- JavaScript added native support for named capture in ES2018. XRegExp support predates this, and it extends this support into pre-ES2018 browsers.
- Capture names can use a wide range of Unicode characters (see the definition of
RegExpIdentifierName
).
Example
const repeatedWords = XRegExp.tag('gi')`\b(?<word>[a-z]+)\s+\k<word>\b`; // Alternatively: XRegExp('\\b(?<word>[a-z]+)\\s+\\k<word>\\b', 'gi'); // Check for repeated words repeatedWords.test('The the test data'); // -> true // Remove any repeated words const withoutRepeated = XRegExp.replace('The the test data', repeatedWords, '${word}'); // -> 'The test data' const url = XRegExp(`^(?<scheme> [^:/?]+ ) :// # aka protocol (?<host> [^/?]+ ) # domain name/IP (?<path> [^?]* ) \\?? # optional path (?<query> .* ) # optional query`, 'x'); // Get the URL parts const parts = XRegExp.exec('https://google.com/path/to/file?q=1', url); // parts -> ['https://google.com/path/to/file?q=1', 'https', 'google.com', '/path/to/file', 'q=1'] // parts.groups.scheme -> 'https' // parts.groups.host -> 'google.com' // parts.groups.path -> '/path/to/file' // parts.groups.query -> 'q=1' // Named backreferences are available in replacement functions as properties of the last argument XRegExp.replace('https://google.com/path/to/file?q=1', url, (match, ...args) => { const groups = args.pop(); return match.replace(groups.host, 'xregexp.com'); }); // -> 'https://xregexp.com/path/to/file?q=1'
Regexes that use named capture work with all native methods. However, you need to use XRegExp.exec
and XRegExp.replace
for access to named backreferences, otherwise only numbered backreferences are available.
Annotations
- Rationale: Named capture can help make regular expressions and related code self-documenting, and thereby easier to read and use.
- Compatibility: The named capture syntax is illegal in pre-ES2018 native JavaScript regular expressions and hence does not cause problems. Backreferences to undefined named groups throw a
SyntaxError
. - Compatibility with deprecated features: XRegExp's named capture functionality does not support the
lastMatch
property of the globalRegExp
object or theRegExp.prototype.compile
method, since those features were deprecated in JavaScript 1.5. - Prior art: Comes from Python (feature) and .NET (syntax).
Inline comments
Inline comments use the syntax (?#comment)
. They are an alternative to the line comments allowed in free-spacing mode.
Comments are a do-nothing (rather than ignore-me) metasequence. This distinction is important with something like \1(?#comment)2
, which is taken as \1
followed by 2
, and not \12
. However, quantifiers following comments apply to the preceeding token, so x(?#comment)+
is equivalent to x+
.
Example
const regex = XRegExp('^(?#month)\\d{1,2}/(?#day)\\d{1,2}/(?#year)(\\d{2}){1,2}', 'n'); const isDate = regex.test('04/20/2008'); // -> true // Can still be useful when combined with free-spacing, because inline comments // don't need to end with \n const regex = XRegExp('^ \\d{1,2} (?#month)' + '/ \\d{1,2} (?#day )' + '/ (\\d{2}){1,2} (?#year )', 'nx');
Annotations
- Rationale: Comments make regular expressions more readable.
- Compatibility: No known problems with this syntax; it is illegal in native JavaScript regular expressions.
- Prior art: The syntax comes from Perl. It is also available in .NET, PCRE, Python, Ruby, and Tcl, among other regular expression flavors.
Leading mode modifier
A mode modifier uses the syntax (?imnsuxA)
, where imnsuxA
is any combination of XRegExp flags except g
or y
. Mode modifiers provide an alternate way to enable the specified flags. XRegExp allows the use of a single mode modifier at the very beginning of a pattern only.
Example
const regex = XRegExp('(?im)^[a-z]+$'); regex.ignoreCase; // -> true regex.multiline; // -> true
When creating a regex, it's okay to include flags in a mode modifier that are also provided via the separate flags
argument. For instance, XRegExp('(?s).+', 's')
is valid.
Flags g
and y
cannot be included in a mode modifier, or an error is thrown. This is because g
and y
, unlike all other flags, have no impact on the meaning of a regex. Rather, they change how particular methods choose to apply the regex. In fact, XRegExp methods provide e.g. scope
, sticky
, and pos
arguments that allow you to use and change such functionality on a per-run rather than per-regex basis. Also consider that it makes sense to apply all other flags to a particular subsection of a regex, whereas flags g
and y
only make sense when applied to the regex as a whole. Allowing g
and y
in a mode modifier might therefore create future compatibility problems.
The use of unknown flags in a mode modifier causes an error to be thrown. However, XRegExp addons can add new flags that are then automatically valid within mode modifiers.
Annotations
- Rationale: Mode modifiers allow you to enable flags in situations where a regex pattern can be provided as a string only. They can also improve readability, since flags are read first rather than after the pattern.
- Compatibility: No known problems with this syntax; it is illegal in native JavaScript regular expressions.
- Compatibility with other regex flavors: Some regex flavors support the use of multiple mode modifiers anywhere in a pattern, and allow extended syntax for unsetting flags via
(?-i)
, simultaneously setting and unsetting flags via(?i-m)
, and enabling flags for subpatterns only via(?i:…)
. XRegExp does not support these extended options. - Prior art: The syntax comes from Perl. It is also available in .NET, Java, PCRE, Python, Ruby, and Tcl, among other regular expression flavors.
Stricter error handling
XRegExp makes any escaped letters or numbers a SyntaxError
unless they form a valid and complete metasequence or backreference. This helps to catch errors early, and makes it safe for future versions of ES or XRegExp to introduce new escape sequences. It also means that octal escapes are always an error in XRegExp. ES3/5 do not allow octal escapes, but browsers support them anyway for backward compatibility, which often leads to unintended behavior.
XRegExp requires all backreferences, whether written as \n
, \k<n>
, or \k<name>
, to appear to the right of the opening parenthesis of the group they reference.
XRegExp never allows \n
-style backreferences to be followed by literal numbers. To match backreference 1 followed by a literal 2
character, you can use, e.g., (a)\k<1>2
, (?x)(a)\1 2
, or (a)\1(?#)2
.
Unicode
XRegExp supports matching Unicode categories, scripts, and other properties via addon scripts. Such tokens are matched using \p{…}
, \P{…}
, and \p{^…}
. See XRegExp Unicode addons for more details.
XRegExp additionally supports the \u{N…}
syntax for matching individual code points. In ES6 this is supported natively, but only when using the u
flag. XRegExp supports this syntax for code points 0
–FFFF
even when not using the u
flag, and it supports the complete Unicode range 0
–10FFFF
when using u
.
Replacement text
XRegExp's replacement text syntax is used by the XRegExp.replace
function. It adds $0
as a synonym of $&
(to refer to the entire match), and adds $<n>
and ${n}
for backreferences to named and numbered capturing groups (in addition to $1
, etc.). When the braces syntax is used for numbered backreferences, it allows numbers with three or more digits (not possible natively) and allows separating a backreference from an immediately-following digit (not always possible natively). XRegExp uses stricter replacement text error handling than native JavaScript, to help you catch errors earlier (e.g., the use of a $
character that isn't part of a valid metasequence causes an error to be thrown).
Following are the special tokens that can be used in XRegExp replacement strings:
$$
- Inserts a literal$
character.$&
,$0
- Inserts the matched substring.$`
- Inserts the string that precedes the matched substring (left context).$'
- Inserts the string that follows the matched substring (right context).$n
,$nn
- Where n/nn are digits referencing an existing capturing group, inserts backreference n/nn.$<n>
,${n}
- Where n is a name or any number of digits that reference an existent capturing group, inserts backreference n.
XRegExp behavior for $<n>
and ${n}
:
- Backreference to numbered capture, if
n
is an integer. Use0
for the entire match. Any number of leading zeros may be used. - Backreference to named capture
n
, if it exists. Does not overlap with numbered capture since XRegExp does not allow named capture to use a bare integer as the name. - If the name or number does not refer to an existing capturing group, it's an error.
XRegExp behavior for $n
and $nn
:
- Backreferences without curly braces end after 1 or 2 digits. Use
${…}
for more digits. $1
is an error if there are no capturing groups.$10
is an error if there are less than 10 capturing groups. Use${1}0
instead.$01
is equivalent to$1
if a capturing group exists, otherwise it's an error.$0
(not followed by 1-9) and$00
are the entire match.
For comparison, following is JavaScript's native behavior for $n
and $nn
:
- Backreferences end after 1 or 2 digits. Cannot use backreference to capturing group 100+.
$1
is a literal$1
if there are no capturing groups.$10
is$1
followed by a literal0
if there are less than 10 capturing groups.$01
is equivalent to$1
if a capturing group exists, otherwise it's a literal$01
.$0
is a literal$0
.