This document uses syntax diagrams to visually explain PCRE syntax. PCRE is available as an extension for Apache, NGINX, PHP, Python, Go, and others.
The overall syntax of PCRE is the following:
A pattern is one or more branches to try, each separated by a vertical bar. When there's two or more branches, this is known as alternation. Branches are tried from left to right until one of the branches match. If none of the branches match, the pattern fails. An empty branch will automatically match.
A branch contains metacharacters and literal characters (nonmetacharacters) that make up what to match. Metacharacters have special meanings that may involve a sequence of one or more characters. Literal characters match themselves.
Within a branch, any character that is not one of the following metacharacters is treated as a literal character:
( ) [ | * + ? \ . $ ^
Additionally, {
is considered a metacharacter if it forms a valid
quantifier, otherwise it is treated as a literal character.
By default, a literal characer matches a single character. A literal character may be followed by a quantifier to change this behavior.
Preceding a metacharacter by a backslash forms an escape sequence that will make a metacharacter match the literal character instead. See character quoting.
.
Within a branch, the dot metacharacter matches any character except if the character is the start of a valid newline sequence. The dot may be followed by a quantifier to change how many characters it matches.
In dotall mode, the dot matches any character.
\N
behaves like the
dot wildcard with the exception that dotall mode does not affect it; it never matches the start of a valid
newline sequence.
Within a branch, a character class matches a single character within a set of characters. A quantifier may come immediately after the closing bracket of the character class to change how many characters it matches.
\a\b\c\e\f\n\o\r\t\x
),
character types (\d\D\h\H\p\P\s\S\v\V\w\W
but not
\C\N\R\X
),
and character quoting.
\b
is treated as a backspace
character within a character class (same as \x08
).
A POSIX character class may appear within a character class.
[:alpha:][:digit:]
\d
)[:alnum:][:space:]
[:graph:][:space:]
, except not any controls\s
since v8.34)\w
)[0-9A-Fa-f]
An escape sequence may appear within a branch. Additionally, they may appear within a character class depending on the escape sequence. A quantifier may come after an escape sequence appearing in a branch depending on the escape sequence.
An escaped assertion sequence may appear within a branch. A quantifier cannot come after an escaped assertion sequence. An escaped assertion sequence cannot appear within a character class.
\b\B
extend to include Unicode
characters as well.
A non-printing character escape sequence may appear within a branch or character class. A quantifier may come after it to modify how many it must match.
\b
: Within a character class, \b
will match a backspace
character. Within a branch, it asserts a word boundary. See
escape sequence: assertion.
\o{...}
: The valid octal number range is 0-377 in ASCII mode and
0-4177777 in Unicode mode (it must also be a valid Unicode codepoint)
\x{...}
: The valid hexadecimal number range is 0-FF in ASCII mode and
0-10FFFFF in Unicode mode (it must also be a valid Unicode codepoint).
\g
for backreferences. Use \o
for octal codepoints.
A character type escape sequence may appear within a branch. All but
\C\N\R\X
may appear within a character class. Match a single
character that falls within the character type. A quantifier may
come after a character type escape sequence to modify how many it must match.
\p{...}
and \P{...}
Match a character that either does (\p
) or doesn't (\P
) have the underlying
property.
\p{...}
and \P{...}
Match a character that does (\p
) or doesn't (\P
) belong to the character script:
Arabic, Armenian, Avestan, Balinese, Bamum, Bassa_Vah, Batak, Bengali, Bopomofo, Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Albanian, Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Duployan, Egyptian_Hieroglyphs, Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao, Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Mahajani, Malayalam, Mandaic, Manichaean, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive, Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Myanmar, Nabataean, New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_South_Arabian, Old_Turkic, Oriya, Osmanya, Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician, Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Shavian, Siddham, Sinhala, Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi
A backreference matches an existing subresult of the most recent subpattern with the given name or offset. A quantifier may come after an escaped backreference to modify how many it must match.
\g
for backreferences. Use \o
for octal codepoints.
A subroutine call matches an existing subpattern with the given name or offset at the current position. A quantifier may come after a subroutine call to modify how many it must match.
Any previously matched characters will not be included in the result of the regular expression (e.g.
foo\Kbar
will match foobar but ony return bar as the result of the match). Any
subpatterns that come before this sequence will still be captured as subresults/backreferences; that is,
this escape sequence only affects the main result and not subresults. A quantifier
CANNOT come after this escape sequence.
This escape sequence is ignored within a negative assertion but works within a positive assertion. Normally,
assertions are not included in the final result.
(?<=fo\Ko)bar
will return obar as the result when matching against
foobar instead of just bar. ((?<=fo\Ko)bar)
will return bar
as the first subresult and obar as the result when matching against foobar.
Character quoting lets you turn metacharacters into literal characters to be matched. A
quantifier may come after these escape sequences, but a quantifier
after a \Q \E
sequence does not entirely work.
\Q
and \E
: All characters between these two escape
sequences will be treated as literal characters. If the regular expression ends
before \E
, a \E
will be implied. If \E
occurs before \Q
within the regular expression, it will be ignored.
The regular expression engine will fail when these escape sequences are encountered.
If the PCRE_JAVASCRIPT_COMPAT option is set, \U
will be treated like an
unrecognized escape sequence and \u
takes on the syntax
below. If it does not follow the below syntax, \u
is also treated like an
unrecognized escape sequence.
The following escape sequences match the literal characters being escaped, that is, they are not special escape sequences within PCRE. However, it is strongly discouraged to use these escape sequences as future versions of PCRE may change this behavior. To write regular expressions that cause the engine to fail when these escape sequences are used, set PCRE_EXTRA mode.
A quantifier allows for the repetition of what comes before it. A quantifier may come after a literal character, an escape sequence, a subexpression, a character class, or the dot wildcard.
{1,5.0}
would be matched as literal characters instead of treated as a quantifier).
This behavior is different from other regular expression constructs, which cause the engine to fail if they
are ill-formed.
{0}
is a valid quantifier. It behaves as if the previous item was not present. It
is useful for defining subroutines.
^
The caret metacharacter is a zero-width assertion that asserts the match is happening at the start of the subject. In multiline mode, it asserts the match is happening at the start of the subject or after a newline sequence (i.e. the start of a line).
$
The dollar sign matacharacter is a zero-width assertion that asserts the match is happening at the end of the subject or before a newline sequence at the end of the subject. In multiline mode, it asserts the match is happening at the end of the subject or before a newline sequence (i.e. the end of a line).
A subexpression may appear within a branch. A quantifier may appear after the closing parenthesis of the subexpression to repeat it, unless the subexpression is a comment or backtracking control.
The following syntax is used within a subexpression and within a conditional branching subexpression.
The following syntax is the conditional part of a conditional branching subexpression.
Change the options of the pattern following it (e.g. (?i)
would change to case-insensitive for the rest of pattern containing this construct, (?i:ABC)
would change to case-insensitive for the ABC pattern within the subexpression)
^
and $
metacharacters to also match after and before any newline sequence,
respectively.
.
metacharacter to also match characters that form a
valid newline sequence.
#
will be treated
as the start of an inline comment that is terminated by a newline
sequence. Comment subexpressions would no longer be valid because of
this.
?
from enabling
minimal greediness to enabling maximal greediness.
Control verbs don't match any characters. They allow controlling the engine to achieve hints, optimizations, and jumps. PCRE documentation on control verbs
(?!)
.
These special options must appear right at the start of the regular expression. They are not Perl-compatible. These options are available to pattern writers who are not able to change the program that processes the patterns.
By default, the engine uses a newline sequence of one or more characters (e.g.
\n
on Unix-like systems, \r\n
on Windows). The newline sequence may be explicitly
controlled using special options.