PCRE: Perl Compatible Regular Expressions

This document uses syntax diagrams to visually explain PCRE syntax. PCRE is available as an extension for Apache, NGINX, PHP, Python, Go, and others.

regular expression

regular expression

The overall syntax of PCRE is the following:

pattern

A pattern is one or more branches to try, each separated by a vertical bar. When there's two or more branches, this is known as alternation. Branches are tried from left to right until one of the branches match. If none of the branches match, the pattern fails. An empty branch will automatically match.

branch

A branch contains metacharacters and literal characters (nonmetacharacters) that make up what to match. Metacharacters have special meanings that may involve a sequence of one or more characters. Literal characters match themselves.

A quantifier cannot come after a subexpression that is a comment or an escape sequence that is an assertion or reset match start.

literal character (nonmetacharacter)

Within a branch, any character that is not one of the following metacharacters is treated as a literal character:

( ) [ | * + ? \ . $ ^

Additionally, { is considered a metacharacter if it forms a valid quantifier, otherwise it is treated as a literal character.

By default, a literal characer matches a single character. A literal character may be followed by a quantifier to change this behavior.

Preceding a metacharacter by a backslash forms an escape sequence that will make a metacharacter match the literal character instead. See character quoting.

wildcard `.`

Within a branch, the dot metacharacter matches any character except if the character is the start of a valid newline sequence. The dot may be followed by a quantifier to change how many characters it matches.

In dotall mode, the dot matches any character.

Note: The escape sequence \N behaves like the dot wildcard with the exception that dotall mode does not affect it; it never matches the start of a valid newline sequence.

character class

Within a branch, a character class matches a single character within a set of characters. A quantifier may come immediately after the closing bracket of the character class to change how many characters it matches.

escape sequence: valid escape sequences include non-printing characters (\a\b\c\e\f\n\o\r\t\x), character types (\d\D\h\H\p\P\s\S\v\V\w\W but not \C\N\R\X), and character quoting. \b is treated as a backspace character within a character class (same as \x08).

POSIX character class

A POSIX character class may appear within a character class.

alnum: same as [:alpha:][:digit:]
alpha: letters
ascii: ASCII codepoints 0 through 127
blank: space or tab only
cntrl: control characters
digit: decimal digits (same as \d)
graph: same as [:alnum:][:space:]
lower: lowercase letters
print: same as [:graph:][:space:], except not any controls
punct: printing characters, excluding letters, digits, and space
space: whitespace (same as \s since v8.34)
upper: uppercase letters
word: "word" characters (same as \w)
xdigit: hexadecimal digit, same as [0-9A-Fa-f]

escape sequence

An escape sequence may appear within a branch. Additionally, they may appear within a character class depending on the escape sequence. A quantifier may come after an escape sequence appearing in a branch depending on the escape sequence.

escape sequence: assertion/anchor

An escaped assertion sequence may appear within a branch. A quantifier cannot come after an escaped assertion sequence. An escaped assertion sequence cannot appear within a character class.

Unicode note: In Unicode mode, the behavior of \b\B extend to include Unicode characters as well.

See start anchor and end anchor for other simple assertions.

escape sequence: non-printing character

A non-printing character escape sequence may appear within a branch or character class. A quantifier may come after it to modify how many it must match.

Notes on \b: Within a character class, \b will match a backspace character. Within a branch, it asserts a word boundary. See escape sequence: assertion.

Notes on \o{...}: The valid octal number range is 0-377 in ASCII mode and 0-4177777 in Unicode mode (it must also be a valid Unicode codepoint)

Notes on \x{...}: The valid hexadecimal number range is 0-FF in ASCII mode and 0-10FFFFF in Unicode mode (it must also be a valid Unicode codepoint).

Warning: It is highly recommended that you do not use a backslash followed by digits as it may be treated as an octal codepoint, an escaped backreference, or a literal number depending on the number and other conditions. Use \g for backreferences. Use \o for octal codepoints.

escape sequence: character type

A character type escape sequence may appear within a branch. All but \C\N\R\X may appear within a character class. Match a single character that falls within the character type. A quantifier may come after a character type escape sequence to modify how many it must match.

Unicode note: In Unicode mode, the behavior of most of these escape sequences extend to include Unicode characters as well.

Properties for `\p{...}` and `\P{...}`

Match a character that either does (\p) or doesn't (\P) have the underlying property.

Scripts for `\p{...}` and `\P{...}`

Match a character that does (\p) or doesn't (\P) belong to the character script:

Arabic, Armenian, Avestan, Balinese, Bamum, Bassa_Vah, Batak, Bengali, Bopomofo, Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Albanian, Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Duployan, Egyptian_Hieroglyphs, Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao, Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Mahajani, Malayalam, Mandaic, Manichaean, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive, Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Myanmar, Nabataean, New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_South_Arabian, Old_Turkic, Oriya, Osmanya, Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician, Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Shavian, Siddham, Sinhala, Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi

escape sequence: backreference

A backreference matches an existing subresult of the most recent subpattern with the given name or offset. A quantifier may come after an escaped backreference to modify how many it must match.

escape sequence: subroutine call

A subroutine call matches an existing subpattern with the given name or offset at the current position. A quantifier may come after a subroutine call to modify how many it must match.

escape sequence: reset match start

Any previously matched characters will not be included in the result of the regular expression (e.g. foo\Kbar will match foobar but ony return bar as the result of the match). Any subpatterns that come before this sequence will still be captured as subresults/backreferences; that is, this escape sequence only affects the main result and not subresults. A quantifier CANNOT come after this escape sequence.

This escape sequence is ignored within a negative assertion but works within a positive assertion. Normally, assertions are not included in the final result. (?<=fo\Ko)bar will return obar as the result when matching against foobar instead of just bar. ((?<=fo\Ko)bar) will return bar as the first subresult and obar as the result when matching against foobar.

escape sequence: character quoting

Character quoting lets you turn metacharacters into literal characters to be matched. A quantifier may come after these escape sequences, but a quantifier after a \Q \E sequence does not entirely work.

Notes on \Q and \E: All characters between these two escape sequences will be treated as literal characters. If the regular expression ends before \E, a \E will be implied. If \E occurs before \Q within the regular expression, it will be ignored.

unsupported escape sequence

The regular expression engine will fail when these escape sequences are encountered.

If the PCRE_JAVASCRIPT_COMPAT option is set, \U will be treated like an unrecognized escape sequence and \u takes on the syntax below. If it does not follow the below syntax, \u is also treated like an unrecognized escape sequence.

unrecognized escape sequence

The following escape sequences match the literal characters being escaped, that is, they are not special escape sequences within PCRE. However, it is strongly discouraged to use these escape sequences as future versions of PCRE may change this behavior. To write regular expressions that cause the engine to fail when these escape sequences are used, set PCRE_EXTRA mode.

quantifier

A quantifier allows for the repetition of what comes before it. A quantifier may come after a literal character, an escape sequence, a subexpression, a character class, or the dot wildcard.

Warning: If a braced quantifier is ill-formed, then the initial brace will be matched as a literal character instead, and the regular expression engine will continue after that character (e.g. {1,5.0} would be matched as literal characters instead of treated as a quantifier). This behavior is different from other regular expression constructs, which cause the engine to fail if they are ill-formed.

optimizing speed of regex engine with greediness:
- maximal (greedy) means attempt to match maximum quantity first, then backtrack if necessary
- minimal (lazy) means attempt to match minimum quantity first, then advance if necessary
- possessive means match maximum quantity possible and don't backtrack, even if the match fails
the first set of digits n must be 0 ≤ n ≤ 65535
the second set of digits m must be n ≤ m ≤ 65535
{0} is a valid quantifier. It behaves as if the previous item was not present. It is useful for defining subroutines.

start anchor `^`

The caret metacharacter is a zero-width assertion that asserts the match is happening at the start of the subject. In multiline mode, it asserts the match is happening at the start of the subject or after a newline sequence (i.e. the start of a line).

end anchor `$`

The dollar sign matacharacter is a zero-width assertion that asserts the match is happening at the end of the subject or before a newline sequence at the end of the subject. In multiline mode, it asserts the match is happening at the end of the subject or before a newline sequence (i.e. the end of a line).

subexpression

A subexpression may appear within a branch. A quantifier may appear after the closing parenthesis of the subexpression to repeat it, unless the subexpression is a comment or backtracking control.

subexpression: assertion

The following syntax is used within a subexpression and within a conditional branching subexpression.

subexpression: condition

The following syntax is the conditional part of a conditional branching subexpression.

If condition is DEFINE, the no-branch is not allowed and the condition is always treated as false.

If there's a named capture with the name DEFINE, R, R1, R2, etc, then those override the default behavior of those identifiers.

subexpression: internal option setting

Change the options of the pattern following it (e.g. (?i) would change to case-insensitive for the rest of pattern containing this construct, (?i:ABC) would change to case-insensitive for the ABC pattern within the subexpression)

options

i (PCRE_CASELESS): Any alphabetic character within the pattern will match either lowercase or uppercase.
m (PCRE_MULTILINE): Extend the ^ and $ metacharacters to also match after and before any newline sequence, respectively.
s (PCRE_DOTALL): Extend the . metacharacter to also match characters that form a valid newline sequence.
x (PCRE_EXTENDED): Any literal space character in the pattern will be ignored. Any unescaped # will be treated as the start of an inline comment that is terminated by a newline sequence. Comment subexpressions would no longer be valid because of this.
J (PCRE_DUPNAMES): (Not Perl compatible) By default, names must be unique within the regular expression. This relaxes that rule.
U (PCRE_UNGREEDY): (Not Perl compatible) Converts quantifiers from having maximal greediness to having minimal greediness (lazy) by default. Also, changes meaning of ? from enabling minimal greediness to enabling maximal greediness.
X (PCRE_EXTRA): (Not Perl compatible) Causes regular expression engine to fail if it encounters any undefined escape sequence.

control verb

Control verbs don't match any characters. They allow controlling the engine to achieve hints, optimizations, and jumps. PCRE documentation on control verbs

THEN: Autopossessification. Give up matching the current branch within an alternation, try the next branch. If the current pattern only consists of one branch (that is, it is not an alternation), then it goes to the next branch of the outer pattern if the current pattern is within another pattern.
ACCEPT: This verb acts immediately. This verb causes the pattern to match successfully, skipping the remainder of the pattern. Within a subroutine, it causes only the subroutine to end successfully. Within a positive assertion, only the assertion ends successfully. Within a negative assertion, the assertion fails.
F and FAIL: This verb acts immediately. If the engine reaches this control verb within a branch, it immediately fails and begins to backtrack. It is the same as (?!).

special options

These special options must appear right at the start of the regular expression. They are not Perl-compatible. These options are available to pattern writers who are not able to change the program that processes the patterns.

newline sequence

By default, the engine uses a newline sequence of one or more characters (e.g. \n on Unix-like systems, \r\n on Windows). The newline sequence may be explicitly controlled using special options.