Regular Expression Syntax

package java.util.regex;

Regular Expression Syntax
This page was last updated on 9 April 2009 and much of the content is from Sun's Java 2 Platform SE 5.0. See Java's Pattern Class for more details on regular expressions and their usage.

Brief Background
A regular expression consists of a character string where some characters are given special meaning with regard to pattern matching. Regular expressions have been in use from the early days of computing, and provide a powerful and efficient way to parse, interpret and search and replace text within an application.

Supported Syntax
Within a regular expression, the following characters have special meaning:

Boundary Operators

^ matches at the beginning of a line
$ matches at the end of a line
\A matches the start of the entire string
\b matches a word boundary
\B matches a non-word boundary
\G matches the end of the previous match
\Z matches the end of the entire string, except for the final terminator, if any
\z matches the end of the entire string
One-Character Operators

. matches any single character (may or may not match line terminators)
\\ matches a backslash character
\0n matches the character with octal value 0n (0 <= n <= 7)
\0nn matches the character with octal value 0nn (0 <= n <= 7)
\0mnn matches the character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7)
\a matches an alert (bell) character ('\u0007')
\cx matches the control character corresponding to x
\d matches any decimal digit: [0-9]
\D matches any non-digit: [^0-9]
\e matches an escape character ('\u001B')
\f matches a form-feed character ('\u000C')
\n matches a newline (line feed) character ('\u000A')
\r matches a return character ('\u000D')
\s matches any whitespace character: [ \t\n\x0B\f\r]
\S matches any non-whitespace character: [^\s]
\t matches a horizontal tab character ('\u0009')
\w matches any word (alphanumeric) character: [a-zA-Z_0-9]
\W matches any non-word (alphanumeric) character: [^\w]
\x matches the character x, if x is not one of the above listed escape sequences.
\xhh matches the character with hexadecimal value 0xhh
\uhhhh matches the character with hexadecimal value 0xhhhh
Character Class Operator

[abc] matches any character in the set a, b or c
[^abc] matches any character not in the set a, b or c
[a-zA-Z] matches any character in the range a through z or A through Z (range)
[a-d[m-p]] matches any character in the range a through d, or m through p: [a-dm-p] (union)
[a-z&&[def]] matches any character in the range d, e, or f (intersection)
[a-z&&[^bc]] matches any character in the range a through z, except for b and c: [ad-z] (subtraction)
[a-z&&[^m-p]] matches any character in the range a through z, and not m through p: [a-lq-z](subtraction)
A leading or trailing dash will be interpreted literally.

POSIX character classes (US-ASCII only)

\p{Lower} matches a lower-case alphabetic character: [a-z]
\p{Upper} matches an upper-case alphabetic character:[A-Z]
\p{ASCII} matches all ASCII:[\x00-\x7F]
\p{Alpha} matches an alphabetic character:[\p{Lower}\p{Upper}]
\p{Digit} matches a decimal digit: [0-9]
\p{Alnum} matches an alphanumeric character:[\p{Alpha}\p{Digit}]
\p{Punct} matches punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
\p{Graph} matches a visible character: [\p{Alnum}\p{Punct}]
\p{Print} matches a printable character: [\p{Graph}\x20]
\p{Blank} matches a space or a tab: [ \t]
\p{Cntrl} matches a control character: [\x00-\x1F\x7F]
\p{XDigit} matches a hexadecimal digit: [0-9a-fA-F]
\p{Space} matches a whitespace character: [ \t\n\x0B\f\r]

java.lang.Character classes (simple java character type)

\p{javaLowerCase} Equivalent to java.lang.Character.isLowerCase()
\p{javaUpperCase} Equivalent to java.lang.Character.isUpperCase()
\p{javaWhitespace} Equivalent to java.lang.Character.isWhitespace()
\p{javaMirrored} Equivalent to java.lang.Character.isMirrored()

Classes for Unicode blocks and categories

\p{InGreek} A character in the Greek block (simple block)
\p{Lu} An uppercase letter (simple category)
\p{Sc} A currency symbol
\P{InGreek} Any character except one in the Greek block (negation)
[\p{L}&&[^\p{Lu}]] Any letter except an uppercase letter (subtraction)

Greedy quantifiers
These quantifiers continue to match as much as possible, even when stopping would allow the overall match to succeed.
X? X, once or not at all
X* X, zero or more times
X+ X, one or more times
X{n} X, exactly n times
X{n,} X, at least n times
X{n,m} X, at least n but not more than m times

Reluctant quantifiers
These quantifiers will stop matching, if doing so will allow the overall match to succeed.
X?? X, once or not at all
X*? X, zero or more times
X+? X, one or more times
X{n}? X, exactly n times
X{n,}? X, at least n times
X{n,m}? X, at least n but not more than m times

Possessive quantifiers

X?+ X, once or not at all
X*+ X, zero or more times
X++ X, one or more times
X{n}+ X, exactly n times
X{n,}+ X, at least n times
X{n,m}+ X, at least n but not more than m times

Logical operators

XY X followed by Y
X|Y Either X or Y
(X) X, as a capturing group

Back references

\n Whatever the nth capturing group matched

Quotation

\ Nothing, but quotes the following character
\Q Nothing, but quotes all characters until \E
\E Nothing, but ends quoting started by \Q

Special constructs (non-capturing)

(?:X) X, as a non-capturing group
(?idmsux-idmsux) Nothing, but turns match flags on - off
(?idmsux-idmsux:X) X, as a non-capturing group with the given flags on - off
(?=X) X, via zero-width positive lookahead
(?!X) X, via zero-width negative lookahead
(?<=X) X, via zero-width positive lookbehind
(?<!X) X, via zero-width negative lookbehind
(?>X) X, as an independent, non-capturing group