Regular Expression Syntax
package java.util.regex;
Regular Expression Syntax
This page was last updated on 9 April 2009 and much of the content is from Sun's Java 2 Platform SE 5.0. See Java's Pattern Class for more details on regular expressions and their usage.
Brief Background
A regular expression consists of a character string where some characters are given special meaning with regard to pattern matching. Regular expressions have been in use from the early days of computing, and provide a powerful and efficient way to parse, interpret and search and replace text within an application.
Supported Syntax
Within a regular expression, the
following characters have special meaning:
- Boundary Operators
^matches at the beginning of a line
$matches at the end of a line
\Amatches the start of the entire string
\bmatches a word boundary
\Bmatches a non-word boundary
\Gmatches the end of the previous match
\Zmatches the end of the entire string, except for the final terminator, if any
\zmatches the end of the entire string
- One-Character Operators
.matches any single character (may or may not match line terminators)
\\matches a backslash character
\0nmatches the character with octal value 0n (0 <= n <= 7)
\0nnmatches the character with octal value 0nn (0 <= n <= 7)
\0mnnmatches the character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7)
\amatches an alert (bell) character ('\u0007')
\cxmatches the control character corresponding to x
\dmatches any decimal digit: [0-9]
\Dmatches any non-digit: [^0-9]
\ematches an escape character ('\u001B')
\fmatches a form-feed character ('\u000C')
\nmatches a newline (line feed) character ('\u000A')
\rmatches a return character ('\u000D')
\smatches any whitespace character: [ \t\n\x0B\f\r]
\Smatches any non-whitespace character: [^\s]
\tmatches a horizontal tab character ('\u0009')
\wmatches any word (alphanumeric) character: [a-zA-Z_0-9]
\Wmatches any non-word (alphanumeric) character: [^\w]
\xmatches the character x, if x is not one of the above listed escape sequences.
\xhhmatches the character with hexadecimal value 0xhh
\uhhhhmatches the character with hexadecimal value 0xhhhh
- Character Class Operator
[abc]matches any character in the set a, b or c
[^abc]matches any character not in the set a, b or c
[a-zA-Z]matches any character in the range a through z or A through Z (range)
[a-d[m-p]]matches any character in the range a through d, or m through p: [a-dm-p] (union)
[a-z&&[def]]matches any character in the range d, e, or f (intersection)
[a-z&&[^bc]]matches any character in the range a through z, except for b and c: [ad-z] (subtraction)
[a-z&&[^m-p]]matches any character in the range a through z, and not m through p: [a-lq-z](subtraction)
A leading or trailing dash will be interpreted literally.
- POSIX character classes (US-ASCII only)
\p{Lower}matches a lower-case alphabetic character: [a-z]
\p{Upper}matches an upper-case alphabetic character:[A-Z]
\p{ASCII}matches all ASCII:[\x00-\x7F]
\p{Alpha}matches an alphabetic character:[\p{Lower}\p{Upper}]
\p{Digit}matches a decimal digit: [0-9]
\p{Alnum}matches an alphanumeric character:[\p{Alpha}\p{Digit}]
\p{Punct}matches punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
\p{Graph}matches a visible character: [\p{Alnum}\p{Punct}]
\p{Print}matches a printable character: [\p{Graph}\x20]
\p{Blank}matches a space or a tab: [ \t]
\p{Cntrl}matches a control character: [\x00-\x1F\x7F]
\p{XDigit}matches a hexadecimal digit: [0-9a-fA-F]
\p{Space}matches a whitespace character: [ \t\n\x0B\f\r]
- java.lang.Character classes (simple java
character type)
\p{javaLowerCase}Equivalent to java.lang.Character.isLowerCase()
\p{javaUpperCase}Equivalent to java.lang.Character.isUpperCase()
\p{javaWhitespace}Equivalent to java.lang.Character.isWhitespace()
\p{javaMirrored}Equivalent to java.lang.Character.isMirrored()
- Classes for Unicode blocks and categories
\p{InGreek}A character in the Greek block (simple block)
\p{Lu}An uppercase letter (simple category)
\p{Sc}A currency symbol
\P{InGreek}Any character except one in the Greek block (negation)
[\p{L}&&[^\p{Lu}]]Any letter except an uppercase letter (subtraction)
- Greedy quantifiers
These quantifiers continue to match as much as possible, even when stopping would allow the overall match to succeed.X?X, once or not at all
X*X, zero or more times
X+X, one or more times
X{n}X, exactly n times
X{n,}X, at least n times
X{n,m}X, at least n but not more than m times
- Reluctant quantifiers
These quantifiers will stop matching, if doing so will allow the overall match to succeed.X??X, once or not at all
X*?X, zero or more times
X+?X, one or more times
X{n}?X, exactly n times
X{n,}?X, at least n times
X{n,m}?X, at least n but not more than m times
- Possessive quantifiers
X?+X, once or not at all
X*+X, zero or more times
X++X, one or more times
X{n}+X, exactly n times
X{n,}+X, at least n times
X{n,m}+X, at least n but not more than m times
- Logical operators
XYX followed by Y
X|YEither X or Y
(X)X, as a capturing group
- Back references
\nWhatever the nth capturing group matched
- Quotation
\Nothing, but quotes the following character
\QNothing, but quotes all characters until \E
\ENothing, but ends quoting started by \Q
- Special constructs (non-capturing)
(?:X)X, as a non-capturing group
(?idmsux-idmsux)Nothing, but turns match flags on - off
(?idmsux-idmsux:X)X, as a non-capturing group with the given flags on - off
(?=X)X, via zero-width positive lookahead
(?!X)X, via zero-width negative lookahead
(?<=X)X, via zero-width positive lookbehind
(?<!X)X, via zero-width negative lookbehind
(?>X)X, as an independent, non-capturing group