A refresher concerning the syntax and rules of constructing regular expressions.
See also Configuring an incremental index of a staged website.
Syntax of regular expressions
Text |
Any single character |
|
[chars] |
Character class: One of chars |
|
[^chars] |
Character class: None of chars |
|
text1|text2 |
Alternative: text1 or text2 |
|
Quantifiers |
||
? |
0 or 1 of the preceding text |
|
* |
0 or N of the preceding text (N > 1) |
|
+ |
1 or N of the preceding text (N > 1) |
|
Grouping |
||
(text) |
Grouping of text, either to set the borders of an alternative or to make back references where the Nth group is used on the RHS of a RewriteRule with $N) |
|
Anchors |
||
^ |
Start of line anchor. |
|
$ |
End of line anchor. |
|
Escaping |
||
|
\char |
Escape the particular char. For example, to specify the chars ".[]()" and so forth. |
Rules about regular expressions
An ordinary character—not one of the special characters described below—is a one-character regular expression that matches itself.
A backslash () followed by any special character is a one-character regular expression that matches the special character itself. Special characters include the following:
.
(period), *
(asterisk), ?
(question mark), +
(plus sign), [
(left square bracket), |
(vertical pipe), and \
(backslash) are always special characters, except when they appear within square brackets.
^
(caret or circumflex) is special at the beginning of a regular expression, or when it immediately follows the left of a pair of square brackets.
$
(dollar sign) is special at the end of a regular expression.
.
(period) is a one-character regular expression that matches any character, including supplementary code set characters with the exception of new-line.
A non-empty string of characters enclosed in [ ]
(left and right square brackets) is a one-character regular expression that matches one character, including supplementary code set characters, in that string.
If, however, the first character of the string is a ^
(circumflex), the one-character regular expression matches any character, including supplementary code set characters, with the exception of new-line and the remaining characters in the string.
The ^
has this special meaning only if it occurs first in the string. You can use -
(minus sign) to indicate a range of consecutive characters, including supplementary code set characters. For example, [0-9] is equivalent to [0123456789].
Characters specifying the range must be from the same code set. When the characters are from different code sets, one of the characters specifying the range is matched. The -
loses this special meaning if it occurs first (after an initial ^
, if any) or last in the string. The ]
(right square bracket) does not terminate such a string when it is the first character within it, after an initial ^
, if any. For example, []a-f]
matches either a ]
(right square bracket) or one of the ASCII letters a through f inclusive. The four characters listed as special characters above stand for themselves within such a string of characters.
Rules for constructing regular expressions from one-character regular expressions
You can use the following rules to construct regular expressions from one-character regular expressions:
*
(asterisk) is a regular expression that matches zero or more occurrences of the one-character regular expression, which may be a supplementary code set character. If there is any choice, the longest leftmost string that permits a match is chosen.?
(question mark) is a regular expression that matches zero or one occurrences of the one-character regular expression, which may be a supplementary code set character. If there is any choice, the longest leftmost string that permits a match is chosen.+
(plus sign) is a regular expression that matches one or more occurrences of the one-character regular expression, which may be a supplementary code set character. If there is any choice, the longest leftmost string that permits a match is chosen.{m}
, {m,}
, or {m,n}
is a regular expression that matches a range of occurrences of the one-character regular expression. The values of m and n must be non-negative integers less than 256; {m}
matches exactly m occurrences; {m,}
matches at least m occurrences; {m,n}
matches any number of occurrences between m and n inclusive. Whenever a choice exists, the regular expression matches as many occurrences as possible.|
(vertical pipe) followed by a regular expression is a regular expression that matches either the first regular expression (before the vertical pipe) or the second regular expression (after the vertical pipe).You can also constrain a regular expression to match only an initial segment or final segment of a line, or both.
^
(circumflex) at the beginning of a regular expression constrains that regular expression to match an initial segment of a line.$
(dollar sign) at the end of an entire regular expression constrains that regular expression to match a final segment of a line.There are some predefined character class names that you can use in place of complex bracketed regular expressions. For example, a digit can be represented by the one-character regular expression [0-9] or by the character class one-character regular expression [[:digit:]].
The predefined character classes and their meanings are the following:
Character class |
Meaning |
---|---|
[[:alnum:]] |
An alphabetic character or a digit. |
[[:alpha:]] |
An alphabetic character. |
[[:blank:]] |
A space or a tab. |
[[:cntrl:]] |
A control code; non-printing character. |
[[:digit:]] |
A digit. |
[[:graph:]] |
Any printing character except space. |
[[:lower:]] |
A lower-case alphabetic character. |
[[:print:]] |
Any printing character including space. |
[[:punct:]] |
Punctuation. |
[[:space:]] |
White space such as a space, a tab, or an end-of-line. |
[[:upper:]] |
An upper-case alphabetic character. |
[[:xdigit:]] |
A hexadecimal digit, upper- or lower-case. |
Two special character class names match the null space at the start and the end of a word. In other words, they do not match an actual character. A word is considered to be any sequence of alphabetic characters, digits, or underscores (_).
Character class |
Meaning |
---|---|
[[:<:]] |
start of a word |
[[:>:]] |
end of a word |