Regular Expressions

A refresher concerning the syntax and rules of constructing regular expressions.

See also Configuring an incremental index of a staged website.

Syntax of regular expressions

Text

Any single character

[chars]

Character class: One of chars

[^chars]

Character class: None of chars

text1|text2

Alternative: text1 or text2

Quantifiers

?

0 or 1 of the preceding text

*

0 or N of the preceding text (N > 1)

+

1 or N of the preceding text (N > 1)

Grouping

(text)

Grouping of text, either to set the borders of an alternative or to make back references where the Nth group is used on the RHS of a RewriteRule with $N)

Anchors

^

Start of line anchor.

$

End of line anchor.

Escaping

\char

Escape the particular char. For example, to specify the chars ".[]()" and so forth.

Rules about regular expressions

  • An ordinary character—not one of the special characters described below—is a one-character regular expression that matches itself.

  • A backslash () followed by any special character is a one-character regular expression that matches the special character itself. Special characters include the following:

    • . (period), * (asterisk), ? (question mark), + (plus sign), [ (left square bracket), | (vertical pipe), and \ (backslash) are always special characters, except when they appear within square brackets.

    • ^ (caret or circumflex) is special at the beginning of a regular expression, or when it immediately follows the left of a pair of square brackets.

    • $ (dollar sign) is special at the end of a regular expression.

    • . (period) is a one-character regular expression that matches any character, including supplementary code set characters with the exception of new-line.

    • A non-empty string of characters enclosed in [ ] (left and right square brackets) is a one-character regular expression that matches one character, including supplementary code set characters, in that string.

      If, however, the first character of the string is a ^ (circumflex), the one-character regular expression matches any character, including supplementary code set characters, with the exception of new-line and the remaining characters in the string.

      The ^ has this special meaning only if it occurs first in the string. You can use - (minus sign) to indicate a range of consecutive characters, including supplementary code set characters. For example, [0-9] is equivalent to [0123456789].

      Characters specifying the range must be from the same code set. When the characters are from different code sets, one of the characters specifying the range is matched. The - loses this special meaning if it occurs first (after an initial ^, if any) or last in the string. The ] (right square bracket) does not terminate such a string when it is the first character within it, after an initial ^, if any. For example, []a-f] matches either a ] (right square bracket) or one of the ASCII letters a through f inclusive. The four characters listed as special characters above stand for themselves within such a string of characters.

Rules for constructing regular expressions from one-character regular expressions

You can use the following rules to construct regular expressions from one-character regular expressions:

  • A one-character regular expression is a regular expression that matches whatever the one-character regular expression matches.
  • A one-character regular expression followed by a * (asterisk) is a regular expression that matches zero or more occurrences of the one-character regular expression, which may be a supplementary code set character. If there is any choice, the longest leftmost string that permits a match is chosen.
  • A one-character regular expression followed by a ? (question mark) is a regular expression that matches zero or one occurrences of the one-character regular expression, which may be a supplementary code set character. If there is any choice, the longest leftmost string that permits a match is chosen.
  • A one-character regular expression followed by a + (plus sign) is a regular expression that matches one or more occurrences of the one-character regular expression, which may be a supplementary code set character. If there is any choice, the longest leftmost string that permits a match is chosen.
  • A one-character regular expression followed by {m}, {m,}, or {m,n} is a regular expression that matches a range of occurrences of the one-character regular expression. The values of m and n must be non-negative integers less than 256; {m} matches exactly m occurrences; {m,} matches at least m occurrences; {m,n} matches any number of occurrences between m and n inclusive. Whenever a choice exists, the regular expression matches as many occurrences as possible.
  • The concatenation of regular expressions is a regular expression that matches the concatenation of the strings matched by each component of the regular expression.
  • A regular expression enclosed between the character sequences ( and ) is a regular expression that matches whatever the unadorned regular expression matches.
  • A regular expression followed by a | (vertical pipe) followed by a regular expression is a regular expression that matches either the first regular expression (before the vertical pipe) or the second regular expression (after the vertical pipe).

You can also constrain a regular expression to match only an initial segment or final segment of a line, or both.

  • A ^ (circumflex) at the beginning of a regular expression constrains that regular expression to match an initial segment of a line.
  • A $ (dollar sign) at the end of an entire regular expression constrains that regular expression to match a final segment of a line.
  • The construction ^regular expression$ constrains the regular expression to match the entire line.

There are some predefined character class names that you can use in place of complex bracketed regular expressions. For example, a digit can be represented by the one-character regular expression [0-9] or by the character class one-character regular expression [[:digit:]].

The predefined character classes and their meanings are the following:

Character class

Meaning

[[:alnum:]]

An alphabetic character or a digit.

[[:alpha:]]

An alphabetic character.

[[:blank:]]

A space or a tab.

[[:cntrl:]]

A control code; non-printing character.

[[:digit:]]

A digit.

[[:graph:]]

Any printing character except space.

[[:lower:]]

A lower-case alphabetic character.

[[:print:]]

Any printing character including space.

[[:punct:]]

Punctuation.

[[:space:]]

White space such as a space, a tab, or an end-of-line.

[[:upper:]]

An upper-case alphabetic character.

[[:xdigit:]]

A hexadecimal digit, upper- or lower-case.

Two special character class names match the null space at the start and the end of a word. In other words, they do not match an actual character. A word is considered to be any sequence of alphabetic characters, digits, or underscores (_).

Character class

Meaning

[[:<:]]

start of a word

[[:>:]]

end of a word

On this page

Adobe Summit Banner

A virtual event April 27-28.

Expand your skills and get inspired.

Register for free
Adobe Summit Banner

A virtual event April 27-28.

Expand your skills and get inspired.

Register for free