Regular Expressions Primer

About this Primer

The Regular Expressions Primer is a tutorial for those completely new to regular expressions. To familiarize you with regular expressions, this primer starts with the simple building blocks of the syntax and through examples, builds to construct expressions useful for solving real every-day problems including searching for and replacing text.

A regular expression is often called a “regex”, “rx” or “re”. This primer uses the terms “regular expression” and “regex”.

Unless otherwise stated, the examples in this primer are generic, and will apply to most programming languages and tools. However, each language and tool has its own implementation of regular expressions, so quoting conventions, metacharacters, special sequences, and modifiers may vary (e.g. Perl, Python, grep, sed, and Vi have slight variations on standard regex syntax). Consult the regular expression documentation for your language or application for details.

What are regular expressions?

Regular expressions are a syntactical shorthand for describing patterns. They are used to find text that matches a pattern, and to replace matched strings with other strings. They can be used to parse files and other input, or to provide a powerful way to search and replace. Here’s a short example in Python:

import re
n = re.compile(r'\bw[a-z]*', re.IGNORECASE)
print n.findall('will match all words beginning with the letter w.')

Here’s a more advanced regular expression from the Python Tutorial:

# Generate statement parsing regexes.
stmts = ['#\s*(?P<op>if|elif|ifdef|ifndef)\s+(?P<expr>.*?)',
       '#\s*(?P<op>else|endif)',
       '#\s*(?P<op>error)\s+(?P<error>.*?)',
       '#\s*(?P<op>define)\s+(?P<var>[^\s]*?)(\s+(?P<val>.+?))?',
       '#\s*(?P<op>undef)\s+(?P<var>[^\s]*?)']
patterns = ['^\s*%s\s*%s\s*%s\s*$'
          % (re.escape(cg[0]), stmt, re.escape(cg[1]))
          for cg in cgs for stmt in stmts]
stmtRes = [re.compile(p) for p in patterns]

Komodo can accept Python syntax regular expressions in its various Search features.

Komodo IDE’s Rx Toolkit can help you build and test regular expressions. See Using Rx Toolkit for more information.

Matching: Searching for a String

Regular expressions can be used to find a particular pattern, or to find a pattern and replace it with something else (substitution). Since the syntax is same for the “find” part of the regex, we’ll start with matching.

Literal Match

The simplest type of regex is a literal match. Letters, numbers and most symbols in the expression will match themselves in the the text being searched; an “a” matches an “a”, “cat” matches “cat”, “123” matches “123” and so on. For example:

Example: Search for the string “at”.

  • Regex:
    at
    
  • Matches:
    at
    
  • Doesn't Match:
    it
    a-t
    At
    

Note: Regular expressions are case sensitive unless a modifier is used .

Wildcards

Regex characters that perform a special function instead of matching themselves literally are called "metacharacters". One such metacharacter is the dot ".", or wildcard. When used in a regular expression, "." can match any single character.

Using "." to match any character.

Example: Using '.' to find certain types of words.

  • Regex:
    t...s
    
  • Matches:
    trees
    trams
    teens
    
  • Doesn't Match:
    trucks
    trains
    beans
    

Escaping Metacharacters

Many non-alphanumeric characters, like the "." mentioned above, are treated as special characters with specific functions in regular expressions. These special characters are called metacharacters. To search for a literal occurrence of a metacharacter (i.e. ignoring its special regex attribute), precede it with a backslash "\". For example:

  • Regex:
    c:\\readme\.txt
    
  • Matches:
    c:\readme.txt
    
  • Doesn't Match:
    c:\\readme.txt
    c:\readme_txt
    

Precede the following metacharacters with a backslash "\" to search for them as literal characters:

^ $ + * ? . | ( ) { } [ ] \

These metacharacters take on a special function (covered below) unless they are escaped. Conversely, some characters take on special functions (i.e. become metacharacters) when they are preceeded by a backslash (e.g. "\d" for "any digit" or "\n" for "newline"). These special sequences vary from language to language; consult your language documentation for a comprehensive list.

Quantifiers

Quantifiers specify how many instances of the preceeding element (which can be a character or a group) must appear in order to match.

Question mark

The "?" matches 0 or 1 instances of the previous element. In other words, it makes the element optional; it can be present, but it doesn't have to be. For example:

  • Regex:
    colou?r
    
  • Matches:
    colour
    color
    
  • Doesn't Match:
    colouur
    colur
    

Asterisk

The "*" matches 0 or more instances of the previous element. For example:

  • Regex:
    www\.my.*\.com
    
  • Matches:
    www.my.com
    www.mypage.com
    www.mysite.com then text with spaces ftp.example.com
    
  • Doesn't Match:
    www.oursite.com
    mypage.com
    

As the third match illustrates, using ".*" can be dangerous. It will match any number of any character (including spaces and non alphanumeric characters). The quantifier is "greedy" and will match as much text as possible. To make a quantifier "non-greedy" (matching as few characters as possible), add a "?" after the "*". Applied to the example above, the expression "www\.my.*?\.com" would match just "www.mysite.com", not the longer string.

Plus

The "+" matches 1 or more instances of the previous element. Like "*", it is greedy and will match as much as possible unless it is followed by a "?".

  • Regex:
    bob5+@foo\.com
    
  • Matches:
    bob5@foo.com
    bob5555@foo.com
    
  • Doesn't Match:
    bob@foo.com
    bob65555@foo.com
    

Number: ‘{N}’

To match a character a specific number of times, add that number enclosed in curly braces after the element. For example:

  • Regex:
    w{3}\.mydomain\.com
    
  • Matches:
    www.mydomain.com
    
  • Doesn't Match:
    web.mydomain.com
    w3.mydomain.com
    

Ranges: ‘{min, max}’

To specify the minimum number of matches to find and the maximum number of matches to allow, use a number range inside curly braces. For example:

  • Regex:
    60{3,5} years
    
  • Matches:
    6000 years
    60000 years
    600000 years
    
  • Doesn't Match:
    60 years
    6000000 years
    

Quantifier Summary

Quantifier Description
? Matches any preceding element 0 or 1 times.
* Matches the preceding element 0 or more times.
+ Matches the preceding element 1 or more times.
{num} Matches the preceding element num times.
{min, max} Matches the preceding element at least min times, but not more than max times.

Alternation

The vertical bar "|" is used to represent an "OR" condition. Use it to separate alternate patterns or characters for matching. For example:

  • Regex:
    perl|python
    
  • Matches:
    perl
    python
    
  • Doesn't Match:
    ruby
    

Grouping with Parentheses

Parentheses “()” are used to group characters and expressions within larger, more complex regular expressions. Quantifiers that immediately follow the group apply to the whole group. For example:

  • Regex:
    (abc){2,3}
    
  • Matches:
    abcabc
    abcabcabc
    
  • Doesn't Match:
    abc
    abccc
    

Groups can be used in conjunction with alternation. For example:

  • Regex:
    gr(a|e)y
    
  • Matches:
    gray
    grey
    
  • Doesn't Match:
    graey
    

Strings that match these groups are stored, or "delimited", for use in substitutions or subsequent statements. The first group is stored in the metacharacter "\1", the second in "\2" and so on. For example:

  • Regex:
    (.{2,5}) (.{2,8}) <\1_\2@example\.com>
    
  • Matches:
    Joe Smith <Joe_Smith@example.com>
    jane doe <jane_doe@example.com>
    459 33154 <459_33154@example.com>
    
  • Doesn't Match:
    john doe <doe_john@example.com>
    Jane Doe <janie88@example.com>
    

Character Classes

Character classes indicate a set of characters to match. Enclosing a set of characters in square brackets "[...]" means "match any one of these characters". For example:

  • Regex:
    [cbe]at
    
  • Matches:
    cat
    bat
    eat
    
  • Doesn't Match:
    sat
    beat
    

Since a character class on its own only applies to one character in the match, combine it with a quantifier to search for multiple instances of the class. For example:

  • Regex:
    [0123456789]{3}
    
  • Matches:
    123
    999
    376
    
  • Doesn't Match:
    W3C
    2_4
    

If we were to try the same thing with letters, we would have to enter all 26 letters in upper and lower case. Fortunately, we can specify a range instead using a hyphen. For example:

  • Regex:
    [a-zA-Z]{4}
    
  • Matches:
    Perl
    ruby
    SETL
    
  • Doesn't Match:
    1234
    AT&T
    

Most languages have special patterns for representing the most commonly used character classes. For example, Python uses "\d" to represent any digit (same as "[0-9]") and "\w" to represent any alphanumeric, or "word" character (same as "[a-zA-Z_]"). See your language documentation for the special sequences applicable to the language you use.

Negated Character Classes

To define a group of characters you do not want to match, use a negated character class. Adding a caret "^" to the beginning of the character class (i.e. [^...]) means "match any character except these". For example:

  • Regex:
    [^a-zA-Z]{4}
    
  • Matches:
    1234
    $.25
    #77;
    
  • Doesn't Match:
    Perl
    AT&T
    

Anchors: Matching at Specific Locations

Anchors are used to specify where in a string or line to look for a match. The “^” metacharacter (when not used at the beginning of a negated character class) specifies the beginning of the string or line:

  • Regex:
    ^From: root@server.*
    
  • Matches:
    From: root@server.example.com
    
  • Doesn't Match:
    I got this From: root@server.example.com yesterday
    >> From: root@server.example.com
    

The "$" metacharacter specifies the end of a string or line:

  • Regex:
    .*\/index.php$
    
  • Matches:
    www.example.org/index.php
    the file is /tmp/index.php
    
  • Doesn't Match:
    www.example.org/index.php?id=245
    www.example.org/index.php4
    

Sometimes it's useful to anchor both the beginning and end of a regular expression. This not only makes the expression more specific, it often improves the performance of the search.

  • Regex:
    ^To: .*example.org$
    
  • Matches:
    To: feedback@example.org
    To: hr@example.net, qa@example.org
    
  • Doesn't Match:
    To: qa@example.org, hr@example.net
    Send a Message To: example.org
    

Substitution: Searching and Replacing

Regular expressions can be used as a "search and replace" tool. This aspect of regex use is known as substitution.

There are many variations in substitution syntax depending on the language used. This primer uses the "/search/replacement/modifier" convention used in Perl. In simple substitutions, the "search" text will be a regex like the ones we've examined above, and the "replace" value will be a string:

For example, to earch for an old domain name and replace it with the new domain name:

  • Regex Substitution:
    s/http:\/\/www\.old-domain\.com/http://www.new-domain.com/
    
  • Search for:
    http://www.old-domain.com
    
  • Replace with:
    http://www.new-domain.com
    

Notice that the "/" and "." characters are not escaped in the replacement string. In replacement strings, they do not need to be. In fact, if you were to preceed them with backslashes, they would appear in the substitution literally (i.e. "http:\/\/www\.new-domain\.com").

The one way you can use the backslash "\" is to put saved matches in the substitution using "\num". For example:

  • Substitution Regex:
    s/(ftp|http):\/\/old-domain\.(com|net|org)/\1://new-domain.\2/
    
  • Target Text:
    http://old-domain.com
    
  • Result:
    http://new-domain.com
    

This regex will actually match a number of URLs other than "http://old-domain.com". If we had a list of URLs with various permutations, we could replace all of them with related versions of the new domain name (e.g. "ftp://old-domain.net" would become "ftp://new-domain.net"). To do this we need to use a modifier.

Modifiers

Modifiers alter the behavior of the regular expression. The previous substitution example replaces only the first occurence of the search string; once it finds a match, it performs the substitution and stops. To modify this regex in order to replace all matches in the string, we need to add the "g" modifier.

  • Substitution Regex:
    /(ftp|http):\/\/old-domain\.(com|net|org)/\1://new-domain.\2/g
    
  • Target Text:
    http://old-domain.com and ftp://old-domain.net
    
  • Result:
    http://new-domain.com and ftp://new-domain.net
    

The "i" modifier causes the match to ignore the case of alphabetic characters. For example:

  • Regex:
    /ActiveState\.com/i
    
  • Matches:
    activestate.com
    ActiveState.com
    ACTIVESTATE.COM
    

Modifier Summary

Modifier Meaning
i Ignore case when matching exact strings.
m Treat string as multiple lines. Allow "^'' and "$'' to match next to newline characters.
s Treat string as single line. Allow ".'' to match a newline character.
x Ignore whitespace and newline characters in the regular expression. Allow comments. 
o Compile regular expression once only.
g Match all instances of the pattern in the target string.

Python Regex Syntax

Komodo's Search features (including "Find...", "Replace..." and "Find in Files...") can accept plain text, glob style matching (called "wildcards" in the drop list, but using "." and "?" differently than regex wildcards), and Python regular expressions. A complete guide to regexes in Python can be found in the Python documentation. The Regular Expression HOWTO by A.M. Kuchling is a good introduction to regular expresions in Python.

More Regex Resources

Beginner:

Intermediate:

Advanced:

Language-Specific: