Regular Expressions Primer
About this Primer
The Regular Expressions Primer is a tutorial for those completely new to regular expressions. To familiarize you with regular expressions, this primer starts with the simple building blocks of the syntax and through examples, builds to construct expressions useful for solving real every-day problems including searching for and replacing text.
A regular expression is often called a “regex”, “rx” or “re”. This primer uses the terms “regular expression” and “regex”.
Unless otherwise stated, the examples in this primer are generic, and will apply to most programming languages and tools. However, each language and tool has its own implementation of regular expressions, so quoting conventions, metacharacters, special sequences, and modifiers may vary (e.g. Perl, Python, grep, sed, and Vi have slight variations on standard regex syntax). Consult the regular expression documentation for your language or application for details.
What are regular expressions?
Regular expressions are a syntactical shorthand for describing patterns. They are used to find text that matches a pattern, and to replace matched strings with other strings. They can be used to parse files and other input, or to provide a powerful way to search and replace. Here’s a short example in Python:
import re
n = re.compile(r'\bw[a-z]*', re.IGNORECASE)
print n.findall('will match all words beginning with the letter w.')
Here’s a more advanced regular expression from the Python Tutorial:
# Generate statement parsing regexes.
stmts = ['#\s*(?P<op>if|elif|ifdef|ifndef)\s+(?P<expr>.*?)',
'#\s*(?P<op>else|endif)',
'#\s*(?P<op>error)\s+(?P<error>.*?)',
'#\s*(?P<op>define)\s+(?P<var>[^\s]*?)(\s+(?P<val>.+?))?',
'#\s*(?P<op>undef)\s+(?P<var>[^\s]*?)']
patterns = ['^\s*%s\s*%s\s*%s\s*$'
% (re.escape(cg[0]), stmt, re.escape(cg[1]))
for cg in cgs for stmt in stmts]
stmtRes = [re.compile(p) for p in patterns]
Komodo can accept Python syntax regular expressions in its various Search features.
Komodo IDE’s Rx Toolkit can help you build and test regular expressions. See Using Rx Toolkit for more information.
Matching: Searching for a String
Regular expressions can be used to find a particular pattern, or to find a pattern and replace it with something else (substitution). Since the syntax is same for the “find” part of the regex, we’ll start with matching.
Literal Match
The simplest type of regex is a literal match. Letters, numbers and most symbols in the expression will match themselves in the the text being searched; an “a” matches an “a”, “cat” matches “cat”, “123” matches “123” and so on. For example:
Example: Search for the string “at”.
-
Regex:
at
-
Matches:
at
-
Doesn't Match:
it a-t At
Note: Regular expressions are case sensitive unless a modifier is used .
Wildcards
Regex characters that perform a special function instead of matching themselves literally are called "metacharacters". One such metacharacter is the dot ".", or wildcard. When used in a regular expression, "." can match any single character.
Using "." to match any character.
Example: Using '.' to find certain types of words.
-
Regex:
t...s
-
Matches:
trees trams teens
-
Doesn't Match:
trucks trains beans
Escaping Metacharacters
Many non-alphanumeric characters, like the "." mentioned above, are treated as special characters with specific functions in regular expressions. These special characters are called metacharacters. To search for a literal occurrence of a metacharacter (i.e. ignoring its special regex attribute), precede it with a backslash "\". For example:
-
Regex:
c:\\readme\.txt
-
Matches:
c:\readme.txt
-
Doesn't Match:
c:\\readme.txt c:\readme_txt
Precede the following metacharacters with a backslash "\" to search for them as literal characters:
^ $ + * ? . | ( ) { } [ ] \
These metacharacters take on a special function (covered below) unless they are escaped. Conversely, some characters take on special functions (i.e. become metacharacters) when they are preceeded by a backslash (e.g. "\d" for "any digit" or "\n" for "newline"). These special sequences vary from language to language; consult your language documentation for a comprehensive list.
Quantifiers
Quantifiers specify how many instances of the preceeding element (which can be a character or a group) must appear in order to match.
Question mark
The "?" matches 0 or 1 instances of the previous element. In other words, it makes the element optional; it can be present, but it doesn't have to be. For example:
-
Regex:
colou?r
-
Matches:
colour color
-
Doesn't Match:
colouur colur
Asterisk
The "*" matches 0 or more instances of the previous element. For example:
-
Regex:
www\.my.*\.com
-
Matches:
www.my.com www.mypage.com www.mysite.com then text with spaces ftp.example.com
-
Doesn't Match:
www.oursite.com mypage.com
As the third match illustrates, using ".*" can be dangerous.
It will match any number of any character
(including spaces and non alphanumeric characters). The
quantifier is "greedy" and will match as much text as possible.
To make a quantifier "non-greedy" (matching as few characters as
possible), add a "?" after the "*". Applied to the example above,
the expression "www\.my.*?\.com
" would match just
"www.mysite.com
", not the longer string.
Plus
The "+" matches 1 or more instances of the previous element. Like "*", it is greedy and will match as much as possible unless it is followed by a "?".
-
Regex:
bob5+@foo\.com
-
Matches:
bob5@foo.com bob5555@foo.com
-
Doesn't Match:
bob@foo.com bob65555@foo.com
Number: ‘{N}’
To match a character a specific number of times, add that number enclosed in curly braces after the element. For example:
-
Regex:
w{3}\.mydomain\.com
-
Matches:
www.mydomain.com
-
Doesn't Match:
web.mydomain.com w3.mydomain.com
Ranges: ‘{min, max}’
To specify the minimum number of matches to find and the maximum number of matches to allow, use a number range inside curly braces. For example:
-
Regex:
60{3,5} years
-
Matches:
6000 years 60000 years 600000 years
-
Doesn't Match:
60 years 6000000 years
Quantifier Summary
Quantifier | Description |
? | Matches any preceding element 0 or 1 times. |
* | Matches the preceding element 0 or more times. |
+ | Matches the preceding element 1 or more times. |
{num} | Matches the preceding element num times. |
{min, max} | Matches the preceding element at least min times, but not more than max times. |
Alternation
The vertical bar "|" is used to represent an "OR" condition. Use it to separate alternate patterns or characters for matching. For example:
-
Regex:
perl|python
-
Matches:
perl python
-
Doesn't Match:
ruby
Grouping with Parentheses
Parentheses “()” are used to group characters and expressions within larger, more complex regular expressions. Quantifiers that immediately follow the group apply to the whole group. For example:
-
Regex:
(abc){2,3}
-
Matches:
abcabc abcabcabc
-
Doesn't Match:
abc abccc
Groups can be used in conjunction with alternation. For example:
-
Regex:
gr(a|e)y
-
Matches:
gray grey
-
Doesn't Match:
graey
Strings that match these groups are stored, or "delimited", for use in substitutions or subsequent statements. The first group is stored in the metacharacter "\1", the second in "\2" and so on. For example:
-
Regex:
(.{2,5}) (.{2,8}) <\1_\2@example\.com>
-
Matches:
Joe Smith <Joe_Smith@example.com> jane doe <jane_doe@example.com> 459 33154 <459_33154@example.com>
-
Doesn't Match:
john doe <doe_john@example.com> Jane Doe <janie88@example.com>
Character Classes
Character classes indicate a set of characters to match. Enclosing a set of characters in square brackets "[...]" means "match any one of these characters". For example:
-
Regex:
[cbe]at
-
Matches:
cat bat eat
-
Doesn't Match:
sat beat
Since a character class on its own only applies to one character in the match, combine it with a quantifier to search for multiple instances of the class. For example:
-
Regex:
[0123456789]{3}
-
Matches:
123 999 376
-
Doesn't Match:
W3C 2_4
If we were to try the same thing with letters, we would have to enter all 26 letters in upper and lower case. Fortunately, we can specify a range instead using a hyphen. For example:
-
Regex:
[a-zA-Z]{4}
-
Matches:
Perl ruby SETL
-
Doesn't Match:
1234 AT&T
Most languages have special patterns for representing the most commonly used character classes. For example, Python uses "\d" to represent any digit (same as "[0-9]") and "\w" to represent any alphanumeric, or "word" character (same as "[a-zA-Z_]"). See your language documentation for the special sequences applicable to the language you use.
Negated Character Classes
To define a group of characters you do not want to match, use a negated character class. Adding a caret "^" to the beginning of the character class (i.e. [^...]) means "match any character except these". For example:
-
Regex:
[^a-zA-Z]{4}
-
Matches:
1234 $.25 #77;
-
Doesn't Match:
Perl AT&T
Anchors: Matching at Specific Locations
Anchors are used to specify where in a string or line to look for a match. The “^” metacharacter (when not used at the beginning of a negated character class) specifies the beginning of the string or line:
-
Regex:
^From: root@server.*
-
Matches:
From: root@server.example.com
-
Doesn't Match:
I got this From: root@server.example.com yesterday >> From: root@server.example.com
The "$" metacharacter specifies the end of a string or line:
-
Regex:
.*\/index.php$
-
Matches:
www.example.org/index.php the file is /tmp/index.php
-
Doesn't Match:
www.example.org/index.php?id=245 www.example.org/index.php4
Sometimes it's useful to anchor both the beginning and end of a regular expression. This not only makes the expression more specific, it often improves the performance of the search.
-
Regex:
^To: .*example.org$
-
Matches:
To: feedback@example.org To: hr@example.net, qa@example.org
-
Doesn't Match:
To: qa@example.org, hr@example.net Send a Message To: example.org
Substitution: Searching and Replacing
Regular expressions can be used as a "search and replace" tool. This aspect of regex use is known as substitution.
There are many variations in substitution syntax depending on the language used. This primer uses the "/search/replacement/modifier" convention used in Perl. In simple substitutions, the "search" text will be a regex like the ones we've examined above, and the "replace" value will be a string:
For example, to earch for an old domain name and replace it with the new domain name:
-
Regex Substitution:
s/http:\/\/www\.old-domain\.com/http://www.new-domain.com/
-
Search for:
http://www.old-domain.com
-
Replace with:
http://www.new-domain.com
Notice that the "/" and "." characters are not escaped in the replacement string. In replacement strings, they do not need to be. In fact, if you were to preceed them with backslashes, they would appear in the substitution literally (i.e. "http:\/\/www\.new-domain\.com").
The one way you can use the backslash "\" is to put saved matches in the substitution using "\num". For example:
-
Substitution Regex:
s/(ftp|http):\/\/old-domain\.(com|net|org)/\1://new-domain.\2/
-
Target Text:
http://old-domain.com
-
Result:
http://new-domain.com
This regex will actually match a number of URLs other than "http://old-domain.com". If we had a list of URLs with various permutations, we could replace all of them with related versions of the new domain name (e.g. "ftp://old-domain.net" would become "ftp://new-domain.net"). To do this we need to use a modifier.
Modifiers
Modifiers alter the behavior of the regular expression. The previous substitution example replaces only the first occurence of the search string; once it finds a match, it performs the substitution and stops. To modify this regex in order to replace all matches in the string, we need to add the "g" modifier.
-
Substitution Regex:
/(ftp|http):\/\/old-domain\.(com|net|org)/\1://new-domain.\2/g
-
Target Text:
http://old-domain.com and ftp://old-domain.net
-
Result:
http://new-domain.com and ftp://new-domain.net
The "i" modifier causes the match to ignore the case of alphabetic characters. For example:
-
Regex:
/ActiveState\.com/i
-
Matches:
activestate.com ActiveState.com ACTIVESTATE.COM
Modifier Summary
Modifier | Meaning |
i | Ignore case when matching exact strings. |
m | Treat string as multiple lines. Allow "^'' and "$'' to match next to newline characters. |
s | Treat string as single line. Allow ".'' to match a newline character. |
x | Ignore whitespace and newline characters in the regular expression. Allow comments. |
o | Compile regular expression once only. |
g | Match all instances of the pattern in the target string. |
Python Regex Syntax
Komodo's Search features (including "Find...", "Replace..." and "Find in Files...") can accept plain text, glob style matching (called "wildcards" in the drop list, but using "." and "?" differently than regex wildcards), and Python regular expressions. A complete guide to regexes in Python can be found in the Python documentation. The Regular Expression HOWTO by A.M. Kuchling is a good introduction to regular expresions in Python.
More Regex Resources
Beginner:
- Python Standard Library: re - Regular Expression Operations
- ActiveState Code regular expression recipes
- Five Habits for Successful Regular Expressions, The O'Reilly ONLamp Resource Center
- Beginner's Introduction to Perl - Part 3, The O'Reilly Perl Resource Center
Intermediate:
- Regexp Power, The O'Reilly Perl Resource Center
Advanced:
- Power Regexps, Part II, The O'Reilly Perl Resource Center
Language-Specific:
- Perl: http://perldoc.perl.org/perlre.html
- PHP: http://www.php.net/manual/en/ref.pcre.php
- Python: https://docs.python.org/3.6/library/re.html
- Ruby: http://www.ruby-doc.org/core/Regexp.html
- Tcl: http://www.tcl.tk/man/tcl8.4/TclCmd/re_syntax.htm
- Javascript: https://developer.mozilla.org/en/JavaScript/Guide/Regular_Expressions