Regular Expressions Primer

About this Primer

The Regular Expressions Primer is a tutorial for those completely new to regular expressions. To familiarize you with regular expressions, this primer starts with the simple building blocks of the syntax and through examples, builds to construct expressions useful for solving real every-day problems including searching for and replacing text.

A regular expression is often called a “regex”, “rx” or “re”. This primer uses the terms “regular expression” and “regex”.

Unless otherwise stated, the examples in this primer are generic, and will apply to most programming languages and tools. However, each language and tool has its own implementation of regular expressions, so quoting conventions, metacharacters, special sequences, and modifiers may vary (e.g. Perl, Python, grep, sed, and Vi have slight variations on standard regex syntax). Consult the regular expression documentation for your language or application for details.

What are regular expressions?

Regular expressions are a syntactical shorthand for describing patterns. They are used to find text that matches a pattern, and to replace matched strings with other strings. They can be used to parse files and other input, or to provide a powerful way to search and replace. Here’s a short example in Python:

import re
n = re.compile(r'\bw[a-z]*', re.IGNORECASE)
print n.findall('will match all words beginning with the letter w.')

Here’s a more advanced regular expression from the Python Tutorial:

# Generate statement parsing regexes.
stmts = ['#\s*(?P&lt;op&gt;if|elif|ifdef|ifndef)\s+(?P&lt;expr&gt;.*?)',
       '#\s*(?P&lt;op&gt;else|endif)',
       '#\s*(?P&lt;op&gt;error)\s+(?P&lt;error&gt;.*?)',
       '#\s*(?P&lt;op&gt;define)\s+(?P&lt;var&gt;[^\s]*?)(\s+(?P&lt;val&gt;.+?))?',
       '#\s*(?P&lt;op&gt;undef)\s+(?P&lt;var&gt;[^\s]*?)']
patterns = ['^\s*%s\s*%s\s*%s\s*$'
          % (re.escape(cg[0]), stmt, re.escape(cg[1]))
          for cg in cgs for stmt in stmts]
stmtRes = [re.compile(p) for p in patterns]

Komodo can accept Python syntax regular expressions in its various Search features.

Komodo IDE’s Rx Toolkit can help you build and test regular expressions. See Using Rx Toolkit for more information.

Matching: Searching for a String

Regular expressions can be used to find a particular pattern, or to find a pattern and replace it with something else (substitution). Since the syntax is same for the “find” part of the regex, we’ll start with matching.

Literal Match

The simplest type of regex is a literal match. Letters, numbers and most symbols in the expression will match themselves in the the text being searched; an “a” matches an “a”, “cat” matches “cat”, “123” matches “123” and so on. For example:

Example: Search for the string “at”.

Regex:
```
at
```
Matches:
```
at
```
Doesn't Match:
```
it
a-t
At
```

Note: Regular expressions are case sensitive unless a modifier is used .

Wildcards

Regex characters that perform a special function instead of matching themselves literally are called "metacharacters". One such metacharacter is the dot ".", or wildcard. When used in a regular expression, "." can match any single character.

Using "." to match any character.

Example: Using '.' to find certain types of words.

Regex:
```
t...s
```
Matches:
```
trees
trams
teens
```
Doesn't Match:
```
trucks
trains
beans
```

Escaping Metacharacters

Many non-alphanumeric characters, like the "." mentioned above, are treated as special characters with specific functions in regular expressions. These special characters are called metacharacters. To search for a literal occurrence of a metacharacter (i.e. ignoring its special regex attribute), precede it with a backslash "\". For example:

Regex:
```
c:\\readme\.txt
```
Matches:
```
c:\readme.txt
```
Doesn't Match:
```
c:\\readme.txt
c:\readme_txt
```

Precede the following metacharacters with a backslash "\" to search for them as literal characters:

^ $ + * ? . | ( ) { } [ ] \

These metacharacters take on a special function (covered below) unless they are escaped. Conversely, some characters take on special functions (i.e. become metacharacters) when they are preceeded by a backslash (e.g. "\d" for "any digit" or "\n" for "newline"). These special sequences vary from language to language; consult your language documentation for a comprehensive list.

Quantifiers

Quantifiers specify how many instances of the preceeding element (which can be a character or a group) must appear in order to match.

Question mark

The "?" matches 0 or 1 instances of the previous element. In other words, it makes the element optional; it can be present, but it doesn't have to be. For example:

Regex:
```
colou?r
```
Matches:
```
colour
color
```
Doesn't Match:
```
colouur
colur
```

Asterisk

The "*" matches 0 or more instances of the previous element. For example:

Regex:
```
www\.my.*\.com
```

Matches:

www.my.com
www.mypage.com
www.mysite.com then text with spaces ftp.example.com

Doesn't Match:
```
www.oursite.com
mypage.com
```

As the third match illustrates, using ".*" can be dangerous. It will match any number of any character (including spaces and non alphanumeric characters). The quantifier is "greedy" and will match as much text as possible. To make a quantifier "non-greedy" (matching as few characters as possible), add a "?" after the "*". Applied to the example above, the expression "www\.my.*?\.com" would match just "www.mysite.com", not the longer string.

Plus

The "+" matches 1 or more instances of the previous element. Like "*", it is greedy and will match as much as possible unless it is followed by a "?".

Regex:
```
bob5+@foo\.com
```
Matches:
```
bob5@foo.com
bob5555@foo.com
```
Doesn't Match:
```
bob@foo.com
bob65555@foo.com
```

Number: ‘{N}’

To match a character a specific number of times, add that number enclosed in curly braces after the element. For example:

Regex:
```
w{3}\.mydomain\.com
```
Matches:
```
www.mydomain.com
```
Doesn't Match:
```
web.mydomain.com
w3.mydomain.com
```

Ranges: ‘{min, max}’

To specify the minimum number of matches to find and the maximum number of matches to allow, use a number range inside curly braces. For example:

Regex:
```
60{3,5} years
```
Matches:
```
6000 years
60000 years
600000 years
```
Doesn't Match:
```
60 years
6000000 years
```

Quantifier Summary

Quantifier	Description
?	Matches any preceding element 0 or 1 times.
*	Matches the preceding element 0 or more times.
+	Matches the preceding element 1 or more times.
{num}	Matches the preceding element num times.
{min, max}	Matches the preceding element at least min times, but not more than max times.

Alternation

The vertical bar "|" is used to represent an "OR" condition. Use it to separate alternate patterns or characters for matching. For example:

Regex:
```
perl|python
```
Matches:
```
perl
python
```
Doesn't Match:
```
ruby
```

Grouping with Parentheses

Parentheses “()” are used to group characters and expressions within larger, more complex regular expressions. Quantifiers that immediately follow the group apply to the whole group. For example:

Regex:
```
(abc){2,3}
```
Matches:
```
abcabc
abcabcabc
```
Doesn't Match:
```
abc
abccc
```

Groups can be used in conjunction with alternation. For example:

Regex:
```
gr(a|e)y
```
Matches:
```
gray
grey
```
Doesn't Match:
```
graey
```

Strings that match these groups are stored, or "delimited", for use in substitutions or subsequent statements. The first group is stored in the metacharacter "\1", the second in "\2" and so on. For example:

Regex:
```
(.{2,5}) (.{2,8}) <\1_\2@example\.com>
```

Matches:

Joe Smith <Joe_Smith@example.com>
jane doe <jane_doe@example.com>
459 33154 <459_33154@example.com>

Doesn't Match:

john doe <doe_john@example.com>
Jane Doe <janie88@example.com>

Character Classes

Character classes indicate a set of characters to match. Enclosing a set of characters in square brackets "[...]" means "match any one of these characters". For example:

Regex:
```
[cbe]at
```
Matches:
```
cat
bat
eat
```
Doesn't Match:
```
sat
beat
```

Since a character class on its own only applies to one character in the match, combine it with a quantifier to search for multiple instances of the class. For example:

Regex:
```
[0123456789]{3}
```
Matches:
```
123
999
376
```
Doesn't Match:
```
W3C
2_4
```

If we were to try the same thing with letters, we would have to enter all 26 letters in upper and lower case. Fortunately, we can specify a range instead using a hyphen. For example:

Regex:
```
[a-zA-Z]{4}
```
Matches:
```
Perl
ruby
SETL
```
Doesn't Match:
```
1234
AT&T
```

Most languages have special patterns for representing the most commonly used character classes. For example, Python uses "\d" to represent any digit (same as "[0-9]") and "\w" to represent any alphanumeric, or "word" character (same as "[a-zA-Z_]"). See your language documentation for the special sequences applicable to the language you use.

Negated Character Classes

To define a group of characters you do not want to match, use a negated character class. Adding a caret "^" to the beginning of the character class (i.e. [^...]) means "match any character except these". For example:

Regex:
```
[^a-zA-Z]{4}
```
Matches:
```
1234
$.25
#77;
```
Doesn't Match:
```
Perl
AT&T
```

Anchors: Matching at Specific Locations

Anchors are used to specify where in a string or line to look for a match. The “^” metacharacter (when not used at the beginning of a negated character class) specifies the beginning of the string or line:

Regex:
```
^From: root@server.*
```
Matches:
```
From: root@server.example.com
```

Doesn't Match:

I got this From: root@server.example.com yesterday
>> From: root@server.example.com

The "$" metacharacter specifies the end of a string or line:

Regex:
```
.*\/index.php$
```

Matches:

www.example.org/index.php
the file is /tmp/index.php

Doesn't Match:

www.example.org/index.php?id=245
www.example.org/index.php4

Sometimes it's useful to anchor both the beginning and end of a regular expression. This not only makes the expression more specific, it often improves the performance of the search.

Regex:
```
^To: .*example.org$
```

Matches:

To: feedback@example.org
To: hr@example.net, qa@example.org

Doesn't Match:

To: qa@example.org, hr@example.net
Send a Message To: example.org

Substitution: Searching and Replacing

Regular expressions can be used as a "search and replace" tool. This aspect of regex use is known as substitution.

There are many variations in substitution syntax depending on the language used. This primer uses the "/search/replacement/modifier" convention used in Perl. In simple substitutions, the "search" text will be a regex like the ones we've examined above, and the "replace" value will be a string:

For example, to earch for an old domain name and replace it with the new domain name:

Regex Substitution:

s/http:\/\/www\.old-domain\.com/http://www.new-domain.com/

Search for:
```
http://www.old-domain.com
```
Replace with:
```
http://www.new-domain.com
```

Notice that the "/" and "." characters are not escaped in the replacement string. In replacement strings, they do not need to be. In fact, if you were to preceed them with backslashes, they would appear in the substitution literally (i.e. "http:\/\/www\.new-domain\.com").

The one way you can use the backslash "\" is to put saved matches in the substitution using "\num". For example:

Substitution Regex:

s/(ftp|http):\/\/old-domain\.(com|net|org)/\1://new-domain.\2/

Target Text:
```
http://old-domain.com
```
Result:
```
http://new-domain.com
```

This regex will actually match a number of URLs other than "http://old-domain.com". If we had a list of URLs with various permutations, we could replace all of them with related versions of the new domain name (e.g. "ftp://old-domain.net" would become "ftp://new-domain.net"). To do this we need to use a modifier.

Modifiers

Modifiers alter the behavior of the regular expression. The previous substitution example replaces only the first occurence of the search string; once it finds a match, it performs the substitution and stops. To modify this regex in order to replace all matches in the string, we need to add the "g" modifier.

Substitution Regex:

/(ftp|http):\/\/old-domain\.(com|net|org)/\1://new-domain.\2/g

Target Text:

http://old-domain.com and ftp://old-domain.net

Result:

http://new-domain.com and ftp://new-domain.net

The "i" modifier causes the match to ignore the case of alphabetic characters. For example:

Regex:
```
/ActiveState\.com/i
```

Matches:

activestate.com
ActiveState.com
ACTIVESTATE.COM

Modifier Summary

Modifier	Meaning
i	Ignore case when matching exact strings.
m	Treat string as multiple lines. Allow "^'' and "$'' to match next to newline characters.
s	Treat string as single line. Allow ".'' to match a newline character.
x	Ignore whitespace and newline characters in the regular expression. Allow comments.
o	Compile regular expression once only.
g	Match all instances of the pattern in the target string.

Python Regex Syntax

Komodo's Search features (including "Find...", "Replace..." and "Find in Files...") can accept plain text, glob style matching (called "wildcards" in the drop list, but using "." and "?" differently than regex wildcards), and Python regular expressions. A complete guide to regexes in Python can be found in the Python documentation. The Regular Expression HOWTO by A.M. Kuchling is a good introduction to regular expresions in Python.

More Regex Resources

Beginner:

Python Standard Library: re - Regular Expression Operations
ActiveState Code regular expression recipes
Five Habits for Successful Regular Expressions, The O'Reilly ONLamp Resource Center
Beginner's Introduction to Perl - Part 3, The O'Reilly Perl Resource Center

Intermediate:

Regexp Power, The O'Reilly Perl Resource Center

Advanced:

Power Regexps, Part II, The O'Reilly Perl Resource Center

Language-Specific:

Perl: http://perldoc.perl.org/perlre.html

PHP: http://www.php.net/manual/en/ref.pcre.php

Python: https://docs.python.org/3.6/library/re.html

Ruby: http://www.ruby-doc.org/core/Regexp.html

Tcl: http://www.tcl.tk/man/tcl8.4/TclCmd/re_syntax.htm

Javascript: https://developer.mozilla.org/en/JavaScript/Guide/Regular_Expressions