If you don’t know what regular expression (also known as “regex” or “regexp”) is, it’s an unbelievably powerful language for executing a search and/or replace through any kind of text. Regex has had a long and glorious history (it’s origins date back to the ‘50s) and even now, you use it every day … you just might not know it. Regex is baked into most major programming languages and computer systems these days and is used in a wide variety of apps.
Regex is no longer just for programmers: it’s showing up in all sorts of places today. Some of the places you might have seen it so far is for matching URLs in Google Analytics, search & replace in your favorite text editor (some popular editors supporting regex include Sublime, Atom, Cloud9, Notepad++, Google Docs and Microsoft Word … although Word has a regex syntax that is very non-standard) and even matching file names in our awesome WordPress Membership Plugin, MemberPress.
Even though regex is now showing up in surprisingly new places, its seemingly cryptic nature will probably always make it more predominantly used by programmers and “power users.”
So here's the point in this article where you need to decide if you’re going to continue life without knowing the untold power of regex, or if you’ll swallow the red pill and wield the sword of a power user! In this post, I’ll give you the basic knowledge you’ll need to start putting this extremely useful tool to work for you. So let’s get started!
Some Examples
What do regex statements look like? If you saw one without having any idea what it was, it would look like complete gobbledygook.
Here’s a sample regexp statement to match a phone number:
\(?\d{3}\)?[- ]\d{3}-\d{4}
Here’s one that matches a URL:
https?:\/\/[\w-]+(\.[\w-]{2,})*(:\d{1,5})?
And one that can match an email address:
[\w\d._%+-]+@[\w\d.-]+\.[\w]{2,4}
Finally, here’s a simple search and replace that will find any domain names with the ‘test’ subdomain and change them to a ‘www’ subdomain:
Search: test\.([^\.]+)\.com
Replace: www.$1.com
Intimidated yet? Don’t be. I’ll go through how regex statements are interpreted, breaking down what we’re doing in each of these statements and explaining how to create your own.
How to Read Regex
Regular expressions are like normal find statements, but on steroids. For instance, if you wanted to find the name “Harry” on a web page, you’d click control-f and type “Harry” to find it. With regex you can do more than simply search for “Harry.” You could search for “Harry” or “Bob” with (Harry|Bob)
or you could search for any word starting with “Ha” with Ha\S*
.
To your computer, text is represented as a “string” of characters. In fact, if you’ve ever overheard programmers talking about code you’ve probably heard them use the term “string” when referring to some variable.
Regular expressions read left to right (just like English and many other languages) and they explain to the computer what to match as it scans a string left to right. As the computer looks at each character it follows this flow:
If you look at this flowchart, as complicated as it might seem, you’ll see something very familiar here. It’s pretty much identical to how you’d expect any search to happen on a computer and how searches that you’re currently comfortable with happen already. Regex just adds some extremely powerful features on top of normal search.
The Basics
Now before we start dissecting our example regex statements from above, let’s get through some of the very basic command characters of regex:
Repeaters: *, + and {…}
The asterisk (*) and plus (+) characters are repeaters. A repeater is only used after another character or enclosed statement and tells the computer to match the preceding character X number of times. The asterisk will match the preceding character 0 to infinite times; the plus, 1 to infinite times; {2}
will match it 2 times exactly; {4,6}
will match it between 4 and 6 times exactly; and {7,}
will match it between 7 and infinite times.
Wildcard: .
You’re probably used to the star symbol being used as a wildcard, but it’s a period (.) in regex. The period will simply match any character.
Optional: ?
Like the repeater, the optional character, a question mark (?), is only used after another character or enclosed statement. When this is in place, it tells the computer that the preceding character may or may not be present in search results.
Beginning and End of String: ^ and $
When the caret (^) is used at the beginning of the regex, it indicates that the first of the string should be matched. And the dollar sign ($), when used at the end of the regex, will match the end of the string.
Possible Characters: […]
Enclosing characters within brackets indicates the characters that could be matched in this position. For example, if you wanted to match an ‘n’,’m’,’l’ or ‘_’ then you could add this to your regex [nml_]
. If you just wanted to match any letter or number you could add this [A-Za-z0-9]
.
If you want to match any character except the possible characters you can just put a caret (^) after the opening bracket. For instance, if you wanted to match any character except ‘x’,’y’,’z’ or ‘_’ you could do so with the statement: [^xyz_]
. As you can see, this can give you fine grain control over what you’re matching on a character-by-character basis.
Enclosed Statements: (…)
Sometimes you want to make sections of your regex behave as a block or save them for later. For this, all you need to do is wrap the statement in parentheses.
Escape: \
Now, what if you want to match for the actual ‘+’, ‘.’, etc characters? Well, if you throw a backslash in front of it then that will tell your computer that you’re trying to match an actual search character, and to not interpret it as a command character.
Shorthand Characters: /s, /S, /d, /D, /w, /W and /b
These characters are very useful and will help you match certain sets of characters. Here’s how these break down:
/s
matches any whitespace characters such as space and tab/S
matches any non-whitespace characters/d
matches any digit character/D
matches any non-digit characters/w
matches any word character (basically alpha-numeric)/W
matches any non-word character/b
matches any word boundary (this would include spaces, dashes, commas, semi-colons, etc)
Of course there are many other, more powerful things that you can do in regex but these basics will get you the tools you need for 90% of your advanced search needs.
Breakdown of Our Phone Number Example
Okay, now that you have the tools, let’s look through our examples to see how they work, starting with our match for the phone number:
\(?\d{3}\)?[- ]\d{3}-\d{4}
So we’ll break it down by statement:
\(?
This is the first statement that the computer will look for. It starts with an escape character (\) followed by an open parentheses (() followed by an optional character (?). This tells the regex to look for an actual open parentheses to signal the start of a phone number, but that it may not be present in every instance of a phone number.
\d{3}
The second statement starts with the shorthand character for digit (\d) then it has a repeater ({3}) that will require there to be 3 digits. It does not end with a question mark because these three digits are not optional – they must be present for the search result to qualify as a phone number.
\)?
Now the computer will check for an optional closing parenthesis, but if it’s not present it can still match.
[- ]
After that, we have a possible characters statement that will look for either a space or a dash to match.
\d{3}-\d{4}
Finally, we actually have some statements that will expect 3 digits followed by a dash and then followed by 4 digits.
You can see this in action and play around with it here: http://regex101.com/r/nW9iC5/3
A few things you might notice about this regex is that it will only match two acceptable formats for phone numbers:
(888) 888-8888
888-888-8888
It won’t match a multitude of unacceptable phone number formats such as:
8888888888
88-8888-8888
But it will also match a few formats that might not be appropriate:
(888-888-8888
888) 888-8888
(888)-888-8888
What are some things you think you could do to ensure that this regex only matched the appropriate formats?
Breakdown of our URL Regex
First, I’d like to state that the following regex is not a comprehensive URL matching regex by any means. It will only match a web address with no arguments and has several other problems, but would serve to match many URLs effectively.
https?:\/\/[\w-]+(\.[\w-]+)*(:\d{1,5})?
Now for the breakdown:
https?:\/\/
This first statement matches the protocol of the URL. Because the “s” has a question mark after it, it’s optional. There are also escape characters in front of the forward slashes, which is usually what has to happen in regex to match forward slashes. So really, the only two ways a URL could match this statement would be to begin with http://
or https://
.
[\w-]+
Here we have a possible characters statement followed by a plus. This means that the next characters must be 1 or more word characters or dashes.
(\.[\w-]+)*
Here we have the statement that will match a period followed by word characters or dashes. This whole statement is wrapped in parentheses and followed by a plus which tells the computer that we can repeat this whole sequence zero or more times. So, this section could match .memberpress
or .memberpress.com
or .memberpress.co.uk
, and so on.
(:\d{1,5})?
Finally, this statement will provide for an optional port number to be appended to the end of the URL which includes a colon followed by 1 to 5 digits then wrapped in parentheses and followed by a question mark to make the entire statement optional.
Here’s where you can play around with this one: http://regex101.com/r/vL5uZ2/2
This regex will match:
- https://memberpress.com
- https://memberpress.com
- http://localhost:3000
BUT it won’t match any parameters trailing a URL like this:
- http://localhost:3000?test=1&page=5
What could you add to this regex to match some parameters at the end of a URL?
If you're interested in the most comprehensive URL matching regex (one that will match parameters, unicode characters, large TLDs, etc) there are plenty of them that you can look at but apparently the most comprehensive and accurate is the one created by Diego Perini which you can look at on his Github gist.
Breakdown of our Email Regex
Another common use of regex is to match email addresses. Our email matching regex will actually match a substantial number of email addresses:
[\w\d\._%+-]+@[\w\d.-]+\.[\w]{2,4}
Here’s the breakdown:
[\w\d\._%+-]+
This statement will match one or more word characters, digits, periods, underscores, percentage symbols, plus symbols, or dashes.
@
Every email address needs an ‘@’ symbol and that’s what this matches of course.
[\w\d.-]+
After the ‘@’ symbol we need to start on the domain name. This statement will match one or more word characters, digits, periods, or dashes.
\.[\w]{2,4}
Now we need to match the top level domain, which is a period followed by 2 to 4 word characters long.
You can also play around with this regex here: http://regex101.com/r/yQ2wP9/1
Even though this regex isn’t 100% comprehensive it should be able to match most addresses.
Regex Search and Replace
So in many cases you’ll care not only about matching patterns in your text with regex but also utilizing the power of search and replace with regex.
Let's use the search and replace example from above:
Search: test\.([^\.]+)\.com
Replace: www.$1.com
Here's the breakdown, starting with the search pattern:
test\.
This first part will match any part of the string that starts with “test.”
([^\.]+)
This statement will match one or more characters that are anything except a period. The entire statement is wrapped in parentheses for a different reason than we’ve seen so far: this time it’s to save whatever is matched by this statement to be used in the replacement.
\.com
Finally, it must match a “.com” at the end.
Now, in the replacement we can use “$1” to include whatever we saved between the parentheses in the search pattern. If you had a second statement like this in the search pattern you could use it in the replacement by using “$2.”
You can play around with this search and replace regex here: http://regex101.com/r/bW5vX8/1
This search pattern has some real limitations though. It can only match URLS starting with “test.” and ending with “.com”. How do you think you could modify it to match URLs ending in “.com”, “.net” or “.org”?
Learning more about Regex
This primer has given you a crash course on regular expressions but there’s plenty more to learn. You can get more info about regex on Wikipedia or regular-expressions.info.
Hey Blair, the one thing I like in your posts is the uniqueness of the subjects. Nowadays you read the same things over and over and over…all these rewritten no one knows how many times articles. I read every single post of yours because you use two things that not many marketers use- expertise and professionalism. Thanks for the interesting posts and please, keep on in this way. We all need people like you 🙂
Thanks Ivo …
I have a problem with forms that lock the user into a specific format for telephone number. I notice that on many contact forms where companies offer technical services to international users, a telephone number is required. However, outside of USA, telephone numbers can have different formats.
For example:
+45 8765-4321
+43 1-123-456-7890
+46 987-654-321
In many cases, the user is forced to provide a fake number in order to complete the form. Let’s take the first example; it coincidentally fits into the typical forced format on many on-line forms. However, the person reading the form will believe the number is to Oregon and not Denmark. You would think businesses promoting themselves internationally would be aware of this. How would you handle this?
Laurence, thanks for the comment … you raise a good point. In this article, my regular expression is quite rigid and tied to US phone numbers. But the great thing about Regex is that you can make it as loose or rigid as you want it. For instance, if I just wanted to match integers, spaces and dashes possibly proceeded by a plus I could do this:
This regex would match each of your examples above but is really liberal. It will would also match nonsense patterns like:
+– — — —- —
+4-9-1-2-0-1-3-0-2
600-001-001-005-3838-38-28383
Possibly, a better approach would be to not allow any dashes, spaces or parenthesis in the number at all if you want to support international formats. An example of this would be:
That would match all of your numbers above without dashes:
+4587654321
+4311234567890
+46987654321
This solution would be great but is also extremely liberal … you’d have no way of validating the country code or anything like that.
Another possible solution, if you know which countries you’re targeting, would be to address all the possible formats you’re interested in supporting in your regex like so:
This pattern would match each of your three examples above but would not match any US phone numbers, etc. It provides better validation than the regex statements above because it’s more rigid … but will also not validate any numbers that are not part of the 45, 43 and 46 country codes.
Perhaps if you google around a bit you could find a master regex for validating phone numbers from across the world. I haven’t found one yet … so if you do turn one up, let me know. Alternatively, you could write your own comprehensive regex for this (I’m sure it would be quite popular) … you’d just have to use a technique like I did in the last regex here … but that encompasses all of the numerical formats from around the world. A good start would be to look at this page on wikipedia.
At any rate, I hope that helps … I know it’s not a silver bullet but hopefully it can at least point you in the right direction.
Thank you for such an informative and much needed article. I find that regex coding is often taken for granted by people who know it, and that they usually expect that everyone else knows it too. Now, I have an understandable resource I can refer to.
I understand that controlling that the digits entered conform to a preset standard is a feature meant to minimize errors. As for international telephone numbers in a form field, I suspect it might be easier to just trust that the user will check that they have typed in the correct telephone number than to try and micromanage the field for the many possibilities. BTW, the most important use of regex for me has thus far been in the .htaccess file.
I appreciate that you not only answer comments on your blog, but that your answers are well thought out and meaningful. Great job.
Thanks Laurence.
Great point … but that’s part of the beauty of Regex. I mean you don’t need to use it at all … but as I said in my first reply, you could do an extremely light-weight regex that just enforces they enter digits and dashes or something.
Yeah, using Regex with the RedirectMatch directive is extremely useful … it makes it so easy to map url structures. It’s especially important to use this when you’re migrating a site where the links will all change … I mean if you don’t use some kind of url mapping like RedirectMatch your SEO can suffer dramatically when migrating your site.
Blair,
Thanks for this post! I always avoid regular expressions because I’ve never taken the time to figure out the syntax. This really breaks it down and makes it simple.
I’m not afraid of regular expressions anymore 🙂
Awesome Jamie! Glad to be of some help!