| |
|
 |
Regular Expressions explained
|
Quantifiers
Before I start explaining the syntax you might want to jump to the last page to learn which programs you can use to test out the examples in this article.
The contents of an expression is, as explained earlier, a combination of alphanumeric characters and metacharacters. An alphanumeric character is either a letter from the alphabet
or a number
Actually in the world of regular expressions any character which is not a metacharacter will match itself(often called literal characters), however a lot of the time you're mostly concerned with the alphanumeric characters. A very special character is the backslash \, this turns any metacharacters into literal characters, and alphanumeric characters into a sort of metacharacter or sequence. The metacharacters are:
\ | ( ) [ { ^ $ * + ? . < > | With that said normal characters don't sound too interesting so lets jump to the our very first meta characters.
The punctuation mark, or dot, . needs explaining first since it often leads to confusion. This character will not, as many might think, match the punctuation in a line, it is instead a special meta character which matches any character. Using this were you wanted to find the end of the line or the decimal in a floating number will lead to strange results. As explained above, you need to backslashify it to get the literal meaning. For instance take this expression
will match the number 1.23 in a text as you might have guessed, but it will also match these next lines
to make the expression only match the floating number we change it to
Remember this, it's very important. Now with that said we can get the show going.
Two heavily recurring metacharacters are
They are called quantifiers and tells the engine to look for several occurrences of a characters, the quantifier always precedes the character at hand. The * character matches zero or more occurrences of the character in a row, the + characters is similar but matches one or more.
So what if you decided to find words which had the character c in it you might be tempted to write:
What might come as a surprise to you is that you will find an enormous amount of matches, even words with no c in it will match. How so you ask, well the answer is simple. Recall that the * character matches zero or more characters, well thats exactly what you did, zero characters.
You see in regular expressions you have the possibility to match what is called the empty string, which is simply a string with zero size. This empty string can actually be found in all texts, for instance the word:
contains three empty strings. They are contained at the position right before the g, in between the g and the o and after the o. And an empty string contains exactly one empty string. At first this might seem like a really silly thing to do but you'll learn later on how this is used in more complex expressions.
So with this knowledge we might want to change our expression to:
and voila we get only words with c in them.
The next metacharacter you'll learn is:
This simply tells the engine to either match the character or not (zero or one). For instance the expression:
will match any of these lines:
These three metacharacters are simply a specialized scenario for the more generalized quantifier
the n and m are respectively the minimum and maximum size for the quantifier. For instance
means match one or up to five characters. You can also skip m to allow for infinite match:
which matches one or more characters. This is exactly what the + characters does. So now you see the connection, * is equal to {0,}, + is equal to {1,} and ? is equal to {0,1}.
The last thing you can do with the quantifier is to also skip the comma,
which means to match 5 characters, no more no less.
Comment List
| Topic: |
Author: |
Time: |
|
another great regexp tool
|
S Church
|
01.03.2005 16:16
|
|
There's a free-as-in-beer development environment for Windows called HTML-Kit that's just great for writing scripts and web code. The Find or Find / Replace functions have a check box for Regexps, with a "Find All" button to highlight every instance matched by a regexp. The only drawback is that it assumes /is (case insensitivity and multiline).
VisualREGEXP mentioned in the article says it has no required supporting files, that the standalone executable is all that's needed. However, most Windows machines don't have the TCL/TK component "wish," which the README file claims is necessary for operation. Wish might be available somewhere online as a precompiled binary without having to install all of TCL/TK, but I'm not motivated enough to google it at the moment.
|
|
Email match
|
David Robarts
|
15.01.2005 22:45
|
|
Some valid email addresses will fail this expression (and some invalid addresses pass).
[a-z0-9_-]+(.[a-z0-9_-]+)*@[a-z0-9_-]+(.[a-z0-9_-]+)+
The underscore character is not allowed in the domain part of the email address and some additional characters are allowed in the username part.
This might be better:
[a-z0-9_-]+(.[a-z0-9_-+]+)*@[a-z0-9-]+(.[a-z0-9-]+)+
|
|
can't see the graphic
|
x x
|
02.11.2001 01:59
|
|
I can't see the graphic towards the bottom to demonstrate the usage of < >
|
|
 |
|
|