| |
|
 |
Larger Regular Expressions
|
Warning: The regular expressions contained in this article are of some length and quite illegible to the inexperienced. They can, however, be extremely useful and time-saving if you are trying to add structure to or search through large text documents. Proceed with caution.
Preface
If you are unfamiliar with the concept of regular expressions, or you need a reference, I suggest you read Jan Borsodi's article Regular Expressions explained on this site. The purpose of this article is to show a few real-life examples of regular expressions at work. The case is as follows: I had a huge (400 Kb) broken XML document, that needed some serious fixing up. I needed to do a lot of search and replace operations on multiple lines of text at a time. My first go at it was Sed, the Stream Editor.
Sed
Parts of my document could look like this:
<td><span
style='font:times;align:center;color:blue'>Some blue text</td>
|
As you can see, the span tag wasn't always closed, and a tag could contain newlines. I needed to preserve the color. I had to transform it into this:
<td><font color="blue">Some blue text</font></td>
|
This is the Sed command that did the trick:
sed -e :a -e "s/\(<span[ \t\n\r]*style='.*color:\)\([a-z]*\)\(.*'>\)\([^<]*\)/<font color=\"\ 2\">\ 4<font>/g; /</N; //ba" old.xml > new.xml
|
Since I needed my search to span multiple lines, I had to use a special Sed syntax. The expression
sed -e :a -e "s/search-text/replace-text/g; /</N; //ba" old.xml > new.xml | basically means: "Search all the text of old.xml for search-text, and replace any matches with replace-text. Write the result to new.xml." The interesting part is the regular expression itself, i.e. what is contained within s///g. I use sub-matches, contained within parentheses () to keep parts of the text, while discarding others.
\(<span[ \t\n\r]*style='.*color:\) | matches the span tag up to, but not including the value of color.
matches the value of color.
matches the end of the span tag.
matches the text up to, but not including the next tag.
The replace part of the regular expression inserts parts of the matches into the new text.
<font color=\"\ 2\">\ 4<font> | This means that the second sub-match is inserted into the color attribute, and the fourth sub-match is inserted between the font start and end tags. The point is that a number preceded by a backslash refers to a sub-match. \ 2 refers to the second sub-match, and so on. (There is a space between the backslash and the number in my examples. This is due to a HTTP POST problem. Do not use this space in your own code, write the number directly after the backslash!)
Comment List
| Topic: |
Author: |
Time: |
|
Errors in perl code
|
Patrik Grip-Jansson
|
22.08.2001 04:08
|
|
The following line is wrong;
$docstring =~ s/(<span[\s]*style='.*color:)([a-z]*)(.*'>)([^<]*)/<font color="$ 2">$ 4<font>/g
to work as intended it need to be changed to;
$docstring =~ s/(<span[\s]*style='.*color:)([a-z]*)(.*'>)([^<]*)/<font color="$ 2">$ 4</font>/g;
The reg exp is far from optimal. For example there really isn't a need to use so many capturing paranthesis. This should match the same rows, and be a bit faster;
<span\s+style='.*?color:(\w+)'>(.*?)</td>
|
|
 |
|
|