Found at: http://publish.ez.no/article/articleprint/65/

Larger Regular Expressions



Warning: The regular expressions contained in this article are of some length and quite illegible to the inexperienced. They can, however, be extremely useful and time-saving if you are trying to add structure to or search through large text documents. Proceed with caution.

Preface

If you are unfamiliar with the concept of regular expressions, or you need a reference, I suggest you read Jan Borsodi's article Regular Expressions explained on this site. The purpose of this article is to show a few real-life examples of regular expressions at work. The case is as follows: I had a huge (400 Kb) broken XML document, that needed some serious fixing up. I needed to do a lot of search and replace operations on multiple lines of text at a time. My first go at it was Sed, the Stream Editor.

Sed

Parts of my document could look like this:

<td><span
style='font:times;align:center;color:blue'>Some blue text</td>


As you can see, the span tag wasn't always closed, and a tag could contain newlines. I needed to preserve the color. I had to transform it into this:

<td><font color="blue">Some blue text</font></td>


This is the Sed command that did the trick:

sed -e :a -e "s/\(<span[ \t\n\r]*style='.*color:\)\([a-z]*\)\(.*'>\)\([^<]*\)/<font color=\"\ 2\">\ 4<font>/g; /</N; //ba" old.xml > new.xml


Since I needed my search to span multiple lines, I had to use a special Sed syntax. The expression

sed -e :a -e "s/search-text/replace-text/g; /</N; //ba" old.xml > new.xml

basically means: "Search all the text of old.xml for search-text, and replace any matches with replace-text. Write the result to new.xml." The interesting part is the regular expression itself, i.e. what is contained within s///g. I use sub-matches, contained within parentheses () to keep parts of the text, while discarding others.

\(<span[ \t\n\r]*style='.*color:\)

matches the span tag up to, but not including the value of color.

\([a-z]*\)

matches the value of color.

\(.*'>\)

matches the end of the span tag.

\([^<]*\)

matches the text up to, but not including the next tag.

The replace part of the regular expression inserts parts of the matches into the new text.

<font color=\"\ 2\">\ 4<font>

This means that the second sub-match is inserted into the color attribute, and the fourth sub-match is inserted between the font start and end tags. The point is that a number preceded by a backslash refers to a sub-match. \ 2 refers to the second sub-match, and so on. (There is a space between the backslash and the number in my examples. This is due to a HTTP POST problem. Do not use this space in your own code, write the number directly after the backslash!)


Perl

One problem of using Sed in this way was that it was too slow. The Sed loop could, for some reason, take up to 25 minutes to finish a multi-line search/replace operation. So I tried doing the same thing in Perl, this took just 30 seconds to process my 400 Kb document. Slightly more effective, I'd say. This is the script:


#!/usr/bin/perl

$inputfile = $ARGV[0];
$outputfile = ">$ARGV[1]";

$docstring = "";

open( INPUT, $inputfile ) or die "Error while opening $inputfile: $!\n";
while( <INPUT> )
{
 $docstring = $docstring . $_;
}
close INPUT;

$docstring =~ s/(<span[\s]*style=\'.*color:)([a-z]*)(.*\'>)([^<]*)/<font color=\"$ 2\">$ 4<font>/g

open( OUTPUT, $outputfile ) or die "Error while opening $outputfile: $!\n";
print OUTPUT $docstring;
close OUTPUT;


This gives the same result as the Sed command on the previous page. Remember to give the input and output file as arguments when you run it. It works as follows: First I assign the two command line arguments to $inputfile and $outputfile. (The ">" before ARGV[1] means that the file is opened in write-only mode.) Then I open the input file and reads the entire content into the variable $docstring. (I had to do this since I have to search on multiple lines of text. If you are just searching one line at a time, you can just read one line, run the regular expression, and write the line to the output file. Repeat until end of file.) Once the entire file is read, I run the regular expression. The changes is written back to the $docstring variable, so now all I have to do is open the output file and write the contents of $docstring to it. Done.

As you can see, the regular expression is almost the same as in the Sed example. The most important differences are that you don't have to escape the parentheses, and that you refer to the sub-matches with the dollar symbol, not backslash. (As in the Sed example, I had some trouble displaying the sub-match numbers. I had to use a space between the dollar symbol and the number. Do not use this space in your own code, write the number directly after the dollar symbol!)

The script is more or less a general purpose search/replace tool. All you have to do is change the regular expression. Feel free to use it as you like. If you have written any particularly smart/unusual/powerful regular expressions that others might find helpful, please consider posting them as comments to this article. Share and enjoy!


| Back to normal page view |