Latest

Archive

Community news

C++

Communities and Content

Databases

Editorials

Emacs

General

HTML

Java

Notices

PHP

XML

Apache

C++

Database

General

HTML

Java

Javascript

Linux

Object oriented programming

Open source

Perl

PHP

Python

Ruby

SOAP

XML

Suggest a link

Advertise on zez

Contribute

Contact us

About zez


Larger Regular Expressions




Perl

One problem of using Sed in this way was that it was too slow. The Sed loop could, for some reason, take up to 25 minutes to finish a multi-line search/replace operation. So I tried doing the same thing in Perl, this took just 30 seconds to process my 400 Kb document. Slightly more effective, I'd say. This is the script:


#!/usr/bin/perl

$inputfile = $ARGV[0];
$outputfile = ">$ARGV[1]";

$docstring = "";

open( INPUT, $inputfile ) or die "Error while opening $inputfile: $!\n";
while( <INPUT> )
{
 $docstring = $docstring . $_;
}
close INPUT;

$docstring =~ s/(<span[\s]*style=\'.*color:)([a-z]*)(.*\'>)([^<]*)/<font color=\"$ 2\">$ 4<font>/g

open( OUTPUT, $outputfile ) or die "Error while opening $outputfile: $!\n";
print OUTPUT $docstring;
close OUTPUT;


This gives the same result as the Sed command on the previous page. Remember to give the input and output file as arguments when you run it. It works as follows: First I assign the two command line arguments to $inputfile and $outputfile. (The ">" before ARGV[1] means that the file is opened in write-only mode.) Then I open the input file and reads the entire content into the variable $docstring. (I had to do this since I have to search on multiple lines of text. If you are just searching one line at a time, you can just read one line, run the regular expression, and write the line to the output file. Repeat until end of file.) Once the entire file is read, I run the regular expression. The changes is written back to the $docstring variable, so now all I have to do is open the output file and write the contents of $docstring to it. Done.

As you can see, the regular expression is almost the same as in the Sed example. The most important differences are that you don't have to escape the parentheses, and that you refer to the sub-matches with the dollar symbol, not backslash. (As in the Sed example, I had some trouble displaying the sub-match numbers. I had to use a space between the dollar symbol and the number. Do not use this space in your own code, write the number directly after the dollar symbol!)

The script is more or less a general purpose search/replace tool. All you have to do is change the regular expression. Feel free to use it as you like. If you have written any particularly smart/unusual/powerful regular expressions that others might find helpful, please consider posting them as comments to this article. Share and enjoy!


<< Previous page | 1 | < 2 > | Printer-friendly page |

Comment List


Topic: Author:
Time:
Errors in perl code Patrik Grip-Jansson 22.08.2001 04:08

The following line is wrong;

$docstring =~ s/(<span[\s]*style='.*color:)([a-z]*)(.*'>)([^<]*)/<font color="$ 2">$ 4<font>/g

to work as intended it need to be changed to;

$docstring =~ s/(<span[\s]*style='.*color:)([a-z]*)(.*'>)([^<]*)/<font color="$ 2">$ 4</font>/g;

The reg exp is far from optimal. For example there really isn't a need to use so many capturing paranthesis. This should match the same rows, and be a bit faster;

<span\s+style='.*?color:(\w+)'>(.*?)</td>




Forgot your password?

Register a new user

Results

Polls