Latest

Archive

Community news

C++

Communities and Content

Databases

Editorials

Emacs

General

HTML

Java

Notices

PHP

XML

Apache

C++

Database

General

HTML

Java

Javascript

Linux

Object oriented programming

Open source

Perl

PHP

Python

Ruby

SOAP

XML

Suggest a link

Advertise on zez

Contribute

Contact us

About zez


Larger Regular Expressions



Warning: The regular expressions contained in this article are of some length and quite illegible to the inexperienced. They can, however, be extremely useful and time-saving if you are trying to add structure to or search through large text documents. Proceed with caution.

Preface

If you are unfamiliar with the concept of regular expressions, or you need a reference, I suggest you read Jan Borsodi's article Regular Expressions explained on this site. The purpose of this article is to show a few real-life examples of regular expressions at work. The case is as follows: I had a huge (400 Kb) broken XML document, that needed some serious fixing up. I needed to do a lot of search and replace operations on multiple lines of text at a time. My first go at it was Sed, the Stream Editor.

Sed

Parts of my document could look like this:

<td><span
style='font:times;align:center;color:blue'>Some blue text</td>


As you can see, the span tag wasn't always closed, and a tag could contain newlines. I needed to preserve the color. I had to transform it into this:

<td><font color="blue">Some blue text</font></td>


This is the Sed command that did the trick:

sed -e :a -e "s/\(<span[ \t\n\r]*style='.*color:\)\([a-z]*\)\(.*'>\)\([^<]*\)/<font color=\"\ 2\">\ 4<font>/g; /</N; //ba" old.xml > new.xml


Since I needed my search to span multiple lines, I had to use a special Sed syntax. The expression

sed -e :a -e "s/search-text/replace-text/g; /</N; //ba" old.xml > new.xml

basically means: "Search all the text of old.xml for search-text, and replace any matches with replace-text. Write the result to new.xml." The interesting part is the regular expression itself, i.e. what is contained within s///g. I use sub-matches, contained within parentheses () to keep parts of the text, while discarding others.

\(<span[ \t\n\r]*style='.*color:\)

matches the span tag up to, but not including the value of color.

\([a-z]*\)

matches the value of color.

\(.*'>\)

matches the end of the span tag.

\([^<]*\)

matches the text up to, but not including the next tag.

The replace part of the regular expression inserts parts of the matches into the new text.

<font color=\"\ 2\">\ 4<font>

This means that the second sub-match is inserted into the color attribute, and the fourth sub-match is inserted between the font start and end tags. The point is that a number preceded by a backslash refers to a sub-match. \ 2 refers to the second sub-match, and so on. (There is a space between the backslash and the number in my examples. This is due to a HTTP POST problem. Do not use this space in your own code, write the number directly after the backslash!)


| < 1 > | 2 | Next page >> | Printer-friendly page |

Comment List


Topic: Author:
Time:
Errors in perl code Patrik Grip-Jansson 22.08.2001 04:08

The following line is wrong;

$docstring =~ s/(<span[\s]*style='.*color:)([a-z]*)(.*'>)([^<]*)/<font color="$ 2">$ 4<font>/g

to work as intended it need to be changed to;

$docstring =~ s/(<span[\s]*style='.*color:)([a-z]*)(.*'>)([^<]*)/<font color="$ 2">$ 4</font>/g;

The reg exp is far from optimal. For example there really isn't a need to use so many capturing paranthesis. This should match the same rows, and be a bit faster;

<span\s+style='.*?color:(\w+)'>(.*?)</td>




Forgot your password?

Register a new user

Results

Polls