Re: Regular Expressions (Torben Ęgidius Mogensen)
12 Oct 2004 00:56:25 -0400

          From comp.compilers

Related articles
Regular Expressions (2004-10-09)
Re: Regular Expressions (Eric Bodden) (2004-10-12)
Re: Regular Expressions (Randall Hyde) (2004-10-12)
Re: Regular Expressions (Sylvain Schmitz) (2004-10-12)
Re: Regular Expressions (Martin Ward) (2004-10-12)
Re: Regular Expressions (2004-10-12)
Re: Regular Expressions (David Z Maze) (2004-10-12)
Re: Regular Expressions (Martin Ward) (2004-10-17)
Re: Regular Expressions (ChokSheak Lau) (2004-10-21)
Re: regular expressions wendt@CS.ColoState.EDU (1993-03-22)
Regular Expressions (trejo ortiz alejandro augusto) (1995-10-16)
Re: Regular Expressions (Mitchell Perilstein) (1995-10-23)
[5 later articles]
| List of all articles for this month |

From: (Torben Ęgidius Mogensen)
Newsgroups: comp.compilers
Date: 12 Oct 2004 00:56:25 -0400
Organization: Department of Computer Science, University of Copenhagen
References: 04-10-069
Keywords: lex
Posted-Date: 12 Oct 2004 00:56:25 EDT (Mark) writes:

> I just can't seem to figure out how to invent a regular expression
> that will strip all HTML tags (except TABLE tags) out of a string and
> leave the rest of the text. When a TABLE tag is encountered i need to
> strip everything under it.
> This will strip all HTML out <[^>]*>

It will recognize all HTML tags, but it depends on the action if they
are left out, preserved, or whatnot. Also, IIRC, you may have >
inside a tag if it is enclosed in quotes, as in <a href=">>">xx</a>

> But how do I make it also strip entire TABLE elements?
> Perhaps something like <table[^</table>]*</table>|<[^>]*>

A "^" at the start of a bracket means that none of the characters
following it may appear, so any of <, /, t, a, b, l, e, or > would be
required to be followed by </table>. Also, by listing both table and
non-table on the same line you force them to have the same action (so
either both will be skipped or both will be preserved).

And then there is the possibility for nested tables. If you don't
take care of this, a regular expression will think the outer table has
ended when the inner endtag is read. A regular expression can not
handle arbitray nesting depts, so you would either need to use a
counter in the action of the regular expression or limit yourself to a
fixed limit on the number of nested tables and write a regular
expression for each level of nesting. How this is best done depends
on which tool you use (lex, Perl, etc.).

You could also consider using a parser generator, which eases handling
of matching tags and nested tables.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.