|[2 earlier articles]|
|Re: Regular Expressions firstname.lastname@example.org (Randall Hyde) (2004-10-12)|
|Re: Regular Expressions email@example.com (Sylvain Schmitz) (2004-10-12)|
|Re: Regular Expressions Martin.Ward@durham.ac.uk (Martin Ward) (2004-10-12)|
|Re: Regular Expressions firstname.lastname@example.org (2004-10-12)|
|Re: Regular Expressions email@example.com (David Z Maze) (2004-10-12)|
|Re: Regular Expressions Martin.Ward@durham.ac.uk (Martin Ward) (2004-10-17)|
|Re: Regular Expressions firstname.lastname@example.org (ChokSheak Lau) (2004-10-21)|
|Re: regular expressions wendt@CS.ColoState.EDU (1993-03-22)|
|Regular Expressions email@example.com (trejo ortiz alejandro augusto) (1995-10-16)|
|Re: Regular Expressions firstname.lastname@example.org (Mitchell Perilstein) (1995-10-23)|
|Re: Regular Expressions email@example.com (1995-10-29)|
|Re: Regular Expressions firstname.lastname@example.org (Colm O'Dunlaing) (1995-10-31)|
|Re: Regular Expressions email@example.com (1995-11-03)|
|[2 later articles]|
|From:||ChokSheak Lau <firstname.lastname@example.org>|
|Date:||21 Oct 2004 22:30:17 -0400|
|Organization:||Georgia Institute of Technology|
|Posted-Date:||21 Oct 2004 22:30:17 EDT|
> Hi everyone
> I just can't seem to figure out how to invent a regular expression
> that will strip all HTML tags (except TABLE tags) out of a string and
> leave the rest of the text. When a TABLE tag is encountered i need to
> strip everything under it.
> This will strip all HTML out <[^>]*>
> But how do I make it also strip entire TABLE elements?
> Perhaps something like <table[^</table>]*</table>|<[^>]*>
as others have pointed out, the HTML thing is context-free so you
can't use a regex to fully capture it (100% of the time). however,
you can use Perl-like regexes to filter out everything you don't want.
anyway, just to illustrate a little, in Perl (the code has not been
so please assume they don't work):
1. find all <table> tags
$s =~ m/<table[^>]*>.*"</table>"?/i;
2. stripping all tags
$s =~ s/<(\w+)[^>]*>([^<]*)</\1>/$2/i;
so what does that mean? roughly speaking, iterate on the same string
until you can't find any more <table> tags, then strip all tags within
the pre-match and post-match strings until you're done. there are many
details left to be figured out.
this approach will not always work, but most of the time it will.
if we're looking at a commercial product here, then use a real HTML
Return to the
Search the comp.compilers archives again.