|HTML grammar email@example.com (Israel Antezana Rojas) (1999-09-16)|
|Re: HTML grammar firstname.lastname@example.org (Andrzej Serafin) (1999-09-20)|
|Re: HTML grammar Ralf.Gerlich@t-online.de (Ralf Gerlich) (1999-09-20)|
|From:||Ralf Gerlich <Ralf.Gerlich@t-online.de>|
|Date:||20 Sep 1999 12:01:35 -0400|
> I am trying to build an HTML parser, please if somebdy has already
> written an HTML grammar send it to me!.
You may probably find a definition of the HTML "grammar" at the W3C
In fact HTML has a rather sloppy grammar. Parsing should normally be
done in two levels:
1. Generally decide which input is text and which is a tag. Parse the
tags by dividing their contents into words and arguments.
2. Now you need a system to check those "errorneous" constructs(which
are in fact supported by the grammar)
Therefore you need a definition for each type of block that contains
1. the name of the starting command
2. is it a block?(just think of IMG tags: they ain't got a "closer")
3. May either the starting or the ending tag or both be omitted?
(For an example of such a definition you should perhaps have a look at
how SGML or XML work)
According to this definition you can now generate a "parser" which
synchronizes itself by implicitly inserting missing start and end tags
A good example of this _may_ be SGMLtools (http://www.sgmltools.org/).
They have C code which _may_ help you(I haven't had a look at it yet,
but they are in fact doing a "pretty print" of the SGML code according
to a definition, adding missing start and end tags where possible, thus
getting correct "code" to send to the real parser)
I hope this helps a bit(sorry I didn't go more into depth but I don't
have much time to answer and also this is only an idea of mine which is
not tested or implemented in any way yet)
Ralf Gerlich Ralf.Gerlich@t-online.de
Passionate programmer http://www.d-design.net/rgerlich/
Return to the
Search the comp.compilers archives again.