Re: HTML grammar

Ralf Gerlich <>
20 Sep 1999 12:01:35 -0400

          From comp.compilers

Related articles
HTML grammar (Israel Antezana Rojas) (1999-09-16)
Re: HTML grammar (Andrzej Serafin) (1999-09-20)
Re: HTML grammar (Ralf Gerlich) (1999-09-20)
| List of all articles for this month |

From: Ralf Gerlich <>
Newsgroups: comp.compilers
Date: 20 Sep 1999 12:01:35 -0400
Organization: T-Online
References: 99-09-059
Keywords: parse


> I am trying to build an HTML parser, please if somebdy has already
> written an HTML grammar send it to me!.
You may probably find a definition of the HTML "grammar" at the W3C

In fact HTML has a rather sloppy grammar. Parsing should normally be
done in two levels:

1. Generally decide which input is text and which is a tag. Parse the
tags by dividing their contents into words and arguments.

2. Now you need a system to check those "errorneous" constructs(which
are in fact supported by the grammar)
Therefore you need a definition for each type of block that contains
this data:
1. the name of the starting command
2. is it a block?(just think of IMG tags: they ain't got a "closer")
3. May either the starting or the ending tag or both be omitted?
(For an example of such a definition you should perhaps have a look at
how SGML or XML work)

According to this definition you can now generate a "parser" which
synchronizes itself by implicitly inserting missing start and end tags
where possible.

A good example of this _may_ be SGMLtools (
They have C code which _may_ help you(I haven't had a look at it yet,
but they are in fact doing a "pretty print" of the SGML code according
to a definition, adding missing start and end tags where possible, thus
getting correct "code" to send to the real parser)

I hope this helps a bit(sorry I didn't go more into depth but I don't
have much time to answer and also this is only an idea of mine which is
not tested or implemented in any way yet)


Ralf Gerlich
Passionate programmer

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.