Re: Parsing a text stream

"Paul Robinson" <Postmaster@paul.washington.dc.us>
24 May 2004 00:28:13 -0400

          From comp.compilers

Related articles
Parsing a text stream jacktow@hotmail.com (2004-04-28)
Re: Parsing a text stream mailbox@dmitry-kazakov.de (Dmitry A.Kazakov) (2004-04-29)
Re: Parsing a text stream gopi@sankhya.com (2004-05-02)
Re: Parsing a text stream pjj@cs.man.ac.uk (Pete Jinks) (2004-05-02)
Re: Parsing a text stream Postmaster@paul.washington.dc.us (Paul Robinson) (2004-05-24)
| List of all articles for this month |

From: "Paul Robinson" <Postmaster@paul.washington.dc.us>
Newsgroups: comp.compilers
Date: 24 May 2004 00:28:13 -0400
Organization: Compilers Central
References: 04-04-076
Keywords: parse
Posted-Date: 24 May 2004 00:28:13 EDT

Mansoor wrote in message 04-04-076...


>What I'm looking for is a bit just a bit different form what I've
>found so far. For example using a parser such as GOLD Parser with a
>grammar , lets say HTML, we can parse a HTML file and tockenize it.


Is that so that it can work like a clock? Or would you rather
"tokenize" it? (I wouldn't normally pick on someone else's "grammar"
except that we are discussing "text processing" here so I thought it
would be funny,)


>However !!, the problem is here. These parsers only succeed till the
>end of data as long as every thing goes according to plan. If say you
>have left out a HTML tag open, e.g "<FONT color=black
><I>something</I>", here FONT tag is not closed with a corresponding
>">". As with all lexical analysers I have found so far, they can't
>handle this sort of situations. If you ask why should they, then the
>answer is in a text editor where somebody is not done with the code
>yet, and syntax highlighting feature is supposed to ease the writer's
>task, even unfinished tokens must be highlighted.


I'll offer my two cents (my US bias shows or I'd mention offering "my
2/100 of a Euro." By the way what is 2/100 of a Euro called these
days?)


You keep a list of opened items that are not closed, their priority or
hierarchical level in the scheme of HTML and their location. You can
check that when you get a close item;to see if your current close is
the same or is before something later of the same or a lower
priority,. and if you get a fault, go back to the point where the
unclosed item is and scream bloody murder. To explain what I'm
thinking, consider the following:


<head></head>
<body>
<FONT Color=123456>This is an example of something
<I>while this is italicized<B>and this is bold</i></b>
</body>


<head> has highest priority; <body> has next highest; <font> comes next on
the totem pole, and <i><b><blockquote><tt> etc.. are all at the same level.
So when you get a priority imbalance (either closing a higher priority item
before a lower one, or closing an equivalent while a later one is still
pending) tells you that there is an error.


Now, when you get to the </i>, this is when you yelp because you have a
pending <b> that is of the same priority as <i> but came earlier in the
stream. Now, if you continue, when you get to </body> you know that </font>
is missing. It is reasonable to presume a font statement could end
anywhere, but until it does you do not know where that is. It is
unreasonable for a preceding enclosing entry to end before the enclosed
entry if they are equivalent.


If the item has no blocking then you ignore it for semantic purposes (<br>
and <p> don't really need closure but you can <br/> the former and block
</p> the latter). Or you can use that also for checking by giving them an
"optional" status; if you close with </p> before a </i> it's probably an
error. Or maybe not, I don't know.


The nice thing about this is it's not hard to understand and doesn't require
anything more complicated than either an array of records (struct for C) or
two or three arrays to hold the items if you can't or don't want to use a
structure array.


--
Paul Robinson
"Above all else... We shall go on..."
"...And continue!"


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.