Re: Parsing a text stream

Dmitry A.Kazakov <mailbox@dmitry-kazakov.de>
29 Apr 2004 12:09:36 -0400

          From comp.compilers

Related articles
Parsing a text stream jacktow@hotmail.com (2004-04-28)
Re: Parsing a text stream mailbox@dmitry-kazakov.de (Dmitry A.Kazakov) (2004-04-29)
Re: Parsing a text stream gopi@sankhya.com (2004-05-02)
Re: Parsing a text stream pjj@cs.man.ac.uk (Pete Jinks) (2004-05-02)
Re: Parsing a text stream Postmaster@paul.washington.dc.us (Paul Robinson) (2004-05-24)
| List of all articles for this month |

From: Dmitry A.Kazakov <mailbox@dmitry-kazakov.de>
Newsgroups: comp.compilers
Date: 29 Apr 2004 12:09:36 -0400
Organization: Compilers Central
References: 04-04-076
Keywords: parse
Posted-Date: 29 Apr 2004 12:09:36 EDT

On 28 Apr 2004 14:38:41 -0400, jacktow@hotmail.com (Mansoor) wrote:


>I surprisingly don't seem to be able to find a clear explanation of
>"How to lexically analyse a chunk of text data".


Using a lexer! (:-))


>What I'm looking for is a bit just a bit different form what I've
>found so far. For example using a parser such as GOLD Parser with a
>grammer , lets say HTML, we can parse a HTML file and tockenize it.
>
>However !!, the problem is here. These parsers only succeed till the
>end of data as long as every thing goes according to plan. If say you
>have left out a HTML tag open, e.g "<FONT color=black
><I>something</I>", here FONT tag is not closed with a corresponding
>">". As with all lexical analysers I have found so far, they can't
>handle this sort of situations.


They could if you would formally define the case: "unclosed tag" as
rather "legal" than just "error".


>If you ask why should they, then the
>answer is in a text editor where somebody is not done with the code
>yet, and syntax highlighting feature is supposed to ease the writer's
>task, even unfinished tokens must be highlighted.
>
>I have an idea already which is using Regular Expressions. The problem
>with regex however is that we just can search and find a match. We
>can't recognize parts and sections of a code - lets say in a C
>program, - such as a function body or any other section made of
>logical sub parts.


This can work if you have something like "immediate assignment" in
SNOBOL. When a part of pattern is matched one can note this event by
assigning matched text to a variable. Later one can analyse what was
actually matched. For example, one can write one pattern for all tags,
which would store the tag name and parameters into different
variables.


There are free implementations of SNOBOL-like patterns:


For example GNAT Ada has SPITBOL patterns, see www.gnat.com


Or (for C/C++) http://www.dmitry-kazakov.de/match/match.htm, it can
recognize a function body, of course a valid body. If invalid ones
have to be matched too, you should tell which ones. I think you've got
the idea.


As for regular expressions, I am not sure if they are capable to
recognize, say, balanced brackets. But actually it is a lesser
problem. You can probably use any pattern matcher if that has a "fixed
cursor mode", i.e. does not skip text when a pattern fails. In this
case you have to move the cursor manually, but in return you will have
a full control on what happens. (In fact it will be a
recursive-descent parser then.)
--
Regards,
Dmitry Kazakov
www.dmitry-kazakov.de


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.