Re: Compiler Books? Parsers?

Chris F Clark <cfc@shell01.TheWorld.com>
3 Dec 2003 19:40:57 -0500

From comp.compilers

Related articles
Compiler Books? vicky7909@rediffmail.com (2003-10-27)
Re: Compiler Books? vbdis@aol.com (2003-10-31)
Re: Compiler Books? Parsers? napi@cs.indiana.edu (2003-11-01)
Re: Compiler Books? Parsers? napi@cs.indiana.edu (2003-11-01)
Re: Compiler Books? Parsers? henry@spsystems.net (2003-11-02)
Re: Compiler Books? Parsers? henry@spsystems.net (2003-11-08)
Re: Compiler Books? Parsers? Jeffrey.Kenton@comcast.net (Jeff Kenton) (2003-11-21)
*Re: Compiler Books? Parsers? cfc@shell01.TheWorld.com (Chris F Clark)* (2003-12-03)**
RE: Compiler Books? Parsers? qjackson@shaw.ca (Quinn Tyler Jackson) (2003-12-03)
Re: Compiler Books? Parsers? rbates@southwind.net (Rodney M. Bates) (2003-12-08)
Re: Compiler Books? Parsers? nick.roberts@acm.org (Nick Roberts) (2003-12-08)
Re: Compiler Books? Parsers? marcov@stack.nl (Marco van de Voort) (2003-12-20)
Re: Compiler Books? Parsers? cfc@world.std.com (Chris F Clark) (2003-12-21)
Re: Compiler Books? Parsers? cdc@maxnet.co.nz (Carl Cerecke) (2003-12-23)
[6 later articles]

| List of all articles for this month |

From:	Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups:	comp.compilers
Date:	3 Dec 2003 19:40:57 -0500
Organization:	The World Public Access UNIX, Brookline, MA
References:	03-10-113 03-10-145 03-11-010 03-11-083
Keywords:	tools
Posted-Date:	03 Dec 2003 19:40:57 EST

Mohd Hanafiah Abdullah asked:
> I have never used parser generators such as yacc/bison or antlr.
> Would be nice to know people's opinion on manual implementation vs
> automation of parsers, especially from those who have done both; in
> terms of efficiency, readability, maintainability,
> ease-of-development, and so forth.

Jeff Kenton replied:
> 1. Efficiency: someone I work with, who's often right, claims that you
> can always write a parser by hand that's faster than what yacc and lex
> produce. Other people claim exactly the opposite.

(Preface: my responses are *very* biased as I am the author of a tool
which is "actively" marketed!)

      This is trivially true, if one considers the fact that one can take the
      output of any given tool and then profile and modify it so that the
      resulting hand-written tool is faster.

      However, in contrast, if someone figures out how to write a faster
      output for a specific tool, you may have to rearchitect your
      hand-tweaked to take advantage of those improvements. Note, that
      that doesn't happen very often (in 15 years of working on a tool,
      there were approximately 4 times where we specifically worked on
      speeding the generated parsers up).

      All that being said, generally hand-written tools can be faster
      than machine generated ones (provided you know what you are doing).
      The difference in execution speed may or may not be significant.
      If you don't know what you are doing, machine generated ones are
      likely as fast. More importantly, this can be a case of premature
      optimization. Why hand-craft a fast lexer or parser if that's not
      the bottleneck in your application? You would be better off,
      measuring the real bottleneck and fixing that!

      As evidence of the above, in several years of using automated tools
      to build a variety of projects for clients, I have only twice had
      times where adjusting the lexing/parsing code made a significant
      impact on the client applications and these applications (the ones
      where it didn't matter too) were essentially parsers. Most of the
      time, fixing up malloc/free and other low-level issues were more
      important.

> 2. Readability and maintainability: probably goes to the tools,
> although some of that is superficial. There's often a lot of magic
> hidden in the details that isn't as obvious on second read as you
> thought it was the first time through.
>
> 3. Ease of development: here's where I dislike lex and yacc. The
> actual code produced is impossible to read, and it's really painful to
> debug when you get it wrong. Furthermore, you always have to invent
> trickery to handle certain cases. It's often at the boundary between
> lexing and parsing, or between the parser and whatever calls it, where
> you start to think the tools are hurting you more than they're
> helping.

      These are the real issues. Your end goal should probably be to
      produce a correct lexer/parser for your target language with the
      minimum effort, so you can expend that effort on some other part of
      the tool.

      A real-life eample is particularly illustrative here. I work with
      a group that builds a Verilog compiler for use by chip designers.
      Verilog is a typically nasty language, with a C-like (but not C)
      preprocessor, places where whitespace is important and places where
      it isn't, messy identifier and numeric rules, and what-have you.
      In our product, we have at least two different Verilog parsers.
      One is hand-written and one is generated by a tool.

      The hand-written one was borrowed from a previous product and then
      slowly has slowly gathered more frills, knots, and warts. The
      hand-written one parses only a subset of Verilog (although the
      subset continually grows) and silently skips any text it can't
      parse. That form of error-recovery is important for it. It is used
      to parse texts that may or may not be correct Verilog (e.g. code
      still being written, text documents that aren't intended to be
      Verilog, binary files, what-have-you). It's goal is to extract as
      much valid Verilog as it can out of said documents, syntax
      highlight them and do some other "trivial" tasks--some of the tasks
      not being trivial at all. It has taken a couple of person years to
      get it to parse what it does as correctly as it does, not counting
      the time to initially write the precursor version.

      The machine generated one was created by transliterating the 1995
      IEEE spec into the dialect that the generation tool uses and making
      some adjustments to account for the hard-to-parse parts and for
      local dialect extensions. In addition, the language of the tool
      was used to describe the resulting tree we wanted built. A second
      pass was added to translate the tree that the generator automatic-
      ally builds into a different representation used by the "back-end"
      of our product. The initial coding of the gramar took under a
      month, and since that time we have made only a few changes to it,
      on the order of two-more person months. However, the translation
      phase to the second representation was a far larger effort and has
      consumed on the order of one to two person years. Only about six
      months ago did we feel that we had the translation fully stable.

      At this point, I'm willing to draw the following conclusions:

1) The machine generated parser is generally more reliable, more
      robust, and easier to maintain than the hand-written one.
      Moreover, no one is completely sure what dialect the hand-written
      one actually parses. In addition, the hand-written one has spawned
      numerous spin-offs, where part of the hand-written one was cloned
      and reused in other parts of the product (to parse other small
      fragments of the input).

      However, the hand-written one *DOES* do a far better job of
      responding to ill-formed texts. Also, the architect of the
      hand-written one is far more comfortable with it and finds its code
      simpler to follow. Most parser generators do a good job at
      generating parsers which handle correct input texts or texts with
      some small amounts of errors in them that the user needs to correct
      (and even there certain lexical errors typically destroy the
      parsers ability to recover). They do not do a good job at
      generating parsers that can pick fragments of correct code out of a
      sea of garbage, nor any other tasks they weren't designed to do.
      But, note that some parser generators (e.g. Quinn Tyler Jackson's
      meta-S) are now beginning to support a grep-mode, which may solve
      that problem.

2) The machine generated one was easier to get to a high-level of
      functioning first. For complicated languages it is easier to find
      the right hacks to apply to a machine generated parser than it is
      to write the same correct code by hand. If we had been able to use
      the parse (AST) tree coming out of the parser as our internal
      representation, the parsing part of the project would have been
      done in a trivial amount of time.

      However, a parse tree is often not the correct internal representa-
      tion of the input text. It often takes a lot of work to generate a
      data structure that has the correct semantics (especially when it
      isn't clear what the correct semantics are--several of our internal
      representations have had major shifts in how they model the
      semantics to reflect our evolving understanding of how to map the
      input text into a working abstract model). As far as I can tell,
      parser generators have not eliminated that work. That said, we did
      not used a tool with attribute grammar support (such as Eli) and
      perhaps some of those issues would have been mitigated through
      that. Still, the hard part is often interfacing the output of the
      parser generator with parts of the project that may have stringent
      requirements that aren't "compatible". However, my impression is
      that these tasks are intrinsically hard and that hand-generating
      the parser would not have mitigated them (and would have made them
      harder, since a hand generated parser does not build an AST
      representation automatically, and that was a stepping stone at
      least). I know that's true for the project, I'm working on. In
      fact, I want more automation, not less.

      I have ideas (drawn from my pratical experience) that will be added
      to the tool over time. Some of them will improve that problem.
      None of them are likely to be magic wands.

3) The output of a tool (and its associated library, if any) is more
      opaque than hand-written code. Moreover, the more a tool enables
      you to do (and worse, enables you to do efficiently) only makes the
      problem worse. Doing a task takes code. Doing a generalized
      version of the same task takes even more code. Do an efficient
      generalized version of the same task takes sill more code. Doing
      that and also including options for tasks that aren't being done
      takes even more. . . .

      I have not yet seen a tool that generates totally simple code and
      code that handles all the complicated cases easily. Again, I have
      some ideas here (on how to make the output more tailorable so that
      simple problems can get simpler code generated), but the problem is
      intrinsically difficult. The two items are polar opposites. You
      can't get both at the same time. The best one can hope for is a
      balance that is right for ones own project.

      That said, the tradeoff in having already (mostly?) debugged code
      that implements some of the things that I will need (e.g. nested
      include file support, AST construction, i18n error meesages)
      already in the library, lets me build most things faster and not
      have to debug my own typo errors as I reimplement them. That
      weighs heavily in faovr of a tool (and library) in my book. I'll
      take a little opaqueness to get that.

      Some relevant statistics: I cannot recall the last time I debugged
      the *code* generated by the tool. I debug approximately two
      library problems a year (across the entire tool community's user
      base). I debug approximately four "table generation" problems per
      year, mostly dealing with features in the tool that aren't in the
      underlying theoretical model. Most of the bugs are both obscure
      (happening in corners of the feature set) and obvious (when the
      failure happens the tool or the generated application fails in a
      dramatic way). There has so far been only one bug in the tool that
      has affected the project above I was describing. There was also
      only one case where the grammar did not describe the language
      correctly and the grammar needed to be debugged.

Recommendations:

      If you are doing something trivial or need some obscure ability not
      supported by any tool you can find, don't use a tool. We will
      probably never replace the hand-written parser in our product with
      a machine generated one (although we consider it from time to time,
      especially when needing to parse in a new context), because we
      don't trust the tool we use to gracefully implement a grep/
      maximal-munch-correct-fragments-only type parsing. In addition,
      sometimes, when I have to implement a small state machine, I will
      do so by hand, espcially in embedded contexts where firing up a
      whole object created by the tool would be expensive at runtime and
      would provide no explicit benefits.

      However, if you have a non-trivial language that may grow and/or
      change, get the best (most automated) tool you can. It *will* save
      you work. Especially, if that tool goes beyone simple lexing and
      parsing to AST construction (or other features relevant to your
      application). For example, ask someone about JavaCC with JJTree or
      JTB or ask someone about ANTLR (aka PCCTS ver 2) or about PCCTS ver
      1 and Sorcerer. There are news-groups for both tools
      (comp.compilers.tools.javacc and ==.pccts), so it shouldn't be hard
      to get info and opinions, with the one caveat being that many of
      the posters are "new" users (e.g. students or experimenters doing
      projects, who currently have questions) so it is hard (but not
      impossible) to get a response from a seasoned user. Most seasoned
      users have simply finished their parsing part and moved on to other
      tasks--and that's the point.

Hope this helps,
-Chris

*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
19 Bronte Way #33M voice : (508) 435-5016
Marlboro, MA 01752 USA fax : (508) 251-2347 (24 hours)

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Compiler Books? Parsers?

Chris F Clark <cfc@shell01.TheWorld.com>3 Dec 2003 19:40:57 -0500

Chris F Clark <cfc@shell01.TheWorld.com>
3 Dec 2003 19:40:57 -0500