Re: Maintainable compiler design?

"BGB / cr88192" <cr88192@hotmail.com>
Wed, 29 Jul 2009 10:38:06 -0700

          From comp.compilers

Related articles
Maintainable compiler design? lerno@dragonascendant.com (=?ISO-8859-1?Q?Christoffer_Lern=F6?=) (2009-07-27)
Re: Maintainable compiler design? mhelvens@gmail.com (Michiel) (2009-07-29)
Re: Maintainable compiler design? cr88192@hotmail.com (BGB / cr88192) (2009-07-29)
Re: Maintainable compiler design? martin@gkc.org.uk (Martin Ward) (2009-08-06)
Re: Maintainable compiler design? james.harris.1@googlemail.com (James Harris) (2009-08-06)
| List of all articles for this month |

From: "BGB / cr88192" <cr88192@hotmail.com>
Newsgroups: comp.compilers
Date: Wed, 29 Jul 2009 10:38:06 -0700
Organization: albasani.net
References: 09-07-102
Keywords: design, practice
Posted-Date: 29 Jul 2009 23:00:57 EDT

"Christoffer Lernv" <lerno@dragonascendant.com> wrote in message
>I am working with my pet language, defining grammar, building runtime
> etc - the usual things.
>
> However, as this is my first complier, I am worried that I'm not
> organising the code well. Aside from the frontend/backend division, I
> feel the compiler books I red are sort of hazy in regards to how to
> solidly organize the compiler to ensure that it is easy to maintain.


"live and learn" I guess...




> Currently I am working with ANTLR on the grammar, while I'm writing
> the runtime in C. My idea was to make the first version of the
> compiler with ANTLR in Java (striking a compromise between ease of
> implementation and performance).
> I have also played around a bit with Flex and Bison as well as Racc
> and Rex on the ruby side, but ANTLR felt pretty easy to work with, esp
> with ANTLRWorks for debugging grammar.


I have never really used these tools...


then again, IME, the parser has never really been the a major part of
the code/effort (except for with small scripting languages). so,
usually I have just hand-written my parsers (actually, more correctly,
I typically copy/paste/edit my parsers...).




> Most compiler books by necessity run rather limited examples where
> code generation is done directly by emitting strings by walking the
> AST.
>


ASTs work well for the language frontend...
IME, things turn a little more malevolent when one gets to the backend...




> That sounds reasonable for small language examples, but would seem
> like it would be quite a nightmare to walk through for a larger
> language. Looking through the ANTLR docs, there are examples where
> they run their generated AST through multiple rule definitions to do
> optimizations and finally generation, which seems like a good idea.
>


emitting strings actually scales fairly well, just the scale of the
instructions gets fairly small, and the logic involved in the internal
"compiler" machinery gets very complex.




> Are there any best practices to keep the compiler easy to understand
> and well organized?
>


not particularly that I am aware of.




however, from personal experience:
divide the task up into stages (for example: preprocessor, parser,
upper-compiler, lower-compiler, assembler, linker);
each stage should avoid, if at all possible, getting involved in things that
are the responsibilities of other stages;
clearly define the input and output of each stage, and avoid use of "magic
data" (AKA: data with an ill-defined structure and purpose, and which does
not relate in an obvious way with other parts of the structure);
similarly, out-of-band communication should be minimized (as much data as
possible should be passed in-line with the inter-stage representation);
...


similarly, if one feels tempted to make use of a "3rd party", which may pass
data between stages without the data being passed explicitly, IMO, this
should be avoided (IME, it tends not to work out well).




now, back to representations:
consider defining a set of conventions for things like types representation,
...
these should be general purpose, and easy to process.


for example, in my case, I represent types as strings, where the conventions
used in the strings are agreed upon by a number of subsystems. however, the
way these strings are represented internally to each subsystem subsystem,
how they are process, ... is allowed to vary.


as an example (from my project):
http://cr88192.dyndns.org/2009-06-02_Meta0.html


as such, the representation should be primary, but the handling secondary.
(this spec is slightly 'stale' now, but it works...). note that most of this
spec can be ignored.




as another example:
note the widespread use of traditional number formatting, where the value of
"3.14159265358793..." is agreed upon almost universally, even despite the
internal representations and processing of numbers varrying greatly between
different systems (computers of various sorts, the human brain, paper, ...).


as I see it, saying "the representation is the sole propery of a specific
piece of code" (and then treating the code as a central authority of sorts),
on the large scale, is not a good way of doing things. it is the
"representations" which should be the authorities, and the code serving
either as specific processors, or as repositories.


granted, differing representations may be used internally, but these
representations will have a decidedly limited scope ("within this subsystem,
types are represented as opaque integer handles", ...).


(granted, in an ontological sense, the implementation is more fundamental
than the representation, but in an engineering sense, the representation
should take precedence over the implementation).




however, as I see it, particular pieces of machinery should typically be
regarded as opaque black-boxes, taking specific input representations, doing
a specific task, and producing a specific output representation (liberty
being given internally as to how this task may be accomplished). this is
applied heirarchically, where we build bigger and more abstract
"black-boxes" from more fundamental ones.


as I see it, just as soon as specific libraries or pieces of code are
regarded as "sole authorities", then one is just asking for trouble.


similarly, centralized "do everything" code (or libraries) should generally
be, IMO, avoided at all costs. (granted, many projects tend to focus on, or
even idolize, centralized do-everything components...). as such, a component
"should" defined in terms of specific representations, operations, and
behaviors (take X, produce Y).


if a component's behavior can't be well defined, or if one finds it doing
things which are essentially unrelated to its initial core behavior (but not
necessarily its external use), then one may be down the road to trouble...


similar goes for dependencies:
a particular piece of code should not depend on (AKA: even touch) any piece
of code or data for which it does not have clear business in doing so (as
such, any dependencies should be both defined and justified, and in general,
dependencies should be kept minimal).




for example: does a compiler really need to depend on, for example, Qt or
GTK+ ?... having a dependency of this sort does little more than to create
annoyance for anyone else who may be forced to deal with the code, and left
wodering why the hell said project's JIT internals (or, even a simple
command-line tool) have a GTK+ dependency...


"because it is convinient" is not usually a good answer to potentially
far-reaching engineering decisions...






it is worth note that a lot of this runs in contrast to what is typically
regarded as "OO" (though, personally, it may just be a limitation of
perspective, but I don't really know how traditional OO practices could be
used effectively in a compiler without producing a complex mess...).


granted, I mean OO more in the sense as usually understood, not in the more
abstract OOA/D sense, rather, OO as understood/promoted by typical C++ and
Java zealots, and as seen in typical C++ and Java coding practices...


who then go and blame C for all of the world's coding problems... like bad
coding practice is intrinsic to using C or something... but, then again, I
guess "good practice" is hard to see, when the solution is to wrap the
problem in templates and use yet more operator overloading...


it is just a different flavor of terrible coding practice...


but, alas, this is its own sort of topic...


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.