Re: (C as) Intermediate Representation

chased@Eng.Sun.COM (David Chase)
Tue, 14 Aug 90 16:32:58 GMT

          From comp.compilers

Related articles
Re: (C as) Intermediate Representation chased@Eng.Sun.COM (1990-08-14)
Re: (C as) Intermediate Representation Olivier.Ridoux@irisa.fr (1990-08-20)
| List of all articles for this month |

Newsgroups: comp.compilers
From: chased@Eng.Sun.COM (David Chase)
Keywords: C, code, Modula, translator
Organization: Compilers Central
Date: Tue, 14 Aug 90 16:32:58 GMT

Robert Marti writes:


> Many language designers/implementors have turned to using C as an
> intermediate representation language ...
> Modula-3 compilers developed at DEC SRC and the (now defunct?) Olivetti
> Research Center
> ...
> Compilation speed using C as an IRL may not be overwhelming, but this
> solution is certainly portable. Of course, the quality of the generated
> machine code depends substantially on on the C compiler used as a backend.


I did the back-end (for better or worse) for the Olivetti Modula-3 system,
and informally compared my experiences with the people doing the portable
Cedar system at Xerox PARC, and with Rodney Farrow's experience with
Linguist (converts AGs into evaluators, generates C).


Basically, C is not a wonderful intermediate language if you are compiling
something substantially different from C and intend to get good
performance out of what results, though it should run faster than
interpreted code. If your goal is reasonable portability, and ok
performance, it is probably a good choice. When in doubt, treat it as a
portable assembler, and avoid the type system as much as possible.


For Modula-3, the big difficulties were caused by (1) exception-handling,
(2) garbage collection, and (3) nested procedures.


Exception-handling can be implemented easily enough through use of
setjmp/longjmp, but this tends to inhibit either optimization or
correctness (or both), depending upon the compiler. I believe, though I'm
not certain, that the people at Xerox use the trick of placing guarded
blocks in separate subroutines; thus, walking backwards through a stack
trace serves to determine what guarded blocks are active. (Note that this
is not portable, though it is confined to a small amount of tricky code.)
Procedure inlining breaks this, for obvious reasons.


For garbage collection, either you conservatively trace through activation
records (in the style of Boehm & Weiser, or Bartlett) or else you
"register" roots as they appear (in the style of Edelson
[UCSC-CRL-90-19]). Registering of roots takes time, of course, but at
least guarantees that the root pointers will stick around. If you do a
conservative scan, you have to be very wary of what optimizations are
performed. For example, something along the lines of


      f(g,s) char (*g)(); char * s;
      {
            int i; int l = strlen(s);
            for (i = 0; i < l; i++)
                    s[i] = (*g)(s[i]);
      }


can easily be compiled into code whose frame contains no pointer to s[0].
If the call to g happens to provoke a garbage collection, and if this
activation of f represents the last use of s, then an incorrect re-use of
s could easily result. Worse things can happen; in some situations, it
may profit to use &a[-100] to reference elements of a, or to use
&b[0]-&a[0] to generate references to b. Nowhere in any C standard does
it indicate that these are illegal transformations. Use of "volatile"
will make things safe, but at the expense of all optimization of
addressing expressions using the volatile pointers. Possible workarounds
include assigning each interesting pointer into a volatile shadow variable
at the beginning of the program, or using registration. (see note 2)


The easy way to implement nested procedures is to take the address of a
structure containing pointers to variables shared between a procedure and
those it contains, or to place the variables themselves in the structure
(See Appel and Jim, CS-TR-168-88, Princeton, for a discussion of these
choices). Whatever the choices, taking the address and passing it off to
a subroutine typically causes an optimizing C compiler to make very
conservative assumptions about aliasing, even though a simple scan of the
program can determine what procedures can and can not modify or examine
those variables. Reference parameters in Modula-3 are implemented as
pointer parameters in C; the difference is that a reference parameter may
only be used or modified by the call for which it is a parameter, but a
pointer parameter must be treated as if it were used or modified in ANY
subsequent call to any procedure.


The Modula-3 compiler also made frequent use of descriptors, referenced
through pointers, that would never, ever change, but the C compilers
didn't know this, and thus did not take advantage of this. Use of "const"
might do the trick, but I can't tell whether this is intended as a hint to
the programmer (to be subverted by the appropriate cast) or as a guarantee
to the optimizer.


NOW, it is possible to get around much of this by doing analysis,
optimization, and clever transformation before generating clever C, but
then that isn't using C as your only intermediate representation, and
generating the right clever C is not necessarily fun or easy. It is
certainly a mistake (and I learned this the hard way) to expect any kind
of a clean mapping from Modula-3 types to C types; "TYPE F = PROCEDURE() :
F" was one loser. Another mistake was attempting to map "ARRAY [0..9] OF
CHAR" into "struct{x char[10];}" to get by-value assignment and
parameter-passing for arrays; some machines impose alignment requirements
on structures, and thus this was Not Portable, since Modula-3 permits
construction of SUBARRAYs. Eventually, I used "bcopy".


It is perhaps better to think of C as a portable assembler, use it at a
very low level, be thankful for the portability, and just not sweat the
performance too much. The people at Xerox took a more low-level approach
to their use of C, and generally had better luck, though they must still
worry about optimizers mangling their garbage collector. The low-level
approach also helps out compilation speed.


Then, there's debugging.


Note 1: There are other problems posed in compilation of Modula-3 to C,
but they are not of such a general nature.


Note 2: With pre-emptively scheduled threads, one must also be careful
about the order in which loads and stores are scheduled by the compiler.
Essentially, the collector produces an implicit dependence between
assignments to the record which serves to "register" a root pointer and
any use of any part of the object referenced by the pointer. If


    x = p -> member; p = 0;


is re-ordered into the equivalent of


    t = & (p -> member); p = 0; x = *t;


collection just after "p = 0" could result in the incorrect value being
loaded into x (note that x is "registered", while t, a temporary introduced
by the C compiler, is not).


Jolly fun.


David Chase
Sun Microsystems
[The ANSI standard says that &b[0] - &a[0] is invalid unless a and b point
into the same array, and that &a[-100] is OK only if the result of the
subscripting is an element of the array that a points to. But I can believe
that in many cases hacks like that are necessary to trick a C compiler into
doing what you want it to. -John]
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.