Re: Intermediate forms (again?): worthwhile?

BGB <cr88192@hotmail.com>
Wed, 26 Jan 2011 15:13:35 -0700

From comp.compilers

Related articles
[16 earlier articles]
Re: Intermediate forms (again?): worthwhile? steshaw@gmail.com (Steven Shaw) (2011-01-24)
Re: Intermediate forms (again?): worthwhile? barry.j.kelly@gmail.com (Barry Kelly) (2011-01-24)
Re: Intermediate forms (again?): worthwhile? nospam@myisp.net (Tony) (2011-01-24)
Re: Intermediate forms (again?): worthwhile? thomas.mertes@gmx.at (tm) (2011-01-24)
Re: Intermediate forms (again?): worthwhile? eh@electrichedgehog.net (Chris Dollin) (2011-01-26)
Re: Intermediate forms (again?): worthwhile? bc@freeuk.com (Bartc) (2011-01-26)
*Re: Intermediate forms (again?): worthwhile? cr88192@hotmail.com (BGB)* (2011-01-26)**

| List of all articles for this month |

From:	BGB <cr88192@hotmail.com>
Newsgroups:	comp.compilers
Date:	Wed, 26 Jan 2011 15:13:35 -0700
Organization:	albasani.net
References:	11-01-082 11-01-088 11-01-096 11-01-112
Keywords:	code
Posted-Date:	30 Jan 2011 21:51:27 EST

On 1/24/2011 3:26 AM, Barry Kelly wrote:
> BGB wrote:
>
> (Apologies ahead of time for slightly reordering your quotes.)
>
>> also, admittedly, it doesn't sit well with my design sensibilities. I am
>> personally more of a fan of modularized black-box components,
>
>> personally I have my doubts about LLVM, mostly:
>> it is a not exactly small piece of code;
>
> If you treat it as a black box, you can largely avoid this. Most of what
> one relies on in modern computing is formed out of large pieces of code.
> What's more important is how large the conceptual surface area is, and
> how consistent and orthogonal the component concepts are.
>

ok.

from what I had seen, it looked like much of the code using LLVM and the
design of LLVM itself was largely about digging around in its own internals.

granted, I had not spent a lot of time looking over the code (and most
of when I did look at it was a few years ago).

sadly, my own projects by this point are not exactly small either, grr...

>> it is written in C++ and doesn't provide a plain-C API (and is written
>> in what appears to me to be "traditional Java-style OOP", ...), which
>> may have some additional costs in some cases;
>
> LLVM does have a C API; it's used by the OCaml binding, for example. It
> is limited, but it's not too much work (in my limited experience) to
> extend that C API to access more bits of the C++ API that aren't
> directly exposed.
>

ok, I was not aware of this.
I guess it may seem odd for someone to want C based API's for
everything, and to write a lot of code in C, but there are reasons for this.

interestingly, I typically implement C API's first, and any C++ API's as
wrappers, even if the library itself was written internally in C++.

>> having to get code into SSA form is not exactly an easy task.
>
> Don't get hung up on this; LLVM converts code into SSA for you where
> it's relevant. Let me explain. There are two aspects to the SSA
> transformation as I see it: the simple acyclic layout of an expression
> tree or DAG; and then making sure control flow etc. works out, with phi
> nodes etc.
>
> If you have a simple expression (a + b * c), and view it as a tree:
>
> (+ a (* b c))

snip, converting expression to DAG.

> what we have here is essentially SSA form for the expression. This is
> how you describe expressions to LLVM; each SSA variable is essentially a
> name for a node in the expression tree (or DAG).

granted, I am typically not dealing with expressions by this point, but
rather stack machines.

granted, I could skip the stack machine stage if needed, and use another
representation of mine directly (which I call GAST, but I guess another
project uses the term GAST for a somewhat different representation),
which is essentially a "normalized" AST (its goal is to gloss over most
overt language-specific issues, mostly type-names and usage of modifier
flags at present).

currently, the intended use case for GAST was to help provide a common
"shared" area between my frontends and backends (vs having each frontend
produce its own ASTs, and each backend accepting only the ASTs for its
associated frontend).

some issues remain though, namely that thus far I have not implemented a
sufficiently language-neutral backend (I was working on a "BGBScript2"
language, and associated backend, where the backend is being designed to
be more neutral).

usually, it is the backends which use stack machines.

an issue not currently addressed by GAST though is how to deal with
differences in type-coercion semantics, where for example Java and C#
have different semantics than, say, C, but the format also mostly leaves
type-handling to the backend (in most of my stuff, type and scope
handling are typically delayed, typically until link/load time, or
runtime if need-be), preventing the option of limiting this mostly by
inserting explicit casts (and having the frontend complain about
implicit coercion). at present, the default is for C-like type coercion
semantics.

granted, a possible way to address this would be via a special flag:
noUnsafeConversion="true"

its current/typical representation at present is as a binary-XML format
(GAST is XML based).

> That's expressions out of the way; it should be clear that they're
> trivial to put into SSA form. For local variables, you can allocate them
> with 'alloca'. This relieves you from the requirement to transform all
> references to those local variables into SSA form yourself; you can
> treat them as memory cells that you poke data in and out of via
> dereferencing.
>
> The nice thing is that LLVM has an optimization pass which converts
> these alloca locals into SSA for you!
>

ok, this part is interesting...

so no need to worry about phi?...

this was a major issue in the past, and something I couldn't figure out
how to handle reliably. my own stuff had usually restricted the
situation so that the full general-case of phi handling was not
necessary (the fallback case being simply to dump the variable back onto
the stack and reload from there later, if the codegen couldn't figure
out how to get everything back into the same register from all jumps).

temporary variables were used and just assumed to have a limited
lifespan and potentially an assigned register (in such a case, the
temporary variable would own the register during its lifespan).

the IL was not itself in SSA form (it was stack based), nor was any
explicit SSA stage used (mostly just lots of plain-C logic code, and a
stack machine, where the stack would hold both data related to the
target-code, and also related to temporary state within the codegen, and
did not have a necessarily 1:1 mapping with the CPU stack).

> I have a toy Pascal language currently compiling into LLVM. Here's a
> sample procedure:
>
> procedure R;
> var
> a, b, c: Integer;
> begin
> a := 10;
> b := 20;
> c := 30;
> WriteInt(a + b * c);
> end;
>
> Unoptimized, it compiles into this in LLVM assembly:
>
> define void @R() {
> entry:
> %a = alloca i32
> %b = alloca i32
> %c = alloca i32
> store i32 10, i32* %a
> store i32 20, i32* %b
> store i32 30, i32* %c
> %0 = load i32* %a
> %1 = load i32* %b
> %2 = load i32* %c
> %3 = mul i32 %1, %2
> %4 = add i32 %0, %3
> call void @WriteInt(i32 %4)
> ret void
> }
>
> Trivial, right?
>
> Pass it through the LLVM optimizer (llvm-opt -O3), and you get this:
>
> define void @R() {
> entry:
> tail call void @WriteInt(i32 610)
> ret void
> }
>
> OK, so maybe that's more optimization than you expected to see. Less
> aggressively, just the mem2reg pass:
>
> define void @R() {
> entry:
> %0 = mul i32 20, 30 ;<i32> [#uses=1]
> %1 = add i32 10, %0 ;<i32> [#uses=1]
> call void @WriteInt(i32 %1)
> ret void
> }
>
> It's really quite simple.

yes, although I usually perform constant-folding at the AST level (such
as when converting ASTs to GAST).

a slight issue in the LLVM case may be that one still has to resolve
types prior to submitting code to LLVM, which would mean in my case,
probably after performing dynamic linking.

at the moment, an interpreter will still likely be a simpler option
initially for my BS2VM effort (even as long as it is taking to get the
thing written...), but I may consider LLVM later possibly for the JIT
(the other option being to reuse my old codegen, probably in a somewhat
modified form).

converting GAST directly to LLVM IR could work, but the issue would be
that I couldn't have a "bytecode image" stage in this case, as type
information would have to be available at this point.

GAST could be used as a distributable form, but is not technically a
bytecode, and as-is would be fairly trivial to coerce back into
source-code (while preserving much of the original code structure),
limiting one of the major merits of distributing precompiled code.

so, it all requires a little thought...

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Intermediate forms (again?): worthwhile?

BGB <cr88192@hotmail.com>Wed, 26 Jan 2011 15:13:35 -0700

BGB <cr88192@hotmail.com>
Wed, 26 Jan 2011 15:13:35 -0700