8 Jun 1996 22:12:31 -0400

Related articles |
---|

Define C mark@omnifest.uwm.edu (1996-06-08) |

From: | mark@omnifest.uwm.edu (Mark Hopkins) |

Newsgroups: | comp.compilers |

Date: | 8 Jun 1996 22:12:31 -0400 |

Organization: | Omnifest |

Keywords: | C, theory |

This is the outline of an approach in which the C programming language is

reduced to primitive form which is directly represented within an extended

Lambda Calculus. An evaluation mechanism, analogous to Landin's SECD

Machine, can then be rigorously defined which will process an expression in

the extended Lambda Calculus exactly as an imperative machine would, when

the expression is the representation of a program segment -- all without

explicitly being told to do so.

How is the Lambda Calculus extended? By allowing its terms to be infinite

in size. The structure of an infinite lambda term will exactly mimick the

control flow graph of a program which is how we get away with defining

programs without states or machines. By extending the definition of alpha

conversion it is even possible to come up with a new and much more compact

notation for expressing lambda terms which also has the benefit of directly

embodying pointer semantics. Once again, we get away with doing this without

ever having to refer to a machine, or memory state.

The basic correspondence was first noted by Landin in 1965:

X = E; Q <--> (lambda X.Q) E

for which he coined the syntax "let X = E in Q". I'll retain the C-like

syntax since it's our intent to show how C itself can be regarded as nothing

more than a notation for the extended lambda calculus.

There is a common conception about how an imperative program and

functional program differ. One says that "variables can be updated and

there are side effects" in an imperative program, whereas in a functional

program "variables uniquely denote single values forever". In reality that

is an illusory distinction: either semantics can be used with either language

type and indeed one has the following combinations of (imperative/functional)

semantics with (imperative/functional) languages:

Functional Language Imperative Language

Funct. Seman. The reduction of the <What I'm about to show>

language to the Lambda Calc.

Imp. Seman. The evaluation of the The processing of the language

language on a SECD-like on an imperative machine

machine.

So in reality, it's not the language families that are different, but the

semantics that are commonly used with them. As an example:

(lambda x. (lambda x.x+2) x+1) x+1

has a functional semantics in which each level of scoping of the variable

"x" represents an instance of a DISTINCT variable that just happens to have

the same name. This expression could just as well be written:

(lambda u. (lambda v.v+2) u+1) x+1

The imperative semantics of this expression is that expression by applying

the Landin correspondence: x = x + 1; x = x + 1; x + 2 -- a "single" variable

named x is incremented twice and then the value x + 2 is returned.

Thus, we see that "variable updating" and "name clashes" represent exactly

the same thing, just seen under different semantic filters. This provides the

theme behind what's to follow: every feature commonly attributed to an

imperative language is actually seen to be a feature of the lambda calculus

in disguise, hidden by the fact that it is being interpreted under a different

semantics. What's interesting is that this fact, in virtue of the existence

of SECD-like machines as I described above, which can evaluate any lambda

term which is reducible in applicative order, is largely reversible! Most

features of the Lambda Calculus may be rendered an imperative interpretation.

Also, note that in the example above the corresponding conversion would

look like this: u = x + 1; v = u + 1; v + 2. So you can at last see:

variables aren't actually being UPDATED (nooooo!) they're really just

different variables that just happen to have the same name, but different

scopes (which in imperative semantics translates as: "different lifetimes").

(1) Rational Infinite Lambda Expressions and Control Structures

But wait, you might say: what about while loops? How many variables are

in a while loop that correspond to any variable inside its body which is

assigned to? Isn't this where the correspondence breaks down? Indeed, this

is exactly where Landin's original 1965 treatment ran aground and is why

you don't hear nowadays how and why imperative and functional languages are

the same thing in different guises. Well, I aim to correct this matter.

The answer is yes. If we're to consistently apply our principples then

we must admit indeed that there are in infinite number of distinct variables

in the body of a while loop and that the loop, itself, is really nothing

more than the compact notation for an infnitely nested set of conditional

expressions. In particular, we make the correspondence:

while (E) S <--> E? (S x): Q

Q

where x denotes a copy of the entire expression on the right: E? (S x): Q,

and where (S x) denotes what would result if we further applied this

correspondence and others to the (S x) combination. For example:

I = 0; J = 1; while (I < N) { I++; J *= 2; } J

when written in full would look like this:

I = 0; J = 1; (I < N)? (I = I + 1; J = J * 2; (I < N)?...: J): J

or more compactly as:

I = 0; J = 1; x where x denotes the expression

(I < N)? (I = I + 1; J = J * 2; x): J

An expression, such as this, which though infinite in size has only a

finite number of distinct subterms, is called rational (infinite). As

a general rule: control structures correspond to rational infinite lambda

terms.

Note the use of the label x and its reference within the expression itself.

Doesn't this look a lot like a goto label and goto statement? We could have

written the while loop in equivalent form like so:

I = 0; J = 1;

x: if (I < N) { I++; J *= 2; goto x; } else goto y;

y: J

Thus we arrive at the root of our correspondence here: goto labels denote

subexpressions which comprise both the statement labeled and everything

that comes after it. A goto statement corresponds to an embedded copy of

the subexpression labeled by goto label which the statement refers to.

What we've been calling "states" all along are actually expressions!

A complete set of correspondences may thus be written down for every

possible combination (S Q) of a statement followed by Q, which represents

"what comes after" and which can denote any expression or any combination

of expressions (or even things which the language itself does not have

notation for but which we define anyhow for the sake of carrying out an

analysis). The ultimate question of "what comes after" the main program

can then be answered: anything that's pertinent to the analysis which

we're trying to figure out the behavior of. This could include file streams,

peripheral I/O streams, memory blocks or just expressions.

There's no real sense in carrying out the reduction in detail here since

in most cases it's fairly obvious where the labels and "goto"'s go for a

given control structure. But what should be pointed out is that everything

will then be reduced to one of the primitive forms:

(Expression evaluation followed by Q): E; Q

(Conditional expression): E? Q: Q

(Return): return [E]; Q

In the latter case, as with the other control-breaking statements (goto x,

continue; break;) the final outcome of the reduction does not involve Q.

Another thing that should be pointed out is that a labeled statement followed

by Q, (x: S) Q is interpreted as the definition of a SUBEXPRESSION x: (S Q).

So this is the mechanism by which expressions are embedded in other expressons

(possibly themselves).

Now reconsider the expression in the example cited above:

I = 0; J = 1; x x: (I < N)? (I = I + 1; J = J * 2; x): J

We noted that a variable such as I or J occurs an infinite number of times,

one for each lambda operator for each copy of x (noting that x is actually

the lambda expression: (I < N)? (lambda I. (lambda J.x) J*2) I+1: J.

How do we reduce this? Related questions are: how do you substitute a

values for a variable in an infinite lambda expression, and which variables

are free in that expression (and which are bound)? Fortunately, since in

the lambda calculus, the definitions of substitution, free and bound variables

only rely on the LOCAL structure of an expression, it does not actually

matter whether the expression itself is infinite in size or not, the exact

same definition applies! It only looks somewhat complicaetd as the price

we pay for writing them in compact, finite, notation. For instance if

you substitute 3 for I in x, you get (3<N)?(I = 3+1; J = J+2; x): J,

but if you substitute 32 for N in x, you get a term (X<32)?(I=I+1;J=J*2;y):J

where y is the term that resulted from making this substitution (i.e. the

substitution propagates all the way down to infinite since there are no

lambdas blocking it).

The set of free variables for the subexpression x would have to be

defined in a way that reflects the cyclic structure of x:

FR(x) = {N,I,J} union FR(x)

The least solution to this formula (which is {N, I, J}) then gives you

FR(x).

Does this remind you of the kinds of equations you would see in variable

flow analysis? If one defines USED(S) as the set of variables used by a

statement S before being defined, and DEFINED(S) as the set of variables

which are defined before being used, then one can write down the equation:

FR(S Q) = USED(S) union (FR(Q) - DEFINED(S))

which (a) defines the correspndence between variable flow analysis and

variable analysis in the lambda calculus, (b) shows we could actually

*derive* definitions for USED and DEFINED from the definition of FR and

(c) we can therefore dispense with USED and DEFINED entirely and work with

FR instead.

(2) Algebraic Infinite Lambda Expressions and Call-Returns

This time consider the following program sequence:

x = 1; P(5); x

where P(n) is defined as the following procedure (written in a C-like

syntax)

P(n) { if (n > 0) x *= n, P(n - 1); }

We ask the same question: how many distinct copies of n are there in this

program segment? Once again the answer is Infinity, if we are to apply

our principles in a consistent manner. Also: what comes after the embedded

call to P? Implicitly, we note the existence of the return statement

P(n) { if (n > 0) x *= n, P(n - 1); return; }

so where is the "return" do a "goto" to?

The combination P(n); Q is actually an infinitely sized expression the

structure of whose subexpressions is (potentially) so vastly infinite that

not even a finite number of labels would suffice to label all the distinct

ones. Yet, we can still arrive at a compact notation by allowing labels

to carry "parameters" within them which can be substituted by subexpression,

thus yielding a potentially infinite number of distinct subexpressions for

that label. In particular, here, the sequence P(n); x will correspond to

the instantiation P[n, x], where P is a parametrized label defined by:

P[n, Q]: ((n > 0)? (x = x*n; P[n-1,Q]): Q)

The extra label Q is where the return statement "goes to", so each

occurence of "return" (including the implicit one at the end, if need

be) is replaced by Q.

The original example therefore wil correspond to the infinite lambda

expression:

x = 1; P[5, x]

which when written out in full would look something like this:

x = 1; (n>0)?(x = x*n; (n-1>0? (x = x*(n-1);...): x): x

where the more deeply nested instantiations of P would contain copies

of (n-1)-1, ((n-1)-1)-1, etc. respectively, thus generating an infinitely

sized lambda expression now with an infinite number of DISTINCT subterms.

A (possibly) infinite expression whose subterm structure can be

specified by the use of parametrized labels is referred to as Algebraic

(infinite).

As another example, suppose we had the sequence

x = P(3); x

where now P(n) is defined as the procedure

P(n) { if (n <= 0) return 1; else return P(n-1)*n; }

This time, we represent the procedure by a parametrized label where there

are now two extra parameters:

P[n,v,Q]: (n <= 0)? (v = 1; Q): P[n-1, w, (v = w*n; Q)]

Note the appearance of the auxilliary variable w. This arose by

rewriting the procedure in the form:

P(n) { if (n <= 0; return 1; else { w = P(n-1); return w*n; }}

first and then applying the transformation.

The example above will then be transformed into the lambda expression

P[3,x,x]

which I'll leave to you to write out in all its infinite glory (since it

takes a long time to completely unroll an infinitely big expression from

its compact notation).

Thus there is a slight difference between the way void procedures and

value-returning procedures are made to correspond to infinite lambda

expressions.

With the preliminaries out of the way, we can then conclude the following:

All sequences (S Q) of statements S following by combinations

of values in Q may be reduced to an algebraic infinite

expression involving only the combinations (E; Q) and

(E? Q: Q), in addition to the lambda operators.

Thus, to define a language such as C it is really only necessary to look

at how its expressions behave. The statement part of the language is

completely encapsulated in terms of the algebraic infinite lambda terms

used.

To carry out this definition, we will have to adopt a convention which

will make expressions deterministic: all concurrent "side-effects" are

considered to be evaluated from left to right, e.g., (i++ + j++) is

regarded as equivalent to the sequence (u = i; i++; v = j; j++; u + v).

Systematically, we may apply reductions to the combinations (E; Q)

and (E? Q: Q) and these will only spawn smaller combinations of the

same type (which may be reduced by the same means) and of the following

type (u = E; Q), where u is an "auxillary variable" which we use solely

to carry out the analysis with. Ultimately, everything will reduce to

a set of primitive combinations which could even be taken to be a kind

of intermediate code. This reduction is indicated in the following

charts:

E E; Q E? Q: R u = E; Q

--- ---- ------- --------

A,B A;(B;Q) A; (B?Q:R) A; (u=B;Q)

A?B:C A?(B;Q):(C;Q) A?(B?Q:R):(C?Q:R) A?(u=B;Q):(u=C;Q)

A||B A?Q:(B;Q) A?Q:(B?Q:R) u = A?1:(B?1:0); Q

A&&B A?(B;Q):Q A?(B?Q:R):R u = A?(B?1:0):0; Q

!A A; Q A? R: Q u = A?0:1; Q

A rel B A; B; Q u=A;v=B;u rel v?Q:R v=A; w=B; u=v rel w; Q

rel denotes one of the relative comparison operators ==, !=, <, >, <=, >=

A op B A; B; Q u=A op B; u? Q: R v=A; w=B; u=v op w; Q

op denotes one of the operators + - * / % & | ^ << >>

op A A; Q u = op A; u? Q: R v = A; u = op v; Q

P(A...Z) a=A;...;z=Z; (not allowed) (not allowed)

P[a...z,Q]

where P is a void function. Parameters are pre-evaluated, thus the

assignments a=A, ..., z=Z.

P(A...Z) u=P(A...Z);Q u=P(A...Z);u?Q:R a=A;...;z=Z;P[a...z,u,Q]

where P is a value-returning function

X Q u=X; u? Q: R <u=X;Q is in primitive form>

where X denotes a variable

C Q either Q or R <u=C;Q is in primitive form>

where C denotes a constant

&L L;Q u=&L; u?Q:R <see below>

A[B] A;B;Q u=A[B]; u?Q:R v=A;w=B; u=v[w];Q

A.X A;Q u=A.X; u?Q:R v=&A;u=v->X;Q

A->X A;Q u=A->X; u?Q:R v=A;u=v->X;Q

*A A;Q u=*A; u?Q:R v=A;u=*v;Q

L inc u=&L [(*u)inc;Q] u=&L [(*u)inc?Q:R] v=&L[u=*v;(*v)inc;Q]

inc L u=&L [(*u)inc;Q] u=&L [inc(*u)?Q:R] v=&L[(*v)inc;u=*v;Q]

where inc denotes one of the operators ++ or --

L op A u=&L u=(L op A); u?Q:R v=&L [w=A;(*v)=w;u=*v;Q]

[v=A;*u op v;Q]

where op denotes any of the assignment operators = += -= *= /= %= &= ^= |=

<<= or >>=.

The variables v and w, like u, are auxillary variables used only in the

context of the analysis and are intended to be unique to each application

of each rule (so as to avoid name clashes and the like). The form

u=&L[...*u...] is indicated only schematically and is completed in

different ways depending on which assignable L-value L corresponds to.

The rest of this reduction will be presented in a followup article, along

with the remainder of this presentation.

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.