Mon, 12 Jan 2009 13:13:02 -0800 (PST)

Related articles |
---|

The Magic Algebra -- The Algebraic Approach to Control Flow Analysis markwh04@yahoo.com (Rock Brentwood) (2009-01-12) |

An Algebra for Control Flow Analysis and Decompilation federation2005@netzero.com (Mark) (2012-07-16) |

Re: An Algebra for Control Flow Analysis and Decompilation federation2005@netzero.com (2012-07-20) |

From: | Rock Brentwood <markwh04@yahoo.com> |

Newsgroups: | comp.compilers |

Date: | Mon, 12 Jan 2009 13:13:02 -0800 (PST) |

Organization: | Compilers Central |

Keywords: | design, question, assembler, optimize |

Posted-Date: | 13 Jan 2009 07:31:49 EST |

0. Background

Back in 2002 I posted a brief note here on a rewriting of a

disassembler (the DAS 8051 disassembler), which put it in a position

where it could (1) be easily ported to other microprocessors (which it

was) and (2) converted into a decompiler (which I started to do --

resulting in DAC).

The irony is that the only remaining extant version of the 2002 DAS

(and DAC) are executables sitting on one of my old machines, which I

just managed to get onto my newer machines. The decompiler and

disassembler need to be decompiled and disassembled; hence the renewed

interest.

The revisions that had been done on them grew partly out of a

formalism I started developing in 2000. The following material

describes this formalism, which never saw light of day beyond my

notes, until now.

First, some history...

Contents:

0. Background

1. Magmas and Algebras

2. Alpha versus Beta Levels; Control Flow by Infinitary Expressions

3. The Extended Landin Correspondence -- Statements as Contexts

4. Object-Lifetimes via the "Free Variable" Predicate

5. A "Free" and "Bound" Calculus for Control Flow Constructs

6. The Magic Algebra

7. Extension to Inter-Procedural Control Flow

8. Multiple Objects and the Extended Magic Algebra

9. Primitive Statements; the Handling of Object-Subobject Relations

10. Conclusion

At the time, I was involved in a project whose very conception raised

a significant amount of attention from the executive staff at the

company I was at. The idea was simply to take all the legacy firmware

that had been written (all in assembly), run a simple decompiler on it

(in fact, the "DAS" 8051 disassembler that is part of the CSD4 8051

archive, retrofitted to produce simple C-like code), and port it over

to a PC development platform.

The company makes LED video displays. That includes everything. Not

only is the hardware constructed and designed in-house, but it

contains virtually no direct support for raster-scanning. It's all

done at the firmware level, which is also in-house. That means: even

the production of different shades of colors, since the individual LED

units are binary and only have an ON or OFF state. Shading is done by

time-dithering. That's atop the CRT-like raster-scanning, which is

also at the firmware level.

At the time, an experimental hardware-programmed card had been

developed by the company's European subsidiary (which, however, also

has only minimal support for raster scanning and color dithering). I

had already ported MY assembly over in a very short time to the PC.

The simplicity of the task arose from the fact that it was written

using the CAS 8051 assembler (which I developed). As its name

suggests, CAS provides one with a source language with many of the

niceties seen in C. So, translating the actual text is trivial.

The non-trivial part was porting over the underlying kernel (the WSMK,

or "World's Smallest Multitasking Kernel", which I also developed). To

call the kernel a microkernel would be understating the point. The

listing of WSMK is about 1 page and it has only 3 1/2 system calls:

SPAWN (to spawn a new process), EXIT (which happens when a process

does a "return" at the top-level), PAUSE (to await an event) and

RESUME (to kick-start someone waiting on an event). Since events can

be triggered by interrupts at multiple priority levels in the

underlying hardware (the 8051, which also provides limited support for

"register windows", amply used in the kernel), this provides the basis

for a co-routining multi-threaded kernel that is nonetheless pre-

emptive.

This was ported to DOS and everything was done on top of DOS.

With this past, the general idea was now to take the legacy firmware,

disassemble it to C, massage it over and run it on the PC card. This

would be the first step in launching an evolutionary transformation of

the entire range of legacy code into C onto a PC platform.

However, given that the legacy is an accrual of code written by

multiple people at various times, and that the original assumptions

that went into its construction were long since broken; the whole

thing had been for long in a morbid tangle. So, this made necessary a

detailed analysis down to the data flow level; as well as an analysis

of the stack usage (which, itself, is extremely important if one wants

to seek out and eliminate all the leakages from strict "call-return"

nesting that existed in the code).

The lynchpin of this analysis was a transformation of the program into

a series of algebraic equations. Each jump

X: jump Y

can be treated as an equation X = Y. Each statement

X: S; Q:

can be treated as an equation X = S[Q], where Q is the address label

following the statement S; and each branch

X: jump on condition C to Y; Q:

can be treated as the equation X = C? Y: Q, where Q is the address

label following the statement.

*IF* properly nested call-return format is restored and all the

leakages eliminated, then a "return" statement for a procedure P that

had been invoked as

X: call P; Q:

would be bound to Q, thus resulting in the equation

return = Q.

Thus, each procedure P is a control-flow expression with a "hole" in

it, marked by the "return" statement. If P is defined with the

statement body B[return], then its call

X: call P; Q:

is treated as the expression B[Q].

There is one equation posed for every single address -- even

overlapping addresses. (The 8051 supports 65536 addresses for code, so

that's 65536 equations). Overlapping addresses are treated as multiple

program segments superposed on top of one another. Out of the entire

set of equations, only those accessible from a designated subset of

"entry" points are then included. The check for overlapping is used by

the DAS disassembler to ensure that the entry points have are

consistently chosen. DAS, itself, does not carry out the full analysis

of setting up 65536 equations, but its design is nonetheless based on

the description just given.

The key point in reconstructing the routines is determining how the

data flows -- particularly how the objects live and die (i.e. are read

from and written to). On top of the system of equations provided by

the control-flow structure, one can formulate a system of algebraic

equations that captures the variable usage.

Once this information is intact, it's then possible to uncover the

formal parameters to be attached to the subprograms (and to make

decisions on which variables ought to be treated as global, static,

auto, etc.)

1. Magmas and Algebras

I'm going to describe an approach, here, to control flow analysis

which eliminates the dynamic element and incorporates an algebra and

calculus into the analysis. The foundation of this approach is that

each thread of control is treated as a single expression -- one that's

infinite in size, as I'm about to explain.

In functional languages, or other notations or calculi founded on the

Lambda calculus, there is a distinction drawn between the "alpha"

level and "beta" level. The former lies in the realm of syntax (more

precisely: in "abstract" syntax). In it, the alpha conversion rule

applies. This carries out a syntactic transformation whereby bound

variables are renamed. The effect of the alpha conversion is such that

not just the MEANING of the term remains the same, but its FORM, at a

suitably abstract level -- hence the designation as a syntactic rule.

The bound variables are "not there". A term is regarded as an

equivalence class of all its concrete forms, each one convertible to

the other by alpha conversion; each one committing to specific names

for the bound variables.

In contrast, at the beta level, one is at the semantic level. The

operative rule, beta conversion still preserves the MEANING of the

term, but not its form. The transformation of form defines the essence

of computation.

In algebraic theory, these differences can be lucidly and simply

explained in a way that removes the fog of philosophical haze. Terms

are elements of an algebra. For a language, like C, the actual

notation of the language is meant to embody a specific many-sorted

algebra that contains certain operators that act on certain types to

produce results of certain types. The language, itself, is then built

around this algebraic theory with the following modifications and

extensions: (a) the introduction of a dynamic element atop the static

algebraic notation -- control flow; and (b) the imposition of

limitations that restrict the algebra to a finite algebra or a finite

approximation of an infinite algebra.

The distinguishing features of an algebra are (1) its set of "sorts"

or types and its "signature", which comprises the listing of its

operations and what types they apply to and result in; (2) a set of

relations or axioms that define the equivalence of terms and serve as

the basis for the transformations that go to make up "computation".

If we suspend the axioms and take only the algebra with its sorts and

signature, but with no axioms, then the result is a generalization of

what's known as a MAGMA. A magma (in this general sense) is simply an

algebraic theory consisting of a signature but no axioms.

The designation "magma" probably arises from that fact that this gives

us the substructure on top of which all algebras are constructed.

Hence, the magma for a group G would be a signature containing the 3

group operations

Identities e in G

Inverses g in G |-> g' in G

Products (g1,g2) in GxG |-> g1 g2 in G.

An equivalent theory can be posed with just the one operation

Left Quotient (g1,g2) in GxG |-> g1\g2 in G

or

Right Quotient (g1,g2) in GxG |-> g1/g2 in G

(provided we also add a postulate that G be a non-empty set; so, we

can get around this by explicitly adding the identity to the

signature). With suitable axioms this can be made to emulate a group

(in fact, the axioms for the quotient are MUCH simpler than for the

product).

Every theory that can be expressed for any language possessing this

signature then comes down to an imposition of an equivalence relation.

This relation is known as a CONGRUENCE. A congruence is an equivalence

relation whose equivalence classes are preserved by all operations.

So, here, a congruence for this signature would be a relation (a==b)

satisfying the properties:

a==a

a==b -> b==a

a==b & b==c -> a==c

a==b -> a'==b'

a==b & a'==b' -> aa' == bb'.

The mapping of a magma M with this signature onto its equivalence

classes M/== = { [m]/==: m in M } then gives us the archetypical case

in point of what's termed a HOMOMORPHISM. The process reduces the

magma M to its quotient M/==, as a homomorphic image.

Since congruence relations are closed under arbitrary intersections,

then it follows that they form a complete lattice. Hence, one can also

talk about taking the "smallest" relation that contains a set of

equations. Out of this comes the notion of an algebra and a

PRESENTATION for an algebra. For instance, to define a group G, we can

take the quotient+identity signature and define the congruence

relation == as that containing the following relations

g/e==g

g/g==e

(g/k)/(h/k)==g/h

The result makes the reduced magma G/== into an algebra equivalent to

a group, in which the group operations are defined by

g' = (g/g)/g

g1 g2 = g1/((g2/g2)/g2)

It then follows that these operations define a group and that, under

these definitions, one also has the identity g/h = g h', as expected;

as well as g/g = e.

Finally, we get to the point: the FREE magma <X> generated by a set X

is none other than the set of terms built up out of the operations of

the given signature, with "variables" taken from the set X. When we go

one step further and add in a functional calculus with a lambda

operator, then this is where the distinction between free and bound

variables is introduced. Then, we would also require that terms be

alpha convertible, even in the free magma.

2. Alpha versus Beta Levels; Control Flow by Infinitary Expressions

So, with this background, we can draw a clear distinction between the

alpha and beta levels as follows:

Alpha Level -- the equivalences at the level of the magma

Beta Level -- the equivalences under the algebraic theory that are

defined atop the magma.

It's at this point that we bump (almost accidentally) right into the

middle of control flow analysis; and it's also here that we can show

where a sharp distinction between recursive functions or subroutines

lies with respect to another concept that will be introduced here --

the control-flow expression.

The question that naturally arises is this: if we define a functional

notation at the beta level, with recursion, we have the workings of

the lambda calculus built atop the given signature. It's out of this

that one normally expects to see a functional notation and language

arise.

If, on the other hand we try to define a fixed-point closure of the

FREE magma, then what do we get? Since the free magma <X> is nothing

more than an algebraic rendering of the concept of a "term language",

whose elements are the "trees" representing the structures of the

individual terms; then what inhabits the fixed point closure?

If we assume closure under finite systems of fixed point relations,

the resulting solutions are infinite extensions of terms whose tree

structures are known in the computer science literature as RATIONAL

TREES. The distinguishing feature of a rational tree is that, though

it may be infinite in size, it contains only a finite number of

distinct sub-trees.

For the expression calculus underlying a language like C, one has a

conditional operator A? B: C. Under a fixed-point closure, one may

conceive of a term of the form

A? (A? (A? (A? ...): C): C): C.

If the entire term is denoted X, then the sub-terms are A, C (and

their sub-terms), as well as A? X: C, which is just X, itself.

This is "control-flow" at its purest. Each expression or sub-

expression may be thought of as a "label". A reference to the sub-

expression is a "goto". A rational expression, if infinite, contains

cyclic references in it -- i.e., cyclic control flow structures.

A larger class, still, of expressions is arrived at by allowing fixed-

point abbreviations with GNU-breviations (i.e. macros that have

infinite self-referential expansions that may unfold into an infinite

number of distinct sub-terms). This embodies the notion of

INTER-procedural control flow, just as the rational expression

embodies the notion of INTRA-procedural control flow.

3. The Extended Landin Correspondence -- Statements as Contexts

It is out of this that a natural SYNTACTIC (not semantic, but

syntactic) representation for control flow arises. When this is

grafted onto an algebra that, itself, is built atop the magma, the

result is an extension of the algebra to a control-flow language of

such a nature, that the syntactic representation leads to direct

realizations of all of the known formalisms for computational

semantics.

The basis of this correspondence is as follows: associated with each

statement S, in the control flow language is a "context" S[] (i.e. a

function defined on the magma) such that the result of substituting an

expression Q into the context is an expression S[Q] that encapsulates

the "value of Q after S is executed".

For assignment statements involving simple variables, the context

arose from the functional<->imperative correspondence first described

by Landin in the 1960's

x = e <-> (lambda x []) e.

One may conceive of a more complex extension of the lambda calculus

where not just simple variables but objects (and subobjects) are

permitted under the binding of the lambda. For instance, if a(x) is

the function underlying an array a[x], then one would have an

equivalence of the form

(lambda a[e].f) = lambda x. (lambda a.f)([e]<-x,a)

where ([e]<-y,a) is the array whose underlying function is given by

([e]<-y,a) = lambda x.(x==e? y? a(x))

A more elaborate set of equivalences can be posed for multiple access;

e.g.

(lambda a[e][f].g) = lambda x. (lambda a[e].g)([f]<-x,a(e)).

Further elaborations can be conceived of in which object-subobject

relations are enforced (e.g. with pointers, dynamic segments, etc.)

The point of all this is not to get into these intricacies, but to

exemplify the primordial case-in-point of the generalized "Landin

correspondence":

S[Q] is the result of applying statement S to an (infinite)

expression Q, where S[] is the context corresponding to S.

More elaborate formulae for contexts can be built up, e.g.

(S1;S2)[Q] = S1[S2[Q]]

(E?S)*[Q] = E? S[(E?S)*[Q]]: Q

S# = S[S#]

(;)[Q] = Q

(E?S)[Q] = E?S[Q]:Q

(E?S1:S2)[Q] = E?S1[Q]:S2[Q]

and so on.

Though we're only looking at single-threaded computation, it's also

natural to extend these deliberations to non-deterministic terms

(i.e. terms that designate entire sets of expressions), with the

following primitives

(exit)[Q] = {}

(spawn S)[Q] = S[Q] union Q

This will actually AID in the analysis to be formulated below, and (in

fact) proves to be a nearly-indispensable ingredient even for the

lesser task of writing an analysis out for the deterministic case.

4. Object-Lifetimes via the "Free Variable" Predicate

The rules we're seeking for are to determine the activity of the

various objects in the language. We will assume that the underlying

lambda notation works with objects and subobjects, as well as simple

variables.

The analysis adopted here is based on the simple primitives: the

notions of FREE and BOUND variables. And it's here that the underlying

syntactic representation in terms of magmas makes its full power felt.

For, even though the usual definitions of "free" and "bound" are posed

only for the FREE magma defining a "term language", they apply intact,

just as well, to fixed-point closed magmas. That is, one can still

distinguish which variables in a given expression are free, and which

ones are bound even when the expression is infinite. That's because

the freeness and boundness is determined by what expressions contain a

given sub-expression. All sub-expressions lie a finite number of

levels below the top-level expression. This is also true even if the

expression is not rational and has an infinity of distinct sub-

expressions.

Hence, the freeness and boundness predicates are well-defined for

infinitary expressions.

We start out by assuming that there are a family of "primitive"

statements; and that the context S[] of each such statement binds some

variables, while adding free variables.

Hence, if we denote F(Q) as the set of free variables of Q, we

postulate the existence of sets F(S) and B(S) such that

F(S[Q]) = F(S) union (F(Q) - B(S)).

For instance, an assignment to a simple variable x of the form x = 0

has F(x=0) = {}, B(x=0) = {x}.

If this, indeed, holds true then we can actually DERIVE the formulas

for the basic control constructs:

F((S1;S2)[Q])

= F(S1[S2[Q]])

= F(S1) u (F[S2[Q]]) - B(S1))

= F(S1) u ((F(S2) u (F(Q) - B(S2))) - B(S1))

= F(S1) u (F(S2) - B(S1)) u (F(Q) - B(S2) - B(S1))

= (F(S1) u (F(S2) - B(S1))) u (F(Q) - (B(S2) u B(S1)))),

(where I'm abbreviating "union" to "u" above, for brevity). Matching

to the formula

F((S1;S2)[Q]) = F(S1;S2) union (F(Q) - B(S1;S2)),

we find that

F(S1;S2) = F(S1) union (F(S2) - B(S1))

B(S1;S2) = B(S1) union B(S2).

Based on the analogy between the formula for F(S1;S2) and F(S[Q]),

we'll also assume that a "bound" operator B(Q) can be defined for

expressions which satisfies the identity

B(S[Q]) = B(S) union B(Q).

5. A "Free" and "Bound" Calculus for Control Flow Constructs

Second, we as assume for the control flow extension that

F(Q1 union Q2) = F(Q1) union F(Q2)

F({}) = {}

Then we may derive the following

F(exit[Q]) = {} = F(exit) union (F(Q) - B(exit))

from which we obtain

F(exit) = {}

B(exit) = Omega = set of ALL objects

In turn, we have

F(spawn S[Q]) = F(S[Q] union Q)

= F(S[Q]) union F(Q)

= (F(S) union (F(Q) - B(S))) union F(Q)

= F(S) union F(Q)

from which we derive

F(spawn S) = F(S)

B(spawn S) = {}.

If we treat the conditional as a derivative of the non-deterministic

branch, e.g. by the equivalence

x? S1: S2 = (spawn (x? S1: exit));(spawn (x? exit: S2))

whenever x is a simple variable, then we can write down a similar rule

F(x? S1: S2) = {x} union F(S1) union F(S2)

B(x? S1: S2) = {x} union (B(S1) INTERSECT B(S2))!

where the last formula is underscored for emphasis.

Out of this arises the following

F(E? S[Q]: Q) = F(E) union (F(S[Q] union Q) - B(E))

= F(E) union ((F(S) union F(Q)) - B(E))

= F(E) union (F(S) - B(E)) union (F(Q) - B(E)).

Hence

F(E?S) = F(E) union (F(S) - B(E)) = F(E;S)

B(E?S) = B(E)

Similarly, we get

F(E?S1:S2) = F(E) u ((F(S1) u F(S2)) - B(E)) = F(E;S1) u F(E;S2)

B(E?S1:S2) = B(E) u (B(S1) n B(S2)) = B(E;S1) n B(E;S2)

(abbreviating "union" and "intersection", resepctively, to "u" and

"n").

In order for this to be consistent with a B() operator on expressions,

we also have to adopt the following axioms:

B(Q1 union Q2) = B(Q1) intersect B(Q2)

B({}) = Omega.

Finally, to deal with fixed point relations, we adopt the postulate

that the infinitary expression corresponding to the expression is the

limiting case of all finite truncations of that expression where sub-

expressions are pruned off and replaced by {}.

Hence, for S#, we have the limiting case containing the set of

expressions

{}, S[{}], S[S[{}]], S[S[S[{}]]], ...

out of which we get for F():

{}, F(S), F(S), F(S), ...

and for B(), we get:

Omega, Omega, Omega, Omega, ...

The limiting relations are then taken to be the values for the

corresponding rational expression:

F(S#) = F(S),

B(S#) = Omega

For the loop construct, taking limits, we arrive at the following:

F((E?S)*) = F(E) union (F(S) - B(E)) = F(E?S)

B((E?S)*) = B(E) = B(E?S)

In summary, we have the following table

F({}) = {} B({}) = Omega

F(Q1 u Q2) = F(Q1) u F(Q2) B(Q1 u Q2) = B(Q2) n B(Q2)

F(S[Q]) = F(S) u (F(Q) - B(S)) B(S[Q]) = B(S) u B(Q)

F(exit) = {} B(exit) = Omega

F(spawn S) = F(S) B(spawn S) = {}

F(S1;S2) = F(S1) u F(S1) B(S1;S2) = B(S1) n B(S2)

F(S#) = F(S) B(S#) = Omega

F((E?S)*) = F(E?S) = F(E;S) B((E?S)*) = B(E?S) = B(E)

F(E?S1:S2) = F(E;S1) u F(E;S2) B(E?S1:S2) = B(E;S1) n B(E;S2)

(again, with the abbreviations "u" and "n").

6. The Magic Algebra

Now, it's at this point that the MAGIC ALGEBRA enters into the

picture. First, pick out a single object x. Using this object, we may

assign any of the following values to an expression Q and statement S:

<Q> = 1 if x is not in F(Q) or B(Q)

<Q> = r if x is in F(Q) - B(Q)

<Q> = w if x is in B(Q) - F(Q)

<Q> = rw if x is in B(Q) intersect F(Q).

Then, the surprise that awaits us is that there is a simple algebra

for {1,r,w,rw} underlying the decomposition formulae just derived. It

is given by the following

<{}> = <exit> = w

<Q1 union Q2> = <Q1> + <Q2>

<S[Q]> = <S><Q>

<S1;S2> = <S1><S2>

<S#> = <S>w

<(E?S)*> = <E?S> = <E>(<S>+1)

<spawn S> = <S>+1

<E?S1:S2> = <E>(<S1>+<S2>),

with operations given by

a+b = least upper bound of {a,b}

1x = x = x1 for all x

rr = r, ww = wr = w

The least upper bound is taken under the partial ordering given by

w < 1 < r and w < rw < r.

Thus, for instance, 1 + rw = r, 1 + w = 1, 1 + r = r. Under addition,

r behaves as infinity, while w behaves as zero. Under multiplication 1

is the identity, while w behaves as a zero ONLY on the left, but not

on the right!

It is also useful to carry the analysis to places where control flow

cannot be easily resolved by static analysis (i.e. indirect calls and

jumps). For this, we add in the extra elements

<S> = 1, R, r, W, RW, rW, w, Rw or rw

with the factor R occurring if x's membership in F(S) is unresolved,

and W occurring if x's membership in B(S) is unresolved.

A suitable extension to the algebra can then be defined. In

particular, the partial ordering relation is the one given by

w < W < 1 < R < r,

along with the ordering this implies for all the factorings

Rw < RW < R, rw < rW < r, w < Rw < rw, W < RW < rW. This results in

the following lattice

1 R r

W WR Wr

w wR wr

where increasing order takes place going up or to the right.

Again, we find that

1x = x = x1, wx = w

for all x. From this, it follows that Rwx = Rw and rwx = rw for all x.

The remaining identities are

RR = R, rR = Rr = rr = r, WR = Wr = RW, WW = W, Ww = w.

Together, this gives us the MAGIC ALGEBRA -- which serves as a basis

for the control flow analysis of F() and B().

7. Extension to Inter-Procedural Control Flow

To expand this to inter-procedural analysis, this is where we may use

of the larger family of infinitary expressions -- those which solve

non-rational fixed-point relations.

The context for a subprogram given by the following definition

define P(x1,x2,...,xn) B[return](x1,x2,...,xn)

when applied to the procedure call

P(e1,e2,...,en)[Q]

is the expression

B[Q](e1,e2,...,en).

This representation treats the "return" statement, itself, as a place-

holder for the context Q in the expression S[Q]. That is, "return" is

treated as an expression.

The context for a value-returning subprogram given by the following

definition

define F(x1,x2,...,xn) B[return[]](x1,x2,...,xn)

when applied to the function-evaluation (itself, in context),

S[F(e1,e2,...,en)]

is the expression

B[S[]](e1,e2,...,en) = (B;S)[](e1,e2,...,en)

Here, the "return" statement is not a place-holder for a context, Q,

but for a statement S[]; "return" is treated as a statement.

A call-by-value semantics for parameter-binding is directly

incorporated by the following equivalences

define P(x1,x2,...,xn) B[return]

define F(x1,x2,...,xn) B[return[]]

becoming

P(e1,e2,...,en)[Q] = (x1=e1;x2=e2;...;xn=en;B)[Q]

S[F(e1,e2,...,en)] = (x1=e1;x2=e2;...;xn=en;B;S)

or, if ordering is to be made indeterminate,

P(e1,e2,...,e2)[Q] = (spawn(x1=e1);spawn(x2=e2); ...; xn=en)[B[Q]]

S[F(e1,e2,...,en)] = (spawn(x1=e1);spawn(x2=e2); ...; xn=en);(B;S)

Both of these schemata are fixed-point equations in the underlying

magma whose solutions yield infinite expressions that contain an

infinite number of distinct sub-terms, in general -- irrational terms.

Here, F() and B() when applied to subprograms will produce a recursive

system of equations between the subprograms. The same prescription may

be used to arrive at a limit definition.

When the MAGIC ALGEBRA is inserted into this analysis, what this

amounts to is solving fixed point equations over the algebra. Though

it hasn't been developed in detail here, it seems fairly obvious and

intuitive that there is both a simple and efficient equation-solving

algorithm for parametric fixed-point systems over both the {1,r}x{1,w}

magic algebra and its extended {1,R,r}x{1,W,w} algebra.

More importantly: we're not talking about something as powerful as the

Boolean algebra. So equation-solving over this algebra should not

degenerate into full-blown NP-completeness.

8. Multiple Objects and the Extended Magic Algebra

Multiple objects are handled together by generalizing the algebra to

include POWERS of r and w. Thus, corresponding to the pair of sets (F

(S), B(S)) will be the expression r^F(S) w^B(S). This places the focus

on the algebra inherited for the exponents, themselves.

The extended algebra is thus obtained by generalizing the sets {1,R,r}

and {1,W,w} to powers of r and w. The primitives adopted here reflect

the relations of the F() and B() operators, and are as follows:

r^a r^b = r^(a|b), w^a r^b = r^(b-a) w^a, w^a w^b = w^(a|b)

r^a w^b + r^c w^d = r^(a|b) w^(cd)

r^0 w^0 = 1

a < b -> r^a < r^b & w^a > w^b.

In order for this to work consistently with associativity and the

identity property for products, this requires that

a|0 = a = 0|a, a-0 = a, 0-a = 0

a|(b|c) = (a|b)|c, (a|b)-c = (a-c)|(b-c), a-(b|c) = (a-c)-b

Note the reversal of ordering in the last identity.

Consistency with the ordering relation requires

0 <= a

a <= b, c <= d -> a|c <= b|d

a <= b, c >= d -> a-c <= b-d

From the last of these, it follows that a-b <= a.

In order that it yield associativity for +, it also requires that

a(bc) = (ab)c.

For commutativity of +, we also need

a|b = b|a, ab = ba.

For distributivity over + on the left, we need

a|b|a = a|b, a|bc = (a|b)(a|c).

For distributivity over + on the right, we need

a|b|(a-c) = b|a|(a-c), ab|c = (a|c)(b|c)

Finally, to recover r and w, respectively, as special cases r^1 and

w^1 (corresponding to r^Omega and w^Omega), we assume that all

exponents lie in the interval [0,1]. Then we also have

a <= 1.

If all these properties are assumed, then it follows that a|b is the

least upper bound operator for the ordering relation <, and that 0 is

the minimal element and 1 as the maximal element.

An example may be obtained by mapping the exponents onto the linearly

ordered real-number interval [0,1]. Here, a|b = max(a,b) and a-b can

be given any of a wide range of definitions consistent with the above

properties, including the following:

(a>b)?(a-b):0, (a>b)?a:0.

9. Primitive Statements; the Handling of Object-Subobject Relations

For the above analysis to take off, we first need a grounding for the F

() and B() sets. These need to be defined for the primitive

statements. These are the assignment statements and (more generally)

statements that contain expression evaluation threaded with embedded

assignments.

At the most general level, an assignment statement of the form X = E

bounds the object X to B(). However, the story doesn't end there. If

the object is part of an object hierarchy, then the SUB-objects are

also bound, and the SUPER-objects are partly bound. The object-

subobject hierarchy imposes a set of constraints on F() and B().

For an assignment statement S = (X = E), one has the following

F(S) = F(&X; E) = F(&X) union F(E) - B(&X)

B(S) = B(&X; E) = B(&X) union F(E)

In the absence of an object-subobject hierarchy, we may simply write

F(&X) = B(&X) = { &X }.

However, in the presence of an object-subobject hierarchy, the sets F

(&X) and B(&X) need not be trivial.

Instead, we have to account for the hierarchy. Suppose &Y contains &X

as a proper subobject. Then, a read from &X also reads &Y, while a

write to &X equates to both a read AND write of &Y (in effect, it's as

if one "reads" the unchanged parts of &Y and writes them back). Hence,

if <S> = r^a w^b for &X then <S> = r^{a|b} w^b for &Y.

This is the constraint that needs to be enforced.

For non-overlapping objects, there are no constraints. For overlapping

objects, things get more complicated, but the occurrence of

overlapping (other than proper inclusion or identity of objects) is

not generally granted by programming languages.

Thus, at its most general level, the value of the statement (X.f = E)

for the object &X.g would be the conditional expression

<X.f = E> = w^{f==g}.

while its value for the object &X would be

<X.f = E> = rw.

If &X.g, itself, had a non-trivial part-whole relation with &X or with

a subobject of &X.f, then the first expression could also acquire an

exponent of r or raise its exponent of w to a higher power.

10. Conclusion

For the task at hand, analysis of assembly code for the 8051, none of

these elements were of direct relevance. The processor is essentially

is a register-based processor (a relatively large 256+128-byte

register space), having little need for non-register variables in most

applications the 8052 is used for. This was (for the most part) the

case with the analysis done in 2000.

However, for more general tasks -- particlarly the problem of

decompilation in its full-blown generality (and even the problem of

*COM*-piling and doing optimization and control-flow analysis, the

considerations just raised here come fully to bear.

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.