Re: Writing Assembler! (Mark Hopkins)
25 May 1997 13:31:31 -0400

          From comp.compilers

Related articles
[3 earlier articles]
Re: Writing Assembler! (1997-05-17)
Re: Writing Assembler! (Ben Bullock) (1997-05-17)
Re: Writing Assembler! (1997-05-17)
Re: Writing Assembler! (Mr J R Hall) (1997-05-22)
Re: Writing Assembler! (JUKKA) (1997-05-22)
Re: Writing Assembler! (Charles Fiterman) (1997-05-22)
Re: Writing Assembler! (1997-05-25)
Re: Writing Assembler! (1997-05-25)
Re: Writing Assembler! (JUKKA) (1997-06-09)
Re: Writing Assembler! (1997-06-09)
Re: Writing Assembler! (1997-06-09)
Re: Writing Assembler! (1997-06-11)
Re: Writing Assembler! (Cliff Click) (1997-06-11)
[7 later articles]
| List of all articles for this month |

From: (Mark Hopkins)
Newsgroups: comp.compilers
Date: 25 May 1997 13:31:31 -0400
Organization: Omnifest
References: 97-05-156
Keywords: assembler

>From (Khoo Kiak Wei):
> I am planning to write a generic assembler, as a work of learning
> flex and bison. However, after reading a book on Lex and Yacc, I am
> still confused on how should I start!?
> If any of you know of any sample code on assembler or good books talking
> on the assembler construction, I would like to hear from you please!
> Thanks in advance.

      What should go into the design of an assembler? I'll give you a
head start by listing what I think the ideal design should look like.

(A) Resolution of Operations
      A subtle fact that took me a long while to learn is that there are
different levels of assembly language. At the bottommost level, one has
what you might call "literal" assembly. Here you have to explicitly
indicate the exact form of every mnemonic and take total responsibility for
making sure that the mnemonic has valid operands. The processors I'm most
familiar with are the 8051 and 8086. In both processors you have different
instructions, for instance, for calling a subroutine or jumping to an
address. With the 8051, for instance, you have these operations:

                                SJMP Relative
                                AJMP Paged ACALL Paged
                                LJMP Absolute LCALL Absolute
                                JMP @A+DPTR

A relative address lies in the range [$ - 0x80, $ + 0x7f], where $ stands
for the address following the end of the instruction. A paged address
lies on the same 0x800-byte "page" as $, i.e. (Paged & 0x7ff) == ($ & 0x7ff).
An absolute address can lie anywhere. An absolute address can also be
specified through the pointer formed as the sum of A and DPTR. In a "literal"
assembler you have to specify which instruction to use.

      With the 8086 you have
                                JMP SHORT Relative
                                JMP NEAR Paged CALL NEAR Paged
                                JMP FAR Absolute CALL FAR Absolute

where FAR Absolute is a 32-bit address that can be specifed directly
or through a pointer, Paged is an address (specifiable directly or
through a pointer) that lies in the same 16-bit page as $, and
Relative means the same thing as it does on the 8051.

      It's not really a major problem to indicate which form you want:
you always specify the shortest form (SJMP, ACALL) and let the
assembler tell you which ones are out of range and need to be
extended. But it's a minor hassle.

      At the next level, you'll have an assembler which does the
resolution of operations itself. One way to do this, with the 8086
example, is to require the programmer to explicitly designate some
addresses as near and some as far. This runs roughly parallel to what
you do in C by specifying some functions as static (visible only from
within the file) and some as global (visible from all files: the
default), with the C file and 8086 segment being analogous. In the
case of CALL, this is absolutely required since the corresponding
return instruction has to be explicitly indicates as a NEAR or FAR

      With the 8051, an assembler may allow the programmer to just say:

                                JMP Absolute CALL Absolute

and then resolve the instructions as best as it can.

      At the highest level -- the ideal level -- an assembler will allow
EVERY operand combination for EVERY mnemonic. For example, consider
the following instruction:
                                                XCH X,Y

used to exchange the values in variables X and Y. On the 8051, the only
allowed combinations are:

                                                XCH A,Data
                                                XCH A,@Ri
                                                XCH A,Rn

where Data is an address in "directly addressible" (internal) RAM,
where @Ri (i = 0, 1) points to "indirectly addressible" RAM, and Rn (n
= 0, ..., 7) lies in the Register address space.

      Directly addressible RAM consists of the internal RAM 0 to 0x7f,
and the Special Function Registers (range: 0x80 to 0xff). Indirectly
addressible RAM consists also of the internal RAM 0 to 0x7f, but a
different segment in the range 0x80 to 0xff, which can also be used as
stack space. The Register address space is a "register window" in
internal RAM, with one of the 4 ranges 0 to 7, 8 to 0xf, 0x10 to 0x17
or 0x18 to 0x1f.

      An assembler at the first two levels will force you to make your
code conform to the restrictions placed on the XCH instruction. The
assembler at the 3rd level will insert its own code, as needed, to
make your operands fit the requirements of the instruction. For
instance, you might have the following translations:

                XCH Data,A ---> XCH A,Data

                XCH @Addr,A ---> MOV R0,#Addr
                                                                XCH A,@R0

                XCH D1,D2 ---> XCH A,D1
                                                                XCH A,D2
                                                                XCH A,D1

with the middle instruction requiring some extra resources not
explicitly indicated in the instruction.

      Another kind of instruction which will be resolved in an orthogonal
manner in this kind of assembler is the conditional jump. On the
8051, which has addressible bits, there are only a couple different
kind of conditional jumps. For instance, one has

                                                JB Bit,Relative

which is equivalent to the C statement:

                                                if (Bit) goto Relative;

Typical of these instructions is that they only allow restricted
ranges of addressing. An assembler will normally force you to fit the
address into the indicated range or else flag an error. This is true
even of those assemblers (at level 2) which try to resolve JMP into
SJMP/AJMP/LJMP -- which is a serious inconsistency! If you're going
to allow one kind of jump to be resolved by the assembler, then you
have to let them all be.

      What the assembler might do, if the indicated address is not in
range, is convert it into the sequence:

                JB Bit,Address ---> JNB Bit,XX
                                                                                JMP Address

where the opposite condition is used. In the 8051, there are some
versions of the conditional jump which don't have opposites. They
would have to be translated, when out of range, as in the following

                JBC Bit,Address ---> JBC Bit,YY
                                                                                SJMP XX
                                                                                YY: JMP Address

      So, an assembler at the highest level should allow every operator
to be used with every type-conforming combination of operands,
possibly doing some translation in the process to make the operation
fit the requirements of the hardware.

(B) Resolution of Addresses
      In order to resolve operations, the assembler has to be able to map
your variables and routines to actual addresses. At the lowest level,
the assembler will force the programmer to explicitly indicate every
address by the following devices:

                                                ORG Address

to specify that what follows is to begin at the indicated address, or

                                Var: DS Size

to reserve a space of indicated size for the indicated variable (in the
process incrementing $ by Size), or


to equate Label to $, the current address, or

                                Label EQU Address

to explicitly set the indicated label to the indicated address. The
assembler, itself will not do any mapping beyond this.

      In an assembler at the next level, a programmer could specify a
segment for variables or code without explicitly indicating the
starting address. It's then up to the assembler to map this segment
to an appropriate location.

      Normally this kind of assembler will also allow separate assembly
of files (into an object file format), and will come with a linking
phase. It's during linking -- and is the main function of linking --
that the mapping of the segments will take place.

      Since addresses may be referred to before actually being defined,
and since they could even be used in expressions (such as in the EQU
example above), then there has to be some way to defer the actual
creation of code until all the information about the address is known.

      In the traditional two-pass assembler, the first run through the
program is made to collect numeric values for addresses. The
mnemonics, like JMP and CALL are not resolved in the optimal way when
referring to addresses not yet defined, and other operations (like JB,
JBC) could not be resolved in this way at all. Also, there's no easy
way to handle relatively addresses segments this way.

      So two-pass assembly is not really suitable for the ideal design
we're constructing here.

      A one-pass assembler will likewise collect addresses (*relative to
segments*) but also generate some intermediate code during the first
pass. Ideally, this code will be none other than the OBJECT FILE, and
it need not be at all similar to the binary which is finally created.
This is especially true considering how instructions like JB or JMP
(or XCH in our example above) are going to be mapped in the object

      In place of the second pass, to create the actual assembly code,
there will be a linker which will combine the object files, mapping
addresses in the process. As part of the mapping process, operations
can also be resolved using a kind of "shortest-first" strategy, as
described above, to find the optimal fit.

      Since the yet-to-be-resolved addresses may even be involved in
expressions, the object file format has to be able to list not just
partially resolved addresses, but partially resolved expressions!
Still, it is convenient to restrict the types of expressions to the

                                                Address + Number
                                                Address - Number
                                                Number + Address
                                                Address - Address

where a restriction is made that in expressions of the form (A1 - A2),
not only must A1 and A2 be addresses of the same type, but they must
come from the same segment. For purposes of making that distinction,
an absolute address can be considered to lie on the "absolute segment"
of the corresponding type (data, code, bit, etc.), so that A1 and A2
are allowed to be absolute addresses.

      These restrictions are the same as apply to pointers arithmetic in
C, where instead of "segment", you have static or dynamic array (or a
single variable, which can be considered as a one-unit segment).

      Other expressions may use the "numeric value" of an address. This
should be explicitly disallowed, except for those addresses which lie
on the absolute segment. Alternative, if A denotes an address which
lies on a segment S, then the assembler can be set up to typecast A to
the number
                                                                A - S0

where S0 denotes the starting address of segment S.

      I know of no assembler which does all of what I've described above.

(C) User-Defined Operations and Directives
      A facility for defining your own operations (MACROS), somewhat
analogous to C's #define, should be included. In the ideal design,
the syntax used for mnemonics should be fully orthogonal, as described
above. Not only should they be fully orthogonal, but there should be
enough power in the ability to define macros that they could even be
user-defineable in terms of the basic operation:
                                                                DB Byte

(which maps the given byte expression explictly to the current
location in the current segment).

      First, this requires the ability for the macro to distinguish
between alternatives -- which implies that one must also have the

                                                IF (Cond) S1 [ELSE S2]

where Cond is an expression indicating condition, where S1 and S2 are
assembler statements. It implies the ability to group 0, 1, 2 or more
statements together (even on the same line!):

                                                { S1; S2; ...; Sn }

and it implies that conditional expressions should be able to refer to
whether an address (and more generally: an expression) is relative or
absolute, and the type of an expression and the segment it lies on.

      Further, the statment and expression syntax should be fully
orthogonal, so that conditionals can be used in any context any other
expression can, and conditional statements or statement groups can be
used outside of macro definitions, even to group assembly statements
together on one line. Also, since one has to have the ability to
refer to the type of an expression, then the very names of these types
should be able to fully participate in the expression syntax, and
there should even be a type called "type".

      So there should be operators of the form

                                                TYPEOF Expression

to indicate an expression's type (thus: typeof (typeof E) == type),

                                                SEGMENT Address

to indicate the (starting location) of the indicated address's
segment, and
                                                OFFSET Address

which should be equivalent to (Address - SEGMENT Address). For
absolute addresses, (SEGMENT A) will be equal to the starting location
of the address space -- which normally will be 0.

      Among other things, all of this implies the need to name segments. The
easiest way to handle this is to just allow segments to be defined by the

[Label:] SEG [TYPE] [AT/ORG Address]

and to allow for the following equivalences:

                                SEG Type AT A <---> SEG Type
                                                                                                AT/ORG A

                X: SEG Type AT A <---> SEG Type AT A

When no starting address is explicitly indicated, the segment is then
considered to be relative and then the value of X will not be known
until the linker maps out segments.

      Since the ; is used to mark comments in many assemblers -- those
for the Intel processors such as the 8086 and 8051, some amends have
to be made here to avoid a conflict of syntax. An easy fix is to
require all comments to begin in ;;, and interpret two consecutive
";"'s as the start of a line comment instead of two statement
terminators. It's also useful to allow the full range of C++ comments
-- the line comments which start in /, and the /* ... */ comments
which may extend across several lines.

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.