# 1 - 1, 1 -1, 1-1, 1 - -1 and -2147483648

## "Edsko de Vries" <devriese@cs.tcd.ie>31 Jan 2006 09:57:37 -0500

From comp.compilers

Related articles
1 - 1, 1 -1, 1-1, 1 - -1 and -2147483648 devriese@cs.tcd.ie (Edsko de Vries) (2006-01-31)
Re: 1 - 1, 1 -1, 1-1, 1 - -1 and -2147483648 torbenm@app-0.diku.dk (2006-01-31)
Re: 1 - 1, 1 -1, 1-1, 1 - -1 and -2147483648 rsc@swtch.com (Russ Cox) (2006-01-31)
Re: 1 - 1, 1 -1, 1-1, 1 - -1 and -2147483648 jm@bourguet.org (Jean-Marc Bourguet) (2006-01-31)
Re: 1 - 1, 1 -1, 1-1, 1 - -1 and -2147483648 mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2006-01-31)
Re: 1 - 1, 1 -1, 1-1, 1 - -1 and -2147483648 gneuner2@comcast.net (George Neuner) (2006-01-31)
Re: 1 - 1, 1 -1, 1-1, 1 - -1 and -2147483648 rivers@dignus.com (Thomas David Rivers) (2006-01-31)
[5 later articles]
| List of all articles for this month |

 From: "Edsko de Vries" Newsgroups: comp.compilers Date: 31 Jan 2006 09:57:37 -0500 Organization: http://groups.google.com Keywords: lex, arithmetic, question Posted-Date: 31 Jan 2006 09:57:37 EST

Hi,

I have a question regarding lexical analysis. I recently came across a
bug in our lexical analyser in phc (www.phpcompiler.org), that I am
unsure how to solve. This is the problem: our current definition for
integer constant looks something like

INT ([1-9][0-9]*)|0

In particular, note that it does not allow for an (optional) "+" or "-"
at the start of the integer. This means that the strings "1 - 1", "1
-1" and "1-1" all generate the same sequence of three tokens INT(1),
OP(-), INT(1), for which the syntax analyser generates the subtree
BIN_OP(-, 1, 1).

For the string "1 - -1", the lexer (unsurprisingly) generates INT(1),
OP(-), OP(-), INT(1). The syntax analyser recognises this as BIN_OP(1,
UNARY_OP(-, 1)). In other words, the second "-" is treated as a unary
operator, rather than as part of the number.

This works fine, with the sole exception of the number "-2147483648".
The problem is, of course, overflow: -2147483648 is a valid negative
number (assuming 32-bit numbers), but the integer 2147483648 is _not_ a
valid positive number. Thus, the above method of dealing with "-" as a
unary operator breaks down.

The solution is to interpret the "-" as part of the number, and
generate INT(-2147483648), rather than OP(-), INT(...). However,
changing the definition of INT to

INT [+-]?([1-9][0-9]*)|0

causes "1-1" to be recognised as INT(1), INT(-1), which is incorrect.
Even requiring a space before the operator

INT (" "[+-])?([1-9][0-9]*)|0

does not help, because "1 -1" will be recognised as INT(1), INT(-1)

Is there a good solution that solves both problems? At the moment, I
have added a new rule to the lexer that matches this exact number,
which solves the problem - but it's a hack.. (actually, there are two
rules, one for 32-bit machines, one for 64-bit machines..)

Thanks,

Edsko

Post a followup to this message