Re: Definable Operators

Craig Burley <burley@tweedledumb.cygnus.com>
30 May 1997 23:04:47 -0400

From comp.compilers

Related articles
[38 earlier articles]
Re: Definable operators burley@tweedledumb.cygnus.com (Craig Burley) (1997-05-13)
Re: Definable operators burley@tweedledumb.cygnus.com (Craig Burley) (1997-05-13)
Re: Definable operators pjj@cs.man.ac.uk (1997-05-14)
Re: Definable operators jkoss@snet.net (1997-05-15)
Re: Definable operators genew@vip.net (1997-05-22)
Re: Definable operators mfinney@lynchburg.net (1997-05-22)
*Re: Definable Operators burley@tweedledumb.cygnus.com (Craig Burley)* (1997-05-30)**

| List of all articles for this month |

From:	Craig Burley <burley@tweedledumb.cygnus.com>
Newsgroups:	comp.compilers
Date:	30 May 1997 23:04:47 -0400
Organization:	Cygnus Support
References:	97-03-037 97-03-076 97-03-112 97-03-115 97-03-141 97-03-162 97-03-184 97-04-027 97-04-095 97-04-113 97-04-130 97-04-164 97-05-053 97-05-119
Keywords:	syntax, design

genew@vip.net (Gene Wirchenko) writes:

> Craig Burley <burley@tweedledumb.cygnus.com> wrote:
>
> [snip]
>
> >I'm in favor of much of that, but until we all agree that a language
> >that allows
> >
> > a = b + c /* whether a semicolon follows or not ;-) */
> >
> >to be redefined such that b or c can be modified, a be referenced
> >before its modified, and so on is thus _worse_ as a _language_, and
> >that any code that actually performs such redefinitions is "wrong on
> >the face of it", we won't have achieved much. [...]
>
> Get off your high horse about "+" for string concatenation! I
> understand it isn't natural to you. You have made your point.

Actually, `+' for concatenation _is_ natural for _me_. I started
programming in BASIC at age 9. (As I'll explain further below,
this did come back to "haunt" me when I had to write in PL/I for a
few years.)

But designing a language is not best done by implementing just those
constructs "natural" to the author of the language. Care must be
taken to ensure that it doesn't encourage programmers who use it to
type in _valid_ code that means something _different_ than what they
expected, and, similarly, to _misread_ code that means one thing as if
it meant something else.

And, that's exactly what happens when people type in C++ code,
just as it happened when they typed in PL/I code. In PL/I, `+'
means addition. Period. No overloading. And it works on
strings! It does the "natural" thing -- it converts the strings
to numbers and adds them, then string-izes the results. Lots of
thought went in to determining that this was the "right" thing
to do.

Problem was, too many of us didn't _intend_ for all this to happen
when writing such code in PL/I. Our typos and thinkos wouldn't be
detected in PL/I -- instead it would be compiled to code that might
work sometimes or often, but be slower than we expected (assuming we
could measure it). That bothered us, even though we had to admit that
the language implemented the "conceptually correct" thing. I think of
it as one of several similar examples of what caused PL/I to encounter
too much resistance to be successful as a language -- the lack of
"responsiveness" in the language<->machine interface. (Specifically,
you can imagine how frustrating it was to finally discover that the
reason an apparently efficient statement like "recs = recs +
new_recs;" would end up calling two, slow, library routines, because
you'd typed "new_recs", the string you read in from the user, instead
of "n_new_recs", the value you'd set from the converted "new_recs".
At least in that case, the bug didn't manifest itself, but in other
cases, your bugs just stayed hidden for a while.)

In short, whether it's C++ or PL/I or what-have-you, there's a big
difference between DWIM (Do What I Mean) and DWYDISETIMSD (Do What
You've Decided I Said Even Though I Meant Something Different).
People who say "`+' for concatenation of strings is obviously right"
are as guilty of DWYDISETIMSD as are those who said "`+' means
addition regardless of data type". Even though the latter is simpler
and most easily taught, both have problems from a human-languages
point of view.

(In case some of you haven't figured it out yet and I haven't said it
already, I think a good C/Fortran-level language would do what Fortran
already does -- disallow `+' on strings. In particular, the
programmer hasn't _said_ in any clear way that a string contains a
digits interpreted as decimal, so why should the language/compiler
assume that it does? People who want `+' to mean addition on strings
should ask for, and be required to use, a facility whereby they
express the semantic content of strings, e.g. "string contains integer
expressed as decimal digits", in the same sense that C programmers
indicate whether a "string" of bits is to be interpreted as a signed
or unsigned integer, instead of having programmers encode that
information in any operator or operation where the difference might be
important. People who want `+' to mean concatenation should simply be
told to use the concatenation operator. Neither audience will be
thrilled, but both will be less likely to write code that is buggy in
this area. E.g. I've never encountered a Fortran programmer who made
the kinds of mistakes I, and others, made in PL/I and I expect people
are making in C++, because the _compilers_ caught their mistakes.)

> It is natural to me and I am starting to resent that you figure you
> can browbeat me or anyone else about something that is an opinion.
> Your opinion of what is natural is an OPINION, not an objective
> fact. My opinion is the same, BUT I KNOW IT.

Sorry, it is as much "fact" that `+' is "natural" to me as
concatenation when applied to strings as it is that I like chocolate,
or enjoy spending time with my wife, or love listening to Widor's
Symphony #10.

What is also objective fact, _given_ that many programmers find `+'
meaning addition on strings "natural" _and_ that many (more, I think)
find `+' meaning concatenation on strings "natural", is that there is
_no_ "obvious" meaning for `+' on strings. It is my _opinion_ that
the "givens" are true, but a very strongly held one (I've encountered
programmers from both camps), and if it is true, then the conclusion
that there _is_ no "obvious" meaning for `+' on strings is a logical,
factual, conclusion.

And I am, and have been in this thread, completely uninterested in
browbeating _anyone_ into changing what they think is "natural" for
`+' to mean _to them_. It's _precisely_ because I don't intend to try
and repeat that mistake of the designers of PL/I and C++ (to wit,
attempt to re-educate a large population to a meaning for an infix
operator applied to certain types that is different than the one they
already have burned into their heads) that I take the position that
there's no "obvious" meaning for it.

And it's the generalization of that kind of reasoning I've been able
to develop over the decades, to explain (not just _feel_) that PL/I
and C++ made mistakes even though, taken in isolation, each decision
made some sort of "sense", that justifies why arbitrary overloading of
existing, known, widely understood infix operators to mean things
_vastly_ different than they already do (_especially_ when the meaning
changes due to semantic, not syntactic, changes in the program) is
actually a _bad_ thing for a new artifical-language designer to rely
upon.

After all, while I'm not particularly interested in the specifics of
the details of the meaning of the C++ expression (assuming typical
libraries people are using with it these days)

    a = b + c + d;

when a, b, and d are strings but c is an int, I _do_ know what a
PL/I-type person would make of it: if c is the value 341, convert it
to the string "341", then proceed. So, that would be a reasonable
conclusion -- moreso than some other meanings I can think of.

And, regardless of what C++ makes of _that_, I would _think_ the
above guess would accurately apply to this expression:

    myfile << b << c << d;

Just what _is_ the conceptual difference between the subexpressions `b
+ c + d' and `b << c << d' above? They both do concatenation, right?
If so, why do they have to _look_ so different? Presumably because
`myfile + b + c + d;' would look nonsensical; `myfile = b + c + d;'
would be even harder to disambiguate in cases like `myfile = c'; and,
similarly, `a = c << 0;' would be a pretty peculiar way to string-ize
the value (c*10) and likely misinterpreted, even though it'd be,
perhaps, valid.

Now, consider what might have happened had C++ been granted, instead
of _two_ comment lexemes and _no_ concatenation lexeme, one of each,
with `//' meaning concatenate (in the general sense). The above two
expressions might have turned into:

    a = b // c // d;

    *myfile = b // c // d;

(I use the `*' because, presumably, one wants to write into the underlying
stream, not the pointer to the stream, but I see a whole 'nother kettle
of fish there anyway -- it's not pertinent to my point.)

Which pair of expressions seems to more _consistently_ express the
concept of concatenation and, perhaps, the stringizing of `c' (if that
is allowed, which, personally, I don't really recommend)?

I claim the latter two do, and that C++ would have offered its users
and library-designers a _substantially_ superior language given even
this one little change. (Though to really make it useful, careful
thought would have had to be applied to take the advantages further,
e.g. `b // c // d = *myfile;' to _read_ a series of values from a
stream.)

Instead, it seems to me there is _one_ fundamental reason C++ has
no concatenation operator:

    Because _any_ existing operator could be overloaded to mean
    `concatenate' on _certain_ data types, so it seemed "natural"
    to leave, e.g., `+' as a choice for this.

IMO, if it wasn't for what I have called the "stupid pet-trick" of
arbitrary operator overloading being applied to C++ during its design,
C++ _would_ have had a concatenation operator. And an exponentation
operator. And logical AND and OR operators (instead of forcing a
choice between bitwise and if/then/else-style operators). Etc. And
it would have been vastly better as a language, especially if it had
_mandated_ that the _existing_ (C) meanings of operators could not be
changed _even if_ the mandate could not always be implemented as a
compile-time check. (Because there would have then been new lexemes
introduced explicitly as a grab-bag of operators, which, while not
ideal, would at least not tend to mislead readers as much as the
current grab-bag, `+', `=', and so on, do.)

After all, even BASIC, which has no concatenation operator, has an
exponentiation operator! And, of course, PL/I and Fortran both have
concatenation operators. (IIRC, PL/I has exponentation as well.) How
can a supposedly "new" language that people have widely hailed as
"replacing" existing languages (including C and Fortran -- even
Fortran 90) not have such important operators? (The answer: because
the author figured any operator could mean them, so it was unimportant
to actually provide them at the lexical level. That's what people
who've been claiming in this thread that arbitrary operator
overloading is _always_ a win have been saying, in effect -- with all
that overloading, you need only a few lexical operators, right?)

The advantage of these extra operators meaning _one_ thing is that
it's simply _vastly_ easier to read code written using them, because
it can be read _without_ first knowing all the semantic details of the
underlying operands. Fortran's type promotions do still bite people
today (e.g. people who wonder why something akin to `R**(2/3)', though
typically using variables, always returns 1.0), but its sins are
trivial compared to the ones C++ added to C, and _those_ sins will be
comparable to the ones perpetuated by pretty much everyone who
approaches new-language design with the attitude that "arbitrary
operator overloading is wonderful".

Another example: do you think it is _obvious_ what the command

    copy a b

does on an arbitrary file-system shell? If you do, you're wrong,
because no matter _what_ you think it means, it means something _else_
to someone else. It might have a _natural_ meaning to _you_, but
it'll have a (dangerously) _different_ natural meaning to someone
else. (To wit: today, most people would say "it means copy file a
into file b", but there have been systems where it meant "copy file a
from file b", and people did use those systems in a natural way. As
some have pointed out to me in the past, back when I too thought I
knew what was "obvious" for everyone, the latter meaning is more
consistent with computer languages in the sense that it is more like
"let a=b". Lesson learned -- it is natural for me to see it as "copy
a into b", but it is natural for some people to interpret it as "copy
a from b". Therefore, there _is_ no "obvious" meaning for it, no
matter how many bearded academic types bang their fists onto their
podiums claiming otherwise.)

What am I advocating as a solution? Well, not being a language or
language-design expert, I _suggest_ a fundamental _component_ of the
solution is to design a language so that it _syntactically_ expresses
meaning in a way that is as "obvious" as possible, for as wide a range
of expressions people are likely to want to use in the language as
possible.

For example, I would _think_ few people would misunderstand what
this is trying to do:

    copy a <- b

Or this:

    copy a -> b

(Notice the introduction of two new lexemes that could be reserved to
mean "copy contents from" and "copy contents into", respectively.
Their visual appearance makes the meaning far clearer than the typical
operand overloading. Of course, the fact that people constantly type
commands into shells makes this a less-than- persuasive example, but,
thankfully, people designing GUIs have not made the mistake of having
the action of dragging a file X into a folder DIR mean "concatenate
all files in DIR and write the result into file X".)

Similarly, it's hard to imagine someone misunderstanding

    myfile = open_stream ('myfile', 'r');
    ...
    *myfile = b // c // d;

(although some might have trouble _understanding_ it. It's a Holy
Grail to design a language that is easy for even a total newcomer to
understand when he reads real code written in it. I'm asking for
something quite achievable, which is, that such a newcomer should be
very unlikely to _mis_-understand the code, in that he should either
understand it, or be aware that he's confused by it. It is, in this
sense, why MIT TECO might be superior, as a language, to C++ ;-).

A guide to doing this kind of design is to avoid reliance on semantic
information -- especially the kind that tends to get put into separate
files or otherwise "far away" from the code that would have said
reliance -- when designing the syntactic (expressive) components on
the language.

> Perhaps you would care to argue which is The One True Programming
> Language or the One True Natural Language?

No, since I already understand that people have naturally different
interpretations of identical expressions due to having different
backgrounds, experiences, and so on. I think we've already created a
Tower of Babel even in the field of artificial languages (witness the
rat-hole of the meaning of `+' on strings) such that it'd be
impossible to design One True Language in the sense that PL/I or Ada
tried to be.

Maybe what I'm arguing for is the One Untrue Programming Language -- a
language designed such that, any time the answer to "<expression>
obviously always means <thing>" can legitimately be "untrue -- some
people feel it naturally means <other-thing>", then <expression> is
either simply disallowed, or suffers some similarly dire fate.

> BTW, by your opinion, the latter can't be English. English
> overloads plus to use it for string concatentation. If I say "Good
> morning." plus "How are you today?", then I've said "Good morning.
> How are you today?".

Yes, you're using semantic information to disambiguate the overloaded
operator. That's often what fans of operator overloading resort to to
show how wonderfully easy it is to disambiguate it, though nobody
(still) has even _tried_ to tell me what `(a) & (b)' means in C (since
they can't), much less attempt to tell me what `a = b + c;' means in
C++. (Just as I have yet to get any further challenge from Fortran
programmers who want .EQ. and .NE. to work for LOGICAL variables, now
that I've written an apparently irrefutable explanation of why not in
the g77 documentation.)

Q: "How many boxes are in those two storage compartments?"

Which is the better answer:

A1: "Twelve plus ten".

A2: "Twenty-two total".

A3: "Twelve and ten."

A4: "Twelve in one, ten in the other."

A5: "Twelve boxes of napkins, ten of tissues."

Looking carefully at the answers, are you _really_ sure you can cope
with A1 as a useful answer? I claim you can't, because you don't know
whether the person really meant the equivalent of A2, A4, or A5,
_each_ of which is _superior_ to both A1 (because `plus' is, indeed,
overloaded) and A3 (which is, unfortunately, overloaded as well --
`and' sometimes means `plus'). Yet, the _person_ likely knows what he
means -- if he gives A1 or A3 as an answer, he's almost certainly
chosen (perhaps without being aware of it) a more ambiguous response
than he was capable of making. (A4 and A5 are refinements in
knowledge over A2, but of course the person responding might not have
sufficient knowledge to choose the appropriate refinement. I can't
figure out any way he'd know enough to respond with A1 or A3 but not
A2, however, which is why I claim even A2 is superior to A1 and A3,
since it can't be as easily misinterpreted.)

This is much like the question from the tower that was, IMO, a gating
factor in allowing those 300 or so airline passengers to die. The
question "Is everything okay there?" is vastly more ambiguous than "We
show you at 1200 feet altitude, is that really what you intend?". The
tower _meant_ the latter, but the cockpit crew heard something much
more like "Are you sure you guys know how to fly that plane?". Of
course, taking even more human factors into account, the best question
I can think of would have been "What is your current altitude?" asked
in a forceful way, but that gets beyond the issue of overloading.
Anyway, the cockpit crew answered "yup, everything's fine here", and
didn't notice the altitude until they were at something like 300 feet,
when it was too late.

> Since English does this so-called Evil Thing, perhaps you should
> stop using it?

English is already widely understood to have _many_ problems as a
language, this (words and phrases meaning way too many somewhat
similar things) being one of them. Culturally speaking it's wonderful
in a lot of ways, but there apparently are natural languages that
offer a lot less ambiguity in common lexemes and expressions. We're
paying a huge worldwide price for having chosen English over others;
I'm not sure how that price compares to the advantages of _having_ one
language predominate as much as English has, however.

In any case, I can't stop using English because it's my native
language; because it's the world's #1 language for too many things;
and so on.

I'd _like_ to avoid seeing C++ (or any language designed by someone
who thinks arbitrary operator overloading is "cool") end up being the
English of the computer world. Certainly I'm going to do whatever is
reasonable to avoid making C++ a "native" language for me, since I'm
not interested in writing code in a language few people can
effectively read. (I have no doubt I can write in it effectively.)

My other answer to the above-quoted question is:

    I'm not saying people must all stop using languages with too
    much overloading.

    I _am_ saying people who are _designing_ languages must _stop_
    treating overloading as a feature that is legitimate to use in
    any context where it can be correctly parsed by a finite automaton.

    And, I am _trying_ to encourage people who _use_ new languages to
    insist that the languages are easy to describe at the _lexical_
    level in terms of the meaning of the constructs, so that, as
    users of languages, they don't encourage the creation of more
    linguistic failures such as C++. (And the clarity of meaning
    should not trail off after just the lexical level, either.)

> [I don't think the question here is naturalness, it's consistency.
> Arithmetic addition is commutative, string concatenation isn't. But
> I do wish that Kemeny and Kurtz had picked a different concatenation
> operator than + so we could have avoided this whole argument. -John]

Indeed. It'd be nice to have an artificial language that had nice,
distinct, lexical operators for not only the math operators and
concatenation, but things like implication, set membership, and so on.
I don't even care if the language _implements_ all this stuff, as long
as it gives me the tools to express it and then use an implemented
subset of the language to write code that parses those expressions and
implements those expressions as I see fit (but, if I'm doing a good
job, not in ways that are inconsistent with the already-established
_meanings_ of those expressions). C++ already has lots of the nifty
underlying facilities to do this, as do other languages, so I think
the only significant effort in designing such a new language is the
actual design work -- not the implementation -- and, of course, the
promotion in the public space. If I had the role of doing this task,
one major chunk of funding would go to deploying a program that would
_translate_, as cleanly as possible, Fortran and its various dialects
into the new language.

(I'm not sure whether I'd bother with a similar translator for C, and
I'm sure I wouldn't for C++, because users of those languages have
generally decided against the importance of elegance of expression
anyway, while Fortran programmers seem to largely appreciate it, from
my experience. For them, a low-level converter a la `f2c' that simply
illustrates whatever performance improvements the new language might
obtain for them would probably be all that is worth doing.)

The "problem" with `+' is that _enough_ people see it as "naturally"
meaning "addition", "commutative", "only references its operands",
"references only its operands", "has no side effects", and so on, such
that it then _becomes_ inconsistent when it is redefined for some
contexts in ways that conflict with these natural meanings -- and
these contexts often have no clear markers that communicate the
presence of such inconsistencies. (Sorry, but twelve #include
directives at the top of the source listing do _not_ constitute a
"clear marker", though in specific instances, they might be
sufficiently clear for a sufficiently small, "expert" audience. But,
for not-very-different values of "clear" and "small", the same could
be said for all sorts of stuff written in assembly code.)

Ultimately, my question to the language designers out there who think
arbitrary overloading of existing operators is a wonderful thing for
you to rely upon in designing your language is:

    Exactly what is your language giving me, in terms of
    _expressiveness_, that I cannot already achieve, at
    least as portably and efficiently, using tools like
    `lex' and `yacc' (e.g. PCCTS) combined with a good
    code-generator like `gcc' (or even `g++', if I want
    all those nifty libraries with just a more readable
    interface)?

Point being, once you discard the idea that an operator always _means_
one clear thing (as in "`+' means addition"), what use _is_ your
_language_ design?

(Note: in place of "discard" above, I had typed "throw out" -- another
example of evil overloading in English, since some readers might have
taken that as "put forward".)

I know what kind of answer I'm looking for, and what kind of answer
I'll make sure I provide if I ever do design a new general-purpose
programming language, but I'm not going to give it here. This thread
has gone on long enough!
--
James Craig Burley, Software Craftsperson burley@gnu.ai.mit.edu
[This is absolutely definitely the last message on this topic. But
I can't help but point out that in PL/I, "123" + "456" is " ". If
you don't believe me, look it up. -John]

--

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Definable Operators

Craig Burley <burley@tweedledumb.cygnus.com>30 May 1997 23:04:47 -0400

Craig Burley <burley@tweedledumb.cygnus.com>
30 May 1997 23:04:47 -0400