Re: What stage should entities be resolved?

Kaz Kylheku <480-992-1380@kylheku.com>
Fri, 18 Mar 2022 17:50:22 -0000 (UTC)

          From comp.compilers

Related articles
Re: What stage should entities be resolved? christopher.f.clark@compiler-resources.com (Christopher F Clark) (2022-03-12)
Re: What stage should entities be resolved? DrDiettrich1@netscape.net (Hans-Peter Diettrich) (2022-03-14)
Re: What stage should entities be resolved? costello@mitre.org (Roger L Costello) (2022-03-15)
Re: What stage should entities be resolved? DrDiettrich1@netscape.net (Hans-Peter Diettrich) (2022-03-18)
Re: What stage should entities be resolved? gah4@u.washington.edu (gah4) (2022-03-17)
Re: What stage should entities be resolved? 480-992-1380@kylheku.com (Kaz Kylheku) (2022-03-18)
Re: What stage should entities be resolved? gah4@u.washington.edu (gah4) (2022-03-18)
Re: What stage should entities be resolved? martin@gkc.org.uk (Martin Ward) (2022-03-19)
Re: What stage should entities be resolved? matt.timmermans@gmail.com (matt.ti...@gmail.com) (2022-03-20)
| List of all articles for this month |

From: Kaz Kylheku <480-992-1380@kylheku.com>
Newsgroups: comp.compilers
Date: Fri, 18 Mar 2022 17:50:22 -0000 (UTC)
Organization: A noiseless patient Spider
References: 22-03-019 22-03-025 22-03-032
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="29574"; mail-complaints-to="abuse@iecc.com"
Keywords: design
Posted-Date: 18 Mar 2022 14:00:37 EDT

On 2022-03-15, Roger L Costello <costello@mitre.org> wrote:
> But if PI is inside a quoted string:
>
> "Today is PI day"
>
> then the preprocessor does not replace PI.


This was not historically true, though. I know for a fact that ancient
versions of the C preprocessor would have replaced PI in character
constants, as in 'PI'. Possibly, some versions of it also didn't
care about quotes.


> 1. How much knowledge of the language should the preprocessor stage have?


A preprocessor doesn't have to have any knowledge of the language
being preprocessed.


If the preprocessor is universal, like m4, that is practically a given.


If a preprocessor is dedicated to a language, like the C one,
it's a good idea for it to work at the token level: tokenize its
input the same way, or at least very similarly, to the target language.


In C, a preprocessor token is a kind of superset of a proper token.
There are pp-number tokens that are not number tokens, for instance,
but not vice versa.


> 2. How much knowledge of the language should the lexical analysis stage have?
>
> 3. How much knowledge of the language should the syntax analysis stage have?


The traditional lexical/syntax division comes from a blueprint for
programming language processing laid down in the work of computer
scientists done in the 1960's and 1970's.


It is not necessary; a context-free grammar can handle the entire syntax
down to the token level, like a decimal integer being a sequence of
digits and whatnot.


The division into lexical analysis and parsing comes from several
pragmatic observations:


1. There are dedicated algorithms that are good for tokenizing.
        Tokens exhibit regular syntax: they can be recognized using
        finite automata (regular expressions).


2. There are efficient parsing techniques like LALR(1) that assume
        that there is one symbol of lookahead. If you use LALR(1)
        with a scanner, you get one *token* of lookahead.
        If you use LALR(1) without a scanner (token == character)
        then you get one *character* of lookahead, which is going to
        be crippling. Thus, separating lexical analysis effectively
        amplifies LALR(1) into LALR(k), for some k related to maximum
        token character length.


E.g. if you design a C parser without a scanner, over the character
level syntax, when that parser sees the character "do", it
has no idea whether it's looking at the "do" keyword, the
start of the "double" keyword, or something else like "dongle".
One character of lookahead will confirm or eliminate "do",
but not resolve "double" versus "dongle".


So to answer the questions, if you're assuming that you're going to be
using the traditional framework, with a regular-expression-driven
lexer and a LR parser with 1 token of lookahead, the way you divide
the work, roughly, is by identifying the token-like elements in the
language that have regular syntax. Anything that exhibits nesting,
requiring rule recursion, will be farmed off to the parser. Your
decision could sometimes also be informed by the lookahead concern;
you could choose to clump together several items that might plausibly
be individual tokens into a "supertoken" if there is some parsing
advantage in it, or other simplification.


Other kinds of decisions interact with the language definition.
For instance, what is -1234? In Common Lisp, and most other Lisp
dialect, I suspect, that is a token denoting a negative integer.
It may not be written - 1234. In C, and languages imitating its syntax,
-1234 is a unary expression: the unary - operator applied to the
integer 1234, and so there are two tokens. They may be separated
by whitespace, or a comment.


This strikes at the language definition, becuase what it implies is
that C does not have negative constants.


And that causes a subtle problems.


Say the int type is 32 bits wide.


The most negative 32 bit two's complement integer is -2147483648.


But in C this is actually the negation of 2147483648 which does
not fit into the type int. The resulting constant expression will
not have type int.


Defining the INT_MIN constant in <limits.h> such that
the expression has type int, and doesn't contain any casts
requires something like:


    #define INT_MIN (-2147483647 - 1)


> 4. How much knowledge of the language should the semantic analysis stage have?


All knowledge that the previous stages do not have, minus
knowledge that only the application domain has. :)


>
> To make the questions concrete, consider this XML:
>
><foo>…</foo>
>
> Should the lexical analysis stage know that the foo in <foo> is a
> start tag (STAG) and the foo in </foo> is an end tag (ETAG)?


XML has structure inside tags, namely attributes. Various implementation
choices present themselves.


One possibility is to recognize the special cases <abc>, </abc>
and <abc/> as a special token categories, e.g


  ELEM_START_TRIVIAL
  ELEM_END
  ELEM_COMPLETE


All three token categories will carry a semantic attribute on the token
which gives the identity of the element, e.g "abc".


Then the case like <abc foo="bar">


The "<abc" part could produce a ELEM_START_OPEN token (distinct from
ELEM_START_TRIVIAL). After this token, zero or or more attributes
are expected, followed by ELEM_START_CLOSE produced by the angle
bracket.


The parser has to recognize that ELEM_START_OPEN has been seen,
and then parse the attribute tokens, looking for ELEM_START_CLOSE.


Note that ELEM_START_TRIVIAL is then not necessary, becuase
<a> could produce ELEM_START_OPEN followed immediately by
ELEM_START_CLOSE.


> Or should the lexical analysis stage simply identify
> the foo in <foo> as a name (NAME) and the foo in </foo> as a name
> (NAME)? That would mean the lexical analysis stage has lesser
> knowledge of the XML language. How much knowledge of the XML language
> should the lexical analysis stage have?


I would say that because the attributes have special syntax with
quoting like foo="bar", it would be better for the lexer to be aware
of the tag structure, so that it can switch to a mode for precisely
handling attributes.


Consider <abc foo="bar">foo="bar"</a>. Do you want to tokenize the
attribute foo="bar" in the same way as the interior of the element?


Chances are you want to turn the entire content of the <a> element,
into a single text datum, if it doesn't contain any sub-elements.
The quotes and equal sign are not special in any way.


Also, attributes can have angle brackets; so that is to say,
I think <abc foo="<abc>"> is valid.


If the tokenizer is oblivious to all this, it will lead to ugly
complexity in the parser.


As a final general remark, languages are often defined at multiple
description levels which can be identified as lexical and syntactic.
If you're implementing, it behooves you to follow those separations
in the spec, except when your own experience tells you to adjust the
interpretation here and there; like you recognize that some
syntactic aspect is nicely handled by your scanner or vice versa.


The W3C's definition of XML uses one big grammar, without regard
for separation into lexing and parsing. However, you can recognize
regular elements in it. Consider this grammar rule:


      Comment ::= '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->'


where:


      Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
                                    [#x10000-#x10FFFF] /* any Unicode character,
                                                                                excluding the surrogate blocks,
                                                                                FFFE, and FFFF. *


Note that this is entirely regular: we can convert the Comment match
entirely into the rules for a lexical analyzer, and therefore
Comment can be a token. So that would be the main principle here:
scan the tree structur of the XML grammar and look for the tallest
subtrees that are entirely regular, considering those to be token
candidates. Or something like that.


--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.