Example: The Full Syntax Of Prolog (Re: precedences vs. hierarchy)

rockbrentwood@gmail.com
Wed, 24 Aug 2016 15:45:34 -0700 (PDT)

From comp.compilers

Related articles
precedences vs. hierarchy bassobajo@gmail.com (Andreas Schramm) (2016-06-06)
Re: precedences vs. hierarchy federation2005@netzero.com (2016-06-06)
*Example: The Full Syntax Of Prolog (Re: precedences vs. hierarchy) rockbrentwood@gmail.com* (2016-08-24)**

| List of all articles for this month |

From:	rockbrentwood@gmail.com
Newsgroups:	comp.compilers
Date:	Wed, 24 Aug 2016 15:45:34 -0700 (PDT)
Organization:	Compilers Central
References:	16-06-001 16-06-006
Injection-Info:	miucha.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="82272"; mail-complaints-to="abuse@iecc.com"
Keywords:	parse, syntax
Posted-Date:	25 Aug 2016 11:17:15 EDT

On Tuesday, June 7, 2016 at 9:54:57 AM UTC-5, federat...@netzero.com wrote:
> Prolog's expression syntax is non-terminal tagged (the "term" is the only
real
> non-terminal in the language).

For reference I list a detailed Prolog grammar below -- as an indexed
precedence grammar. For those who like to use parser generators I offer the
following challenge to you: either (a) find a way to directly encode this
grammar in your favorite parser generator other than by the brute-force method
(used for instance in the ISO grammars for C and C++) of unrolling all 1201 of
the precedence levels out!; (b) or find a way to modify and/or extend the
capabilities and functionality of your parser generator so as to be able to
accept this grammar or others like it. Keep in mind that the operator
precedences are run-time definable!

I've recently (re)implemented AUTOMATH and am rapidly migrating it to a new
and higher form -- with the intent on subsuming other proof assistants out
there -- and as a first step I'm going to be putting a Prolog operator syntax
on it like that described below. So this is not just for academic interest to
me. (Releases of snapshots of my copy AUTOMATH are available by e-mail.)

The Grammar:
This is an old file I have; it may be partly based on a redaction of the
(mostly defunct) ISO standard or some books on Prolog; I don't remember.

The fundamental syntactic construct in Prolog is the term. At the topmost
level, itself, the terms are interpreted as sentences in the manner described
below. There is a limited ability to define or redefine the syntax of terms,
these facilities, therefore, are also available for defining the syntax of
sentences.

Each term comprises a sequence of tokens; each token, in turn, being treated
as a single symbol that is spelled out in 1, 2 or more characters. The tokens
include those for variables, constants, functors as well as brackets and
punctuation. The token composition of a term is described in further detail
below. The spellings of the tokens are also described in further detail
below.

When a sequence of tokens is read in, it must be terminated by the full-stop
token. Wherever the concatenation of two or more tokens would produce the
spelling of another token, the tokens need to be separated by a layout-text
token in order to prevent the misreading. The layout-text tokens may be
arbitrarily interspersed in a program and serve no other function than to
separate other tokens.

For what follows A* denotes 0, 1, 2 or more or A; [A] denotes 0 or 1 of A;
"..." denotes literals

All the phrase types S = sentences; H = grammar rule head; B = grammar rule
body are specializations of the syntax T for terms and are separated out only
for convenience. The top-level phrase type -- and only real phrase type in the
Prolog syntax is just T itself. The reason for mentioning this is that T is
indexed by precedence levels and all the unary prefix, unary postfix and
binary infix operators have precedence levels -- including those that appear
in the syntax for S, H and B. So I don't list explicit precedence levels for
these three categories. A parser need only parse the syntax for T and pass the
refined interpretations of terms within the categories S, H and B downstream
for the semantic analyzer to handle.

Sentences
S -> a ":" S Sentence labeled by module name
S -> L A sentence in list form
S -> H [":-" B] Clause
If :- B is absent,
H must not be otherwise interpretable as a sentence)
S -> ":-" B Command
S -> "?-" B Query
S -> H "->" B Grammar rule

H -> a ":" H Module label
H -> T Goal; T may not be an X
or have any of , ; : \+ -> as its main operator.

B -> a ":" B Module label
B -> B "->" B ";" B Conditional
B -> B "->" B Elseless conditional
B -> "\+" B
B -> B ";" B Disjunction
B -> B "," B Sequencing
B -> T Goal; T may not be an X
or have any of , ; : \+ -> as its main operator.

In a grammar rule, instances of H or B of the forms L or S are interpreted
respectively as a list of terminals or as a literal terminal; instances of B
of the forms { B } or ! are interpreted as conditions; instances of H or B of
the forms T are interpreted as non-terminals. In H or B, instances of T may
not have the form of a variable X, or contain any of the operators , ; \+ ->
as their main operator.

Terms
As mentioned previously, when a sequence of tokens comprising a term is read
in, it must end in a full-stop. The syntax is: T(1200) full-stop. There are a
set of precedence levels, ranging from 0 to 1200, for terms. This facility is
provided to allow certain operators and punctuation to bind more tightly than
others, and it is available to enable one to partially define or redefine the
syntax of terms.

T -> T(1200);
T(n+1) -> T(n) for all integer n in the half-open interval [0,1200).

T(n) -> fx(n) T(n-1) except for number.
T(n) -> fy(n) T(n)
T(n) -> T(n-1) xfx(n) T(n-1)
T(n) -> T(n-1) xfy(n) T(n)
T(n) -> T(n) yfx(n) T(n-1)
T(n) -> T(n-1) xf(n)
T(n) -> T(n) yf(n)

T(1000) -> T(999) "," T(1000) The comma operator is xfy(1000).
T(0) -> a ["(" T(999) ("," T(999))* ")"] Function or constant.
T(0) -> "(" T(1200) ")'
T(0) -> "{" T(1200) "}"
T(0) -> L Lists
T(0) -> S Strings
T(0) -> N Numeric Constants
T(0) -> X Variables

L -> "[" [T(999) ("," T(999))* ["|" T(999)]] "]"
(The Prolog syntax apparently doesnbt allow two or more T(999)
to precede the separator "|".)

N -> ["+"|"-"] Numeral
N -> ("+"|"-") ("inf" | "nan")

All the tokens fx(n), fy(n), xfx(n), xfy(n), yfx(n), xf(n), yf(n) are
instances of the atom, a, which have been specially declared as the respective
operators of the respective precedences and associativities. The operators of
the respective forms ?f, ?f? and f? are prefix, infix and postfix. Those of
the forms yf* are left-associative and those of the form *fy are
right-associative.

The comma operator "," is fixed in the syntax above as of type xfy(1000). A
term T(1000) of the form A,B is interpreted in the standard syntax as
b,b(A,B); while the terms T(0) of the form {A} and (A) are interpreted,
respectively, as b{}b(A) and A. All the other prefix, infix and postfix
operators can be written in ordinary function syntax like this by quoting
them.

In order to disambiguate between an operator followed by a parenthesis versus
a function-call, a requirement is imposed that "(" must immediately follow the
a of the function in a function call, and "(" must be separated from fx(n) or
fy(n) in an operator application.

For those of you C/C++ programmers who prefer to space out function brackets
-- contravening centuries of mathematical tradition by the way --
PPHHHHTHHHHT!!

Also, in case of ambiguity, the interpretation of an instance of an atom a as
a prefix operator, fx(n) or fy(n), wherever possible, wins out; and the
interpretation of an instance of xf(n) or yf(n) as xfx(n), xfy(n) or yfx(n),
wherever possible, wins out.

Tokens
The following is what a lexer will see, process and pass on down to the
parser. I won't get into internationalization details on the character sets.

As above, I quote literals -- here literal characters. The quote itself " is
quoted as "\"", like in C; while \ is quoted as "\\". The newline is quoted as
"\n", the tab as "\t" and space as " ".

The roster of character categories and their minimal membership are:

White "\t", "\n", " "
Lower "a", "b", "c", "d", "e", "f", "g", "h", "i",
"j", "k". "l", "m", "n", "o", "p", "q", "r",
"s", "t", "u", "v", "w", "x", "y", "z"
Upper "A", "B", "C", "D", "E", "F", "G", "H", "I",
"J", "K". "L", "M", "N", "O", "P", "Q", "R",
"S", "T", "U", "V", "W", "X", "Y", "Z"
Digit "0", "1", "2", "3", "4",
"5", "6", "7", "8", "9"
Symbol "+", "-", "*", "/", "\\", "^", "`", "~", "."
"<", ">", "=", ";", "?", "@", "#", "$", "&"
Solo "!", ";"
Punctuation "%", "(", ")", ",", "|",
"[", "]", "{", "}"
Quote "\"", "'"
Underline "_"

The top-level of the lexer is Token* -- a stream or 0, 1, 2 or more tokens;
with the following lexical syntax.

Token -> a
Token -> Numeral
Token -> X
Token -> S
Token -> Punctuation
Token -> Layout+
Token -> Done

a -> "'" C* "'"
a -> Lower (Upper | Lower | Digit | Underline)*
a -> Symbol+
a -> Solo
a -> "[" Layout* "]"
a -> "{" Layout* "}"

Neither the full-stop nor any symbol sequence starting in "/*" counts as an
a.

Numeral -> Digit+
Numeral -> Digit+ "'" (Upper | Lower | Digit)*
Numeral -> "0" "'" C
Numeral -> Digit+ "." Digit+ [["e"|"E"] ["+"|"-"] Digit+]

Prolog, similar to languages like BC, allows a wide range of radices. That's
what the apostrophe "'" is for. In the base-radix notation R'N, the digits
comprising N must be of value no larger than R, with a,b,b& and A,B,b& both
being treated as being of values 10,11,b&. The base R must in the range
[2,36]. Note that b3 is identified as a numeral, whereas b(3) is an
application of the operator b to the numeral 3.

X -> ("_" | Upper) (Upper | Lower | Digit | Underline)*
S -> "\"" C* "\""

C -> any Char, other than "\\"
C -> "\\" Esc

In the atom a, a quoted "'" must be duplicated. In a string S, "\"" must be
duplicated.

Layout -> White
Layout -> "/*" Char* "*/"
Layout -> "%" Char* "\n"

The character sequences in the "/*" and "%" comments may not contain the
respective ending sequences "*/" or "\n" within them; they may only be used to
close the respective comments -- i.e. no nesting of any comments are permitted
within comments of the same type.

Done -> "."

A full stop must be separated from subsequent tokens by LayoutItems.

Char -> Any character, including White, Lower, Upper, Digit, Underline,
Symbol, Solo, Punctuation and Quote.

Esc -> "a"|"b"|"t"|"n"|"v"|"f"|"r"|"e"|"d"
Esc -> "x" (Upper | Lower | Digit) (Upper | Lower | Digit)
Esc -> Digit [Digit] [Digit]
Esc -> "^" ("?" | Upper | Lower)
Esc -> "c" White*
Esc -> White
Esc -> Char

The single-character escapes are, respectively, the alarm, backspace,
horizontal tab, newline, vertical tab, form feed, carriage return, escape and
delete. The escapes starting in x are hexadecimal escapes; with the following
characters ranging from 0-9, A-F and a-f. The escapes starting in a digit are
octal escapes, and the digits must all be in the range 0-7. The escapes
starting in ^ are control-sequences, with ^? standing for delete (127) and the
corresponding codes for the other controls being taken as the code of the
character following the ^, modulo 32. Escape sequences involving layout are
ignored (and that starting in c is taken with the longest layout sequence that
occurs following it), and the escape sequence of any other character is taken
to stand for the character, itself.

Older implementations of Prolog do not share this lexical syntax for Esc and I
do not believe the lexical syntax for character escape is mandatory.

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Example: The Full Syntax Of Prolog (Re: precedences vs. hierarchy)

rockbrentwood@gmail.comWed, 24 Aug 2016 15:45:34 -0700 (PDT)

rockbrentwood@gmail.com
Wed, 24 Aug 2016 15:45:34 -0700 (PDT)