Re: XML Parsing

rkrayhawk@aol.com (RKRayhawk)
21 Mar 2000 23:42:49 -0500

          From comp.compilers

Related articles
XML Parsing ma9vk@bath.ac.uk (Vassilis Kostakos) (2000-03-11)
Re: XML Parsing rkrayhawk@aol.com (2000-03-21)
Re: XML Parsing nugatory@mindspring.com (nugatory) (2000-03-23)
Re: XML Parsing bcrivat@hotmail.com (Bogdan CRIVAT) (2000-04-01)
| List of all articles for this month |

From: rkrayhawk@aol.com (RKRayhawk)
Newsgroups: comp.compilers
Date: 21 Mar 2000 23:42:49 -0500
Organization: AOL http://www.aol.com
References: 00-03-066
Keywords: parse

Browse around at approximately
http://www.stg.brown.edu/pub/xmlvalid/dist/source/
or
http://www.stg.brown.edu/pub/xmlvalid/dist/
or better
http://www.stg.brown.edu/pub/xmlvalid/Xml.tr98.2.shtml


This matterial represents a validator (as opposed to a grammar with deployment
semantics in the actions). In other words this work can check XML code for
well-formedness.


The work is by Richard Goerwitz Richard_Goerwitz@Brown.EDU


The work originated in 1998. I think a revised version is in the
offing, but don't know if it is yet posted. Mr. Goerwitz has posted
the source as GNU copyleft. Though the GNU organization does not know
of it yet, seems like.


Cover HTML pages characterize work as yacc/lex. His significant
efforts include unicode capability (with utf aspects). Docs include
the note "It is particularly well-suited to legacy SGML documents that
are in the process of being converted, along with their associated
DTDs, to XML." Which significantly reflects a legacy of SGML at Brown
University.


Not only is the work a validator but the validator is up live and
available for validating your pages!.


All of the source is there, Unix file format. It is an excellent
foundation. The parser is concise enough that it makes a good model
for just studying grammars.


You should also consider going to the w3 consortium pages directly for
the latest greatest guidance. They have detailed documentation that
includes imbedded snippets of grammar. That is, if you do not know it
already, the XML standard itself includes explication of BNF rules
with detailed discussion. Let me know if you need URLs. Be forewarned
some links in the w3 pages take you to actual XML pages that can break
many a current browser; but the standards material is usually
'downward' compatible with the HTML browsers most of us still
use. ((one other thing you will notice right off in the realm of XML,
page authors of the future world know not of brevity; and pages will
take a long time to load)). ((also for fun I will mention for lurkers
here who just love computer science as a science, the XML standards
documentation is it's own genre of literature; some of it reinvents
abstraction.))


Vassilis Kostakos, ma9vk@bath.ac.uk,
the original poster, asked
"
  I was wondering if I
could have something such as:


    <num>3</num> <op>plus</op> <num>5</num> <op>equal</op>


which would yield as output <num>8</num>


"


Which looks good. The 'yield', however, would be up to the particular
application specific engine that straps on the XML parsing harness.


Unlike HTML, XML has no meaning per se. So the semantics of the XML
postfix-like language you sketch would be yours to establish. You
could definitely do that sort of thing. You could have two engines
attached to the XML mechanism; the first might present the XML on
screen (or in print) as


  3 +
  5 equal


and the other might interpret/compile that (if validly encoded), and
call the first routine to display


  8.


Yet another set of routines, strapped to the same XML parser,
interpreting the same input, might display


  tres augend
  cinco copula


  ocho


But you would have to write such routines.


The interesting thing about your conjecture is that many of the
natural language like computer languages could be coerced into
well-formed XML arrangement, by source code editors. The XML
representation expands the amount of text, but there could be
important future applications. Editors could maintain the
well-formedness of a program file, even while presenting it in user
friendly form.


If the editor tags it as well formed, it could be compiled by an XML
organized compiler, perhaps _faster_ than traditional raw text of a
natural language like computer language (where the compiler would
otherwise have to start from scratch).


I could also see a possibility of the DTD including ELEMENTS or
ATTRIBUTES for diagnostics. So that the compiler could put the
diagnostics into the XML file. Thus reviewing your error messages is
just another XML parse. Recompile commences with a diagnostics purge.


But better than that would be a super engine that reads the diagnosed
source and corrects it.


Imposing well-formedness on computer source code files might begin to
orient us to source code in new ways.


Best Wishes,


Robert Rayhawk
RKRayhawk@aol.com


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.