Re: Source-to-source transformation: best approach?

idbaxter@semdesigns.com
Sat, 11 Aug 2007 12:23:22 -0700

          From comp.compilers

Related articles
Source-to-source transformation: best approach? somedeveloper@gmail.com (SomeDeveloper) (2007-08-04)
Re: Source-to-source transformation: best approach? DrDiettrich1@aol.com (Hans-Peter Diettrich) (2007-08-07)
Re: Source-to-source transformation: best approach? Meyer-Eltz@t-online.de (Detlef Meyer-Eltz) (2007-08-11)
Re: Source-to-source transformation: best approach? idbaxter@semdesigns.com (2007-08-11)
Re: Source-to-source transformation: best approach? cfc@shell01.TheWorld.com (Chris F Clark) (2007-08-12)
Re: Source-to-source transformation: best approach? cordy@cs.queensu.ca (Jim Cordy) (2007-08-14)
| List of all articles for this month |

From: idbaxter@semdesigns.com
Newsgroups: comp.compilers
Date: Sat, 11 Aug 2007 12:23:22 -0700
Organization: Compilers Central
References: 07-08-013
Keywords: tools
Posted-Date: 11 Aug 2007 17:09:19 EDT

> I gather that TXL (http://www.txl.ca/) is being used by many for
> source to source transformations, and though I'm now somewhat
> comfortable with the language, as primarily a non-compilers person,
> I don't know why TXL would be especially suited for source to source
> transformations.


[Warning: This response is from an opinionated transformation tool
vendor. I'm the guy behind the DMS(R) Software Reengineering Toolkit,
a system I started building back in 1996].


Well, a key issue is that the tool is actually capable of encoding
source to source transformations for real languages. There are
several fundamental issues:


1) Can the tool parse the language(s) of interest, in theory?
2) Has the tool got a parser that is usable in practice?
3) Can the transformations be expressed in something that
      actually resembles source code?
4) Can the tool provide additional facts about the parsed
        code that are necessary to support more sophisticated
        transforms?


Frankly, there aren't that many tools out come close to meeting all
these criteria. I know of 3 tools that I think are capable of doing
this for wide classes of language (in the sense that I have looked at
them hard and have implemented one of the three): TXL, ASF/SDF (and
its derivative, Stratego), and my tool, DMS. There are some other
tools out there that make claims of doing general source to source,
but I haven't examined them closely. There are tools out there that
claim to do this for single languages. (PUMA [C++], Jackpot [Java],
...)


> - Is it because, that unlike yacc/bison, it allows ambiguous task-
> specific (agile) grammars?


This goes to issue 1). Yacc is pretty bad at parsing most real
languages, unless you are willing to mangle the grammar and/or add all
kinds of odd support. The GNU folks gave up completely on YACC to
support C++. My personal favorite is GLR because it handles
ambiguities cleanly, and both ASF/SDF and DMS owe much of their
success IHMO to using GLR parsing technology.


[DMS parses full C++ this way just fine.] TXL succeeds by
backtracking over potentially ambiguous parses. There are those that
argue that phrase-ordered backtracking parsers are adequate to handle
ambiguities (the argument used by TXL and some other more specific
systems). When they succeed in parsing C++ without additional hackery
I'll start to believe the argument.


> - Can't we write 'agile' grammars in yacc/bison/antlr (by ignoring
> tokens/constructs that are outside the scope of the task at hand)?


Nope. You're thinking "island grammars", e.g, a grammar which ignores
the part of the language you claim you don't care about. To process
real languages, you'll eventually have to do name/type resolution, and
you'll pretty much find this impossible if you skip over part of the
code. The DMS solution is to only implement full language parsers.
We only have to do this once, for a wide variety of tasks, as opposed
to attempting an island grammar for each task and continually
struggling with island boundaries as we discover what is really
needed. This goes to issue 2).


> - Wouldn't we be able to meet all our agile source transformation
> needs with, for example, Parse:RecDescent, a recursive descent for
> Perl?


I can't speak for how Perl's parser actually works. But if it is just
a recursive descent parser, and it doesn't handle ambiguities, it
won't work adequately for full languages.


> Basically, I do not know what conceptual 'dead-end' I'd run into in
> future if today I were to choose a recursive descent parser such as
> RecDescent for my long-term source to source transformation needs.


I'll be blunt. If you want to transform more than one language, this
approach will fail you.


> Similarly, I don't know what would go wrong (or become difficult /
> unweildy) if I were to choose antlr, specifically for source to
> source transformation tasks?


Does ANTLR pass issue 3? It has a "string templates", a concept that
I don't really understand.


What matters for a practical source-to-source transformation system is
that individual transformations be expressible as syntactically
correct sub-phrases in the languages of interest, that can somehow
express the internal representations (e.g., trees) so that the
transformation engine can match and replace graphs/trees. So, a good
tool has to have some kind of substring parser for the language. If
you don't do this, the user must learn the complete graph/tree
structure used for the individual language being parsed. TXL, ASF/SDF
and DMS all do this by providing parsers for grammar nonterminals,
allowing placeholders for sub-phrases.


DMS is different than TXL and ASF/SDF in that it keeps the multiple
grammars of interest completely isolated. That matters when you are
trying to build a complex code generator that maps from one or more
input languages to one or more output languages. TXL and ASF/SDF make
a "union" grammar having all the rules of all the langauges of
interests. That works great for one language. It is claimed to work
well for two, but I have heartburn thinking about parsers using the
union of COBOL and Prolog grammars; talk about ambiguity troubles...
I can't imagine this working for half a dozen languages. DMS's scheme
has been proven with practical transformation tools involving multiple
languages.


[Soapbox mode on]


But all this parsing stuff pales in comparison to the support
machinery you need to do real transformations on real languages.
Parsing is just the foothills leading to the Himalayas. Most folks
don't get past the foothills. But remember that the Himalayas are
breathtaking in size.


First, real languages are a mess. [Thank the standardization
committees, the fullness of time, the ambition of each language
community to co-opt the other language arenas for this, and especially
the language vendors for this]. Discovery of all the details for a
particular dialect is a major effort. We've got about 10 man-years
invested in our C++ front ends (parse/name-type resolve/disambiguate.
Consider language dialects, the preprocessors, the need to read
compiler command lines to get preprocessor condition definitions, the
500 odd pages in the C++ reference manual defining the semantics of
names and types, and then include the vendor additions.


If you intend to manipulate a real language, and you don't want to
spend *all* your time just getting the language details right, you
want this available out of the box. This alone is a tall order.


Semantic Designs doesn't do this very well in the sense that while we
have front ends for some 30 languages, we don't fully *mature* front
ends for a wide variety of languages, and we've been working on this
for 12 years. [This makes DMS kind of like democracy: the worst
goverment possible except for all the others :-( ] But we believe the
only way this can happen is for a central agent to organize the
collection and maturation of the front ends. [And that's very
expensive, and why DMS is a commercial product]. I'll give the ANTLR
guys credit for trying to collect such a set. That fact that most of
those are academic products hints they are not likely to be very
useful in practice.


Once you get past name and type resolution, you discover the need for
serious static analyses, including at least control and data flow
analysis, to do any real transformation on any procedural language.
This goes to issue 4. Here one ignores what the compiler community
has been doing for the last 50 years at your own risk [which is why I
religiously follow this news group!]. None of the source-to-source
transformation systems I know except DMS, offer any support for
control and data flow analysis. DMS's is mature only for C at this
point, but at least the foundation machinery is generalized so that it
can be hooked to the other domains. At least that's our claim. DMS
also provides a highly structured scheme for defining classes of
static analyses via attribute grammar evaluation; this is remarkably
handy for simpler static analyses.


The TXL transformation model makes no pretense of providing any of
this kind of machinery; their idea is that you compute the facts on
demand, and they've had some success with that. Similarly with
Stratego (but not its underpinnings ASF/SDF). None of the other
systems that I know about provide anything to support construction of
this kind of machinery; they all hand you a tree and let you code
procedurally whatever you want. In theory this is fine. In practice,
your life is far too short.


There are still problems that the transformation community (including
us) have not solved very well. Our mantra is "process what the
programmer sees", e.g., the source code containing the preprocessor
directives. None of the tools that I know about except DMS makes any
attempt to handle the preprocessor. For C, C++, VHDL and Verilog, DMS
captures preprocessor directives that are structured (e.g., they fit
into the grammar structure nicely) or that are common (by making those
special cases). And DMS can apply transformations to these, but with
some restrictions.


Now, you might consider sticking to transforming Java to get rid of
these issues. What you'll discover is that real applications involve
more than one langauge, and you soon trip over HTML, JSP, ... and the
preprocessor stuff will instantly reappear. People just can't resist.


Lastly, there's scale. Many researchers might demonstrate a solution
by applying a few transforms to a toy example. If you aren't going to
synthesize code from a tiny formal spec, you'll have to manipulate the
code that makes up a real system. How big is that? If the task is to
modify a 200 line program, it is small. But then, you can do this by
hand way faster than you can configure a tool, so the economics aren't
there. The economics are there for systems with millions of lines of
code. So, the tool has to handle millions of lines of code. Often in
multiple lanuages. Our experience is that "In big systems, everything
happens". You have to have the parsers essentially perfect. You have
to have the semantics essentially perfect. [You even have to have the
vendor's broken semantics properly emulated]. You have to read tens
of thousands of files. (You can't do a points-to analysis with any
reasonable degree of accuracy without seeing *everything*.) And you
have to produce an answer in period of time that the customer will
tolerate. DMS is implemented in a parallel programming language to
harness SMP computations. It actually compiles attribute grammars
into partial order parallel computations as one way to acquire
parallel execution opportunities.


DMS has collected enough machinery and scale to be used in carrying
out mass transforms on C++ source code, and for extracting code from C
programs with 7500 compilation units (29,000 files) [the CPU time for
this is measured in tens of CPU days]. TXL and ASF/SDF have not been
used for these tasks to my knowledge.


Your mileage may vary if you choose to manipulate only a DSL that is
easy to parse and has trivial name and data flow analysis issues. I
know of extremely few such DSLs. The Haskell community keeps trying
to the make the case that Haskell is just such. And there are even
Haskell community members trying to build real tools with this
mindset. I've never seen them applied in a commercial context.


One legitimate objections to program transformation tools is that in
practice they can be complex to learn to use. So true! TXL and
Stratego attempt to end-run by use a single formalism in which to
encode everything you want to do about transformations. That makes
them easier to learn. But this is rather like deciding to code web
applications by only using Java. Yes, you can do it in theory. But
you'd be crazy to ignore other specialty languages such as SQL, HTML
and JavaScript, because they either make solving other parts of the
problem much easier. DMS follows the latter perspective; it has about
6 DSLs of it own to specifiy various aspects of a full transformation
task. That makes it hard to use, because there's a lot to learn. Just
like web programming. But sticking with just a single language to
solve the task won't get you a modern website, or IMHO, a useful
program transformation tool.


[Soapbox off]


> Since this question of mine spans breadth as well as depth of the
> field of compilers, I'd greatly appreciate if folks here could share
> some insights / info / links on this.


I've been working on *only* program transformation machinery since the
middle 80s (well, and applying them to real tasks). Program
Transformation tools that work on real systems are *hard*. There
won't be very many general purpose ones because of the effort to build
them and make them mature. I'd guess the ones that exist now are the
field plus maybe one or two surprises over the next ten years, and
that's it. My goal is make DMS be one of the best. I'll be the first
to admit DMS isn't the final answer, or even close to ideal.


Hope you found this helpful.


Ira Baxter, CTO, Semantic Designs


> [The main advantage of TXL is that it has a lot of I/O and tree
> building as part of the languge, whereas you have to implement them
> yourself if you used a yacc parser. -John]


You should look at http://www.ii.uib.no/~magne/stsw04-html-2 for a
community wide view of some of the issues.


You can look at www.semdesigns.com/Products/DMS/DMSComparison.html for
our view of how DMS compares to other compiler like-systems, including
TXL and ASF/SDF. [Yes, I'm interested in any feedback or observations
of inaccuracies!]


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.