Re: How to extract grammar from a program?

x@wins.uva.nl
9 Dec 1999 01:04:51 -0500

          From comp.compilers

Related articles
[2 earlier articles]
Re: How to extract grammar from a program? derekross@fisheracre.freeserve.co.uk (Derek Ross) (1999-02-12)
Re: How to extract grammar from a program? eodell@pobox.com (1999-02-15)
Re: How to extract grammar from a program? dib@dera.gov.uk (David Bruce) (1999-02-15)
Re: How to extract grammar from a program? spencer@cc.gatech.edu (1999-02-15)
Re: How to extract grammar from a program? hunk@alpha1.csd.uwm.edu (1999-02-15)
Re: How to extract grammar from a program? hunk@alpha1.csd.uwm.edu (1999-02-15)
Re: How to extract grammar from a program? x@wins.uva.nl (1999-12-09)
| List of all articles for this month |

From: x@wins.uva.nl
Newsgroups: comp.compilers
Date: 9 Dec 1999 01:04:51 -0500
Organization: Compilers Central
Keywords: parse, practice
X-Organisation: Faculty of Mathematics, Computer Science, Physics & Astronomy University of Amsterdam The Netherlands

Some time ago, Rahul Jain <rahulj@iitk.ac.in> wrote in comp.compilers:


> I'm working on a project for which I need some information about some
> reverse engineering method that would help me extract the grammar from
> a set of programs (written in any language). A sufficient grammar will
> be the one which is able to parse all the programs. Now, the question
> is - Does there exist some formal theory for getting the grammar from
> a program. Any heuristic approaches would also solve the purpose.


> [Seems unlikely there's a formal theory, since there's an unlimited
> number of grammars that could describe any particular program. But I
> can imagine that there could be effective heuristics. -John]


There is a way to deal with this issue. Let us assume for the moment
that the program is compiled by a compiler. Then the grammar
knowledge that you need resides in that compiler. What you do is
write a parser that parses the part of the compiler containing the
grammar knowledge. If you are lucky this is easy and you recover the
BNF in a snippet. See [SV99a] for a case study where we recovered
about 20 sublanguages from a proprietary switching system language.
Some 6 BNF dialects were used in that compiler, a recursive decent
parser for 1 of the six types of embedded assembler, and so on. So
this is not a toy case study.


But now you say: hey but what about COBOL??


If you are dealing with a language like IBM COBOL or PL/I for which it
is not possible to obtain the source code of the grammar there is
another option. You can extract the grammar from the manual. No,
really. In fact, for languages that are proprietary this is too
complicated. For, the manuals suck. See [SV99b,SV98] for a case
study where we recovered too many errors from a manual for an obscure
language, to be able to recover a parser from it. However, if you are
dealing with an often used language, such as COBOL the approach to
recover the grammar from a manual works! With my colleague Ralf
Laemmel I am writing a paper on this issue. We recovered a VS COBOL
II grammar from an on-line IBM manual in a snippet. Of course, we had
to parse the syntax diagrams in those manuals. Also it took effort to
repair all the errors, and after parsing about 2 MLOC of VS COBOL II
code we are still finding small errors. However, the amount of work
decreases the more we parse. Normally during reengineering this is
not the case. See for instance [BSV98b] where we explain that
reengineering companies often have grammar reengineering projects.
<grin>. The reaon is that most of these companies use LR parsing
technolgoies and in some cases ppl are 70% of the time resolving
conflicts. However, for people who are interested in the VS COBOL II
grammar, there is no paper yet (draft only) but there is a grammar
that is fairly correct (parsed 2MLOC VS COBOL II code). See
<http://adam.wins.uva.nl/~x/grammars/vs-cobol-ii/> for the grammar.
We are still looking for some code that contains the keyword KANJI.
Any takers?


Best Regards,


    --XX


REFERENCES


[BSV98b]
@inproceedings{BSV98b,
author = {Brand, {M.G.J. van den} and Sellink, M.P.A. and Verhoef, C.},
title = {Current Parsing Techniques in Software Renovation
Considered Harmful},
year = 1998,
editor = {S. Tilley and G. Visaggio},
pages = {108--117},
booktitle = {Proceedings of the sixth International Workshop on Program
    Comprehension},
note = {Available at http://adam.wins.uva.nl/\~{}x/ref/ref.html},
}


[SV99a]
@inproceedings{SV99a,
author = {Sellink, M.P.A. and Verhoef, C.},
title = {Generation of Software Renovation Factories from Compilers},
year = 1999,
editor = {H. Yang and L. White},
booktitle = {Proceedings of the International Conference on Software
Maintenance},
pages = {245--255},
note = {Available via http://adam.wins.uva.nl/\~{}x/com/com.html}
}


[SV99b]
@inproceedings{SV99b,
author = {Sellink, M.P.A. and Verhoef, C.},
title = {Development, Assessment, and Reengineering of Language
        Descriptions},
booktitle = {Proceedings of the Fourth European Conference
                                  on Software Maintenance and Reengineering},
year = {2000},
editor = {J. Ebert and C. Verhoef},
publisher = {IEEE Computer Society},
month = {March},
note = {Full version of \cite{SV98.ase}. To Appear.
                Available at: http://adam.wins.uva.nl/\~{}x/cale/cale.html}
}


[SV98]
@inproceedings{SV98,
author = {Sellink, M.P.A. and Verhoef, C.},
title = {Development, Assessment, and Reengineering of Language
        Descriptions -- Extended Abstract},
year = 1998,
pages = {314--317},
editor = {B. Nuseibeh and D. Redmiles and A. Quilici},
booktitle = {Proceedings of the 13th International Automated Software
Engineering Conference},
note = {For a full version see \cite{SV99.dagstuhl}.
                Available at: http://adam.wins.uva.nl/\~{}x/ase/ase.html}
}


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.