A PREttier Compiler-Compiler: Generating Higher Order Parsers in C ...

11 downloads 15812 Views 296KB Size Report
58 References. Share. Facebook · Twitter · Google+ · LinkedIn · Reddit · Download full-text PDF ..... But it is perfectly possible to write a plain BNF script that. yacc cannot implement ... yacc script with a shift instead of a reduction implements a parser that is looser than the. specification. ...... py++P DCL(x,vy). where (vy,py)=.
SOFTWARE—PRACTICE AND EXPERIENCE, VOL. 25(11), 1263–1297 (NOVEMBER 1995)

A PREttier Compiler-Compiler: Generating Higher-order Parsers in C PETER T. BREUER DIT, Escuela T´ecnica Superior de Ingenieros de Telecomunicaci´on, Universidad Polit´ecnica de Madrid, Ciudad Universidad, E-28040 Madrid, Spain (email: [email protected]) URL: http://www.dit.upm.es/˜ptb/ AND JONATHAN P. BOWEN Oxford University Computing Laboratory, Wolfson Building, Parks Road, Oxford OX1 3QD, U.K. (email: [email protected]) URL: http://www.comlab.ox.ac.uk/oucl/people/jonathan.bowen.html SUMMARY Top-down (LL) context-sensitive parsers with integrated synthesis and use of attributes are easy to express in functional programming languages, but the elegant functional programming model can also serve as an exact prototype for a more efficient implementation of the technology in ANSI C. The result is a compiler-compiler that takes unlimited lookahead and backtracking, the extended BNF notation, and parameterized grammars with (higher-order) meta-parameters to the world of C programming. This article reports on the utility in question three years after public release. Precc generates standard ANSI C and is ‘plug compatible’ with lex-generated lexical analyzers prepared for the UNIX yacc compilercompiler. In contrast to yacc, however, the generated code is modular, which allows parts of scripts to be compiled separately and linked together incrementally. The constructed code is relatively efficient, as is demonstrated by the example Occam parser treated in depth here, but the main advantages we claim are ease of use, separation of specification and implementation concerns, and maintainability. KEY WORDS:

compiler-compiler; parsers; yacc; infinite-lookahead; top-down parsing; grammars; Occam

BACKGROUND AND OVERVIEW No medium to large-scale software engineering project is complete nowadays without an external or internal proprietary language, and the well established UNIX utility yacc19 is still the tool often chosen to transform the language design into a working parser. This is not only because it genuinely permits the implementation (and therefore later maintenance and version control) to be carried out at a higher level, but because it has traditionally been seen as conferring some degree of respectability upon the final product. Though the perspective of leading edge compiler writers may be different, to our knowledge, yacc is ubiquitously engaged as a technology of first choice by many writers of language-oriented applications in the industrial arena, and certainly by many software engineers and researchers requiring parsing tool support. In connection with COBOL, for CCC 0038–0644/95/111263–35 1995 by John Wiley & Sons, Ltd.

Received 26 November 1993 Revised 6 March 1995 and 4 May 1995

1264

P. T. BREUER AND J. P. BOWEN

example, an author of this article has several times been approached by technical companies of national and international standing seeking to build parsers with yacc, and indeed, we have both taken part in the recent past in a large international consortium in which at least 20 per cent of the effort and many person-years was devoted to developing yacc-based COBOL parsers. It was largely out of frustration at that state of affairs that the original version of precc8,9,10 was developed, and later publicly released. On the face of it, after all, yacc offers attractive programming options: it implements a variety of the BNF grammar specification language, and compiles definition scripts into a C21 program which implements a finite-state automaton, thus converting specification into implementation via well-understood theory.1 In practice, however, the drawbacks of the utility may be more obvious than its attractions. The supported BNF is an extremely impoverished variety; for example, every BNF construct has to be expanded by hand into basic ‘sequential’ and ‘alternation’ components before it can be incorporated into a yacc script. The practice obscures the specification and makes maintenance difficult. This is true too of more modern implementations of yacc and also bison (a similar, widely distributed utility available from the GNU Software Foundation), which use improved internal algorithms (see the work of Sippo,30 for example, for the theory) but possess essentially the same scripting language and functionality. One approach to improving on yacc, in order to be able to parse non-LR(k) grammars, is to modify yacc itself. This is an approach taken by Merrill25 in order to handle C++33 adequately, by allowing dynamic stack allocation, but it is not very elegant. Even the original author of yacc has produced a prototype tool, y++20 to handle C++ using the notion of attribute grammars.13 The approach described in this paper is to use a considerably more versatile tool that is designed to accept higher-level grammar descriptions. More modern languages such as C++, which require more complicated parsing than older languages, may be a key to the widespread acceptance of improved parser technology, although both issues drive each other. It may also be argued by some that C++ as it is now is inherently a bad language, which explains why it is so difficult to parse either by humans or by machines, and that we should simplify the language, obviating the need for better parser generators. But this argument does not generalize. Languages which are difficult or impossible to parse with standard technologies are not necessarily worse than those that are easy to parse: natural language is an obvious case, and other counter examples are supplied by languages like Z11,31 or Occam,5,18 which, at least in the case of the former, evolved for reading by people rather than computers. So we dispute the view that if it cannot be done with yacc, then the language is wrong. It is precisely that point of view that this article seeks to counter. We should state at once, however, for the parser experts, that the solution that we propose is not LL parsing of simple grammars – a parser for a simple LL grammar can be generated by the well-known Earley algorithm,14 and will have at worst cubic order complexity – but, simply, LL parsing of inherited and synthetic attribute grammars. In particular, we expect to be able to parse any language with a decision algorithm for its valid phrases, and therefore may generate utilities with the complexity of any Turing machine, which puts the problem out of the domain of the Earley algorithm. We do expect to be able to describe the language simply, concisely, and accurately, and to automatically generate from that description a parser which requires no additional ‘semantic phase’. The grammar descriptions, for example, may be explicitly parameterized upon such things as lexical tables, or other grammars, and the parser that is generated may use and alter the tables during the parse, if the description so states.

A PRETTIER COMPILER-COMPILER

1265

# include "ccx.h" MAIN(fibber) @ fibber

= { fibs $! } ∗

/ ∗ accept till EOL many times ∗ /

@ fibs @

= fib(1, 1)\k / ∗ attribute is count of terms ∗ / {: printf("%d terms OK\n", (int)$k); :}

@ fib(a, b) = number(a) h’,’i fib(b, a+b)\k {@ $k + 1 @} / ∗ attribute is count plus one ∗ / @ @ | h’.’i h’.’i @ {: printf("Next terms are %d, %d, ..\n", (int)a, (int)b); :} @ {@ 0 @} / ∗ attribute is zero count ∗ / @ number(n) = digit(n) / ∗ concrete representation of n ∗ / @ | digit(HEAD(n)) number(TAIL(n)) @ digit(n) = hn+’0’i

/ ∗ single digit n ∗ /

Figure 1. (a). A precc parser which accepts only the Fibonacci sequence as input and pinpoints errors to within a character (HEAD and TAIL are C macros)

1, 1, 2, 3, 5, .. Next terms are 8, 13, .. 5 terms OK 1, 1, 2, 3, 5, 8, 13, 21, 34, 51, 85, .. (line 2 error) failed parse : probable error at hi1, 85, .. Figure 1. (b). Two input lines and responses

In the remainder of this section we will discuss general aspects of parsing as they relate to the precc utility and the yacc standard. In the following section, the design and technology of precc will be discussed in more detail, with a particular emphasis on distinguishing the standard aspects of the design from its more innovative features. We then explore the application of precc to the Occam programming language as an example. The Implementation section sets out briefly the precise technique used to implement the design. Finally we detail the results of operational tests carried out on precc and yacc. Note that an axiomatic trace/refusal semantics for precc has been formulated, desirable if the tool is to be used in critical applications – where correctness is paramount,7 and this is available elsewhere.9 Declarative language issues In a particular sense, yacc does not ‘correctly’ implement the fragment of the BNF syntax that it can express. Only those definition scripts that do not implicitly require the automaton to look ahead more than one token (1TLA) can be implemented, but others can be specified. In effect, this means that the user has to be aware of the limitations of the implementation when designing the specification. We do believe that specification and implementation issues should be separated, which is an argument against accepting this state of affairs, but consider

1266

P. T. BREUER AND J. P. BOWEN

for a moment the issue of correctness. Yacc certainly does report violations of the 1TLA restriction as ‘shift/reduce’ clashes at compile-time (or ‘shift/shift’ or ‘reduce/reduce’), which amounts to a static check for implementability, so it may be argued that this restricts the range of expressible constructs to those that can be implemented. But it is perfectly possible to write a plain BNF script that yacc cannot implement without reporting clashes. Indeed, it is usual to accept scripts which contain shift/reduce or reduce/reduce clashes, because it is either too dangerous (from the point of view of maintainability) to rewrite the script or plain impossible to avoid them for the language in question. Moreover, the conflict reports refer to the underlying automaton, and can be extremely difficult to relate to what might be wrong with the script itself. It can be particularly frustrating for software engineers hoping to use yacc for rapid prototyping, because time has to be spent debugging the automaton as well as debugging the script itself. Frequently then, conflicts do have to be accepted, and since these are resolved in favour of shifts by default, this means that extra (and unspecified) productions are introduced because a yacc script with a shift instead of a reduction implements a parser that is looser than the specification. So, in practice, the user has an automaton that differs from the specification. To argue that the user who accepts such an automaton is disregarding ‘the fine print’ is true but spurious: the fine print is incomprehensible and (practically speaking) unavoidable, so the product has to be accepted, but it is not exactly what was specified. The combination of automaton and warnings is technically correct, but the warnings are not part of the final deliverable and are not usually very informative, so regarding them as ‘fine print’ in a legal contract is analogous. Consider, for example, a yacc script for a comma-separated sequence of x’s: phooey : x | x ’,’ phooey ; This may generate a shift/reduce conflict report if any other production rule contains phooey ’,’ because yacc does not ‘know’ whether to begin looking for an x to follow the comma (shift for more phooey) or to jump with what it already has into the other production (reduce). The problem here can be construed as a lack of referential transparency: the same ‘phooey ’,’’ construction can evoke different effects in different scripts. Nowadays, interest in the software community has shifted towards syntax-directed techniques which produce highquality software that has few errors and is easy to maintain.24 These are the issues which are addressed by precc,8 and the very areas in which yacc technology shows its weakest aspect. Non-functional considerations Some non-functional aspects of the yacc utility lead to further problems when changes to a design are required, whether for debugging purposes, as genuine developments, or as part of the maintenance process.6 First of all, yacc scripts have to be a single unit, and a change to any part requires the whole script to be recompiled again. This can be a slow process, because the original

A PRETTIER COMPILER-COMPILER

1267

UNIX versions of yacc are not fast; and even if modern algorithms for LALR(1) parsing have speeded up yacc implementations,23 the monolithic output still has to be sent to the C compiler for conversion to native machine code, which can also be a slow process (the compiled C code is always a very large file, and C compilers are not normally well suited to the compilation of that kind of code, containing as it does a very large loop and switch/case statement). Hardware speed has improved greatly in the last few years, and with it compilation times, but there is still a long turn-around time involved in the basic development cycle under yacc. This would be much easier if a yacc script could be split into modules. Secondly, the resulting executable may also be very large, particularly under older versions of yacc, because of the nature of the automaton, and can cause swapping problems (or worse) in the runtime environment which result in a poor perceived performance for the application as a whole. Again, operating system development over the past few years has improved paging and swapping behaviour in general, but the improvement is not uniform over all operating systems. The final performance figures can be more than disappointing for a software project which has invested many programmer-years of effort in the application, and devised the front-end for the application in the industry-standard style: a yacc-generated automaton. Malign influences It may be observed that lex and yacc have together acted to limit the style of modern computer languages to those that can be handled easily by the pair. Conforming languages have to have syntactic constructs that are well ‘signposted’ fore and aft with distinctive tokens, and it is noticeable that languages invented before the utilities became popular are not parsable by them because they lack this design. Neither COBOL nor FORTRAN can be parsed by 1TLA technology (in practice, this means that a prior pass is required to add scoping constructs and distinctive tokens which can make a second 1TLA pass successful). In FORTRAN, for example, the scope of the logical conditional if (. . . ) . . . is a single following statement, but a following then makes multiple statements permissible before an endif terminates the scope. Logically, the then comes too late in the stream to signal to a 1TLA parser that everything is correct. It is possible to overcome the difficulties by setting flags as a side-effect, but the parser specification rapidly loses its comprehensibility as the declarative content dwindles. Moreover, it is possible in FORTRAN to declare variables with the same identifiers as keywords, which makes keyword identification at the lexical stage impossible, and raises obvious difficulties for a lex/yacc generated parser (the solution is to modify the lexer to use lookup tables which are modified by the parser). For COBOL, the scope of record structure declarations is only determined by numbering (and, conventionally, by indentation), which leads to equal difficulties. A declarative approach Developments in declarative languages such as Prolog12 and ML26 have led to the incorporation of parser suites into the associated development environment, and these allow more expressive use of BNF-style descriptions than yacc, and thus more languages can be parsed precisely. When these declarative environments are compilable, the results from their built-in compiler-compiler facilities are very acceptable relative to yacc non-functional

1268

P. T. BREUER AND J. P. BOWEN

overheads, and much more usable from the point of view of the convenience, succinctness, and accuracy of the definitional language itself. Indeed the op(. . . ) pseudo-clause of Prolog allows enough flexibility in the definition of pre/in/post-fix operators, together with their associativity and precedence, for an abstract tree input to a compiler to be sufficiently readable that it is usable directly, at least for rapid-prototyping purposes.3,4 In a functional language, it is normally possible to express the comma-separated list of x’s given as a yacc script above in the following style: phooey = x ⊕ x

o 9

literal ’,’

o 9

phooey

(assuming appropriate definitions of the operators ‘ o ’, ‘literal’ and ‘⊕’). This definition 9 will be absolutely independent of whatever other rules are defined because of the referential transparency of functional languages. Such a definition could be written yet more succinctly using an equivalent of the common BNF construct ‘{1 ’,’ x}’: phooey

=

many1 (literal ’,’, x)

having made the obvious abstraction of many1 from the body of the original phooey: many1 (separator, separated) = separated o separator o 9 9 many1 (separator, separated) ⊕ separated (This is more cumbersomely expressed than it might be, but is presented this way to provide an exact comparison with the yacc construction.) However, higher-order language suites normally compile into native machine code, with a consequent lack of portability between different architectures, and usually carry the considerable overhead of at least some part of the development environment with them into the final product. Even if the overhead is not considerable in terms of machine loading, it still may be considerable in financial terms; and there is always some doubt about the effectiveness of the essentially recursive code, no matter how good the optimization is, because of the number of layers of translation involved. Therefore a niche remains to be filled by a dedicated stand-alone and portable compiler-compiler targeted at ANSI C21 which avoids some of yacc’s infelicities and introduces no more of its own. This is what is claimed for the precc compiler-compiler. So how is it done? Design and implementation The basis, well known to many functional programmers, behind the declarative compilercompiler suites is top-down recursive descent parsing15 and LL(∞) infinite look-ahead technology, together with the parameterization that comes with the use of higher-order languages (and some additional algebraic theory described later). Heavily recursive in execution, the parser definitions can nevertheless be made as elegant as one cares to make them. In a functional language, all that has to be done is to implement the higher-order combinators that make the appropriate compound parsers out of simpler component parsers. The basic combinators can correspond to the operators and separators

A PRETTIER COMPILER-COMPILER

1269

in BNF notation, so the gap between the specification language (BNF) and the implementation as a working parser can be small – especially if the functional language allows the direct binding of the appropriate mixfix symbols to these higher-order functions. Moreover, extensions to BNF (EBNF) can be implemented using the same kind of ideas. The basic method has been set out with great clarity by Fairbairn,15 but runtime considerations do not figure heavily in the account, and the theory known at that time has recently found its place in a broader setting which has allowed us to take account of parse-tree attributes and parameters in an integrated way. The precc design sprang from the belief that the functional method of writing parsers and compilers would be natural in C, and would result in much more efficient and practical parsers than possible in a functional programming environment. Interestingly, the method does turn out to transfer quite easily and naturally, and the generated C code closely follows the form of the intended grammar, providing a natural modularity. We always intended to investigate whether the use of a more powerful scripting language could compensate by its greater succinctness for the extra speed of an automaton, and we believe the answer is yes. Note that precc, as designed, cannot use regular algebra, or even the Earley algorithm14 and define an automaton like yacc’s because (1) conditionals∗ are allowed in precc scripts, and (2) it permits context dependent grammars, in that each grammar clause may be parameterized, and (3) the parameters may even be other non-terminals or terminals, which are then implanted dynamically into the clause. Furthermore, (4) precc supports the generation of synthetic attributes during a parse which may be used as contexts during the remainder of the parse. Termination and correctness for grammars and the parsers compiled from them is a major concern. Precc has a fully specified model semantics expressed in terms of traces (phrases accepted by the constructed parser) and refusals (phrases rejected by the parser) that enables the grammar specification to serve as the basis for formally verified statements about the constructed parsers. The reader is referred to the precc technical manual9 for further details of the logic and methods of reasoning that may be used to establish safety or liveness properties for precc parsers. Efficiency Modularity and higher-order definitions were intended to compensate for deficiencies in execution speed with respect to yacc. After all, most of the effort in writing languages does go into maintenance of one kind or another, so it is reasonable to trade off execution speed against maintenance time, especially as speed improves with hardware and the latter improves rapidly nowadays. But, surprisingly, precc-generated executables turned out to run nearly as fast as those generated by yacc. There is no big disadvantage discernible for the parsers that have been constructed – they are not ten times slower and can be raised to the same speed or better with a little optimization. It may be that LL parsers have obtained a poor image from old technology, and that their capabilities ought to be re-evaluated in the light of more modern hardware technology. With hindsight, this can be attributed to the fact that current C compilers implement ∗

The following example defines a conditional parse of gum if condition bar fails, else a parse of dee: @foo =)!bar( gum | dee

1270

P. T. BREUER AND J. P. BOWEN

subroutine calls, and therefore recursions, very efficiently. Presumably this was not necessarily the case at the time that lex and yacc were becoming popular. UNIX environments ought ideally to provide an environment in which large call-stacks are invisible both to the user and the programmer, being handled by the operating system through swapping and virtual memory if necessary. The space overhead claimed by precc is not even very great: 32KBytes of call-stack is required to process 40-deep nested constructions in Occam 218 (and 40-deep is the maximum in Occam because there are eighty characters to a line, and each nesting must be indented by at least two characters from the last). This corresponds to about 2,500 stacked function calls from the kernel. Those memory-space requirements are not unreasonable nowadays. Precc executables also turn out to be no larger than yacc-generated ones, with no sign of swapping problems. Again, with hindsight, this can be attributed to the higher-order and modular nature of precc output code, which ensures that fewer subroutines are necessary, and that subroutines have small bodies which never approach a memory page or even a memory cache in size. There is no monolithic virtual machine, and, in the most recent versions of precc, not even an interpreter in the kernel. Moreover, the functional model followed by precc offers a way out of the debugging morass of yacc’s ‘shift/reduce’ errors: all BNF constructs are implemented correctly in the model and therefore they are implemented correctly in precc, so there can be no implementation conflicts to report. Precc validly abstracts the recursion equations which hold in the functional programming model. When precc gets an ‘answer’, it is the predicted answer. There is some extra strictness∗ introduced in data values, but not in the higher-order combinators themselves. In one practical matter too, precc seems more amenable than yacc. It is easier by far to debug, simply because the C code produced both follows the structure of the BNF source closely, and is conserved through re-compilations. Local changes in the source force only local changes in the generated C code – another aspect of modularity – so breakpoints and the like can be relocated quickly. Development The initial development effort furnished a definition of precc in its own language, which then made further development easy, after the initial bootstrap. First, a generic rewrite engine was constructed using the ‘Brief editor for the PC’ macro language, and the engine was loaded with rewrite rules which then constructed the first C code for precc from BNF, resulting in the first self-compiling executable. Improvements in the initial definition script quickly allowed an order of magnitude improvement in the execution speed (this does not affect the speed of parsers and compilers built by precc, but it illustrates the advantages offered by a clear and succinct definitional language) and precc is now known to be of linear complexity with respect to other input scripts. It runs at hundreds of lines of input per second on standard platforms such as Sun SPARCstations and HP9000 workstations, and the output code will compile under any ANSI-compliant C compiler. Moreover, the output is indeed completely modular: one may split definition scripts any which way, pass them through precc separately, run the C compiler on them separately, and then link together the object modules. ∗

Strictness: a combination parser p(x) q(y) will only evaluate the argument y to q if the first parser p(x) succeeds. Otherwise, y remains unevaluated. That is non-strict behaviour on the part of the relevant parser combinator. However, x will always be fully evaluated, even if p does not look at it. That is extra strictness in the data.

A PRETTIER COMPILER-COMPILER

1271

In this respect, precc fulfils its design goals; it reduces the turn-around time in language development, both because it requires only local recompilations when changes are made, and also through the expressive power of its definitional language. It has been used to make full scale parsers and scanners for COBOL 74,29 and to implement the programming language Uniform,32 for example. The definition scripts for these languages comprise between one and two thousand lines, involving between one and two hundred parser definitions. The number of keywords alone in COBOL is over two hundred; and the resulting executables are efficient, both in time and in space. The worries over the efficiency of this kind of technology continue to lessen with developments in hardware. Contexts and the synthesis of attributes The Fibonacci sequence parser in Figure 0(a) provides an example of the use of both contexts and synthetic attributes. A pair of consecutive integers a, b, in a Fibonacci sequence forms a context within which the grammar for the remainder of the sequence is defined: @ fib(a, b) = number (a) h’,’i @ fib(b, a+b) . . . In addition, the grammar synthesizes an attribute for the production. The attribute is the count k of how many members of the sequence will be seen to the right. If we see a number a and then a sequence beginning with b, a + b containing k members, k + 1 will be seen in total. . . . fib(b, a+b)\k {@ $k + 1 @} @ @ | h’.’i h’.’i {@ 0 @} On the other hand, if we see just the terminating pair of dots, then the count 0 is returned as the attribute. The ‘\k’ introduces k as a name for the attribute of the term that it follows, and the ‘$k’ dereferences it for use. Higher-order features In addition to the above, precc provides higher-order programming facilities in the C runtime environment. This is a by-product of C’s lax typing restrictions, which mean that functions (such as other parsers) can be freely passed along with other parameters to the parsers which precc builds. Functional parameters are passed as integer-sized addresses of functions, and re-interpreted as functions upon receipt by the callee. But the result is that precc quite transparently supports a higher-order ‘macro’ language, and frequently used grammar constructions can be defined as precc ‘macros’ in order to improve the clarity and maintainability of a script. For example, the precc code for the many1 functional abstraction (i.e., a separated list) can be precisely: @ many1 (separator, separated) @ = separated separator @ many1(separator, separated) @ | separated although @ many1 (separator, separated) @ = separated { separator separated }?

1272

P. T. BREUER AND J. P. BOWEN

is a rather more efficient version. In both of these, separator and separated are other parser identifiers, which is what makes the language higher-order. In the C code that is generated, separator and separated are the addresses of functions. Literacy and compatibility Other features include the permitted embedding of C code and C preprocessor macros or instructions anywhere in a precc script, with the resulting mix of ‘advantages’ from precc’s own referentially transparent higher-order functions and the referentially opaque C ones. In general, precc scripts are literate programs in the sense of Knuth,22 in that only certain parts of the script are visible to the utility itself, and these visible parts can therefore be embedded in ordinary text (i.e., literate prose) or other computer language instructions, typically C function definitions and preprocessor directives. Hence leading @’s are used in the program text to distinguish it as such. Precc also provides a simple hook for tokenizing pre-filters such as the UNIX lex19 utility, to which it presents an interface that is intended to be ‘plug compatible’ with yacc’s. Most lexers aimed at a yacc should be usable as they stand with precc. The majority of yacc scripts can be converted to precc scripts too. There are some difficulties with yacc’s %precedence operator, but operator precedences can be encoded explicitly. Disadvantages One disadvantage precc does have at runtime is its use of the C stack, which can grow large in deeply structured and recursive grammar descriptions. This is a limitation (of C) under some memory models on restricted architectures, such as an 8086-based PC with its 64KByte segment limit. The script-writer must remember to place ‘cuts’ (‘!’ marks) – which have the effect of jetisoning the stack and resetting the buffers and make backtracking across the cut illegal – at judiciously spaced points. Note also that it is possible to declare any C union, simple or structure data as synthetic attribute type but the data type should not be varied through the script and precc does no type-checking. DESIGN The model that precc follows holds that a compiler may be considered as a modified parser, and a parser is a function which consumes an input stream of tokens, returning a status value to indicate whether or not the input satisfies it, together with the unconsumed portion of the input: PARSER == [TOKEN] → ([TOKEN], STATUS) and the fundamental idea behind the C implementation is that the side-effecting part of the functionality – the consuming of the input stream – can be implemented invisibly as a side-effect in C: typedef STATUS PARSER();

A PRETTIER COMPILER-COMPILER

1273

Only the result type need be declared in C, and the side-effect will always be to move the global pointer pstr to a new position in the input stream (if the parse returns with SUCCESS status). A raw stream cannot be rewound by moving the pointer, but precc buffers input in order that rewinding may be implemented∗ correctly by fallback to an earlier value of the pointer pstr, and thus remains faithful to the functional representation above. As stated above, the functional model is based on an exposition by Fairbairn.15 To implement a BNF description as a parser, it suffices to know how to: 1. implement the description of a sequence ab as the parser which accepts the input accepted by a, followed by the input accepted by b, given that it is known how to implement the descriptions a and b correctly as parsers which accept precisely what the descriptions a and b specify; 2. do the same for the description of an ‘or parse’: a|b which accepts input which a accepts, or, if that fails, accepts input which b accepts. We can improve upon the original exposition (not in clarity, but) through the use of modern developments in the application of monadic constructions.34 Hutton17 has noticed this connection, but this paper expands on the idea. The status type is, in the simplest case, an ‘exception-type’ disjoint sum: STATUS = 1 + 1 in which the exception value is FAIL, and the single ordinary result is SUCCESS. In general, however, the status type can deliver both more detailed reports of failure and of success: STATUS = E + X The failure reports can range over the type E and the success reports over the type X. Let inl and inr be the formal functions which inject values into the union type E + X. For fixed error type E, we define the auto-rewinding parsers of type X as the type that either produces a successful result inr x with x in X or an error inl e with e in E: P(X) == [TOKEN] → ([TOKEN], E + X) But the member functions of this type must satisfy the equation: p ts = (ts0 , inl e) ⇒ ts = ts0 which says that parsers which fail to parse their input rewind the input to the state in which they found it. Let ‘()’ be the unique member of the singleton type. Then the simplest example is the general type specialized to return the single value inr() on success. ∗

Only when actions are executed (where the ‘! ’ instruction occurs in the specification) is the historical stream input jettisoned, because accurate backtracking is no longer possible – the design justification is that unknown side-effects may have been triggered which are certainly impossible to recall, so it is not worth recalling anything from now on.

1274

P. T. BREUER AND J. P. BOWEN

PARSER == P(1) Note that the general type, with the equational restriction, is isomorphic to the type of ‘side effects or exceptions’, which has no additional semantic restriction, and is therefore mathematically more tractable: [TOKEN] → (E + ([TOKEN], X))

But the isomorphism is artificial: in practice, side-effects are unavoidable, even if they are trivial, so the type P(X) is the more realistic model. We continue the explanation of the precc design in two parts. The first part of the explanation, on standard aspects, deals with standard and well understood aspects of the higher-order combinator approach and its translation into C, and the second part, on the extension to synthetic attributes, describes the extension to monadic ideas in detail, and shows how it gives rise naturally to a method for synthesizing attributes. Also, we fix the error type E=1 to be the singleton for the remainder of the discussion. The only error value will be inl(), signifying a failure of a kind not specified further here. Standard aspects Two parsers which follow on from each other can be combined using the sequential combinator: poq 9

This combinator has to run p, but save the input stream beforehand so that if p does not raise an exception, but q does, then it can rewind the stream to the initial state, as required by the constraining equation of the auto-rewinding parser type. The auto-rewind from q alone would only recover the output state from p, not the input: (p o q)ts = 9

let (ts0 , rp) = p ts in case rp in inl e : (ts , inl e) inr u : let (ts00 , rq) = q ts0 in case rq in inl f : (ts , inl f) inr v : (ts00 , inr v) esac esac

The combinator implements the ‘and parse’ combination of two parsers. If p implements specification ‘a’ and q implements specification ‘b’, then p o q implements ‘a b’. The oper9

A PRETTIER COMPILER-COMPILER

1275

ator has type o 9

:: P(X) → P(Y) → P(Y)

in general, and precc’s kernel library duplicates the functionality (for result type X = Y) by implementing the C prototype: STATUS thenparse0(PARSER, PARSER); replacing explicit references to the stream states ts and ts0 with real side-effects on the stream pointer, the global variable pstr. The C code is very close to the functional specification: STATUS thenparse0(PARSER p, PARSER q) { void *saved = pstr; static STATUS tok; if GOOD(tok=p()) if GOOD(tok=q()) return tok; else pstr=saved; return tok; } and requires only a suitable macro for GOOD to complete it. The real code in the kernel is not much more complicated than this, and differs essentially only in that it saves a bigger frame, performs some stack manipulations itself, and handles the cases when there are parameters to the parsers p and q. Similarly, the precc library function orparse0 implements the a | b specification via a combinator. The orparse0 C code in the kernel is again close to a functional specification: STATUS orparse0(PARSER p, PARSER q) { static STATUS tok; if GOOD(tok=p()) return tok; return q(); } and relies in the same way on the rewinding of the parse stream by p when it fails. Extension to synthetic attributes In this section, an algebraic theory of parsing with attributes is set out, together with an implementation technique based on combinators. The novel aspect of the treatment is that it deals in a seamless way with attributes. It is simple, given the semantics of the previous section, to prove associativity for ‘ o ’, 9 the sequential parser combinator:

1276

P. T. BREUER AND J. P. BOWEN

(p o q) o r = p o (q o r) 9

9

9

9

but unnecessary, because p o q is a trivial case of a more general associative combinator, 9 called the ‘star’ operator, with properties from which associativity for ‘ o ’ follows. The 9 latter takes an attribute attached to p into account, instead of running q regardless of what p returned with. The ‘star’ operator is one of two basic operations on auto-rewinding parsers, and the type P with its operators together form the distinctive abstract typed algebra of a monad.34 The laws are as shown in Figure 2. The ‘eta’ combinator provides a way of attaching synthetic attributes to parses, and will be dealt with later. Recall that a function of an argument x is written λx.Q where Q is the expression in x that computes the result.

P( )

:: type → type η :: X → P(X) ? :: P(X) → (X → P(Y)) → P(Y)

(p ? λx.q) ? λy.r = p ? λx.(q ? λy.r) η a ? λx.q = q[a/x] p ? λx.η x = p

(1) (2) (3)

Figure 2. The abstract typed auto-rewinding parser constructor, operators and laws. The notation q[a/x] means ‘expression q with x replaced by a throughout’. λx.q means ‘the function with argument x that returns the expression q (of x)’

The operational semantics is shown in Figure 3. For the purposes of translation into C, the type has to be instantiated with X fixed throughout a script. In precc this is accomplished by declaring #define VALUE = . . . at the top of a script, which fixes X = VALUE. This will be the type of attributes that may be synthesized during a parse. An expression ‘p ? λx.q(x)’ may be interpreted as a parser ‘p o q(xp )’ where the xp is the 9 attribute value delivered by parser p (as inr xp ) if it succeeded. Because the value xp is unknowable before execution, ‘?’ is not expressible in terms of ‘ o ’. But the ‘ o ’ operator 9 9 can be expressed in terms of the ‘?’ operation, and the ‘?’ associativity law (1) then reduces to associativity for ‘ o ’. Set ‘ o ’ to be: 9

9

p o q = p ? λ x. q 9

A PRETTIER COMPILER-COMPILER

1277

where x is a variable that does not appear in q, and this reduces to the directly defined semantics. The implementation of the ? operator is not via a kernel function. Instead, the b ? λx. c(x) construct is always embedded as in-line code. This is in order to make it possible to use call-by-value semantics for the x variable (which captures the result of the call to b), using the value and not a reference to it in the call to c. Here the in-line code is represented by the ‘STAR’ macro: STATUS a() { PARSER b, c; PARAM x; static STATUS tok; STAR(x, b(),c(x)) return tok; } The macro expands to C code as follows, beginning with a save of the parsing context in case it needs to be restored (i.e., if the call to b succeeds but the call to c fails): void *saved = pstr; tok = b(); if BAD(tok) return tok; x = InSTATUS(tok); tok = c(x); if BAD(tok) pstr = saved; The value of the attribute x is encoded within the value of tok, the return status of the call to the parser, and the deconstructor InSTATUS extracts it. In the real implementation, the encoding is not by means of structure (i.e., compound) types, which would cost extra bytes of stack space, but by a simple technique. If the attribute interpreted as an integer is non-negative then it is incremented before stacking. That leaves the zero value to denote the failure mode of the call to the parser. In case the parser succeeds, a non-zero value will be returned, and it can then be decoded by InSTATUS, which decrements it as an integer if it is positive and then reads it as an attribute value. Using this approach loses some values (those that map to the maximum positive integer) from the attribute value space, but saves the extra byte that would otherwise be placed on the stack at every call if a structure type were used. The user can opt to use structures instead. The remaining ‘eta’ operator simply attaches a value as an attribute without examining or moving the parse stream. The C code is as follows: STATUS eta0(VALUE x) { static STATUS tok; ToSTATUS(tok,x); return tok; } The ToSTATUS macro performs the necessary encoding of the value x so that it can later be decoded by InSTATUS and stores the code in tok.

1278

P. T. BREUER AND J. P. BOWEN

let (ts0 , ra) = p ts in case ra in inl e : (ts, inl e) inr x : let (ts00 , rb) = (Q) ts0 in case rb in inl f : (ts , inl f) inr v : (ts00 , inr v) esac esac η x ts = (ts, inr x) Figure 3. The semantics of the basic combinators on auto-rewinding parsers

The orparse0 parser combinator, which tries each of p and q in turn and accepts the first (in this order) to succeed, can be represented algebraically as an ‘addition’ p ⊕ q. The ‘?’ operator (and hence ‘ o ’) distributes over the ‘addition’, and there is a ‘zero’ operator 9 too, with which ‘?’ also interacts well. The zero is a parser that always fails. All these operators are embedded in precc’s kernel. The precise laws which hold are shown in Figure 4, and the semantics of the plus and zero are shown in Figure 5, and it may be verified that the given semantics satisfies the asserted laws. Note that, although ‘⊕’ has to be associative, it is not necessarily commutative. The laws state that ‘?’ distributes over ‘⊕’ and ‘0’, and the latter two comprise a semi-group. Higher-order features The thenparse0 and orparse0 combinators are essentially higher-order functions, and C supports this functionality: one may pass the addresses of functions (rather than their machine code) whenever a datum of type PARSER is required. A precc construct like: p (q ) q (p ) (although this particular example is hard to imagine in practical use) is perfectly allowable: the parser p takes the parser q as a parameter, and vice versa. Parameterization is explored further below. Parameterization As noted above, the description is complicated further in the real implementation by the presence of parameters to grammar descriptions, which requires that the thenparse0 and orparse0 combinator implementations in C also accept arguments which correspond to those parameters. This is because the parameters cannot be supplied to the parser arguments without evaluating them in C, which would make the compounded parser stricter than the

A PRETTIER COMPILER-COMPILER

0



1279

:: P(X) :: P(X) → P(X) → P(X)

a ⊕ (b ⊕ c) = a⊕0 =

(p ⊕ q) ? λx.r p ? λx.(q ⊕ r) 0 ? λ x. q p ? λ x. 0

= = = =

(a ⊕ b) ⊕ c 0⊕a=a

p ? λx.r ⊕ q ? λx.r p ? λx.q ⊕ p ? λx.r 0 0

(4) (5)

(6) (7) (8) (9)

Figure 4. Extra laws for a parsing monad over a semi-group

functional description given here. To avoid the problem, the equivalent of the functional programming concept of a closure is required. Every precc kernel routine, such as thenparse0, which we have depicted as taking two arguments: STATUS thenparse0(PARSER p, PARSER q) in fact has an analogue which takes a variable number of arguments, making use of the declaration supported by the stdarg.h ANSI C header file: STATUS thenparse0n(PARSER p, . . . ) and a call for a sequential compound using parameters is supported by a call of thenparse0n: thenparse0n(p, 2, pparm1, pparm2, q, 1, qparm1, qparm2, qparm3)

(p ⊕ q)ts = let (ts0 , ra) = p ts in case ra in inl e : q ts inr v : (ts0 , inr v) esac

0 ts = (ts, inl()) Figure 5. The parser semantics for the semi-group operators

1280

P. T. BREUER AND J. P. BOWEN

corresponding to a p (pparm1, pparm2) q(qparm1, qparm2, qparm3) expression in the precc specification script. The thenparse0n code extracts the parameters off the C call stack and reconstructs the intended calls to the two parsers using the C stdarg routines, then processes them, as may be required. Compilers Compilers and interpreters differ from parsers in that they invoke actions which have effects on some external state (such as the data written into a file of generated object code, or the image on a display screen). One way of extending the functional model to treat compilers is to consider them to be the type: COMPILER==(STATE, [TOKEN]) →

(STATE, [TOKEN], STATUS)

and these can be implemented in C by making the side-effects on state and the input stream into real side-effects, which means that compilers have the same C type as parsers, because the extra effect on STATE cannot be made explicit: typedef PARSER COMPILER; We define an action to be a side-effecting function in the functional model: ACTION == STATE → STATE and implement these in C as functions which take no arguments and return no results: typedef void ACTION(); This format of prototype permits arguments to the function call, and that is exactly what is required in order to support the translation of parameterized BNF definitions. A typically higher-order way of building a compiler is to attach an action to a parser, letting the action be executed precisely when the parse succeeds: attach :: (PARSER,ACTION) → COMPILER attach(p,f) (w,ts) = let (ts’,rp) = p ts in case rp in inl e: (w,ts,inl e) inr v: (f w,ts’,inr v) esac This method can be duplicated in C using a higher-order function with prototype: STATUS attach(PARSER, ACTION); which moves the side-effects out of the parameter list and implements them directly. One important complication arises, however. Whereas the input stream can be rewound in the real world if the parser parameter fails, side-effects cannot be. Therefore it is important that, if the parser succeeds, instead of triggering the action immediately, attach should

A PRETTIER COMPILER-COMPILER

1281

simply place the address of its action parameter onto a fifo queue for later execution. The queue is emptied whenever a cut ‘! ’ command is reached in the grammar script (or at the end of stream input, if the parse is successful, if this occurs first). So the C code includes a ‘PUSHACTION’ command instead of a direct execution of the action: STATUS attach(PARSER p, ACTION a) { if GOOD(p()) { PUSHACTION(a); return SUCCESS; } return FAILURE; } Again, C passes addresses of functions as parameters, not code, so there must be a C function declaration which corresponds to the action parameter to the attach routine, and precc builds the C code for it explicitly. When the form ‘a {: code :}’ is encountered in a script – which means ‘execute code if parser a succeeds’ – the action static void hiddenfn1() { code } is written into the generated code, making the code accessible as a C function. It is followed by the following parser definition: static STATUS hiddenfn2() { PARSER a; ACTION hiddenfn1; return attach(a,hiddenfn1); } This will push the function hiddenfn1 onto the action stack for later execution, provided that parser ‘a’ succeeds. Further aspects of the action semantics are considered in the next sections. It should be noted that precc properly interprets all inherited attribute grammar descriptions. This means that it allows parameters to all the parsers/compilers it constructs∗ , It properly interprets all synthetic attribute grammar descriptions too, provided that the attribute values are of a kind that can be placed on the C call stack. Precc can also feed back synthesized attributes as inherited attributes to sub-grammars (data flow in the other direction is also possible and, indeed, usual in precc definition scripts). Complexity Parsing n input tokens causes precc to stack at most n actions for later execution. Some of these actions are trivial or skipped but the nett effect is that precc builds a program of ∗

There is a hard-coded limit to the number of parameters allowed in the precc library routines, but it should not be a problem in practice. Up to sixteen parameters are allowed at present.

1282

P. T. BREUER AND J. P. BOWEN

size that is linear in its uncut input stream length. It also necessarily stacks C calls on the C call stack to a depth again linearly (directly) bounded by the input stream, but the depth is usually much less than the theoretical maximum, especially if non-recursive constructions are favoured in the script over recursive ones. The call depth depends on the structure of the specification script. The call stack should be jettisoned after passing a cut mark (‘!’) in the script. The designer should use cuts and avoid using recursion too heavily if the call stack is physically limited (as it is under MS-DOS, for example). About twenty bytes of stack are required per kernel call, on average, and a 64Kbyte call stack therefore permits a recursion depth of 3200 calls. These will all be combinators: andparse0 and orparse0 calls, for example. Precc jettisons the stack whenever it returns to its top level parser, as well as while recovering from an error, such as an attempted backtrack through a cut point. It repeats the top level parser until input is exhausted, so scripts are usually designed to be of the form top = comment | code | . . . , with the implied arbitrary repeat of top. The design inherently reduces call stack usage. Under operating systems such as UNIX, however, where stack size is not a practical constraint, no particular care need be taken. More on actions Actions are pieces of C code (terminated by a semi-colon) and placed between a pair of bracket-colons (‘{:. . . :}’) in the grammar definition script. For example: @ addexpr = expr\x h’+’i expr\y {: printf("%d",$x+$y); :} @ is not unreasonable. The ‘$x’ is the value attached to the leftmost term, and ‘$y’ is that attached to the rightmost term. Side-effecting actions can be sequenced as follows (recall that no side-effect takes place until after the parse, by which time precc has built an internal list of actions to execute which corresponds exactly to the successful trace through the grammar). The specification: @ abc = a b c {: printf("D"); :} @a

= h’a’i {: printf("A"); :}

@b

= h’b’i {: printf("B"); :}

@c

= h’c’i {: printf("C"); :}

will, upon receiving input "abc", generate the C program equivalent: printf("A");printf("B"); printf("C");printf("D"); to be executed later. Thus, actions attached to a sequence expression may be thought of as occurring immediately after the actions attached to sub-expressions, and so on down. A WORKED EXAMPLE As an examination of the practicality and usability of precc, the language Occam has provided a ready-made test.5 Occam’s razor:

A PRETTIER COMPILER-COMPILER

1283

Entia non sunt multiplicanda praeter necessitatatem No more things should be presumed to exist than are absolutely necessary reminds us that a thing which can be done without, ought not to be used. Fortunately, it seems that precc may be needed in order to build a parser for Occam, because it is certainly not an easy language for LALR(1)-based parsers to handle. In particular, it requires the lexical indentation of constructs from the left-hand edge to be recognized by the parser since this often delimits their scope, in place of the perhaps more familiar BEGIN/END delimiters of Pascal and other block structured languages (or ‘{’ and ‘}’ of C). As a result, knowledge of the current indentation may be needed at many points during parsing, and backtracking to an alternative interpretation may be necessary as the scope of (possibly several) constructs is ended. For example, the two programs below35 differ only in the indentation of one line, but the program on the left yields y=10, while that on the right results in y=0: SEQ x,y,z WHILE SEQ z x y

:= 10,1,0 x>0 := z+x := x-1 := y*x

SEQ x,y,z := 10,1,0 WHILE x>0 SEQ z := z+x x := x-1 y := y*x

The scope of the second SEQ is terminated by the ‘offside’ (less indented) statement. Occam could theoretically be handled by a LR(40) grammar, with a 40 token lookahead since the indentation may be limited to 80 columns (2 spaces per indentation). So yacc could handle all practical Occam programs, once the LR(40) specification is expanded out into LR(1) form. However, this would be very wasteful of space for the automaton’s tables, much of which would be duplicated. The use of purely synthetic attribute grammars is in any case slightly problematic because several syntactic constructs may be terminated by a single lexical group. For example, in SEQ SEQ SEQ x:=y y:=z the group y:=z belongs to the outermost SEQ by virtue of its indentation position, which terminates the two innermost SEQ constructs, so it is not clear what token type it should present. A prior pass may be necessary to insert BEGIN and END tokens which can serve as an unambiguous delineation of the block structure, but this amounts to admitting that yacc-type technology is not well-suited to this kind of parsing problem. A realistic parser for Occam This section presents a precc parser for a significant subset of Occam 2, and demonstrates how closely the parser can be related to the BNF-style syntax specification in Appendix G of the official Occam manual,18 a sample of which is given in Figure 6. In the notation used

1284

P. T. BREUER AND J. P. BOWEN

process

=

SKIP | STOP | construction | . . .

construction

=

sequence | conditional | loop | . . .

conditional

=

IF { choices }

choice

=

guarded.choice | conditional

guarded.choice

=

boolean process

loop

=

WHILE boolean process

Figure 6. Sample from Occam 2 syntax summary

there and in the manual, the indentation of constructs is specified by indenting the more regular part of the BNF description where appropriate. While not pure BNF, and not completely formal, the metalanguage provides a clear specification of the intended concrete syntax. The corresponding specification in precc is shown in Figure 7. This is one of the more complicated and therefore more interesting parts of the specification as far as parsing is concerned. The precc specification differs in several respects. Firstly and most obviously, all the non-terminals are parameterized by a context. They take the context n as the parameter, denoting the absolute value of the indentation of the concrete representation, so that conditional(5) denotes an IF construct with the initial I in column 5. Where the concrete specification denotes extra indentation by spacing, the precc specification consistently uses an incremented parameter instead. Thus choices(n+2) is included where the concrete syntax instead has choices two spaces further to the right than the governing IF keyword. Secondly, the precc specification employs special matches for the ‘end of line’. Although precc can match the literal ASCII line feed control character directly, the default lexical analyzer which comes with precc generates the special zero token at an end of line, and it is this that is matched by the ‘$! ’ and ‘$’ constructs (see below). Though precc can use lexers generated by the lex utility directly, in exactly the same way as yacc does, no special lexical analysis is required for the purposes of this trial, and the default lexer can be used. This lexer passes every character unchanged, except that it substitutes a line feed control character with the special zero token. The rationale behind the design of the default lexer is that ‘end of line’ frequently marks a natural break point at which special action must be taken, and yacc-compatible lexers conventionally signal special conditions with the zero token. Actions which check for end of file conditions can be attached to the match for the zero token. The ‘$’ is a match for end of line which can be backtracked over, and the ‘$! ’ is a match for one which cannot. It incorporates the special cut symbol, ‘! ’, which disables

A PRETTIER COMPILER-COMPILER

1285

backtracking. It is best to include as many cuts as possible in a precc script, because in the event of an error being detected in the parsed stream, they prevent exponential searches through all the possible alternative parses. The cut is one of the innovative implementational aspects of precc in comparison with other LL(∞) parsers; the use of the C call stack is another. @ process(n)

=

SKIP | STOP | construction(n) | . . .

@ construction(n)

=

sequence(n) | conditional(n) | loop(n) | . . .

@ conditional(n) @

=

IF $! [ choices(n+2) ]

@ choice(n)

=

[ specifications(n) ] indent(n,chosen)

@ chosen(n)

=

guardedchoice(n) | conditional(n)

@ guardedchoice(n) @

=

boolean $ indent(n+2,process)

@ loop(n) @ @ clause(n)

=

WHILE somespace boolean indent(n+2,process) indent(n,process)

=

Figure 7. Corresponding precc specification

It is certain that an IF begins a conditional statement and nothing else, so a cut may be inserted after the IF, but it is conceivable that it may be necessary to backtrack over the newline following a boolean expression in case the following stream does not match. In fact, this is never going to be the case in normal circumstances, but it is safest to allow for such a possibility, so a ‘$’ is advisable at this point. Thirdly, the precc specification uses the square brackets syntax [ foo ] to indicate an optional inclusion, where the concrete syntax uses curly brackets instead. This is a trivial difference – precc reserves the curly brackets for use as grouping parentheses. Fourthly, where the concrete specification has ‘process’, the precc specification instead has ‘indent(n,process)’ or ‘indent(n+2,process)’. The ‘n + 2’ indicates extra indentation to the right and the construction has been introduced in order to codify the extra indentation required in a subordinate block of code. It is convenient to define clause(n) = indent(n+2,process) which might equally well be placed directly in the specification, but, for the purposes of comparison, the expanded version has been left in place here. Giving a name to the construction certainly would make the specification more maintainable because it eliminates repetition and narrows the range over which changes may have to be made at some future point. It also makes the resulting code more compact, at the cost of a tiny loss in runtime performance which may often be optimized away by the C compiler. (Experiments have shown that function calls with simple bodies – which this construction induces – are usually copied into in-line code by optimizing C compilers.) All this is possible because of precc’s

1286

P. T. BREUER AND J. P. BOWEN

declarative semantics. An expression may be substituted by a name for the expression anywhere in the text without altering the semantics. Why, in detail, does the precc specification need an indent(n+2,process) construct at all, when the concrete syntax includes nothing of the sort? It is because the n spaces of context which appear to the left of each production must be taken into account. The representation in Figure 6 drops these from the succeeding lines, showing only the relative indentation. Inserting them explicitly gives the concrete specification the shape guardedchoice(n) ↓ guarded .choice = boolean t . . t t} t t process | t .{z ↑ blank ? n process(n+2) and the precc guardedchoice(n) definition is to represent the construction starting with a boolean in column n. The full precc specification across the two lines has to include the n blank spaces before the process part: @ guardedchoice(n) @ @

=

boolean $ blank ? (n+2) process(n+2)

and the whole lower line of this definition is replaced in the actual script by indent(n + 2 , process ). The latter is a higher-order construction: =

@ indent(n,p)

blank ? n p(n)

The pattern is frequently reused in the Occam specification. Note that ‘blank ? n’ is the precc syntax for ‘blank repeated n times’ (and a ‘?’ without a following expression means ‘repeated zero or more times’). One example of the use of indent is in the definition of clause(n): @ clause(n)

=

indent(n,process)

The script can be made even more flexible if required. The precc parser can allow the process to start beyond the mandatory two spaces farther to the right, and can compute any indentations to come relative to the actual position in which process starts (although this flexibility is disallowed by the standard Occam language, perhaps because of the extra complexity of handling it). This may be achieved by substituting indented(n,p) for p(n), and defining the indented grammars so that they detect any extra blanks and incorporate them into the indentation count: @ indented(n,p) = blank indented(n+1,p) | p(n) @ @ indent(n,p)

= blank ? n indented(n,p)

A PRETTIER COMPILER-COMPILER

1287

There is some inefficiency introduced here, because a malformed p following some blanks may be (partially) scanned twice or more, but it may be considered worthwhile because of the extra flexibility in the acceptable syntax that it affords. There are also good reasons for inserting a test into the specification to ensure that n does not grow too large: @ indented(n,p) = blank )n