Dragon Book - Compilers Principles Techniques and Tools (2nd Edition)

1,038 Pages • 407,342 Words • PDF • 12.3 MB

+ book + dragon + tools + Principles + Techniques + Compilers

Uploaded at 2021-09-24 13:43

This document was submitted by our user and they confirm that they have the consent to share it. Assuming that you are writer or own the copyright of this document, report to us by using this DMCA report button.

PREVIEW PDF

Second Edition

Alfred V. Abo Columbia University Monica

S. Lam

Stanford University

Ravi Sethi Avaya

Jeffrey D. Ullman Stanford University

Boston San Francisco New York London Toronto S ydney Tokyo Singapore Madrid Mexico City Munich Paris Cape Town Hong Kong Montreal

Publisher

Greg Tobin

Executive Editor

Michael Hirsch

Acquisitions Editor

Matt Goldstein

Project Editor

Katherine Harutunian

Associate Managing Editor

Jeffrey Holcomb

Cover Designer

Joyce Cosentino Wells

Digital Assets Manager

Marianne Groth

Media Producer

Bethany Tidd

Senior Marketing Manager

Michelle Brown

Marketing Assistant

Sarah Milmore

Senior Author Support! Technology Specialist

Joe Vetere

Senior Manufacturing Buyer

Carol Melville

Cover Image

Scott Ullman of Strange Tonic Productions (www.strangetonic.com)

Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial caps or all caps. A This interior of this book was composed in L TEX.

Library of Congress Cataloging-in-Publication Data Compilers : principles, techniques, and tools / Alfred V. Aho ... ret a1.]. -- 2nd ed. p. cm. Rev. ed. of: Compilers, principles, techniques, and tools / Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman. 1986. ISBN 0-321-48681-1 (alk. paper) 1. Compilers (Computer programs) 1. Aho, Alfred V. II. Aho, Alfred V. Compilers, principles, techniques, and tools. QA76.76.C65A37 2007 005.4'53--dc22 2006024333 Copyright © 2007 Pearson Education, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher. Printed in the United States of America. For information on obtaining permission for use of material in this work, please submit a written request to Pearson Education, Inc., Rights and Contracts Department, 75 Arlington Street, Suite 300, Boston, MA 02116, fax your request to 617-848-7047, or e-mail at http://www.pearsoned.comllegallpermissions.htm.

2 3 4 5 6 7 8 9 lO-CW-10 09 08 07 06

Preface In the time since the 1986 edition of this book, the world of compiler design has changed significantly. Programming languages have evolved to present new compilation problems. Computer architectures offer a variety of resources of which the compiler designer must take advantage. Perhaps most interestingly, the venerable technology of code optimization has found use outside compilers. It is now used in tools that find bugs in software, and most importantly, find security holes in existing code. And much of the "front-end" technology grammars, regular expressions, parsers, and syntax-directed translators - are still in wide use. Thus, our philosophy from previous versions of the book has not changed. We recognize that few readers will build, or even maintain, a compiler for a major programming language. Yet the models, theory, and algorithms associ ated with a compiler can be applied to a wide range of problems in software design and software development . We therefore emphasize problems that are most commonly encountered in designing a language processor, regardless of the source language or target machine. Use of t he Book

It takes at least two quarters or even two semesters to cover all or most of the material in this book. It is common to cover the first half in an undergraduate course and the second half of the book - stressing code optimization - in a second course at the graduate or mezzanine level. Here is an outline of the chapters: Chapter 1 contains motivational material and also presents some background issues in computer architecture and programming-language principles. Chapter 2 develops a miniature compiler and introduces many of the impor tant concepts, which are then developed in later chapters. The compiler itself appears in the appendix. Chapter 3 covers lexical analysis, regular expressions, finite-state machines, and scanner-generator tools. This material is fundamental to text-processing of all sorts. v

vi

PREFACE

Chapter 4 covers the major parsing methods, top-down (recursive-descent, LL) and bottom-up (LR and its variants) . Chapter 5 introduces the principal ideas in syntax-directed definitions and syntax-directed translations. Chapter 6 takes the theory of Chapter 5 and shows how to use it to generate intermediate code for a typical programming language. Chapter 7 covers run-time environments, especially management of the run-time stack and garbage collection. Chapter 8 is on object-code generation. It covers construction of basic blocks, generation of code from expressions and basic blocks, and register-allocation techniques. Chapter 9 introduces the technology of code optimization, including flow graphs, data-flow frameworks, and iterative algorithms for solving these frameworks. Chapter 10 covers instruction-level optimization. The emphasis is on the ex traction of parallelism from small sequences of instructions and scheduling them on single processors that can do more than one thing at once. Chapter 1 1 talks about larger-scale parallelism detection and exploitation. Here, the emphasis is on numeric codes that have many tight loops that range over multidimensional arrays. Chapter 12 is on interprocedural analysis. It covers pointer analysis, aliasing, and data-flow analysis that takes into account the sequence of procedure calls that reach a given point in the code. Courses from material in this book have been taught at Columbia, Harvard, and Stanford. At Columbia, a senior/first-year graduate course on program ming languages and translators has been regularly offered using material from the first eight chapters. A highlight of this course is a semester-long project in which students work in small teams to create and implement a little lan guage of their own design. The student-created languages have covered diverse application domains including quantum computation, music synthesis, com puter graphics, gaming, matrix operations and many other areas. Students use compiler-component generators such as ANTLR, Lex, and Yacc and the syntax directed translation techniques discussed in chapters two and five to build their compilers. A follow-on graduate course has focused on material in Chapters 9 through 12, emphasizing code generation and optimization for contemporary machines including network processors and multiprocessor architectures. At Stanford, a one-quarter introductory course covers roughly the mate rial in Chapters 1 through 8, although there is an introduction to global code optimization from Chapter 9. The second compiler course covers Chapters 9 through 12, plus the more advanced material on garbage collection from Chap ter 7. Students use a locally developed, Java-based system called Joeq for implementing data-flow analysis algorithms.

PREFACE

vii

P rerequisites

The reader should possess some "computer-science sophistication," including at least a second course on programming, and courses in data structures and discrete mathematics. Knowledge of several different programming languages is useful. Exercises

The book contains extensive exercises, with some for almost every section. We indicate harder exercises or parts of exercises with an exclamation point. The hardest exercises have a double exclamation point. Gradiance. O n-Line Homeworks

A feature of the new edition is that there is an accompanying set of on-line homeworks using a technology developed by Gradiance Corp. Instructors may assign these homeworks to their class, or students not enrolled in a class may enroll in an "omnibus class" that allows them to do the homeworks as a tutorial (without an instructor-created class) . Gradiance questions look like ordinary questions, but your solutions are sampled. If you make an incorrect choice you are given specific advice or feedback to help you correct your solution. If your instructor permits, you are allowed to try again, until you get a perfect score. A subscription to the Gradiance service is offered with all new copies of this text sold in North America. For more information, visit the Addison-Wesley web site www . aw . com/gradiance or send email tocomput ing@aw . com. Supp ort on the World Wide Web

The book's home page is dragonbook . stanford . edu

Here, you will find errata as we learn of them, and backup materials. We hope to make available the notes for each offering of compiler-related courses as we teach them, including homeworks, solutions, and exams. We also plan to post descriptions of important compilers written by their implementers. Acknowledgements

Cover art is by S. D. Ullman of Strange Tonic Productions. Jon Bentley gave us extensive comments on a number of chapters of an earlier draft of this book. Helpful comments and errata were received from:

viii

PREFACE

Domenico Bianculli, Peter Bosch, Marcio Buss, Marc Eaddy, Stephen Edwards, Vibhav Garg, Kim Hazelwood, Gaurav Kc, Wei Li, Mike Smith, Art Stamness, Krysta Svore, Olivier Tardieu, and Jia Zeng. The help of all these people is gratefully acknowledged. Remaining errors are ours, of course. In addition, Monica would like to thank her colleagues on the SUIF com piler team for an 18-year lesson on compiling: Gerald Aigner, Dzintars Avots, Saman Amarasinghe, Jennifer Anderson, Michael Carbin, Gerald Cheong, Amer Diwan, Robert French, Anwar Ghuloum, Mary Hall, John Hennessy, David , Heine, Shih-Wei Liao, Amy Lim, Benjamin Livshits, Michael Martin, Dror Maydan, Todd Mowry, Brian Murphy, Jeffrey Oplinger, Karen Pieper, Mar tin Rinard, Olatunji Ruwase, Constantine Sapuntzakis, Patrick Sathyanathan, Michael Smith, Steven Tjiang, Chau-Wen Tseng, Christopher Unkel, John Whaley, Robert Wilson, Christopher Wilson, and Michael Wolf. A. V. A., Chatham NJ M. S. L., Menlo Park CA R. S., Far Hills NJ J. D. U., Stanford CA June, 2006

Table of Contents

1 Introduction

1.1 1.2

Language Processors . . . . . . 1 . 1 . 1 Exercises for Section 1 . 1 The Structure of a Compiler . 1.2.1 Lexical Analysis 1 .2.2 Syntax Analysis . . . 1.2.3 Semantic Analysis . . 1 .2.4 Intermediate Code Generation 1 .2.5 Code Optimization . . . . . 1 .2.6 Code Generation . . . . . . . . 1.2.7 Symbol-Table Management . 1 .2.8 The Grouping of Phases into Passes 1 .2.9 Compiler-Construction Tools . . . . The Evolution of Programming Languages . 1 .3. 1 The Move to Higher-level Languages 1 .3.2 Impacts on Compilers . . . . 1 .3.3 E xercises for Section 1.3 . . . . . . . The Science of Building a Compiler . . . . . 1 .4.1 Modeling in Compiler Design and Implementation 1 .4.2 The Science of Code Optimization . . . . . . . . . Applications of Compiler Technology . . . . . . . . . . . . 1.5.1 Implementation of High-Level Programming Languages 1 .5.2 Optimizations for Computer Architectures . 1 .5.3 Design of New Computer Architectures 1 .5 .4 Program Translations . . . . 1.5.5 Software Productivity Tools . . . Programming Language Basics . . . . . 1.6.1 The Static / Dynamic Distinction 1 .6.2 Environments and States . . . . 1 .6.3 Static Scope and Block Structure . 1 .6.4 Explicit Access Control . . . . 1.6.5 Dynamic Scope . . . . . . . . . 1.6.6 Parameter Passing Mechanisms .

1 .3

1 .4 1.5

1 .6

ix

1

1 3 4 5 8 8 9 10 10 11 11 12 12 13 14 14 15 15 15 17 17 19 21 22 23 25 25 26 28 31 31 33

x

2

TABLE OF CONTENTS 1 .6.7 Aliasing . . . . . . . . . . 1 .6.8 Exercises for Section 1 .6 . 1 . 7 Summary of Chapter 1 . . 1 . 8 References for Chapter 1 . . . . .

35 35 36 38

A Simple Syntax-Directed Translator

39

2.1 2.2

2.3

2 .4

Introduction . . . . . . . . . . . Syntax Definition . . . . . . . . 2.2. 1 Definition of Grammars 2.2.2 Derivations 2.2.3 Parse Trees . . . . . . . 2.2 .4 Ambiguity . . . . . . . . 2.2.5 Associativity of Operators . 2.2.6 Precedence of Operators . 2.2.7 Exercises for Section 2.2 Syntax-Directed Translation . � 2.3.1 Postfix Notation . . . . 2.3.2 Synthesized Attributes . 2.3.3 Simple Syntax-Directed Definitions . 2.3.4 Tree Traversals . . . . . 2.3.5 Translation Schemes . . 2.3.6 Exercises for Section 2.3 Parsing . . . . . . . . . . 2.4. 1 Top-Down Parsing . . . 2.4.2 Predictive Parsing . . . 2.4.3 When to Use c-Productions 2.4.4 Designing a Predictive Parser 2.4.5 Left Recursion . . . . . . . 2 .4.6 Exercises for Section 2.4 . . . . A Translator for Simple Expressions 2.5. 1 Abstract and Concrete Syntax 2.5.2 Adapting the Translation Scheme . 2 .5.3 Procedures for the Nonterminals 2.5 .4 Simplifying the Translator . 2.5 .5 The Complete Program . . . . Lexical Analysis . . . . . . . . . . . . . 2.6.1 Removal of White Space and Comments 2.6.2 Reading Ahead . . . . . . . . . . . . . 2.6.3 Constants . . . . . . . . . . . . . . . . 2.6.4 Recognizing Keywords and Identifiers 2.6.5 A Lexical Analyzer . . . 2.6.6 Exercises for Section 2.6 . Symbol Tables . . . . . . . . . . 2.7. 1 Symbol Table Per Scope . 2 . 7.2 The Use of Symbol Tables . .

2.5

.

2.6

2.7

40 42 42 44 45 47 48 48 51 52 53 54 56 56 57 60 60 61 64 65 66 67 68 68 69 70 72 73 74 76 77 78 78 79 81 84 85 86 89

xi

TABLE OF CONTENTS 2.8 Intermediate Code Generation . . . . . . . . . . 2.8.1 Two Kinds of Intermediate Representations 2.8.2 Construction of Syntax Trees 2.8.3 Static Checking . . . . . 2.8.4 Three-Address Code . . 2.8.5 Exercises for Section 2.8 2.9 Summary of Chapter 2 .

91 91 92 97 99 105 105

.

· ·

1 09

3 Lexical Analysis

109 3.1 The Role of the Lexical Analyzer 110 3.1.1 Lexical Analysis Versus Parsing . 111 3.1.2 Tokens, Patterns, and Lexemes 112 3.1.3 Attributes for Tokens . 113 3.1.4 Lexical Errors . . . . . . 1 14 3.1.5 Exercises for Section 3.1 115 3.2 Input Buffering . . . 115 3.2.1 Buffer Pairs . . . 116 3.2.2 Sentinels . . . . . 1 16 3.3 Specification of Tokens . 117 3.3.1 Strings and Languages . 119 3.3.2 Operations on Languages 120 3.3.3 Regular Expressions . . . 123 3.3.4 Regular Definitions . . . . 124 Extensions of Regular Expressions 3.3.5 125 3.3.6 Exercises for Section 3.3 128 3.4 Recognition of Tokens . . . . . . . . . . . 130 3.4. 1 Transition Diagrams . . . . . . . . 3.4.2 Recognition of Reserved Words and Identifiers 132 133 3.4.3 Completion of the Running Example . . . . . 3.4.4 Architecture of a Transition-Diagram-Based Lexical Analy zer . . . . . . . . . . . . . . 134 . 136 3.4.5 Exercises for Section 3.4 . . . 3.5 The Lexical -Analy zer Generator Lex . 140 3.5.1 Use of Lex . . . . . . . . . . . 140 . 141 3.5.2 Structure of Lex Programs 3.5.3 Conflict Resolution in Lex . . 144 . 144 3.5.4 The Lookahead Operator 3.5.5 Exercises for Section 3.5 . . . 146 3.6 Finite Automata . . . . . . . . . . 147 3.6.1 Nondeterministic Finite Automata 147 3.6.2 Transition Tables . . . . . . . . . 148 3.6.3 Acceptance of Input Strings by Automata 149 3.6.4 Deterministic Finite Automata . 149 3.6.5 Exercises for Section 3.6 . . . . . 151 3.7 From Regular Expressions to Automata 152 ·

· ·

·

· ·

· ·

·

·

·

· ·

·

·

·

· · ·

.

·

·

·

.

· · ·

·

·

xii

TABLE OF CONTENTS

3.8

3.9

3.10 3.11 4

3.7.1 Conversion of an NFA to a DFA 3.7.2 Simulation of an NFA . . . . . . 3.7.3 Efficiency of NFA Simulation . . 3.7.4 Construction of an NFA from a Regular Expression 3.7.5 Efficiency of String- Processing Algorithms . 3.7.6 Exercises for Section 3.7 . . . . . . . . . . Design of a Lexical-Analyzer Generator . . . . . 3.8.1 The Structure of the Generated Analyzer 3.8.2 Pattern Matching Based on NFA's . . . 3.8.3 DFA's for Lexical Analyzers . . . . . . . 3.8.4 Implementing the Lookahead Operator . 3.8.5 Exercises for Section 3.8 . . . . . . . . . Optimization of DFA-Based Pattern Matchers . 3.9.1 Important States of an NFA . . . . . . . 3.9.2 Functions Computed From the Syntax Tree 3.9.3 Computing nullable, jirstpos, and lastpos . . 3.9.4 Computing Jollowpos . . . . . . . . . . . . . 3.9.5 Converting a Regular Expression Directly to a DFA 3.9.6 Minimizing the Number of States of a DFA 3.9.7 State Minimization in Lexical Analyzers . . 3.9.8 Trading Time for Space in DFA Simulation 3.9.9 Exercises for Section 3.9 Summary of Chapter 3 . . References for Chapter 3 .

Syntax Analysis

4. 1 Introduction . . . . . . . . . . . . 4. 1 . 1 The Role of the Parser . . 4.1.2 Representative Grammars 4. 1 .3 Syntax Error Handling . . 4. 1 .4 Error-Recovery Strategies 4.2 Context-Free Grammars . . . . . 4.2.1 The Formal Definition of a Context-Free Grammar . 4.2.2 Notational Conventions . . . 4.2.3 Derivations .. . . . . . . . . 4.2.4 Parse Trees and Derivations . 4.2.5 Ambiguity . . . . . . . . . . . 4.2.6 Verifying the Language Generated by a Grammar 4.2.7 Context-Free Grammars Versus Regular Expressions 4.2.8 Exercises for Section 4.2 . . . . . . 4.3 Writing a Grammar . . . . . . . . . . . . 4.3.1 Lexical Vers us Syntactic Analysis . 4.3.2 Eliminating Ambiguity . . . . . 4.3.3 Elimination of Left Recursion . 4.3.4 Left Factoring . . . . . . . . .

· · · · · · · · ·

·

· · · · · ·

152 156 157 159 163 166 166 167 168 170 171 172 173 173 175 176 177 179 180 184 185 186 187 189 191

192 192 193 194 195 197 197 198 199 201 203 . 204 205 . 206 209 . 209 210 212 214

·

·

· ·

·

·

·

· ·

·

·

·

·

· · ·

xiii

TABLE OF CONTENTS 4.3.5 Non-Context-Free Language Constructs 4.3.6 Exercises for Section 4.3 . . 4.4 Top-Down Parsing . . . . . . . . . 4.4.1 Recursive-Descent Parsing . 4.4.2 FIRST and FOLLOW . . . 4.4.3 LL ( l ) Grammars . . . . . . 4.4.4 Nonrecursive Predictive Parsing . 4.4.5 Error Recovery in Predictive Parsing . 4.4.6 Exercises for Section 4.4 4.5 Bottom-Up Parsing . . . 4.5.1 Reductions . . . . . . 4.5.2 Handle Pruning . . . . 4.5.3 Shift-Reduce Parsing . 4.5.4 Conflicts During Shift-Reduce Parsing 4.5.5 Exercises for Section 4.5 . . . . . 4.6 Introduction to LR Parsing: Simple LR 4.6.1 Why LR Parsers? . . . . . . . . 4.6.2 Items and the LR ( O) Automaton 4.6.3 The LR-Parsing Algorithm . . . 4.6.4 Constructing SLR-Parsing Tables . 4.6.5 Viable Prefixes . . . . . 4.6.6 Exercises for Section 4.6 4.7 More Powerful LR Parsers . . . 4.7.1 Canonical LR ( l ) Items . 4.7.2 Constructing LR ( l ) Sets of Items . 4.7.3 Canonical LR ( l ) Parsing Tables 4.7.4 Constructing LALR Parsing Tables . 4.7.5 Efficient Construction of LALR Parsing Tables 4.7.6 Compaction of LR Parsing Tables 4.7.7 Exercises for Section 4.7 . . . . . . . . . . . . . 4.8 U sing Ambiguous Grammars . . . . . . . . . . . . . . 4.8.1 Precedence and Associativity to Resolve Conflicts 4.8.2 The "Dangling-Else" Ambiguity 4.8.3 Error Recovery in LR Parsing . 4.8.4 Exercises for Section 4.8 . . 4.9 Parser Generators . . . . . . . . . . . 4.9.1 The Parser Generator Yacc . . 4. 9 . 2 Using Yacc with Ambiguous Grammars 4.9.3 Creating Yacc Lexical Analyzers with Lex 4.9.4 Error Recovery in Yac c 4.9.5 Exercises for Section 4.9 4.10 Summary of Chapter 4 . . 4.11 References for Chapter 4 . . . .

215 216 217 219 220 . 222 . 226 228 231 233 . 234 . 235 . 236 238 . 240 241 241 242 . 248 252 . 256 257 . 259 . 260 . 261 . 265 . 266 . 270 . 275 . 277 . 278 . 279 . 281 . 283 . 285 . 287 . 287 291 . 294 295 297 . 297 . 300

·

·

·

· ·

·

·

·

·

· ·

·

·

·

·

·

·

xiv

TABLE OF CONTENTS

5 Syntax-Directed Translation 5.1 Syntax-Directed Definitions . . . . . . . . . . . . . . . . 5.1.1 Inherited and Synthesized Attributes . . . . . . . 5.1.2 Evaluating an SDD at the Nodes of a Parse Tree 5.1.3 Exercises for Section 5.1 5.2 Evaluation Orders for SDD's . . . . . . . . . 5.2.1 Dependency Graphs . . . . . . . . . . 5.2.2 Ordering the Evaluation of Attributes 5.2.3 S-Attributed Definitions . . . . . . . . 5.2.4 L-Attributed Definitions . . . . . . . . 5.2.5 Semantic Rules with Controlled Side Effects . 5.2.6 Exercises for Section 5.2 . . . . . . . . 5.3 Applications of Syntax-Directed Translation . 5.3.1 Construction of Syntax Trees 5.3.2 The Structure of a Type . . . . 5.3.3 Exercises for Section 5.3 . . . . 5.4 Syntax-Directed Translation Schemes . 5.4.1 Postfix Translation Schemes . . 5.4.2 Parser-Stack Implementation of Postfix SDT's 5.4.3 SDT's With Actions Inside Productions 5.4.4 Eliminating Left Recursion From SDT's 5.4.5 SDT's for L-Attributed Definitions 5.4.6 Exercises for Section 5.4 . . . . . . . . . 5.5 Implementing L-Attributed SDD's . . . . . . . 5.5.1 Translation During Recursive-Descent Parsing 5.5.2 On-The-Fly Code Generation . . . . . . . . 5.5.3 L-Attributed SDD's and LL Parsing . . . . 5.5.4 Bottom-Up Parsing of L-Attributed SDD's 5.5.5 Exercises for Section 5.5 5.6 Summary of Chapter 5 . . 5.7 References for Chapter 5 . . . . 6 Intermediate-Code Generation

6.1 Variants of Syntax Trees . . . . . . . . . . . . . . . . . . . . 6.1.1 Directed Acyclic Graphs for Expressions . . . . . . . 6.1.2 The Value-Number Method for Constructing DAG's 6.1.3 Exercises for Section 6.1 . . 6.2 Three-Address Code . . . . . . . . 6.2.1 Addresses and Instructions 6.2.2 Quadruples . . . . . . . . . 6.2.3 Triples . . . . . . . . . . . . 6.2.4 Static Single-Assignment Form 6.2.5 Exercises for Section 6.2 6.3 Types and Declarations . 6.3. 1 Type Expressions . . . .

. . . . · · · · · · · · · ·

. . . . . . ·

. . . . . . ·

. . . . . . . . . . . . . ·

303 304 304 306 309 310 310 312 312 313 314 317 318 318 321 323 324 324 325 327 328 331 336 337 338 340 343 348 352 353 354 357 358 359 360 362 363 364 366 367 369 370 370 371

xv

TABLE OF CONTENTS 6.3.2 Type Equivalence . . . . . .. . . 6.3.3 Declarations . . . . . . . . . . . . 6.3.4 Storage Layout for Local Names 6.3.5 Sequences of Declarations . . 6.3.6 Fields in Records and Classes 6.3.7 Exercises for Section 6.3 . . . 6.4 Translation of Expressions . . . . . . 6.4. 1 Operations Within Expressions 6.4.2 Incremental Thanslation . . . . 6.4.3 Addressing Array Elements . . 6.4.4 Thanslation of Array References . 6.4.5 Exercises for Section 6.4 . 6.5 Type Checking . . . . . . . . . . 6.5.1 Rules for T ype Checking . 6.5.2 Type Conversions . . . . 6.5.3 Overloading of Functions and Operators 6.5.4 Type Inference and Polymorphic Functions 6.5.5 An Algorithm for Unification 6.5.6 Exercises for Section 6.5 6.6 Control Flow . . . . . . . . 6.6.1 Boolean Expressions . . 6.6.2 Short-Circuit Code . . . 6.6.3 Flow-of-Control Statements 6.6.4 Control-Flow Translation of Boolean Expressions 6.6.5 A voiding Redundant Gotos . . . . . 6.6.6 Boolean Values and Jumping Code . 6.6.7 Exercises for Section 6.6 . . . . . . . 6.7 Backpatching . . . . . . . . . . . . . . . . . 6.7.1 One-Pass Code Generation Using Backpatching . 6.7.2 Backpatching for Boolean Expressions . . 6.7.3 Flow-of-Control Statements . . . . . . . . 6.7.4 Break-, Continue-, and Goto-Statements . 6.7.5 Exercises for Section 6.7 . . . . . 6.8 Switch-Statements . . . . . . . . . . . . . . . . . 6.8.1 Translation of Switch-Statements . . . . . 6.8.2 Syntax-Directed T hanslation of Switch-Statements 6.8.3 Exercises for Section 6.8 . . 6.9 Intermediate Code for Procedures . 6.10 Summary of Chapter 6 . . 6.11 References for Chapter 6 . . . . . .

. . . . . . . . .

372 373 373 376 376 378 378 378 380 381 . 383 . 384 . 386 . 387 . 388 . 390 391 .395 . 398 . 399 . 399 . 400 401 . 403 . 405 . 408 . 408 . 410 . 410 . 411 . 413 416 . 417 . 418 . 419 . 420 . 421 . 422 . 424 . 425 ·

·

·

·

xvi 7

TABLE OF CONTENTS Run-Time Environments

7.1 7.2

7.3

Storage Organization . . . . . . . . . . . . . . . . . 7.1.1 Static Versus Dynamic Storage Allocation . Stack Allocation of Space . 7.2.1 Activation Trees . . 7.2.2 Activation Records . 7.2.3 Calling Sequences . 7.2.4 Variable-Length Data on the Stack . 7.2.5 Exercises for Section 7.2 . . . . . . . Access to Nonlocal Data on the Stack . . . 7.3.1 Data Access \Vithout Nested Procedures . 7.3.2 Issues With Nested Procedures . . . . . . 7.3.3 A Language With Nested Procedure Declarations . 7.3.4 Nesting Depth . . . . . . 7.3.5 Access Links . . . . . . . . . . . . . . . 7.3.6 Manipulating Access Links . . . . . . . 7.3.7 Access Links for Procedure Parameters 7.3.8 Displays . . . . . . . . . . 7.3.9 Exercises for Section 7.3 . Heap Management . . . . . . . . 7.4.1 The Memory Manager . . '7.4.2 The lVIemory Hierarchy of a Computer . 7.4.3 Locality in Programs . . . . . . 7.4.4 Reducing Fragmentation . . . . 7.4.5 Manual Deallocation Requests 7.4.6 Exercises for Section 7.4 . . . . Introduction to Garbage Collection . . 7.5.1 Design Goals for Garbage Collectors 7.5.2 Reachability . . . . . . . . . . . . . . 7.5.3 Reference Counting Garbage Collectors 7.5.4 Exercises for Section 7.5 . . . . . . . Introduction to Trace-Based Collection . . . 7.6.1 A Basic Mark-and-Sweep Collector . 7.6.2 Basic Abstraction . . . . . . . . . 7.6.3 Optimizing Mark-and-Sweep . . . . 7.6.4 Mark-and-Compact Garbage Collectors 7.6.5 Copying collectors . . . 7.6.6 Comparing Costs . . . . . 7.6.7 Exercises for Section 7.6 . Short-Pause Garbage Collection . 7.7.1 Incremental Garbage Collection . 7.7.2 Incremental Reachability Analysis 7.7.3 Partial-Collection Basics . . . . . 7.7.4 Generational Garbage Collection 7.7.5 The Train Algorithm . . . . . . . .

7.4

7.5

7.6

.

7.7

427

. 427 . 429 . 430 . 430 . 433 . 436 . 438 . 440 . 441 . 442 . 442 . 443 . 443 . 445 . 447 . 448 . 449 . 45 1 . 452 . 453 . 454 . 455 . 457 . 460 . 463 . 463 . 464 . 466 . 468 . 470 . 470 . 471 . 473 . 475 . 476 . 478 . 482 . 482 . 483 . 483 . 485 . 487 . 488 . 490

TABLE OF CONTENTS 7.7.6 Exercises for Section 7.7 . . . . . . . . . . . . 7.8 Advanced Topics in Garbage Collection . . . . . . . 7.8. 1 Parallel and Concurrent Garbage Collection . 7.8.2 Partial Object Relocation . . . . . . . . . . . 7.8.3 Conservative Collection for Unsafe Languages . 7.8.4 Weak References . . . . 7.8.5 Exercises for Section 7.8 7.9 Summary of Chapter 7 . . 7.10 References for Chapter 7 . 8 Code Generation

8.1 Issues in the Design of a Code Generator 8 . 1 . 1 Input t o the Code Generator 8.1.2 The Target Program 8.1.3 Instruction Selection 8.1.4 Register Allocation . 8.1.5 Evaluation Order . . 8.2 The Target Language . . . 8.2.1 A Simple Target Machine Model 8.2.2 Program and Instruction Costs 8.2.3 Exercises for Section 8.2 8.3 Addresses in the Target Code 8.3.1 Static Allocation . . . . 8.3.2 Stack Allocation . . . . 8.3.3 Run-Time Addresses for Names . 8.3.4 Exercises for Section 8.3 8.4 Basic Blocks and Flow Graphs 8.4. 1 Basic Blocks . . . . . 8.4.2 Next-Use Information 8.4.3 Flow Graphs . . . . . . 8.4.4 Representation of Flow Graphs 8.4.5 Loops . . . . . . . . . . 8.4.6 Exercises for Section 8.4 . . . . 8.5 Optimization of Basic Blocks . . . . . 8.5.1 The DAG Representation of Basic Blocks 8.5.2 Finding Local Common Subexpressions 8.5.3 Dead Code Elimination . . . . . . . 8.5.4 The Use of Algebraic Identities . . . . . 8.5.5 Representation of Array References . . . 8.5.6 Pointer Assignments and Procedure Calls 8.5.7 Reassembling Basic Blocks From DAG's 8.5.8 Exercises for Section 8.5 . . . . . . 8.6 A Simple Code Generator . . . . . . . . . 8.6.1 Register and Address Descriptors . 8.6.2 The Code-Generation Algorithm

xvii . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

493 494 495 497 498 498 499 500 502 505

506 507 507 508 5 10 511 512 512 515 516 518 518 520 522 524 525 526 528 529 530 531 531 533 533 534 535 536 537 539 539 541 542 543 544

xviii

TABLE OF CONTENTS 8.6.3 Design of the Function getReg . . 547 8. 6 0 548 Peephole Optimization . . . . . . . . . 549 8.7. 1 Eliminating Redundant Loads and Stores . 550 8.7.2 Eliminating Unreachable Code .. . . . . . 550 8.7.3 Flow-of-Control Optimizations . . . . . 551 8.7 04 Algebraic Simplification and Reduction in Strength . 552 8.7.5 Use of Machine Idioms . . . . 552 Exercises for Section 8.7 . . . 8. 7 . . 553 Register Allocation and Assignment . 553 8.8.1 Global Register Allocation . . . 553 8.8. 2 Usage Counts . . . . . . . . . . 554 8.8.3 Register Assignment for Outer Loops 556 8.80 4 Register Allocation by Graph Coloring . . 556 8.8.5 Exercises for Section 8.8 . . . . 557 Instruction Selection by Tree Rewriting . . . . . 558 8.9 . 1 Tree-Translation Schemes . . . . . . . . . 558 8.9.2 Code Generation by Tiling an Input Tree . 560 8.9.3 Pattern Matching by Parsing . . 563 8.90 4 Routines for Semantic Checking . 565 8.9.5 General Tree Matching . . . . . . 565 . 567 8 .9.6 Exercises for Section 8.9 . . . . . 567 Optimal Code Generatipn for Expressions . 567 8 . 1 0 . 1 Ershov Numbers . . . . . . . . . . 568 8.10.2 Generating Code From Labeled Expression Trees 8. 10.3 Evaluating Expressions with an Insufficient Supply of Registers . . . . . . . . . . . . . . . . . 570 . 572 8.100 4 Exercises for Section 8.10 . . . . . . 573 Dynamic Programming Code-Generation . . 574 8!1 1 . 1 Contiguous Evaluation . . . . . . . . 5 75 8. 1 1 . 2 The Dynamic Programming Algorithm . . 577 8 . 1 1.3 Exercises for Section 8 . 1 1 . 578 Summary of Chapter 8 . . . 579 References for Chapter 8 . . . . . ·

8.7

.

.

·

·

·

8.8

·

.

8.9

·

.

.

.

8.10

.

.

8.1 1

8. 1 2 8 . 13 9

583

Machine-Independent Optimizations

9.1

The Principal Sources of Optimization 9. 1 . 1 Causes of Redundancy . . . . 9 . 1 . 2 A Running Example: Quicksort . 9 . 1 .3 Semantics-Preserving Transformations 9 . 1 .4 Global Common Sub expressions 9 . 1 . 5 Copy Propagation . . . 9. 1 .6 Dead-Code Elimination . . . . . 9 . 1 . 7 Code Motion . . . . . . . . . . . 9 . 1 .8 Induction Variables and Reduction in Strength .

. 584 . 584 585 . 586 . 588 . 590 591 . 592 . 592 ·

·

TABLE OF CONTENTS

XIX

596 9 . 1 .9 Exercises for Section 9.1 . 597 Introduction to Data-Flow Analysis . 597 9.2. 1 The Data-Flow Abstraction . 599 9.2.2 The Data-Flow Analysis Schema . 600 9.2.3 Data-Flow Schemas on Basic Blocks 601 9.2.4 Reaching Definitions . . 608 9.2.5 Live-Variable Analysis 610 9.2.6 Available Expressions . 614 9.2.7 Summary . . . . . . . 615 9.2.8 Exercises for Section 9.2 618 Foundations of Data-Flow Analysis . 618 9.3.1 Semilattices . . . . . . . . . . . 623 9.3.2 Transfer Functions . . . . . . . 626 9.3.3 The Iterative Algorithm for General Frameworks . 628 9.3.4 Meaning of a Data-Flow Solution . 631 9.3.5 Exercises for Section 9.3 . . . . . . . . . . . . . . . 632 Constant Propagation . . . . . . . . . . . . . . . . . . . 9.4. 1 Data-Flow Values for the Constant-Propagation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633 - ropagation Framework . . . 633 9.4.2 The Meet for the Constant P Transfer Functions for the Constant-Propagation Frame9.4.3 work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634 9.4.4 Monotonicity of the Constant-Propagation Framework . . 635 9.4.5 Nondistributivity of the Constant-Propagation Framework 635 9.4.6 Interpretation of the Results . 637 9.4.7 Exercises for Section 9.4 . . . . 637 Partial-Redundancy Elimination . . . 639 9.5.1 The Sources of Redundancy . . 639 9.5.2 Can All Redundancy Be Eliminated? . . 642 9.5.3 The Lazy-Code-Motion Problem . . 644 9.5.4 Anticipation of Expressions . . . . . 645 9.5.5 The Lazy-Code-Motion Algorithm . 646 9.5.6 Exercises for Section 9.5 . 655 Loops in Flow Graphs . . . . . 655 9.6.1 Dominators . . . . . . . . 656 9.6.2 Depth-First Ordering . . 660 9.6.3 Edges in a Depth-First Spanning Tree . 661 9.6.4 Back Edges and Reducibility . 662 9.6.5 Depth of a Flow Graph . . . . . . . . . 665 9.6.6 Natural Loops . . . . . . . . . . . . . . 665 9.6.7 Speed of Convergence of Iterative Data-Flow Algorithms . 667 9.6.8 Exercises for Section 9.6 . 669 Region-Based Analysis . . . . . . . . . . . . . . . . . . . 672 9.7.1 Regions . . . . . . . . . . . . . . . . . . . . . . . 672 9.7.2 Region Hierarchies for Reducible Flow Graphs . 673 ·

9.2

·

·

·

·

9.3

·

·

·

9.4

9.5

9.6

9.7

xx

TABLE OF CONTENTS 9.7.3 Overview of a Region-Based Analysis . . . . . . 9.7.4 Necessary Assumptions About Transfer Functions 9.7.5 An Algorithm for Region-Based Analysis 9.7.6 Handling Nonreducible Flow Graphs 9.7.7 Exercises for Section 9.7 . . . . . . . . . . 9.8 Symbolic Analysis . . . . . . . . . . . . . . . . 9.8.1 Affine Expressions of Reference Variables 9.8.2 Data-Flow Problem Formulation 9.8.3 Region-Based Symbolic Analysis 9.8.4 Exercises for Section 9.8 9.9 Summary of Chapter 9 . . 9.1 0 References for Chapter 9 . . .

.

. . . . . . . . . . . .

10 Instruction-Level· Parallelism

676 678 680 684 686 686 687 689 694 699 700 703 707

1 0.1 Processor Architectures . . . . . . . . . . . . . . 708 10.1.1 Instruction Pipelines and Branch Delays . . 708 1 0.1.2 Pipelined Execution . . . . 709 1 0.1.3 :Multiple Instruction Issue 710 1 0.2 Code-Scheduling Constraints . . 710 1 0.2.1 Data Dependence . . . . . 711 712 1 0.2.2 Finding Dependences Among Memory Accesses . 1 0.2.3 Tradeoff Between Register Usage and Parallelism 713 1 0.2.4 Phase Ordering Between Register Allocation and Code Scheduling . . . . . . . . . . . . 716 . 716 1 0.2.5 Control Dependence . . . . . . . 717 1 0.2.6 Speculative Execution Support . 719 1 0.2.7 A Basic Machine Model . . 720 1 0.2.8 Exercises for Section 10.2 . 721 10.3 Basic-Block Scheduling . . . . . . 722 1 0.3.1 Data-Dependence Graphs 723 1 0.3.2 List Scheduling of Basic Blocks 725 1 0.3.3 Prioritized Topological Orders . 726 1 0.3.4 Exercises for Section 10.3 . 727 1 0.4 Global Code Scheduling . . . . . 728 10.4.1 Primitive Code Motion . . 730 1 0.4. 2 Upward Code Motion . . 73 1 1 0.4.3 Downward Code Motion . 732 1 0.4.4 Updating Data Dependences . 732 1 0.4.5 Global Scheduling Algorithms . . 736 1 0.4.6 Advanced Code lVlotion Techniques . . 737 1 0.4.7 Interaction with Dynamic Schedulers . . 737 1 004.8 Exercises for Section lOA . 738 1 0.5 Software Pipelining . . . . . . . . . . . 738 1 0.5.1 Introduction . . . . . . . . 740 1 0.5.2 Software Pipelining of Loops .

·

·

·

·

.

·

.

.

.

.

.

.

.

TABLE OF CONTENTS 10.5.3 Register Allocation and Code Generation 10.5.4 Do-Across Loops . . . . . . . . . . . . . . 10.5.5 Goals and Constraints of Software Pipelining 10.5.6 A Software-Pipelining Algorithm . . . . . . . 10.5.7 Scheduling Acyclic Data-Dependence Graphs 10.5.8 Scheduling Cyclic Dependence Graphs . . . 10.5.9 Improvements to the Pipelining Algorithms 10.5.10 Modular Variable Expansion . . . . . . . . 10.5. 1 1 Conditional Statements . . . . . . . . . . . 10.5.12 Hardware Support for Software Pipelining . 10.5.13 Exercises for Section 10.5 10.6 Summary of Chapter 10 . 10.7 References for Chapter 10 . . . . 11 Optimizing for Parallelism and Locality

1 1 . 1 Basic Concepts . . . . . . . . . . . 1 1 . 1 . 1 Multiprocessors . . . . . . . 1 1 . 1.2 Parallelism in Applications 1 1 . 1.3 Loop-Level Parallelism . . . 1 1 . 1 .4 Data Locality . . . . . . . . 11. 1.5 Introduction to Affine Transform Theory 11.2 Matrix Multiply: An In-Depth Example . . . 1 1 .2.1 The Matrix-Multiplication Algorithm 1 1 .2.2 Optimizations . . . . . . . 1 1 .2.3 Cache Interference . . . . 1 1 .2.4 Exercises for Section 11.2 1 1 .3 Iteration Spaces . . . . . . . . . . 1 1 .3.1 Constructing Iteration Spaces from Loop Nests 1 1.3.2 Execution Order for Loop Nests . 1 1 .3.3 Matrix Formulation of Inequalities . 1 1 .3.4 Incorporating Symbolic Constants . 1 1 .3.5 Controlling the Order of Execution . 1 1.3.6 Changing Axes . . . . . . 1 1 .3.7 Exercises for Section 1 1 .3 1 1 .4 Affine Array Indexes . . . . . . . 1 1.4.1 Affine Accesses . . . . . . 11.4.2 Affine and Nonaffine Accesses in Practice 11 .4.3 Exercises for Sec tion 1 1 .4 1 1 . 5 Data Reuse . . . . . . 11.5.1 Types of Reuse . . 11. 5.2 Self Reuse . . . . . 11 .5.3 Self-Spatial Reuse 11 .5.4 Group Reuse . . . 11.5.5 Exercises for Section 1 1 .5 1 1 .6 Array Data-Dependence Analysis

xxi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

743 743 745 749 749 751 758 758 761 762 763 765 766 769

771 772 773 775 777 778 782 782 785 788 788 788 788 791 791 793 793 798 799 801 802 803 804 804 805 806 809 811 814 815

xxii

TABLE OF CONTENTS

11.6.1 Definition of Data Dependence of Array Accesses . 11 .6.2 Integer Linear Programming . . . . . . . . . . 11.6.3 The GCD Test . . . . . . . . . . . . . . . . . . 1 1 .6.4 Heuristics for Solving Integer Linear Programs 1 1 .6.5 Solving General Integer Linear Programs 11.6.6 Summary . . . . . . . . . . . . . . 1 1 .6.7 Exercises for Section 1 1 .6 . . . 1 1 .7 Finding Synchronization-Free Parallelism 11 .7.1 An Introductory Example . 1 1 .7.2 Affine Space Partitions . . . . . . . 1 1 . 7.3 Space-Partition Constraints . . . . 11. 7.4 Solving Space-Partition Constraints 1 1 . 7.5 A Simple Code-Generation Algorithm 1 1 .7.6 Eliminating Empty Iterations . . . . . 11. 7. 7 Eliminating Tests from Innermost Loops . 1 1 . 7.8 Source-Code Transforms . . . . 1 1 . 7.9 Exercises for Section 1 1 . 7 . . . . . . . . . 11 .8 Synchronization Between Parallel Loops . . . . . 1 1 .8.1 A Constant Number of Synchronizations . 1 1 .8.2 Program-Dependence Graphs . 11.8.3 Hierarchical Time . . . . . . . 11.8.4 The Parallelization Algorithm . 1 1 .8.5 Exercises for Section 11.8 1 1 .9 Pipelining . . . . . . . . . . . . . . . . 1 1 .9.1 What is Pipelining? . . . . . . 11.9.2 Successive Over-Relaxation ( SOR) : An Example 1 1 .9.3 Fully Permutable Loops . . . . . . 1 1 .9.4 Pipelining Fully Permutable Loops 11.9.5 General Theory . . . . . . . . . . . 1 1 .9.6 Time-Partition Constraints . . . . 1 1 .9.7 Solving Time-Partition Constraints by Farkas' Lemma 1 1.9.8 Code Transformations . . . . . . . . . . . . 1 1 .9.9 Parallelism With Minimum Synchronization . 1 1 . 9.10 Exercises for Section 1 1 . 9 . . . . . . . 11.10 Locality Optimizations . , . . . . . . . . . . 1 1 . 10.1 Temporal Locality of Computed Data 1 1 . 10.2 Array Contraction . . . 1 1 . 10.3 Partition Interleaving . . . 1 1 . 10.4 Putting it All Together . . 1 1 . 10.5 Exercises for Secti on 1 1 . 10 . 1 1 . 1 1 Other Uses of Affine Transforms . 1 1 . 1 1 . 1 Distributed memory machines . 11. 11.2 M ulti-Instruction-Issue Processors 1 1 . 1 1 .3 Vector and SIMD Instructions 1 1 . 11 .4 Prefetching . . . . . . . . . . . . . .

.

.

.

· · ·

. . . . . . . ·

. . ·

. . ·

. . . . . . · ·

. . . . . . . . . . . . . . . . . . . .

816 817 818 820 823 825 826 828 828 830 831 835 838 841 844 846 851 853 853 854 857 859 860 861 861 863 864 864 867 868 872 875 880 882 884 885 885 887 890 892 893 894 895 895 896

TABLE OF CONTENTS

xxiii

1 1 . 12 Summary of Chapter 1 1 . 1 1 . 13 References for Chapter 11 . 12 Interprocedural Analysis

12.1 Basic Concepts . . . . . 12.1.1 Call Graphs . . . 12.1.2 Context Sensitivity . 12.1.3 Call Strings . . . . . 12.1.4 Cloning-Based Context-Sensitive Analysis 12. 1 .5 Summary-Based Context-Sensitive Analysis 12.1.6 Exercises for Section 12.1 . 12.2 Why Interprocedural Analysis? . . 12.2 . 1 Virtual Method Invocation 12.2.2 Pointer Alias Analysis . . . 12.2.3 Parallelization . . . . . . 12.2.4 Detection of Software Errors and Vulnerabilities 12.2.5 SQL Injection . . . . . . . . . . . 12.2.6 Buffer Overflow . . . . . . . . . . 12.3 A Logical Representation of Data Flow . 12.3.1 Introduction to Datalog . . . . . 12.3.2 Datalog Rules . . . . . . . . . . . 12.3.3 Intensional and Extensional Predicates . 12.3.4 Execution of Datalog Programs . . . . . 12.3.5 Incremental Evaluation of Datalog Programs 12.3.6 Problematic Datalog Rules . . 12.3.7 Exercises for Section 12.3 . . . . 12.4 A Simple Pointer -Analysis Algorithm . . 12.4. 1 Why is Pointer Analysis Difficult 12.4.2 A Model for Pointers and References . 12.4.3 Flow Insensitivity . . . . . . 12.4.4 The Formulation in Datalog . 12.4.5 Using Type Information . . . 12.4.6 Exercises for Section 12.4 . . 12.5 Context-Insensitive Interprocedural Analysis . 12.5.1 Effects of a Method Invocation . 12.5.2 Call Graph Discovery in Datalog 12.5.3 Dynamic Loading and Reflection 12.5.4 Exercises for Section 12.5 . 12.6 Context-Sensitive Pointer Analysis . . . 12.6.1 Contexts and Call Strings . . . . 12.6.2 Adding Context to Datalog Rules . 12.6.3 Additional Observations About Sensitivity . 12.6.4 Exercises for Section 12.6 . 12.7 Datalog Implementation by BDD's 12.7.1 Binary Decision Diagrams . .

. 897 . 899 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . · ·

. . . . . . . . · ·

903

904 904 906 908 910 911 914 916 916 917 917 917 918 920 921 921 922 924 927 928 930 932 933 934 935 936 937 938 939 941 941 943 944 945 945 946 949 949 950 951 951

xxiv

TABLE OF CONTENTS

12.7.2 Transformations on BDD's . . . . . . . . . 12.7.3 Representing Relations by BDD's . . . . . . 12.7.4 Relational Operations as BDD Operations . 12.7.5 Using BDD's for Points-to Analysis 12.7.6 Exercises for Section 12.7 12.8 Summary of Chapter 12 12.9 References for Chapter 12 A A Complete Front End

A. l A.2 A.3 A.4 A.5 A.6 A.7 A.8 A.9

The Source Language Main . . . . . . . . . . Lexical Analyzer . . . Symbol Tables and Types Intermediate Code for Expressions Jumping Code for Boolean Expressions Intermediate Code for Statements Parser . . . . . . . . . . Creating the Front End . . . . . .

B Finding Linearly Independent Solutions

Index

. . . . . · ·

953 954 954 957 958 958 961 965

965 . 966 . 967 . 970 971 . 974 . 978 981 . 986 ·

·

·

989 993

Chapter

1

Introduction Programming languages are notations for describing computations to people and to machines. The world as we know it depends on programming languages, because all the software running on all the computers was written in some programming language. But, before a program can be run, it first must be translated into a form in which it can be executed by a computer. The software systems that do this translation are called compilers. This book is about how to design and implement compilers. We shall dis cover that a few basic ideas can be used to construct translators for a wide variety of languages and machines. Besides compilers, the principles and tech niques for compiler design are applicable to so many other domains that they are likely to be reused many times in the career of a computer scientist. The study of compiler writing touches upon programming languages, machine ar chitecture, language theory, algorithms, and software engineering. In this preliminary chapter, we introduce the different forms of language translators, give a high level overview of the structure of a typical compiler, and discuss the trends in programming languages and machine architecture that are shaping compilers. We include some observations on the relationship between compiler design and computer-science theory and an outline of the applications of compiler technology that go beyond compilation. We end with a brief outline of key programming-language concepts that will be needed for our study of compilers. 1.1

Language P rocessors

Simply stated, a compiler is a program that can read a program in one lan guage - the source language - and translate it into an equivalent program in another language - the target language; see Fig. 1 . 1 . An important role of the compiler is to report any errors in the source program that it detects during the translation process. 1

2

CHAPTER 1 . INTRODUCTION source program

Compiler

target program

Figure 1.1: A compiler If the target program is an executable machine-language program, it can then be called by the user to process inputs and produce outputs; see Fig. 1 .2. input

Target Program

output

Figure 1.2: Running the target program An interpreter is another common kind of language processor. Instead of producing a target program as a translation, an interpreter appears to directly execute the operations specified in the source program on inputs supplied by the user, as shown in Fig. 1 .3. source program input

Interpreter

output

Figure 1 .3: An interpreter The machine-language target program produced by a compiler is usually much faster than an interpreter at mapping inputs to outputs . An interpreter, however, can usually give better error diagnostics than a compiler, because it executes the source program statement by statement. Example 1 . 1 : Java language processors combine compilation and interpreta tion, as shown in Fig. 1.4. A Java source program may first be compiled into an intermediate form called bytecodes. The bytecodes are then interpreted by a virtual machine. A benefit of this arrangement is that bytecodes compiled on one machine can be interpreted on another machine, perhaps across a network. In order to achieve faster processing of inputs to outputs, some Java compil ers, called just-in-time compilers, translate the bytecodes into machine language immediately before they run the intermediate program to process the input. 0

3

1.1. LANGUAGE PROCESSORS source program

Translator

intermediate program input

Virtual Machine

output

Figure 1 .4: A hybrid compiler In addition to a compiler, several other programs may be required to create an executable target program, as shown in Fig. 1.5. A source program may be divided into modules stored in separate files. The task of collecting the source program is sometimes entrusted to a separate program, called a preprocessor. The preprocessor may also expand shorthands, called macros, into source lan guage statements. The modified source program is then fed to a compiler. The compiler may produce an assembly-language program as its output, because assembly lan guage is easier to produce as output and is easier to debug. The assembly language is then processed by a program called an assembler that produces relocatable machine code as its output. Large programs are often compiled in pieces, so the relocatable machine code may have to be linked together with other relocatable object files and library files into the code that actually runs on the machine. The linker resolves external memory addresses, where the code in one file may refer to a location in another file. The loader then puts together all of the executable object files into memory for execution.

1.1.1

Exercises for Section 1 . 1

Exercise 1 . 1 . 1 : What is the difference between a compiler and an interpreter? Exercise 1 . 1 .2 : What are the advantages of (a) a compiler over an interpreter

(b) an interpreter over a compiler?

Exercise 1 . 1 .3 : What advantages are there to a language-processing system in which the compiler produces assembly language rather than machine language? Exercise 1 . 1 .4 : A compiler that translates a high-level language into another high-level language is called a source-to-source translator. What advantages are there to using C as a target language for a compiler? Exercise 1 . 1 . 5 : Describe some of the tasks that an assembler needs to per

form.

4

CHAPTER 1 . INTRODUCTION source program

modified source program

$

target assembly program

relocatable machine code library files relocatable object files target machine code

Figure 1 .5: A language-processing system 1.2

The Structure of a C ompiler

Up to this point we have treated a compiler as a single box that maps a source program into a semantically equivalent target program. If we open up this box a little, we see that there are two parts to this mapping: analysis and synthesis. The analysis part breaks up the source program into constituent pieces and imposes a grammatical structure on them. It then uses this structure to cre ate an intermediate representation of the source program. If the analysis part detects that the source program is either syntactically ill formed or semanti cally unsound, then it must provide informative messages, so the user can take corrective action. The analysis part also collects information about the source program and stores it in a data structure called a symbol table, which is passed along with the intermediate representation to the synthesis part. The synthesis part constructs the desired target program from the interme diate representation and the information in the symbol table. The analysis part is often called the front end of the compiler; the synthesis part is the back end. If we examine the compilation process in more detail, we see that it operates as a sequence of phases, each of which transforms one representation of the source program to another. A typical decomposition of a compiler into phases is shown in Fig. 1 .6. In practice, several phases may be grouped together, and the intermediate representations between the grouped phases need not be constructed explicitly. The symbol table, which stores information about the

1 .2.

5

THE STRUCTURE OF A COMPILER character stream Lexical Analyzer token stream Syntax Analyzer syntax tree Semantic Analyzer syntax tree Symbol Table

Intermediate Code Generator intermediate representation Machine-Independent Code Optimizer intermediate representation

�

Code G nerator target-machine code Machine-Dependent Code Optimizer target-machine code

t

Figure 1 .6: Phases of a compiler entire source program, is used by all phases of the compiler. Some compilers have a machine-independent optimization phase between the front end and the back end. The purpose of this optimization phase is to perform transformations on the intermediate representation, so that the back end can produce a better target program than it would have otherwise pro duced from an unoptimized intermediate representation. Since optimization is optional, one or the other of the two optimization phases shown in Fig. 1 .6 may be missing.

1 .2 . 1

Lexical Analysis

The first phase of a compiler is called lexical analysis or scanning. The lex ical analyzer reads the stream of characters making up the source program

6

CHAPTER

1.

INTRODUCTION

and groups the characters into meaningful sequences called lexemes. For each lexeme, the lexical analyzer produces as output a token of the form

{token-name, attribute-value) that it passes on to the subsequent phase, syntax analysis. In the token, the first component token-name is an abstract symbol that is used during syntax analysis, and the second component attribute-value points to an entry in the symbol table for this token. Information from the symbol-table entry 'is needed for semantic analysis and code generation. For example, suppose a source program contains the assignment statement pos it ion

=

init ial

+

rate * 60

( 1 . 1)

The characters in this assignment could be grouped into the following lexemes and mapped into the following tokens passed on to the syntax analyzer: 1 . pos it ion is a lexeme that would be mapped into a token {id , 1) , where id is an abstract symbol standing for identifier and 1 points to the symbol table entry for posit ion. The symbol-table entry for an identifier holds information about the identifier, such as its name and type.

2. The assignment symbol = is a lexeme that is mapped into the token {=) . Since this token needs no attribute-value, we have omitted the second component. We could have used any abstract symbol such as assign for the token-name, but for notational convenience we have chosen to use the lexeme itself as the name of the abstract symbol. 3. init ial is a lexeme that is mapped into the token (id, 2 ) , where 2 points to the symbol-table entry for init ial .

4. + is a lexeme that is mapped into the token (+) .

5. rate is a lexeme that is mapped into the token (id, 3 ) , where 3 points to the symbol-table entry for rate.

6.

*

is a lexeme that is mapped into the token ( * ) .

7. 60 is a lexeme that is mapped into the token ( 60 ) . 1 Blanks separating the lexemes would be discarded by the lexical analyzer. Figure 1 .7 shows the representation of the assignment statement (1.1) after lexical analysis as the sequence of tokens (1 .2) ( id, l) ( = ) (id, 2 ) (+) (id, 3 ) ( * ) (60) In this representation, the token names =, +, and * are abstract symbols for the assignment, addition, and multiplication operators, respectively. I TechnicaUy speaking, for the lexeme 60 we should make up a token like (number, 4) , where 4 points to the symbol table for the internal representation of integer 60 but we shall defer the discussion of tokens for numbers until Chapter 2. Chapter 3 discusses techniques for building lexical analyzers.

1 .2. THE STRUCTURE OF A COAiPILER

posi t ion

(id,

I

2 3 ��

7

+

init ial

=

�

(id,

+

rate

------ *

----

*

60

60

\-----1----1

SYMBOL TABLE

(id, If ------ + � (id, 2f (id, 3f

*

hrttofloat I 60

Intermediate Code Generator t1

=

int t o f l o at ( 6 0 )

t2

=

id3

*

t1

id2

+

t2

t3 id1

id1

=

t3

id2

+

t1

MULF R2 , R2 , #60 . 0 LDF

R1 ,

STF

idl , R1

id2

ADDF R 1 ; R 1 , R2

Figure 1 . 7: Translation of an assignment statement

CHAPTER 1 . INTRODUCTION

8

1 .2.2

Syntax Analysis

The second phase of the compiler is syntax analysis or parsing. The parser uses the first components of the tokens produced by the lexical analyzer to create a tree-like intermediate representation that depicts the grammatical structure of the token stream. A typical representation is a syntax tree in which each interior node represents an operation and the children of the node represent the arguments of the operation. A syntax tree for the token stream (1 .2) is shown as the output of the syntactic analyzer in Fig. 1.7. This tree shows the order in which the operations in the assignment pos it ion

=

init ial + rate * 60

are to be performed. The tree has an interior node labeled * with ( id, 3 ) as its left child and the integer 60 as its right child. The node (id, 3 ) represents the identifier rate. The node labeled * makes it explicit that we mu st first multiply the value of rate by 60. The node labeled + indicates that we must add the result of this multiplication to the value of init ial. The root of the tree, labeled =, indicates that we must store the result of this addition into the location for the identifier posit ion. This ordering of operations is consistent with the usual conventions of arithmetic which tell us that multiplication has higher precedence than addition, and hence that the multiplication is to be performed before the addition. The subsequent phases of the compiler use the grammatical structure to help analyze the source program and generate the target program. In Chapter 4 we shall use context-free grammars to specify the grammatical structure of programming languages and discuss algorithms for constructing efficient syntax analyzers automatically from certain classes of grammars. In Chapters 2 and 5 we shall see that syntax-directed definitions can help specify the translation of programming language constructs.

1 .2.3

Semantic Analysis

The semantic analyzer uses the syntax tree and the information in the symbol table to check the source program for semantic consistency with the language definition. It also gathers type information and saves it in either the syntax tree or the symbol table, for subsequent use during intermediate-code generation. An important part of semantic analysis is type checking, where the compiler checks that each operator has matching operands. For example, many program ming language definitions require an array index to be an integer; the compiler must report an error if a floating-point number is used to index an array. The language specification may permit some type conversions called coer cions. For example, a binary arithmetic operator may be applied to either a pair of integers or to a pair of floating-point numbers. If the operator is applied to a floating-point number and an integer, the compiler may convert or coerce the integer into a floating-point number.

I

1 .2. THE STRUCTURE OF A COMPILER

9

Such a coercion appears in Fig. 1 . 7. Suppose that pos it ion, init ial, and rate have been declared to be floating-point numbers, and that the lexeme 60 by itself forms an integer. The type checker in the semantic analyzer in Fig. 1.7 discovers that the operator * is applied to a floating-point number rat e and an integer 60. In this case, the integer may be converted into a floating-point number. In Fig. 1 .7, notice that the output of the semantic analyzer has an extra node for the operator inttofloat , which explicitly converts its integer argument into a floating-point number. Type checking and semantic analysis are discussed in Chapter 6.

1 .2 .4

Intermediate Code Generation

In the process of translating a source program into target code, a compiler may construct one or more intermediate representations, which can have a variety of forms. Syntax trees are a form of intermediate representation; they are commonly used during syntax and semantic analysis. After syntax and semantic analysis of the source program, many compil ers generate an explicit low-level or machine-like intermediate representation, which we can think of as a program for an abstract machine. This intermedi ate representation should have two important properties: it should be easy to produce and it should be easy to translate into the target machine. In Chapter 6, we consider an intermediate form called three-address code, which consists of a sequence of assembly-like instructions with three operands per instruction. Each operand can act like a register. The output of the inter mediate code generator in Fig. 1 . 7 consists of the three-address code sequence tl t2 t3 idl

inttofloat (60) id3 * t l id2 + t2 = t3

(1 .3)

There are several points worth noting about three-address instructions. First, each three-address assignment instruction has at most one operator on the right side. Thus, these instructions fix the order in which operations are to be done; the multiplication precedes the addition in the source program ( 1 . 1 ) . Sec ond, the compiler must generate a temporary name to hold the value computed by a three-address instruction. Third, some "three-address instructions" like the first and last in the sequence (1 .3) , above, have fewer than three operands. In Chapter 6, we cover the principal intermediate representations used in compilers. Chapters 5 introduces techniques for syntax-directed translation that are applied in Chapter 6 to type checking and intermediate-code generation for typical programming language constructs such as expressions, flow-of-control constructs, and procedure calls.

10

1 .2.5

CHAPTER 1. INTRODUCTION

Code Optimization

The machine-independent code-optimization phase attempts to improve the intermediate code so that better target code will result. Usually better meanS faster, but other objectives may be desired, such as shorter co de, or target code that consumes less power. For example, a straightforward algorithm generates the intermediate code ( 1 .3) , using an instruction for each operator in the tree representation that comes from the semantic analyzer. A simple intermediate code generation algorithm followed by code optimiza tion is a reasonable way to generate good target code. The optimizer can deduce that the conversion of 60 from integer to floating point can be done once and for all at compile time, so the inttofloat operation can be eliminated by replacing the integer 60 by the floating-point number 60.0. Moreover , t3 is used only once to transmit its value to id1 so the optimizer can transform ( 1 .3) into the shorter sequence t 1 = id3 * 60 . 0 id1 = id2 + t 1

( 1 .4)

There is a great variation in the amount of code optimization different com pilers perform. In those that do the most, the so-called "optimizing compilers," a significant amount of time is spent on this phase. There are simple opti mizations that significantly improve the running time of the target program without slowing down compilation too much. The chapters from 8 on discuss machine-independent and machine-dependent optimizations in detail.

1 .2.6

Code Generation

The code generator takes as input an intermediate representation of the source program and maps it into the target language. If the target language is machine code, registers Or memory locations are selected for each of the variables used by the program. Then, the intermediate instructions are translated into sequences of machine instructions that perform the same task. A crucial aspect of code generation is the judicious assignment of registers to hold variables. For example, using registers R1 and R2, the intermediate code in ( 1 .4) might get translated into the machine code LDF MULF LDF ADDF STF

R2 , R2 , R1 , R1 , id1 ,

id3 R2 , #60 . 0 id2 R 1 , R2 R1

( 1 .5)

The first operand of each instruction specifies a destination. The F in each instruction tells us that it deals with floating-point numbers. The code in

1 .2. THE STRUCTURE OF A COMPILER

11

( 1 .5) loads the contents of address id3 into register R2, then multiplies it with floating-point constant 60.0. The # signifies that 60.0 is to be treated as an immediate constant. The third instruction moves id2 into register Rl and the fourth adds to it the value previously computed in register R2. Finally, the value in register Rl is stored into the address of idl , so the code correctly implements the assignment statement ( 1 . 1) . Chapter 8 covers code generation. This discussion of code generation has ignored the important issue of stor age allocation for the identifiers in the source program. As we shall see in Chapter 7, the organization of storage at run-time depends on the language be ing compiled. Storage-allocation decisions are made either during intermediate code generation or during code generation.

1 .2.7

Symbol-Table Management

An essential function of a compiler is to record the variable names used in the source program and collect information about various attributes of each name. These attributes may provide information about the storage allocated for a name, its type, its scope (where in the program its value may be used ) , and in the case of procedure names, such things as the number and types of its arguments, the method of passing each argument (for example, by value or by reference ) , and the type returned. The symbol table is a data structure containing a record for each variable name, with fields for the attributes of the name. The data structure should be designed to allow the compiler to find the record for each name quickly and to store or retrieve data from that record quickly. Symbol tables are discussed in Chapter 2.

1.2.8

The Grouping of Phases into Passes

The discussion of phases deals with the logical organization of a compiler. In an implementation, activities from several phases may be grouped together into a pass that reads an input file and writes an output file. For example, the front-end phases of lexical analysis, syntax analysis, semantic analysis, and intermediate code generation might be grouped together into one pass. Code optimization might be an optional pass. Then there could be a back-end pass consisting of code generation for a particular target machine. Some compiler collections have been created around carefully designed in termediate representations that allow the front end for a particular language to interface with the back end for a certain target machine. With these collections, we can produce compilers for different source languages for one target machine by combining different front ends with the back end for that target machine. Similarly, we can produce compilers for different target machines, by combining a front end with back ends for different target machines.

CHAPTER 1 . INTRODUCTION

12

1 .2.9

Compiler-Construction Tools

The compiler writer, like any software developer, can profitably use modern software development environments containing tools such as language editors, debuggers, version managers, profilers, test harnesses, and so on. In addition to these general software-development tools, other more specialized tools have been created to help implement various phases of a compiler. These tools use specialized languages for specifying and implementing spe cific components, and many use quite sophisticated algorithms. The most suc cessful tools are those that hide the details of the generation algorithm and produce components that can be easily integrated into the remainder of the compiler. Some commonly used compiler-construction tools include 1. Parser generators that automatically produce syntax analyzers from a grammatical description of a programming language. 2. Scanner generators that produce lexical analyzers from a regular-expres sion description of the tokens of a language. 3. Syntax-directed translation engines that produce collections of routines

for walking a parse tree and generating intermediate code.

4. Code-generator generators that produce a code generator from a collection

of rules for translating each operation of the intermediate language into the machine language for a target machine.

5. Data-flow analysis engines that facilitate the gathering of information

about how values are transmitted from one part of a program to each other part. Data-flow analysis is a key part of code optimization.

6. Compiler-construction toolkits that provide an integrated set of routines for constructing various phases of a compiler. We shall describe many of these tools throughout this book. 1 .3

The Evolution of P rogramming Languages

The first electronic computers appeared in the 1940's and were programmed in machine language by sequences of O's and 1 's that explicitly told the computer what operations to execute and in what order. The operations themselves were very low level: move data from one location to another, add the contents of two registers, compare two values, and so on. Needless to say, this kind of programming was slow, tedious, and error prone. And once written, the programs were hard to understand and modify.

1.3. THE EVOLUTION OF PROGRAMMING LANG UAGES

1.3.1

13

The Move to Higher-level Languages

The first step towards more people-friendly programming languages was the development of mnemonic assembly languages in the early 1950's. Initially, the instructions in an assembly language were just mnemonic representations of machine instructions. Later, macro instructions were added to assembly languages so that a programmer could define parameterized shorthands for frequently used sequences of machine instructions. A major step towards higher-level languages was made in the latter half of the 1950's with the development of Fortran for scientific computation, Cobol for business data processing, and Lisp for symbolic computation. The philos ophy behind these languages was to create higher-level notations with which programmers could more easily write numerical computations, business appli cations, and symbolic programs. These languages were so successful that they are still in use today. In the following decades, many more languages were created with innovative features to help make programming easier, more natural, and more robust. Later in this chapter, we shall discuss some key features that are common to many modern programming languages. Today, there are thousands of programming languages. They can be classi fied in a variety of ways. One classification is by generation. First-generation languages are the machine languages, second-generation the assembly languages, and third-generation the higher-level languages like Fortran, Cobol, Lisp, C, C++, C#, and Java. Fourth-generation languages are languages designed for specific applications like NOMAD for report generation, SQL for database queries, and Postscript for text formatting. The term fifth-generation language has been applied to logic- and constraint-based languages like Prolog and OPS5. Another classification of languages uses the term imperative for languages in which a program specifies how a computation is to be done and declarative for languages in which a program specifies what computation is to be done. Languages such as C, C++, C#, and Java are imperative languages. In imper ative languages there is a notion of program state and statements that change the state. Functional languages such as ML and Haskell and constraint logic languages such as Prolog are often considered to be declarative languages. The term von Neumann language is applied to programming languages whose computational model is based on the von Neumann computer archi tecture. Many of today's languages, such as Fortran and C are von Neumann languages. An object-oriented language is one that supports object-oriented program ming, a programming style in which a program consists of a collection of objects that interact with one another. Simula 67 and Smalltalk are the earliest major object-oriented languages. Languages such as C++, C#, Java, and Ruby are more recent object-oriented languages. Scripting languages are interpreted languages with high-level operators de signed for "gluing together" computations. These computations were originally

14

CHAPTER 1. INTRODUCTION

called "scripts." Awk, JavaScript, Perl, PHP, Python, Ruby, and Tel are pop ular examples of scripting languages. Programs written in scripting languagesI are often much shorter than equivalent programs written in languages like C.

1 .3.2

Impacts on Compilers

Since the design of programming languages and compilers are intimately related, the advances in programming languages placed new demands on compiler writ ers. They had to devise algorithms and representations to translate and support the new language features. Since the 1940's, computer architecture has evolved as well. Not only did the compiler writers have to track new language fea tures, they also had to devise translation algorithms that would take maximal advantage of the new hardware capabilities. Compilers can help promote the use of high-level languages by minimizing the execution overhead of the programs written in these languages. Compilers are also critical in making high-performance computer architectures effective on users' applications. In fact, the performance of a computer system is so dependent on compiler technology that compilers are used as a tool in evaluating architectural concepts before a computer is built. Compiler writing is challenging. A compiler by itself is a large program. Moreover, many modern language-processing systems handle several source lan guages and target machines within the same framework; that is, they serve as collections of compilers, possibly consisting of millions of lines of code. Con sequently, good software-engineering techniques are essential for creating and evolving modern language processors. A compiler must translate correctly the potentially infinite set of programs that could be written in the source language. The problem of generating the optimal target code from a source program is undecidable in general; thus, compiler writers must evaluate tradeoffs about what problems to tackle and what heuristics to use to approach the problem of generating efficient code. A study of compilers is also a study of how theory meets practice, as we shall see in Section 1 .4. The purpose of this text is to teach the methodology and fundamental ideas used in compiler design. It is not the intention of this text to teach all the algorithms and techniques that could be used for building a state-of-the-art language-processing system. However, readers of this text will acquire the basic knowledge and understanding to learn how to build a compiler relatively easily.

1 .3.3

Exercises for Section 1.3

Exercise 1 . 3 . 1 : Indicate which of the following terms:

b) declarative a) imperative e) functional d) object-oriented g) fourth-generation h) scripting

c) von Neumann f) third-generation

15

1 .4. THE SCIENCE OF BUILDING A COMPILER apply to which of the following languages: 1) C 6) Lisp 1.4

2) C++ 7) ML

3) Cobol

8) Perl

4) Fortran

9) Python

5) Java

10) VB.

The Science of Building a Compiler

Compiler design is full of beautiful examples where complicated real-world prob lems are solved by abstracting the essence of the problem mathematically. These serve as excellent illustrations of how abstractions can be used to solve prob lems: take a problem, formulate a mathematical abstraction that captures the key characteristics, and solve it using mathematical techniques. The problem formulation must be grounded in a solid understanding of the characteristics of computer programs, and the solution must be validated and refined empirically. A compiler must accept all source programs that conform to the specification of the language; the set of source programs is infinite and any program can be very large, consisting of possibly millions of lines of code. Any transformation performed by the compiler while translating a source program must preserve the meaning of the program being compiled. Compiler writers thus have influence over not just the compilers they create, but all the programs that their com pilers compile. This leverage makes writing compilers particularly rewarding; however, it also makes compiler development challenging.

1 .4 . 1

Modeling in Compiler Design and Implementation

The study of compilers is mainly a study of how we design the right mathe matical models and choose the right algorithms, while balancing the need for generality and power against simplicity and efficiency. Some of most fundamental models are finite-state machines and regular expressions, which we shall meet in Chapter 3. These models are useful for de scribing the lexical units of programs (keywords, identifiers, and such ) and for describing the algorithms used by the compiler to recognize those units. Also among the most fundamental models are context-free grammars, used to de scribe the syntactic structure of programming languages such as the nesting of parentheses or control constructs. We shall study grammars in Chapter 4. Sim ilarly, trees are an important model for representing the structure of programs and their translation into object code, as we shall see in Chapter 5.

1 .4.2

The Science of Code Optimization

The term "optimization" in compiler design refers to the attempts that a com piler makes to produce code that is more efficient than the obvious code. "Op timization" is thus a misnomer, since there is no way that the code produced by a compiler can be guaranteed to be as fast or faster than any other code that performs the same task.

16

CHAPTER 1. INTRODUCTION

In modern times, the optimization of code that a compiler performs has become both more important and more complex. It is more complex because processor architectures have become more complex, yielding more opportunities to improve the way code executes. It is more important because massively par allel computers require substantial optimization, or their performance suffers by orders of magnitude. With the likely prevalence of multi core machines (com puters with chips that have large numbers of processors on them ) , all compilers will have to face the problem of taking advantage of multiprocessor machines. It is hard, if not impossible, to build a robust compiler out of "hacks." Thus, an extensive and useful theory has been built up around the problem of optimizing code. The use of a rigorous mathematical foundation allows us to show that an optimization is correct and that it produces the desirable effect for all possible inputs. We shall see, starting in Chapter 9, how models such q,s graphs, matrices, and linear programs are necessary if the compiler is to produce well optimized code. On the other hand, pure theory alone is insufficient. Like many real..,world problems, there are no perfect answers. In fact, most of the questions that we ask in compiler optimization are undecidable. One of the most important skills in compiler design is the ability to formulate the right problem to solve. We need a good understanding of the behavior of programs to start with and thorough experimentation and evaluation to validate our intuitions. Compiler optimizations must meet the following design objectives: •

The optimization must be correct, that is, preserve the meaning of the compiled program,

•

The opti:rnization must improve the performance of many programs,

•

The compilation time must be kept reasonable, and

•

The engineering effort required must be manageable.

It is impossible to overemphasize the importance of correctness. It is trivial to write a compiler that generates fast code if the generated code need not be correct! Optimizing compilers are so difficult to get right that we dare say that no optimizing compiler is completely error-free! Thus, the most important objective in writing a compiler is that it is correct. The second goal is that the compiler must be effective in improving the per formance of many input programs. Normally, performance means the speed of the program execution. Especially in embedded applications, we may also wish to minimize the size of the generated code. And in the case of mobile devices, it is also desirable that the code minimizes power consumption. Typically, the same optimizations that speed up execution time also conserve power. Besides performance, usability aspects such as error reporting and debugging are also important. Third, we need to keep the compilation time short to support a rapid devel opment and debugging cycle. This requirement has become easier to meet as

1.5. APPLICATIONS OF COMPILER TECHNOLOGY

17

machines get faster. Often, a program is first developed and debugged without program optimizations. Not only is the compilation time reduced, but more importantly, unoptimized programs are easier to debug, because the optimiza tions introduced by a compiler often obscure the relationship between the source code and the object code. TIuning on optimizations in the compiler sometimes exposes new problems in the source program; thus testing must again be per formed on the optimized code. The need for additional testing sometimes deters the use of optimizations in applications, especially if their performance is not critical. Finally, a compiler is a complex system; we must keep the system sim ple to assure that the engineering and maintenance costs of the compiler are manageable. There is an infinite number of program optimizations that we could implement, and it takes a nontrivial amount of effort to create a correct and effective optimization. We must prioritize the optimizations, implementing only those that lead to the greatest benefits on source programs encountered in practice. Thus, in studying compilers, we learn not only how to build a compiler, but also the general methodology of solving complex and open-ended problems. The approach used in compiler development involves both theory and experimenta tion. We normally start by formulating the problem based on our intuitions on what the important issues are. 1.5

Applications of Compiler Technology

Compiler design is not only about compilers, and many people use the technol ogy learned by studying compilers in school, yet have never, strictly speaking, written (even part of) a compiler for a major programming language. Compiler technology has other important uses as well. Additionally, compiler design im pacts several other areas of computer science. In this section, we review the most important interactions and applications of the technology.

1.5. 1

Implementation of High-Level Programming Languages

A high-level programming language defines a programming abstraction: the programmer expresses an algorithm using the language, and the compiler must translate that program to the target language. Generally, higher-level program ming languages are easier to program in, but are less efficient, that is, the target programs rUn more slowly. Programmers using a low-level language have more control over a computation and can, in principle, produce more efficient code. Unfortunately, lower-level programs are harder to write and - worse still less portable, more prone to errors, and harder to maintain. Optimizing com pilers include techniques to improve the performance of generated code, thus offsetting the inefficiency introduced by high-level abstractions.

18

CHAPTER 1 . INTRODUCTION

Example 1 . 2 : The register keyword in the C programming language is an early example of the interaction between compiler technology and language evo lution. When the C language was created in the mid 1970s, it was considered necessary to let a programmer control which program variables reside in regis ters. This control became unnecessary as effective register-allocation techniques were developed, and most modern programs no longer use this language feature. In fact, programs that use the register keyword may lose efficiency, because programmers often are not the best judge of very low-level matters like register allocation. The optimal choice of register allocation depends greatly on the specifics of a machine architecture. Hardwiring low-level resource-management decisions like register allocation may in fact hurt performance, especially if the program is run on machines other than the one for which it was written. 0

The many shifts in the popular choice of programming languages have been in the direction of increased levels of abstraction. C was the predominant systems programming language of the 80's; many of the new projects started in the 90's chose C++; Java, introduced in 1995, gained popularity quickly in the late 90's. The new programming-language features introduced in each round spurred new research in compiler optimization. In the following, we give an overview on the main language features that have stimulated significant advances in compiler technology. Practically all common programming languages, including C, Fortran and Cobol, support user-defined aggregate data types, such as arrays and structures, and high-level control flow, such as loops and procedure invocations. If we just take each high-level construct or data-access operation and translate it directly to machine code, the result would be very inefficient. A body of cOJIlpiler optimizations, known as data-flow optimizations, has been developed to analyze the flow of data through the program and removes redundancies across these constructs. They are effective in generating code that resembles code written by a skilled programmer at a lower level. Object orientation was first introduced in Simula in 1967, and has been incorporated in languages such as Smalltalk, C++, C#, and Java. The key ideas behind object orientation are 1. Data abstraction and 2. Inheritance of properties, both of which have been found to make programs more modular and easier to maintain. Object-oriented programs are different from those written in many other languages, in that they consist of many more, but smaller, procedures (called methods in object-oriented terms) . Thus, compiler optimizations must be able to perform well across the procedural boundaries of the source program. Procedure inlining, which is the replacement of a procedure call by the body of the procedure, is particularly useful here. Optimizations to speed up virtual method dispatches have also been developed.

1.5. APPLICATIONS OF COMPILER TECHNOLOGY

19

Java has many features that make programming easier, many of which have been introduced previously in other languages. The Java language is type-safe; that is, an object cannot be used as an object of an unrelated type. All array accesses are checked to ensure that they lie within the bounds of the array. Java has no pointers and does not allow pointer arithmetic. It has a built-in garbage-collection facility that automatically frees the memory of variables that are no longer in use. While all these features make programming easier, they incur a run-time overhead. Compiler optimizations have been developed to reduce the overhead, for example, by eliminating unnecessary range checks and by allocating objects that are not accessible beyond a procedure on the stack instead of the heap. Effective algorithms also have been developed to minimize the overhead of garbage collection. In addition, Java is designed to support portable and mobile code. Programs are distributed as Java bytecode, which must either be interpreted or compiled into native code dynamically, that is, at run time. Dynamic compilation has also been studied in other contexts, where information is extracted dynamically at run time and used to produce better-optimized code. In dynamic optimization, it is important to minimize the compilation time as it is part of the execution overhead. A common technique used is to only compile and optimize those parts of the program that will be frequently executed.

1.5.2

Optimizations for Computer Architectures

The rapid evolution of computer architectures has also led to an insatiable demand for new compiler technology. Almost all high-performance systems take advantage of the same two basic techniques: parallelism and memory hi erarchies. Parallelism can be found at several levels: at the instruction level, where multiple operations are executed simultaneously and at the processor level, where different threads of the same application are run on different pro cessors. Memory hierarchies are a response to the basic limitation that we can build very fast storage or very large storage, but not storage that is both fast and large. Parallelism

All modern microprocessors exploit instruction-level parallelism. However, this parallelism can be hidden from the programmer. Programs are written as if all instructions were executed in sequence; the hardware dynamically checks for dependencies in the sequential instruction stream and issues them in parallel when possible. In some cases, the machine includes a hardware scheduler that can change the instruction ordering to increase the parallelism in the program. Whether the hardware reorders the instructions or not, compilers can rearrange the instructions to make instruction-level parallelism more effective. Instruction-level parallelism can also appear explicitly in the instruction set. VLIW (Very Long Instruction Word) machines have instructions that can issue

20

CHAPTER 1. INTRODUCTION

multiple operations in parallel. The Intel IA64 is a well-known example of such an architecture. All high-performance, general-purpose microprocessors also include instructions that can operate on a vector of data at the same time. COp1piler techniques have been developed to generate code automatically for such machines from sequential programs. Multiprocessors have also become prevalent; even personal computers of ten have multiple processors. Programmers can write multithreaded code for multiprocessors, or parallel code can be automatically generated by a com piler from conventional sequential programs. Such a compiler hides from the programmers the details of finding parallelism in a program, distributing the computation across the machine, and minimizing synchronization and com munication among the processors. Many scientific-computing and engineering applications are computation-intensive and can benefit greatly from parallel processing. Parallelization techniques have been developed to translate auto matically sequential scientific programs into multiprocessor code. Memory Hierarchies

A memory hierarchy consists of several levels of storage with different speeds and sizes, with the level closest to the processor being the fastest but small est. The average memory-access time of a program is reduced if most of its accesses are satisfied by the faster levels of the hierarchy. Both parallelism and the existence of a memory hierarchy improve the potential performance of a machine, but they must be harnessed effectively by the compiler to deliver real performance on an application. Memory hierarchies are found in all machines. A processor usually has a small number of registers consisting of hundreds of bytes, several levels of caches containing kilobytes to megabytes, physical memory containing mega bytes to gigabytes, and finally secondary storage that contains gigabytes and beyond. Correspondingly, the speed of accesses between adjacent levels of the hierarchy can differ by two or three orders of magnitude. The performance of a system is often limited not by the speed of the processor but by the performance of the memory subsystem. While compilers traditionally focus on optimizing the processor execution, more emphasis is now placed on making the memory hierarchy more effective. U sing registers effectively is probably the single most important problem in optimizing a program. Unlike registers that have to be managed explicitly in software, caches and physical memories are hidden from the instruction set and are managed by hardware. It has been found that cache-management policies implemented by hardware are not effective in some cases, especially in scientific code that has large data structures (arrays, typically) . It is possible to improve the effectiveness of the memory hierarchy by changing the layout of the data, or changing the order of instructions accessing the data. We can also change the layout of code to improve the effectiveness of instruction caches.

1.5. APPLICATIONS OF COMPILER TECHNOLOGY

1 .5 .3

21

Design of New Computer Architectures

In the early days of computer architecture design, compilers were developed after the machines were built. That has changed. Since programming in high level languages is the norm, the performance of a computer system is determined not by its raw speed but also by how well compilers can exploit its features. Thus, in modern computer architecture development, compilers are developed in the processor-design stage, and compiled code, running on simulators, is used to evaluate the proposed architectural features. RIse

One of the best known examples of how compilers influenced the design of computer architecture was the invention of the RISC ( Reduced Instruction-Set Computer) architecture. Prior to this invention, the trend was to develop pro gressively complex instruction sets intended to make assembly programming easier; these architectures were known as CISC ( Complex Instruction-Set Com puter ) . For example, CISC instruction sets include complex memory-addressing modes to support data-structure accesses and procedure-invocation instructions that save registers and pass parameters on the stack. Compiler optimizations often can reduce these instructions to a small num ber of simpler operations by eliminating the redundancies across complex in structions. Thus, it is desirable to build simple instruction sets; compilers can use them effectively and the hardware is much easier to optimize. Most general-purpose processor architectures, including PowerPC, SPARC, MIPS, Alpha, and PA-RISC, are based on the RISC concept. Although the x86 architecture-the most popular microprocessor-has a CISC instruction set, many of the ideas developed for RISC machines are used in the imple mentation of the processor itself. Moreover, the most effective way to use a high-performance x86 machine is to use just its simple instructions. Specialized Architectures

Over the last three decades, many architectural concepts have been proposed. They include data flow machines, vector machines, VLIW ( Very Long Instruc tion Word ) machines, SIMD ( Single Instruction, Multiple Data) arrays of pro cessors, systolic arrays, multiprocessors with shared memory, and multiproces sors with distributed memory. The development of each of these architectural concepts was accompanied by the research and development of corresponding compiler technology. Some of these ideas have made their way into the designs of embedded machines. Since entire systems can fit on a single chip, processors need no longer be prepackaged commodity units, but can be tailored to achieve better cost-effectiveness for a particular application. Thus, in contrast to general purpose processors, where economies of scale have led computer architectures

22

CHAPTER 1. INTROD UCTION

to converge, application-specific processors exhibit a diversity of computer ar chitectures. Compiler technology is needed not only to support programming for these architectures, but also to evaluate proposed architectural designs.

1 .5.4

Program Translations

While we normally think of compiling as a translation from a high-level lan guage to the machine level, the same technology can be applied to translate between different kinds of languages. The following are some of the important applications of program-translation techniques. Binary Translation

Compiler technology can be used to translate the binary code for one machine to that of another, allowing a machine to run programs originally compiled for another instruction set. Binary translation technology has been used by various computer companies to increase the availability of software for their machines. In particular, because of the domination of the x86 personal-computer mar ket, most software titles are available as x86 code. Binary translators have been developed to convert x86 code into both Alpha and Sparc code. Binary translation was also used by Transmeta Inc. in their implementation of the x86 instruction set. Instead of executing the complex x86 instruction set directly in hardware, the Transmeta Crusoe processor is a VLIW processor that relies on binary translation to convert x86 code into native VLIW code. Binary translation can also be used to provide backward compatibility. When the processor in the Apple Macintosh was changed from the Motorola MC 68040 to the PowerPC in 1994, binary translation was used to allow PowerPC processors run legacy MC 68040 code. Hardware Synthesis

Not only is most software written in high-level languages; even hardware de signs are mostly described in high-level hardware description languages like Verilog and VHDL (Very high-speed integrated circuit Hardware Description Language) . Hardware designs are typically described at the register trans fer level (RTL) , where variables represent registers and expressions represent combinational logic. Hardware-synthesis tools translate RTL descriptions auto matically into gates, which are then mapped to transistors and eventually to a physical layout. Unlike compilers for programming languages, these tools often take hours optimizing the circuit. Techniques to translate designs at higher levels, such as the behavior or functional level, also exist. Database Query Interpreters

Besides specifying software and hardware, languages are useful in many other applications. For example, query languages, especially SQL (Structured Query

1.5. APPLICATIONS OF COMPILER TECHNOLOGY

23

Language) , are used to search databases. Database queries consist of predicates containing relational and boolean operators. They can be interpreted or com piled into commands to search a database for records satisfying that predicate. Compiled Simulation

Simulation is a general technique used in many scientific and engineering disci plines to understand a phenomenon or to validate a design. Inputs to a simula tor usually include the description of the design and specific input parameters for that particular simulation run. Simulations can be very expensive. We typi cally need to simulate many possible design alternatives on many different input sets, and each experiment may take days to complete on a high-performance machine. Instead of writing a simulator that interprets the design, it is faster to compile the design to produce machine code that simulates that particular design natively. Compiled simulation can run orders of magnitude faster than an interpreter-based approach. Compiled simulation is used in many state-of the-art tools that simulate designs written in Verilog or VHDL.

1 .5.5

Software Productivity Tools

Programs are arguably the most complicated engineering artifacts ever pro duced; they consist of many many details, every one of which must be correct before the program will work completely. As a result, errors are rampant in programs; errors may crash a system, produce wrong results, render a system vulnerable to security attacks, or even lead to catastrophic failures in critical systems. Testing is the primary technique for locating errors in programs. An interesting and promising complementary approach is to use data-flow analysis to locate errors stat ically (that is, before the program is run ) . Data flow analysis can find errors along all the possible execution paths, and not just those exercised by the input data sets, as in the case of program testing. Many of the data-flow-analysis techniques, originally developed for compiler optimizations, can be used to create tools that assist programmers in their software engineering tasks. The problem of finding all program errors is undecidable. A data-flow analy sis may be designed to warn the programmers of all possible statements violating a particular category of errors. But if most of these warnings are false alarms, users will not use the tool. Thus, practical error detectors are often neither sound nor complete. That is, they may not find all the errors in the program, and not all errors reported are guaranteed to be real errors. Nonetheless, var ious static analyses have been developed and shown to be effective in finding errors, such as dereferencing null or freed pointers, in real programs. The fact that error detectors may be unsound makes them significantly different from compiler optimizations. Optimizers must be conservative and cannot alter the semantics of the program under any circumstances.

24

CHAPTER 1 . INTROD UCTION

In the balance of this section, we shall mention several ways in which pro gram analysis, building upon techniques originally developed to optimize code in compilers, have improved software productivity. Of special importance are techniques that detect statically when a program might have a security vulner ability.

Type Checking

Type checking is an effective and well-established technique to catch inconsis tencies in programs. It can be used to catch errors, for example, where an operation is applied to the wrong type of object, or if parameters passed to a procedure do not match the signature of the procedure. Program analysis can go beyond finding type errors by analyzing the flow of data through a program. For example, if a pointer is assigned null and then immediately dereferenced, the program is clearly in error. The same technology can be used to catch a variety of security holes, in which an attacker supplies a string or other data that is used carelessly by the program. A user-supplied string can be labeled with a type "dangerous." If this string is not checked for proper format, then it remains "dangerous," and if a string of this type is able to influence the control-flow of the code at some point in the program, then there is a potential security flaw. Bounds Checking

It is easier to make mistakes when programming in a lower-level language than a higher-level one. For example, many security breaches in systems are caused by buffer overflows in programs written in C. Because C does not have array bounds checks, it is up to the user to ensure that the arrays are not accessed out of bounds. Failing to check that the data supplied by the user can overflow a buffer, the program may be tri cked into storing user data outside of the buffer. An attacker can manipulate the input data that causes the program to misbehave and compromise the security of the system. Techniques have been developed to find buffer overflows in programs, but with limited success. Had the program been written in a safe language that includes automatic range checking, this problem would not have occurred. The same data-flow analysis that is used to eliminate redundant range checks can also be used to locate buffer overflows. The major difference, however, is that failing to elimi nate a range check would only result in a small run-time cost, while failing to identify a potential buffer overflow may compromise the security of the system. Thus, while it is adequate to use simple techniques to optimize range checks, so phisticated analyses, such as tracking the values of pointers across procedures, are needed to get high-quality results in error detection tools.

1.6. PROGRAMMING LANGUAGE BASICS

25

Memory-Management Tools

Garbage collection is another excellent example of the trad�off between effi ciency and a combination of ease of programming and software reliability. Au tomatic memory management obliterates all memory-management errors (e.g. , "memory leaks" ) , which are a major source of problems in C and C++ pro grams. Various tools have been developed to help programmers find memory management errors. For example, Purify is a widely used tool that dynamically catches memory management errors as they occur. Tools that help identify some of these problems statically have also been developed. 1 .6

Programming L anguage Basics

In this section, we shall cover the most important terminology and distinctions that appear in the study of programming languages. It is not our purpose to cover all concepts or all the popular programming languages. We assume that the reader IS familiar with at least one of C , C++, C#, or Java, and may have encountered other languages as well.

1 .6 . 1

The Static/Dynamic Distinction

Among the most important issues that we face when designing a compiler for a language is what decisions can the compiler make about a program. If a language uses a policy that allows the compiler to decide an issue, then we say that the language uses a static policy or that the isslie can be decided at compile time. On the other hand, a policy that only allows a decision to be made when we execute the program is said to be a dynamic policy or to require a decision at run time. One issue on which we shall concentrate is the scope of declarations. The scope of a declaration of x is the region of the program in which uses of x refer to this declaration. A language uses static scope or lexical scope if it is possible to determine the scope of a declaration by looking only at the program. Otherwise, the language uses dynamic scope. With dynamic scope, as the program runs, the same use of x could refer to any of several different declarations of x. Most languages, such as C and Java, use static scope. We shall discuss static scoping in Section 1.6.3. Example 1 . 3 : As another example of the static / dynamic dist inction, consider the use of the term "static" as it applies to data in a Java class declaration. In J ava, a variable is a name for a location in memory used to hold a data value. Here, "static" refers not to the scope of the variable, but rather to the ability of the compiler to determine the location in memory where the declared variable can be found. A declaration like

public stat i c int

x;

26

CHAPTER 1 . INTRODUCTION

makes x a class variable and says that there is only one copy of x, po matter how wany objects of this class are created. Moreover, the compiler can determine a location in memory where this integer x will be held. In contrast, had "static" been omitted from this declaration, then each object of the class would have its own location where x would be held, and the compiler could not determine all these places in advance of running the program. 0

1 .6.2

Environments and States

Another important distinction we must make when discussing programming l;:tnguages is whether changes occurring as the program runs affect the values of data elements or affect the interpretation of names for tlmt data. For example, the execution of an assignment such as x = y + 1 changes the value denoted by the name x. More specifically, the assignment changes the value in whatever location is denoted by x . It may be less clear that the location denoted by x can change at run time. For instance, as we disc-qssed in Exa�ple 1 .3, if x is not a static (or "class" ) variable, then every object of the class has its own location for an instance of variable x. In that case, the assigIJ-ment to x can change any of those "in st ance" variables, depending on the object to which a method containing that assignment is applied.

environment names

locations ( variables)

state values

Figure 1 .8: Two-stage mapping from names to values The association of names with locations in memory (the store) and then with values can be described by two mappings that change as the program runs (see Fig. 1 .8) : 1. The environment is a mapping from names to locations in the store. Since variables refer to locations ( "I-values" in the terminology of C) , we could alternatively define an environment as a mapping from names to variables.

2. The state is a mapping from locations in store to their values. That is, the state maps I-values to their corresponding r-values, in the terminology of C. Environments change according t o the scope rllles of a language. Example 1 .4 : Consider the C program fragment in Fig. 1 .9. Integer i is

declared a global variable, apd also declared as a variable local to function f · When f is executing, the environment adjusts so that name i refers to the

1.6. PROGRAMMING LANGUAGE BASICS

27

int i ,·

1*

global i

*1

void f ( . . . ) { int i ;

1*

local i

*1

3;

1*

use of local i

*1

i + 1;

1*

use of global i

*1

i

=

} x =

Figure 1 .9: Two declarations of the name i location reserved for the i that is local to j, and any use of i, such as the assignment i = 3 shown explicitly, refers to that location. Typically, the local i is given a place on the run-time stack. Whenever a function g other than j is executing, uses of i cannot refer to the i that is local to j. Uses of name i in g must be within the scope of some other declaration of i . An example is the explicitly shown statement x = i + 1 , which is inside some procedure whose definition is not shown. The i in i + 1 presumably refers to the global i. As in most languages, declarations in C must precede their use, so a function that comes before the global i cannot refer to it. D The environment and state mappings in Fig. 1 .8 are dynamic, but there are a few exceptions:

1 . Static versus dynamic binding of names to locations. Most binding of names to locations is dynamic, and we discuss several approaches to this binding throughout the section. Some declarations, such as the global i in Fig. 1 .9, can be given a location in the store once and for all, as the compiler generates object code. 2 2. Static versus dynamic binding of locations to values. The binding of lo cations to values (the second stage in Fig. 1 .8) , is generally dynamic as well, since we cannot tell the value in a location until we run the program. Declared constants are an exception. For instance, the C definition #def ine ARRAYS IZE 1000 2 Technically, the C compiler will assign a location in virtual memory for the global i, leaving it to the loader and the operating system to determine where in the physical memory of the machine i will be located. However, we shall not worry about "relocation" issues such as these, which have no impact on compiling. Instead, we treat the address space that the compiler uses for its output code as if it gave physical memory locations.

28

CHAPTER 1. INTRODUCTION

Names, Identifiers, and Variables Although the terms "name" and "variable," often refer to the same thing, we use them carefully to distinguish between compile-time names and the run-time locations denoted by names. An identifier is a string of characters, typically letters or digits, that refers to (identifies ) an entity, such as a data object, a procedure, a class, or a type. All identifiers are names, but not all names are identifiers. Names can also be expressions. For example, the name x.y might denote the field y of a structure denoted by x. Here, x and y are identifiers, while x.y is a name, but not an identifier. Composite names like x.y are called qualified names. A variable refers to a particular location of the store. It is common for the same identifier to be declared more than once; each such declaration introduces a new variable. Even if each identifier is declared just once, an identifier local to a recursive procedure will refer to different locations of the store at different times.

binds the name ARRAYSIZE to the value 1000 statically. We can determine this binding by looking at the statement, and we know that it is impossible for this binding to change when the program executes.

1 .6.3

Static Scope and Block Structure

Most languages, including C and its family, use static scope. The scope rules for C are based on program structure; the scope of a declaration is determined implicitly by where the declaration appears in the program. Later languages, such as C++, Java, and C#, also provide explicit control over scopes through the use of keywords like public, private, and protected. In this section we consider static-scope rules for a language with blocks, where a block is a grouping of declarations and statements. C uses braces { and } to delimit a block; the alternative use of begin and end for the same purpose dates back to Algol. Example 1 . 5 : To a first approximation, the C static-scope policy is as follows:

1. A C program consists of a sequence of top-level declarations of variables and functions. 2. Functions may have variable declarations within them, where variables include local variables and parameters. The scope of each such declaration is restricted to the function in which it appears.

1.6. PROGRAMMING LANGUAGE BASICS

29

Procedures, Functions, and Methods To avoid saying "procedures, functions, or methods," each time we want to talk about a subprogram that may be called, we shall usually refer to all of them as "procedures." The exception is that when talking explicitly of programs in languages like C that have only functions, we shall refer to them as "functions." Or, if we are discussing a language like Java that has only methods, we shall use that term instead. A function generally returns a value of some type (the "return type" ) , while a procedure does not return any value. C and similar languages, which have only functions, treat procedures as functions that have a special return type "void," to signify no return value. Object-oriented languages like Java and C++ use the term "methods." These can behave like either functions or procedures, but are associated with a particular class.

3. The scope of a top-level declaration of a name x consists of the entire program that follows, with the exception of those statements that lie within a function that also has a declaration of x . The additional detail regarding the C static-scope policy deals with variable declarations within statements. We examine such declarations next and in Example 1.6. 0 In C , the syntax of blocks is given by

1 . One type of statement is a block. Blocks can appear anywhere that other types of statements, such as assignment statements, can appear. 2. A block is a sequence of declarations followed by a sequence of statements, all surrounded by braces. Note that this syntax allows blocks to be nested inside each other. This nesting property is referred to as block structure. The C family of languages has block structure, except that a function may not be defined inside another function. We say that a declaration D "belongs" to a block B if B is the most closely nested block containing D ; that is, D is located within B, but not within any block that is nested within B. The static-scope rule for variable declarations in a block-structured lan guages is as follows. If declaration D of name x belongs to block B, then the scope of D is all of B, except for any blocks B' nested to any depth within B, in which x is redeclared. Here, x is redeclared in B' if some other declaration D ' of the same name x belongs to B' .

30

CHAPTER 1 . INTRODUCTION

An equivalent way to express this rule is to focus on a use of a name x . Let Bl , B2 , . . . , Bk be all the blocks that surround this use of x , with Bk the smallest, nested within Bk-1 , which is nested within Bk-2, and so on. Search for the largest i such that there is a declaration of x belonging to Bi . This use of x refers to the declaration in Bi . Alternatively, this use of x is within the scope of the declaration in Bi .

mainOint { int ba = 1 ,; { int b = 2 ,{ [, int couta«= 3a ; « b-, } { ' [, int coutb «= 4-a « b-, } cout « a « b-, } cout « a « b; } l'

B3

J

B4

J

B2

Figure 1. 10: Blocks in a C++ program Example 1 . 6 : The C++ program in Fig. 1 .10 has four blocks, with several

definitions of variables a and b. As a memory aid, each declaration initializes its variable to the number of the block to which it belongs. For instance, consider th� declaration 1 in block Bl - Its scope is all of Bl , except for those blocks nested ( perhaps deeply ) within Bl that have their own declaration of a . B2 , nested immediately within B1 , does not have a declaration of a , but B3 does. B4 does not have a declaration of a, so block B3 is the only place in the entire program that is outside the scope of the declaration of the name a that belongs to B1 . That is, this scope includes B4 and all of B2 except for the part of B2 that is within B3 . The scopes of all five declarations are summarized in Fig. 1 . 1 1 . From another point of view, let u s consider the output statement i n block B4 and bind the variables a and b used there to the proper declarations. The list of surrounding blocks, in order of increasing size, is B4 , B2 , B1 • Note that B3 does not surround the point in question. B4 has a declaration of b, so it is to this declaration that this use of b refers, and the value of b printed is 4. However, B4 does not have a declaration of a , so we next look at B2 . That block does not have a declaration of a either, so we proceed to B1 - Fortunately,

int a =

1 . 6. PROGRAMMING LANGUAGE BASICS DECLARATION int a = 1 ; = l' int = 2; int int a = 3 ; = 4; int

b b b

,

31

SCOPE

Bl - B3 B1 - B2 B2 - B4 B3 B4

Figure 1 . 1 1 : Scopes of declarations in Example 1 .6 there is a declaration int a = 1 belonging to that block, so the value of a printed is 1 . Had there be�n no such declaration, the program would have been erroneous. 0

1 .6.4

Explicit Access Control

Classes and structures introduce a new scope for their members. If p is an object of a class with a field (member) x, then the use of x ip p.x refers to field x in the class definition. In analogy with block structure, the scope of a member declaration x in a class C extends to any subclass C' , except if Cf has a local declaration of the same name x . Through the use of keywords like public, private, and protected, object oriented languages such as C++ or Java provide explicit control over access to member names in a superclass. These keywords support encapsulation by restricting access. Thus, private names are purposely given a scope that includes only the method declarations and definitions associated with that class and any "friend" classes (the C++ term) . Protected names are accessible to subclasses. Public names are accessible from outside the class. In C++, a class definition may be separated from the definitions of some or all of its methods. Therefore, a name x associated with the class C may have a region of the code that is outside its scope, followed by another region (a method definition) that is within its scope. In fact, regions inside and outside the scope may alternate, until all the methods have been defined.

1.6.5

Dynamic Scope

Technically, any scoping policy is dynamic if it is based on factor(s) that can be known only when the program executes. The term dynamic scope, however, usually refers to the following policy: a use of a name x refers to the declaration of x in the most recently called procedure with such a declaration. Dynamic scoping of this type appears only in special situations. We shall consider two ex amples of dynamic policies: macro expal1sion in the C preprocessor and method resolution in object-oriented programming.

32

CHAPTER 1. INTRODUCTION

Declarations and Definitions The apparently similar terms "declaration" and "definition" for program ming-language concepts are actually quite different. Declarations tell us about the types of things, while definitions tell us about their values. Thus, int i is a declaration of i, while i = 1 is a definition of i . The difference i s more significant when we deal with methods or other procedures. In C++, a method is declared in a class definition, by giving the types of the arguments and result of the method (often called the signature for the method. The method is then defined, i.e., the code for executing the method is given, in another place. Similarly, it is common to define a C function in one file and declare it in other files where the function is used. Example 1 . 7 : In the C program of Fig. 1 . 12, identifier a is a macro that stands for expression (x + 1). But what is x? We cannot resolve x statically,

that is, in terms of the program text. #def ine

a

int x = 2 ;

(x+ 1 )

void b O { int x = 1 ; printf ( " %d\n" , a) ; }

void c O { printf ( " %d\n" , a) ; } vOld main e ) { b e ) ; c ( ) ; }

Figure 1 .12: A macro whose names must be scoped dynamically In fact, in order to interpret x, we must use the usual dynamic-scope rule. We examine all the function calls that are currently active, and we take the most recently called function that has a declaration of x. It is to this declaration that the use of x refers. In the example of Fig. 1 . 12, the function main first calls function b. As b executes, it prints the value of the macro a . Since (x + 1) must be substituted for a, we resolve this use of x to the declaration int x=1 in function b. The reason is that b has a declaration of x, so the ( x + 1) in the printf in b refers to this x. Thus, the value printed is 1 . After b finishes, and c i s called, we again need t o print the value of macro a. However, the only x accessible to c is the global x. The printf statement in c thus refers to this declaration of x, and value 2 is printed . 0 Dynamic scope resolution is also essential for polymorphic procedures, those that have two or more definitions for the same name, depending only on the

33

1.6. PROGRAMMING LANGUAGE BASICS

Analogy Between Static and Dynamic Scoping While there could be any number of static or dynamic policies for scoping, there is an interesting relationship between the normal (block-structured) static scoping rule and the normal dynamic policy. In a sense, the dynamic rule is to time as the static rule is to space. While the static rule asks us to find the d,eclaration whose unit (block) most closely surrounds the physical location of the use, the dynamic rule asks us to find the declaration whose unit (procedure invocation) most closely surrounq.s the time of the use.

types of the arguments. In some languages, such as ML (see Section 7.3.3) , it is possible to determine statically types for all uses of names, in which case the compiler can replace each use of a procedure name p by a reference to the code for the proper procedure. However, in other languages, such as Java and C++, there are times when the compiler cannot make that determination. Example 1 .8 : A distinguishing feature of object-oriented programming is the ability of each object to invoke the appropriate method in response to a message. In other words, the procedure called when x . m O is executed depends on the class of the object denoted by x at that time. A typical example is as follows:

1 . There is a class C with a method named m O . 2. D is a subclass of C, and D has its own method named m O . 3. There is a use of m of the form x . m O , where

x

is an object of class C.

Normally, it is impossible to tell at compile time whether x will be of class C or of the subclass D. If the method application occurs several times, it is highly likely that some will be on objects denoted by x that are in class C but not D , while others will be in class D . It is not until run-time that it can be decided which definition of m is the right one. Thus, the code generated by the compiler must determine the class of the object x , and call one or the other method named m. 0

1 .6.6

Parameter Passing Mechanisms

All programming languages have a notion of a procedure, but they can differ in how these procedures get their arguments. In this section, we shall consider how the actual parameters (the parameters used in the call of a procedure) are associated with the formal parameters (those used in the procedure defi nition) . Which mechanism is used determines how the calling-sequence code treats parameters. The great majority of languages use either "call-by-value," or "call-by-reference," or botb. We shall explain these terms, and another method known as "call-by-name," that is primarily of historical interest.

34

CHAPTER 1 . INTRODUCTION

Call-by-Value

In call-by-value, the actual parameter is evaluated (if it is an expression) or copied (if it is a variable) . The value is placed in the location belonging to the corresponding formal parameter of the called procedure. This method is used ih C and Java, and is a common option in C++, as well as in most other languages. Call-by-value has the effect that all computation involving the formal parameters done by the called procedure is local to that procedure, and the actual parameters themselves cannot be changed. Note, however, that in C we can pass a pointer to a variable to allow that variable to be changed by the callee. Likewise, array names passed as param eters in C, C++, or Java give the called procedure what is in effect a pointer or reference to the array itself. Thus, if a is the name of an array of the calling procedure, and it is passed by value to corresponding formal parameter x, then an assignment such as x [ i] = 2 really changes the array element a[2] . The reason is that, although x gets a copy of the value of a, that value is really a pointer to the beginning of the area of the store where the array named a is located. Similarly, in Java, many variables are really references, or pointers, to the things they stand for. This observation applies to arrays, strings, and objects of all classes. Even though Java uses call-by-value exclusively, whenever we pass the name of an object to a called procedure, the value received by that procedure is in effect a pointer to the object. Thus, the called procedure is abie to affect the value of the object itself. Call-by-Reference

In call-by-reference, the address of the actual parameter is passed to the cailee as the value of the corresponding formal parameter. Uses of the formal parameter in the code of the callee are implemented by following this pointer to the location indicated by the caller. Changes to the formal parameter thus appear as changes to the actual parameter. If the actual parameter is an expression, however, then the expression is evaluated before the call, and its value stored in a location bf its own. Changes to the formal parameter change this location, but can have no effect bn the data of the caller. Call-by-reference IS used for "ref" parameters in C++ and is an option in many other languages. It is almost essential when the formal parameter is a large object, array, or structure. The reason is that strict call-by-value requires that the caller copy the entire actual parameter into the space belonging to the corresponding formal parameter. This copying gets expensive when the parameter is large. As we noted when discussing call-by-value, languages such as Java solve the problem of passing arrays, strings, or other objects by copying only a reference to those objects. The effect is that Java behaves as if it used call-by-reference for anything other than a basic type such as an integer or real.

1.6. PROGRAMMING LANGUAGE BASICS

35

Call-by-Name

A third mechanism - call-by-name - was used in the early programming language Algol 60. It requires that the callee execute as if the actual parameter were substituted literally for the formal parameter in the code of the callee, as if the formal parameter were a macro standing for the actual parameter (with renaming of local names in the called procedure, to keep them distinct) . When the actual parameter is an expression rather than a variable, some unintuitive behaviors occur, which is one reason this mechanism is not favored today.

1 .6.7

Aliasing

There is an interesting consequence of call-by-reference parameter passing or its simulation, as in Java, where references to objects are passed by value. It is possible that two formal parameters can refer to the same location; such variables are said to be aliases of one another. As a result, any two variables, which may appear to take their values from two distinct formal parameters, can become aliases of each other, as well. Example 1 . 9 : Suppose a is an array belonging to a procedure p, and p calls another procedure q(x, y ) with a call q(a, a) . Suppose also that parameters are passed by value, but that array names are really references to the location where the array is stored, as in C or similar languages. Now, x and y have become aliases of each other. The important point is that if within q there is an assignment x [10J = 2, then the value of y[lO] also becomes 2. 0

It turns out that understanding aliasing and the mechanisms that create it is essential if a compiler is to optimize a program. As we shall see starting in Chapter 9, there are many situations where we Gan only optimize code if we can be sure certain variables are not aliased. For instance, we might determine that x = 2 is the only place that variable x is ever assigned. If so, then we can replace a use of x by a use of 2; for example, replace a = x+3 by the simpler a = 5. But suppose there were another variable y that was aliased to x. Then an assignment y = 4 might have the unexpected effect of changing x. It might also mean that replacing a = x+3 by a = 5 was a mistake; the proper value of a could be 7 there.

1 .6 .8

Exercises for Section 1 .6

Exercise 1 .6 . 1 : For the block-structured C code of Fig. 1. 13(a) , indicate the w, x, y, and z .

values assigned to

Exercise 1 .6.2 : Repeat Exercise 1 .6.1 for the code of Fig. 1 . 13(b) . Exercise 1 .6 . 3 : For the block-structured code of Fig. 1 . 14, assuming the usual

static scoping of declarations, give the scope for each of the twelve declarations.

36

CHAPTER 1 . INTRODUCTION int w , x , y , z ,' int i 4 ; int j int j { 7 ,' i 6 ,' w = i + j; =

=

}

x

{

} z

=

int w , x , y , z ; int i = 3 ; int j int i { 5 ,' w i + j;

5 ,'

}

x

{

i + j; int i 8 ', i + ; j Y =

=

}

i + j;

z

( a) Code for Exercise 1.6. 1

4 ,'

=

=

i + j; int j 6 ', i 7; y i + j; =

=

i + j;

(b ) Code for Exercise 1 .6.2

Figure 1 .13: Block-structured code {

int w , x , y , z ; int x , z ; int w , x ; {

1* Block B 1 * 1 1 * Block B 2 *1 1 * Block B 3 *1 }

int w , x ; int y ,

1* Block B4 *1 1* Block B5 *1 }

{ } {

}

{

}

z;

Figure 1 . 14: Block structured code for Exercise 1 .6.3 Exercise 1 .6.4 : What is printed by the following C code?

#def ine a (x+ 1 ) int x 2; void b O { x a ; printf ( " %d\n " , x) ; } void c O { int x 1 ; printf ( " %d\n " ) , a ; } void main e ) { b e ) ; c ( ) ; } =

=

=

1 .7

S ummary of C hapter 1

.. Language Processors. An integrated software development environment includes many different kinds of language processors such as compilers, interpreters, assemblers, linkers, loaders, debuggers, profilers. .. Compiler Phases. A compiler operates as a sequence of phases, each of which transforms the source program from one intermediate representa tion to another.

1 . 7. SUMMARY OF CHAPTER 1

37

.. Machine and Assembly Languages. Machine languages were the first generation programming languages, followed by assembly languages. Pro gramming in these languages was time consuming and error prone . .. Modeling in Compiler Design. Compiler design is one of the places where theory has had the most impact on practice. Models that have been found useful include automata, grammars, regular expressions, trees, and many others. .. Code Optimization. Although code cannot truly be "optimized," the sci ence of improving the efficiency of code is both complex and very impor tant. It is a major portion of the study of compilation. .. Higher-Level Languages. As time goes on, programming languages take on progressively more of the tasks that formerly were left to the program mer, such as memory management, type-consistency checking, or parallel execution of code . .. Compilers and Computer Architecture. Compiler technology influences computer architecture, as well as being influenced by the advances in ar chitecture. Many modern innovations in architecture depend on compilers being able to extract from source programs the opportunities to use the hardware capabilities effectively. .. Software Productivity and Software Security. The same technology that allows compilers to optimize code can be used for a variety of program analysis tasks, ranging from detecting common program bugs to discov ering that a program is vulnerable to one of the many kinds of intrusions that "hackers" have discovered. .. Scope Rules. The scope of a declaration of x is the context in which uses of x refer to this declaration. A language uses static scope or lexical scope if it is possible to determine the scope of a declaration by looking only at the program. Otherwise, the language uses dynamic scope. .. Environments. The association of names with locations in memory and then with values can be described in terms of environments, which map names to locations in store, and states, which map locations to their values. .. Block Structure. Languages that allow blocks to be nested are said to have block structure. A name x in a nested block B is in the scope of a declaration D of x in an enclosing block if there is no other declaration of x in an intervening block. .. Parameter Passing. Parameters are passed from a calling procedure to the callee either by value or by reference. When large objects are passed by value, the values passed are really references to the objects themselves, resulting in an effective call-by-reference.

38

CHAPTER 1 . INTRODUCTION .. Aliasing. When parameters are (effectively) passed by reference, two for mal parameters can refer to the same object. This possibility allows a change in one variable to change another.

1.8

References for C hapter 1

For the development of programming languages that were created and in use by 1967, including Fortran, Algol, Lisp, and Simula, see [7]. For languages that were created by 1982, including C, C++, Pascal, and Smalltalk, see [1] . The GNU Compiler Collection, gcc, is a popular source of open-source compilers for C, C++, Fortran, Java, and other languages [2] . Phoenix is a compiler-construction toolkit that provides an integrated framework for build ing the program analysis, code generation, and code optimization phases of compilers discussed in this book [3] . For more information about programming language concepts, we recom mend [5,6) . For more on computer architecture and how it impacts compiling, we suggest [4] .

1 . Bergin, T. J. and R. G. Gibson, History of Programming Languages, ACM Press, New York, 1996. 2. http : //gcc . gnu . org/ . 3. http : / /research . microsoft . com/phoenix/default . aspx .

4. Hennessy, J. L. and D. A. Patterson, Computer Organization and De sign: The Hardware/Software Interface, Morgan-Kaufmann, San Fran cisco, CA, 2004. 5. Scott, M. L., Programming Language Pragmatics, second edition, Morgan Kaufmann, San Francisco, CA, 2006. 6. Sethi, R., Programming Languages: Concepts and Constructs, Addison Wesley, 1996. 7. Wexelblat, R. L., History of Programming Languages, Academic Press, New York, 1981.

Chapter 2 A Simple Synt ax- Directed

Translator This chapter is an introduction to the compiling techniques in Chapters 3 through 6 of this book. It illustrates the techniques by developing a working; Java program that translates representative programming language statements into three-address code, an intermediate representation. In this chapter, the emphasis is on the front end of a compiler, in particular on lexical analysis, parsing, and intermediate code generation. Chapters 7 and 8 show how to generate machine instructions from three-address code. We start small by creating a syntax-directed translator that maps infix arith metic expressions into postfix expressions. We then extend this translator to map code fragments as shown in Fig. 2;1 into three-address code of the form in Fig. 2.2. The working Java translator appears in Appendix A. The use of Java is convenient, but not essential. In fact, the ideas in this chapter predate the creation of both Java and C. { int i ; int j j f loat [ 1 00] a ; float

v;

f loat

while ( true ) { i+l ; while ( a [i] < v ) j do i do j = j - l ; while ( a [j ] > v ) ; if ( i >= j ) break j x = a [i] j a [i] = a [j ] ; a [j ] = X j } }

Figure 2.1: A code fragment to be translated 39

x;

40

CHAPTER 2. A SIMPLE SYNTAX-DIRECTED TRANSLATOR 1: 2: 3: 4:

5: 6: 7:

8: 9: 10: 11: 12: 13: 14:

i = i + 1 t1 = a [ i ] if t 1 < v goto 1 j = j - 1 t2 = a [ j ] if t2 > v goto 4 if False i >= j goto 9 goto 14 x = a [ i ] t3 = a [ j ] a [ i ] = t3 a [ j ] = x goto 1

Figure 2.2: Simplified intermediate code for the program fragment in Fig. 2.1 2.1

Introduction

The analysis phase of a compiler breaks up a source program into constituent pieces and produces an internal representation for it, called intermediate code. The synthesis phase translates the intermediate code into the target · program. Analysis is organized around the "syntax" of the language to be compiled. The syntax of a programming language describes the proper form of its pro grams, while the semantics of the language defines what its programs mean; that is, what each program does when it executes. For specifying syntax, we present a widely used notation, called context-free grammars or BNF (for Backus-Naur Form) in Section 2.2. With the notations currently available, the semantics of a language is much more difficult to describe than the syntax. For specifying semantics, we shall therefore use informal descriptions and suggestive examples. Besides specifying the syntax of a language, a context-free grammar can be used to help guide the · translation of programs. In Section 2.3, we introduce a grammar-oriented compiling technique known as syntax-directed translation. Parsing or syntax analysis is introduced in Section 2.4. The rest of this chapter is a quick tour through the model of a compiler front end in Fig. 2.3. We begin with the parser. For simplicity, we consider the syntax-directed translation of infix expressions to postfix form, a notation in which operators appear after their operands. For example, the postfix form of the expression 9 - 5 + 2 is 95 - 2+. Translation into postfix form is rich enough to illustrate syntax analysis, yet simple enough that the translator is shown in full in Section 2.5. The simple translator handles expressions like 9 - 5 + 2, consisting of digits separated by plus and minus signs. One reason for starting with such simple expressions is that the syntax analyzer can work qirectly with the individual characters for operators and operands.

2. 1 .

41

INTRODUCTION

source program

tokens

Lexical Analyzer

syntax Intermediate three-address Code code tree Generator

Parser

Symbol Table

Figure 2.3: A model of a compiler front end A lexical analyzer allows a translator to handle multicharacter constructs like identifiers, which are written as sequences of characters, but are treated as units called tokens during syntax analysis; for example, in the expression count + 1, the identifier count is treated as a unit. The lexical analyzer in Section 2.6 allows numbers, identifiers, and "white space" ( blanks, tabs, and newlines ) to appear within expressions. Next, we consider intermediate-code generation. Two forms of intermedi ate code are illustrated in Fig. 2.4. One form, called abstract syntax trees or simply syntax trees, represents the hierarchical syntactic structure of the source program. In the model in Fig. 2.3, the parser produces a syntax tree, that is further translated into three-address code. Some compilers combine parsing and intermediate-code generation into one component. do-while

/ �

body

I

[]

assign

/ \

+

/ \

a

1:

3:

>

/ \

/ \

i = i

+

1

2: tl = a [ i ] if t l <

v

got o 1

(b)

v

1

( a)

Figure 2.4: Intermediate code for " do i = i + 1 ; while ( a [iJ < v) ; " The root of the abstract syntax tree in Fig. 2.4 ( a) represents an entire do while loop. The left child of the root represents the body of the loop, which consists of only the assignment i = i + 1 ; The right child of the root repre sents the condition a [i] < v. An implementation of syntax trees appears in Section 2.8 ( a) . The other common intermediate representation, shown in Fig. 2.4 (b ) , is a .

CHAPTER 2. A SIMPLE SYNTAX-DIRECTED TRANSLATOR

42

sequence of "three-address" instructions; a more complete example appears in Fig. 2.2. This form of intermediate code takes its name from instructions of the form x = y op Z, where op is a binary operator, y and z the are addresses for the operands, and x is the address for the result of the operation. A three address instruction carries out at most one operation, typically a computation, a comparison, or a branch. In Appendix A, we put the techniques in this chapter together to build a compiler front end in Java. The front end translates statements into assembly level instructions. 2.2

Synt ax Definit ion

In this section, we introduce a notation - the "context-free grammar," or "grammar" for short - that is used to specify the syntax of a language. Gram mars will be used throughout this book to organize compiler front ends. A grammar naturally describes the hierarchical structure of most program ming language constructs. For example, an if-else statement in Java can have the form if ( expression ) statement else statement

That is, an if-else statement is the concatenation of the keyword if, an open ing parenthesis, an expression, a closing parenthesis, a statement, the keyword else, and another statement. Using the variable expr to denote an expres sion and the variable stmt to denote a statement, this structuring rule can be expressed as

stmt

-+

if ( expr ) stmt else stmt

in which the arrow may be read as "can have the form." Such a rule is called a production. In a production, lexical elements like the keyword if and the paren theses are called terminals. Variables like expr and stmt represent sequences of terminals and are called nonterminals.

2.2�1 Definition of Grammars A context-free grammar has four components: 1. A set of terminal symbols, sometimes referred to as "tokens." The terIili nals are the elementary symbols of the language defined by the grammar.

2. A set of nonterminals, sometimes called "syntactic variables." Each non terminal represents a set of strings of terminals, in a manner we shall describe. 3. A set of productions, where each production consists of a nonterminal, called the head or left side of the production, an arrow, and a sequence of

2.2.

SYNTAX DEFINITION

43

Tokens Versus Terminals In a compiler, the lexical analyzer reads the characters of the source pro gram, groups them into lexically meaningful units called lexemes, and pro duces as output tokens representing these lexemes. A token consists of two components, a token name and an attribute value. The token names are abstract symbols that are used by the parser for syntax analysis. Often, we shall call these token names terminals, since they appear as terminal symbols in the grammar for a programming language. The attribute value, if present, is a pointer to the symbol table that contains additional infor mation about the token. This additional information is not part of the grammar, so in our discussion of syntax analysis, often we refer to tokens and terminals synonymously. terminals and/or nonterminals, called the body or right side of the produc tion. The intuitive intent of a production is to specify one of the written forms of a construct; if the head nonterminal represents a construct, then the body represents a written form of the construct.

4. A designation of one of the nonterminals as the start symbol. We specify grammars by listing their productions, with the productions for the start symbol listed first. We assume that digits, signs such as < and or = or == or ! = letter followed by letters and digits any numeric constant anything but " , surrounded by " 's

SAMPLE LEXEMES if else See More Pi ctures if you liked that one .

into appropriate lexemes. Which lexemes should get associated lexical values, and what should those values be? 3.2

Input B uffering

Before discussing the problem of recognizing lexemes in the input, let us examine some ways that the simple but important task of reading the source program can be speeded. This task is made difficult by the fact that we often have to look one or more characters beyond the next lexeme before we can be sure we have the right lexeme. The box on "Tricky Problems When Recognizing Tokens" in Section 3.1 gave an extreme example, but there are many situations where we need to look at least one additional character ahead. For instance, we cannot be sure we've seen the end of an identifier until we see a character that is not a letter or digit, and therefore is not part of the lexeme for id. In C, single-character operators like , =, or < could also be the beginning of a two-character operator like ->, ==, or ' ) state = 6 ; else f ail ( ) ; 1* lexeme is not a relop * 1 break ; case 1 : case 8 : retract ( ) j retToken . attribute = GT j return (retToken) j } } }

Figure 3.18: Sketch of implementation of relop transition diagram the true beginning of the unprocessed input. It might then change the value of stat e to be the start state for another transition diagram, which will search for another token. Alternatively, if there is no other transition diagram that remains unused, f ail O could initiate an error-correction phase that will try to repair the input and find a lexeme, as discussed in Section 3.1 .4. We also show the action for state 8 in Fig. 3.18. Because state 8 bears a * , we must retract the input pointer one position (i.e., put c back on the input stream) . That task is accomplished by the function retract O . Since state 8 represents the recognition of lexeme >=, we set the second component of the returned object, which we suppose is named attribute, to GT, the code for this operator. 0 To place the simulation of one transition diagram in perspective, let us consider the ways code like Fig. 3.18 could fit into the entire lexical analyzer.

1 . We could arrange for the transition diagrams for each token to be tried se; quentially. Then, the function f ail O of Example 3.10 resets the pointer forward and starts the next transition diagram, each time it is called. This method allows us to use transition diagrams for the individual key words, like the one suggested in Fig. 3.15. We have only to use these before we use the diagram for id, in order for the keywords to be reserved words.

136

CHAPTER 3. LEXICAL ANALYSIS

2. We could run the various transition diagrams "in parallel," feeding the next input character to all of them and allowing each one to make what ever transitions it required. If we use this strategy, we must be careful to resolve the case where one diagram finds a lexeme that matches its pattern, while one or more other diagrams are still able to process input. The normal strategy is to take the longest prefix of the input that matches any pattern. That rule allows us to prefer identifier thenext to keyword then, or the operator -> to - , for example. 3. The preferred approach, and the one we shall take up in the following sections, is to combine all the transition diagrams into one. We allow the transition diagram to read input until there is no possible next state, and then take the longest lexeme that matched any pattern, as we discussed in item (2) above. In our running example, this combination is easy, because no two tokens can start with the same character; i.e., the first character immediately tells us which token we are looking for. Thus, we could simply combine states 01 9, 12, and 22 into one start state, leaving other transitions intact. However, in general, the problem of combining transition diagrams for several tokens is more complex, as we shall see shortly.

Exercises for Section 3.4 Exercise 3.4. 1 : Provide transition diagrams to recognize the same languages 3.4.5

as each of the regular expressions 'in Exercise 3.3.2.

Exercise 3.4.2 : Provide transition diagrams to recognize the same languages as each of the regu�ar expressions in Exercise 3.3.5. The following exercises, up to Exercise 3.4.12, introduce the Aho-Corasick algorithm for recognizing a collection of keywords in a text string in time pro portional to the length of the text and the sum of the length of the keywords. This algorithm uses a special form of transition diagram called a trie. A trie is a tree-structured transition diagram with distinct labels on the edges leading from a node to its children. Leaves of the trie represent recognized keywords. Knuth, Morris, and Pratt presented an algorithm for recognizing a single keyword b1 b2 bn in a text string. Here the trie is a transition diagram with n states, 0 through n. State 0 is the initial state, and state n represents ac ceptance, that is, discovery of the keyword. From each state s from 0 through 1 , there is a transition to state s + 1, labeled by symbol bs+1 . For example, n the trie for the keyword ababaa is: •

.

•

-

b

In order to process text strings rapidly and search those strings for a key word, it is useful to define, for keyword b1 b2 bn and position s in that keyword (corresponding to state s of its trie) , a failure · junction, f (s) , corriputed as in .

.

•

137

3.4. RECOGNITION OF TOKENS

Fig. 3.19. The objective is that bI b2 · · · bf(s) is the longest proper prefix of bI b2 bs that is also a suffix of bI b2 , bs . The reason f (s) is important is that if we are trying to match a text string for b l b2 bn , and we have matched the first s positions, but we then fail (i.e., the next position of the text string does not hold bs + I ) , then f (s) is the longest prefix of bI b2 bn that could possibly match the text string up to the point we are at. Of course, the next character of the text string must be bf(s) + I , or else we still have problems and must consider a yet shorter prefix, which will be bf U (s)) ' •

.

•

•

.

•

.

.

.

1) 2) 3) 4)

5)

6) 7)

•

•

t 0; f(l) = 0; for (s 1; s < n; s + +) { while (t > 0 && bS + 1 ! = bt + l ) t if . ( bs +1 == bt+d { t = t + 1; f ( s + 1) = t; =

=

=

f (t) ;

}

else f (s + 1) = 0;

8) }

Figure 3. 19: Algorithm to compute the failure function for keyword bI b2 is:

.

•

.

bn

As an example, the failure function for the trie constructed above for ababaa

For instance, states 3 and 1 represent prefixes aba and a, respectively. f (3) = 1 because a is the longest proper prefix of aba that is also a suffix of aba. Also, f(2) = 0, because the longest proper prefix of ab that is also a suffix is the empty string.

Exercise 3.4.3 : Construct the failure function for the strings: a) abababaab. b) aaaaaa. c) abbaabb. ! Exercise 3 .4.4 : Prove, by induction on s, that the algorithm of Fig. 3.19

correctly computes the failure function.

! ! Exercise 3.4.5 : Show that the assignment t = f(t) in line (4) of Fig. 3.19 is

executed at most n times. Show that therefore, the entire algorithm takes only o ( n ) time on a keyword of length n .

138

CHAPTER 3. LEXICAL ANALYSIS

Having computed the failure function for a keyword b1 b2 bn, we can scan a string a 1 a2 .' . . am in time O ( m) to tell whether the keyword occurs in the string. The algorithm, shown in Fig. 3.20, slides the keyword along the string, trying to make progress by matching the next character of the keyword with the next character of the string. If it cannot do so after matching s characters, then it "slides" the keyword right s - f (s) positions, so only the first f (s) characters of the keyword are considered matched with the string. .

1)

•

•

s = 0; for (i = 1 ; i � m; i++) { while (s > 0 && ai ! = bs+1) s = f (s) ; if, ( ai = = bs+d s = s + 1 ; if ( s = = n ) return "yes" ;

2) 3) 4)

5)

}

6)

return "no" ;

Figure 3.20: The KMP algorithm tests whether string a 1 a2 ' " am contains a single keyword b1 b2 bn as a substring in Oem + n ) time .

•

.

Exercise 3.4.6 : Apply Algorithm KMP to test whether keyword ababaa is a substring of: a) abababaab. b) abababbaa. ! ! Exercise 3.4.7 : Show that the algorithm of Fig. 3.20 correctly tells whether

the keyword is a substririg of the given string. Hint: proceed by induction on i. Show that for all i, the value of s after line (4) is the length of the longest prefix of the keyword that is a suffix of a 1 a2 . . . ai ·

! ! Exercise 3.4.8 : Show that the algorithm of Fig. 3.20 runs in time Oem + n ) , assuming that function f is already computed and its values stored in an array indexed by s.

Exercise 3.4.9 : The Fibonacci strings are defined as follows: 1. Sl

=

b.

2. S2

=

a.

3. Sk

=

S k - 1 S k-2 for k

For example, S 3

=

ab, S4

> =

2. aba, and S5

a) What is the length of sn?

=

abaab.

139

3.4. RECOGNITION OF TOKENS b) Construct the failure function for S6 · c) Construct the failure function for S7 ·

! ! d) Show that the failure function for any S n can be expressed by f(1) f(2) = 0, and for 2 < j � I S n l , f (j) is j - I Sk-l l , where k is the largest

integer such that I Sk I � j + 1.

! ! e) In the KMP algorithm, what is the largest number of consecutive applica

tions of the failure function, when we try to determine whether keyword Sk appears in text string Sk+ l ?

Aho and Corasick generalized the KMP algorithm to recognize any of a set of keywords in a text string. In this case, the trie is a true tree, with branching from the root. There is one state for every string that is a prefix (not necessarily proper) of any keyword. The parent of a state corresponding to string b l b2 bk is the state that corresponds to b l b2 bk-I . A state is accepting if it corresponds to a complete keyword. For example, Fig. 3.21 shows the trie for the keywords he , she, his , and hers . .

•

.

.

•

.

s

Figure 3.21: Trie for keywords he, she , his, hers The failure function for the general trie is defined as follows. Suppose S is the state that corresponds to string bl b2 bn . Then f ( s ) is the state that corresponds to the longest proper suffix of b l b2 bn that is also a prefix of some keyword. For example, the failure function for the trie of Fig. 3.21 is: • •

.

.

.

•

! Exercise 3.4.10 : Modify the algorithm of Fig. 3.19 to compute the failure

function for general tries. Hint: The major difference is that we cannot simply test for equality or inequality of bs+ 1 and bt + 1 in lines (4) and (5) of Fig. 3.19. Rather, from any state there may be several transitions out on several charac ters, as there are transitions on both e and i from state 1 in Fig. 3.21. Any of

140

CHAPTER 3. LEXICAL ANALYSIS

those transitions could lead to a state that represents the longest suffix that is also a prefix.

Exercise 3.4. 1 1 : Construct the tries and compute the failure function for the following sets of keywords: a) aaa, abaaa, and ababaaa. b) all , f all, f atal, llama, and lame. c) pipe, pet , item, temper, and perpetual. ! Exercise 3.4.12 : Show that your algorithm from Exercise 3.4.10 still runs in

time that is linear in the sum of the lengths of the keywords. 3.5

The L exical- Analyzer G enerator Lex

In this section, we introduce a tool called Lex, or in a more recent implemen tation Flex, that allows one to specify a lexical analyzer by specifying regular expressions to describe patterns for tokens. The input notation for the Lex tool is referred to as the Lex language and the tool itself is the Lex compiler. Behind the scenes, the Lex compiler transforms the input patterns into a transition diagram and generates code, in a file called lex . yy c, that simulates this tran sition diagram. The mechanics of how this translation from regular expressions to transition diagrams occurs is the subject of the next sections; here we only learn the Lex language. .

Use of Lex Figure 3.22 suggests how Lex is used. An input file, which we call lex . 1 , is 3.5.1

written in the Lex language and describes the lexical analyzer to be generated. The Lex compiler transforms lex . 1 to a C program, in a file that is always named lex . yy . c . The latter file is compiled by the C compiler into a file called a . out , as always. The C-compiler output is a working lexical analyzer that can take a stream of input characters and produce a stream of tokens. The normal use of the compiled C program, referred to as a . out in Fig. 3.22, is as a subroutine of the parser. It is a C function that returns an integer, which is a code for one of the possible token names. The attribute value, whether it be another numeric code, a pointer to the symbol table, or nothing, is placed in a global variable yylval, 2 which is shared between the lexical analyzer and parser, thereby making it simple to return both the name and an attribute value of a token. 2 Incidentally, the yy that appears in yylval and lex . yy . c refers to the Yacc parser generator, which we shall describe in Section 4.9, and which is commonly used in conjunction with Lex.

3.5. THE LEXICAL-ANALYZER GENERATOR LEX

4

Lex source program

lex . l

lex . yy . c

Input stream

-1 -1

Lex compiler

C

OI;P

iler

a . out

�

� �

141

lex . yy . c

a . out

Sequence of tokens

Figure 3.22: Creating a lexical analyzer with Lex

3 .5.2

Structure of Lex Programs

A Lex program has the following form: declarations %% translation rules %% auxiliary functions The declarations section includes declarations of variables, manifest constants (identifiers declared to stand for a constant, e.g. , the name of a token) , and regular definitions , in the style of Section 3.3.4. The translation rules each have the form Pattern { Action } Each pattern is a regular expression, which may use the regular definitions of the declaration section. The actions are fragments of code, typically written in C , although many variants of Lex using other languages have been created. The third section holds whatever additional functions are used in the actions. Alternatively, these functions can be compiled separately and loaded with the lexical analyzer. The lexical analyzer created by Lex behaves in concert with the parser as follows. When called by the parser, the lexical analyzer begins reading its remaining input, one character at a time, until it finds the longest prefix of the input that matches one of the patterns Pi' It then executes the associated action Ai . Typically, Ai will return to the parser, but if it does not (e.g., because Pi describes whitespace or comments ) , then the lexical analyzer proceeds to find additional lexemes, until one of the corresponding actions causes a return to the parser. The lexical analyzer returns a single value, the token name, to the parser, but uses the shared, integer variable yyl val to pass additional information about the lexeme found, if needed.

142

CHAPTER 3. LEXICAL ANALYSIS

Example 3.1 1 : Figure 3.23 is a Lex program that recognizes the tokens of Fig. 3.12 and returns the token found. A few observations about this code will introduce us to many of the important features of Lex. In the declarations section we see a pair of special brackets, %{ and %}. Anything within these brackets is copied directly to the file lex . y y . c , and is not treated as a regular definition. It is common to place there the definitions of the manifest constants, using C #def ine statements to associate unique integer codes with each of the manifest constants. In our example, we have listed in a comment the names of the manifest constants, L T, IF, and so on, but have not shown them defined to be particular integers. 3 Also in the declarations section is a sequence of regular definitions. These use the extended notation for regular expressions described in Section 3.3.5. Regular definitions that are used in later definitions or in the patterns of the translation rules are surrounded by curly braces. Thus, for instance, delim is defined to be a shorthand for the character class consisting of the blank, the tab, and the newline; the latter two are represented, as in all UNIX commands, by backslash followed by t or n, respectively. Then, ws is defined to be one or more delimiters, by the regular expression {delim}+. Notice that in the definition of id and number, parentheses are used as grouping metasymbols and do not stand for themselves. In contrast, E in the definition of number stands for itself. If we wish to use one of the Lex meta symbols, such as any of the parentheses, +, * , or ?, to stand for themselves, we may precede them with a backslash. For instance, we see \ . in the definition of number, to represent the dot, since that character is a metasymbol representing "any character," as usual in UNIX regular expressions. In the auxiliary-function section, we see two such functions, installID 0 and installNum ( ) . Like the portion of the declaration section that appears between %{ . . . %} , everything in the auxiliary section is copied directly to file lex . yy . c, but may be used in the actions. Finally, let us examine some of the patterns and rules in the middle section of Fig. 3.23. First, WS, an identifier declared in the first section, has an associated empty action. If we find whitespace, we do not return to the parser, but look for another lexeme. The second token has the simple regular expression pattern if . Should we see the two letters if on the input, and they are not followed by another letter or digit ( which would cause the lexical analyzer to find a longer prefix of the input matching the pattern for id) , then the lexical analyzer consumes these two letters from the input and returns the token name IF, that is, the integer for which the manifest constant IF stands. Keywords then and else are treated similarly. The fifth token has the pattern defined by id. Note that, although keywords like if match this pattern as well as an earlier pattern, Lex chooses whichever 3 If Lex is used along with Yacc, then it would be normal to define the manifest constants in the Yacc program and use them without definition in the Lex program. Since lex . yy . c is compiled with the Yacc output, the constants thus will be available to the actions in the Lex program.

3.5. THE LEXICAL-ANALYZER GENERATOR LEX

143

%{ / * def init ions of manifest constant s LT , LE , EQ , NE , GT , GE , IF , THEN , ELSE , ID , NUMBER , RELOP */ %} / * regular def init ions */ del im [ \t \n] ws {delim}+ letter [A-Za-z] digit [0-9] id {letter} ({letter} l {digit } ) * number {digit}+ (\ . {digit}+ ) ? (E [+-] ?{digit }+ ) ? %% {ws} if then else { id} {number} "= "

{/* no act ion and no return */} {return ( IF) ; } {return (THEN ) ; } {return (ELSE) ; } {yylval = ( int ) installID ( ) ; rettirn ( ID ) ; } {yylval = ( int ) installNum ( ) ; return (NUMBER) ; } {yylval = LT ; return (RELOP ) ; } {yylval = LE ; return (RELOP ) ; } {yylval = EQ ; return (RELOP ) j } {yylval = NE ; return (RELOP ) ; } {yylval = GT ; return (RELOP ) ; } {yylval = GE ; return (RELOP ) ; }

%% int installID ( ) {/* funct ion to install the lexeme , whose f irst character is pointed to by yytext , arid whose length is yyleng , into the symbol table and return a pointer thereto */ } int installNum ( ) {/* s imilar to installID , but puts numer i cal constants into a separate table * / }

Figure 3.23: Lex program for the tokens of Fig. 3.12

144

CHAPTER 3. LEXICAL ANALYSIS

pattern is listed first in situations where the longest matching prefix matches two or more patterns. The action taken when id is matched is threefold: 1. Function installID 0 is called to place the lexeme found in the symbol table.

2. This function returns a pointer to the symbol table, which is placed in global variable yylval , where it can be used by the parser or a later component of the compiler. Note that install ID 0 has available to it two variables that are set automatically by the lexical analyzer that Lex generates: ( a) yytext is a pointer to the beginning of the lexeme, analogous to lexemeBegin in Fig. 3.3.

( b ) yyleng is the length of the lexeme found.

3. The token name ID is returned to the parser. The action taken when a lexeme matching the pattern number is similar, using the auxiliary function installNum 0 . 0

3.5.3

Conflict Resolution in

Lex

We have alluded to the two rules that Lex uses to decide on the proper lexeme to select, when several prefixes of the input match one or more patterns: 1 . Always prefer a longer prefix to a shorter prefix. 2.

If the longest possible prefix matches two or more patterns, prefer the pattern listed first in the Lex program.

Example 3. 12 : The first rule tells us to continue reading letters and digits to find the longest prefix of these characters to group as an identifier. It also tells us to treat a"(,B. The symbol => means, "derives in one step." When a sequence of derivation steps al => a 2 => . . . => an rewrites al to an , we say at derives an· Often, we wish to say, "derives in zero or more steps." For this purpose, we can use the symbol * . Thus, 1. a * a, for any string a, and

2. If a * ,B and ,B => ,,( , then a * "(. Likewise, :t means, "derives in one or more steps." If S * a, where S is the start symbol of a grammar G, we say that a is a sentential form of G. Note that a sentential form may contain both terminals �nd nonterminals, and may be empty. A sentence of G is a sentential form with no nonterminals. The language generated by a grammar is its set of sentences. Thus, a string of terminals w is in L( G) , the language generated by G, if and only if w is a sentence of G (or S * w ) . A language that can be generated by a grammar is said to be a context-free language. If two grammars generate the same language, the grammars are said to be equivalent. The string - (id + id) is a sentence of grammar (4.7) because there is a derivation

E => -E => - (E)

=>

- (E + E)

=>

- (id + E)

=>

- (id + id)

(4.8)

The strings E, -E, -(E) , . . . , - (id + id) are all sentential forms of this grarn mq,r. We write E * - (id + id) to indicate that - (id + id) can be derived from E. . At each step in a derivation, there are two choices to be made. We need to choose which nonterminal to replace, and having made this choice, we must pick a production with that n�nterminal as head. For example, the following alternative derivation of - (id + id) differs from derivation (4.8) in the last two steps:

E => -E => - (E)

=>

- (E + E)

=>

- (E + id)

=>

- (id + id)

(4.9)

201

4.2. CONTEXT-FREE GRAMMARS

Each nonterminal is replaced by the same body in the two derivations, but the order of replacements is different. To understand how parsers work, we shall consider derivations in which the nonterminal to be replaced at each step is chosen as follows: 1. In leftmost derivations, the leftmost nonterminal in each sentential is al ways chosen. If a =? 13 is a step in which the leftmost nonterminal in a is replaced, we write a =? 13· lm

2. In rightmost derivations, the rightmost nonterminal is always chosen; we write a =? 13 in this case. rm

Derivation (4.8) is leftmost, so it can be rewritten as E

=? lm

-E

=? lm

- (E)

=? lm

- (E + E)

=? lm

- (id + E)

=? lm

- (id + id)

Note that (4.9) is a rightmost derivation. Using our notational conventions, every leftmost step can be written as wA-y =? w8",,( , where w consists of terminals only, A -+ 8 is the production lm applied, and ""( is a string of grammar symbols. To emphasize that a derives (3 by a leftmost derivation, we write a * (3. If S * a, then we say that a is a lm lm left-sentential form of the grammar at hand. Analogous definitions hold for rightmost derivations. Rightmost derivations are sometimes called canonical derivations.

4.2.4

Parse Trees and Derivations

parse tree is a graphical representation of a derivation that filters out the order in which productions are applied to replace nonterminals. Each interior node of a parse tree represents the application of a production. The interior node is labeled with the nonterminal A in the head of the production; the children of the node are labeled, from left to right, by the symbols in the body of the production by which this A was replaced during the derivation. For example, the parse tree for (id + id) in Fig. 4.3, results from the derivation (4.8) as well as derivation (4.9) . The leaves of a parse tree are labeled by nonterminals or terminals and, read from 'left to right, constitute a sentential form, called the yield or frontier of the tree. To see the relationship between derivations and parse trees, consider any derivation al =? a2 =? . . . =? an , where al is a single nonterminal A. For each sentential form ai in the derivation, we can construct a parse tree whose yield is ai . The process is an induction on i. A

-

BASIS: The tree for al

=

A is a single node labeled A.

202

CHAPTER 4. SYNTAX ANALYSIS E

/ � ( E

I

E

/ I � E

)

+

E

/ I �

id

I

id

Figure 4.3: Parse tree for - (id + id)

INDUCTION: Suppose we already have constructed a parse tree with yield ai -l = X1 X2 Xk (note that according to our notational conventions, each grammar symbol Xi is either a nonterminal or a terminal) . Suppose (}i is derived from ai -l by replacing Xj , a nonterminal, by f3 Yi 1'2 Ym . That is, at the ith step of the derivation, production Xj -+ f3 is applied to a i -l to derive a i = X1 X2 Xj- 1 f3Xj + l . . . Xk . To model this step of the derivation, find the jth leaf from the left in the current parse tree. This leaf is labeled Xj . Give this leaf m children, labeled Y1 , Y2 , , Ym , from the left. As a special case, if m = 0, then f3 = E, and we give the jth leaf one child labeled E . •

.

•

=

.

•

•

•

'

"

.

•

Example 4.10 : The sequence of parse trees constructed from the derivation (4.8) is shown in Fig. 4.4. In the first step of the derivation, E => -E . To model this step, add two children, labeled - and E, to the root E of the initial tree. The result is the second tree. In the secorid step of the derivation, E => (E) . Consequently, add three children, labeled ( E, and ) , to the leaf labeled E of the second tree, to obtain the third tree with yield -' (E) . Continuing in this fashion we obtain the complete parse tree as the sixth tree. 0 -

-

,

Since a parse tree ignores variations in the order in which symbols in senten tial forms are replaced, there is a many-to-one relationship between derivations and parse trees. For example, both derivations (4.8) and (4.9) , are associated with the same final parse tree of Fig. 4.4. In what follows, we shall frequently parse by producing a leftmost or a rightmost derivation, since there is a one-to-orie relationship between parse trees and either leftmost or rightmost derivations. Both leftmost and rightmost derivations pick a particular order for replacing symbols in sentential forms, so they too filter out variations in the order. It is not hard to show that every parse tree has associated with it a unique leftmost and a unique rightmost derivation.

203

4.2. CONTEXT-FREE GRAMMARS E

E

:::}

/ �E

/ �E /I� ( /1 � ) E E + E

:::}

E

/ �E /I� ( /1� ) E

E

+

I

:::}

id

Figure 4.4: Sequence of parse trees for derivation (4.8)

4.2.5

Ambiguity

From Section 2.2.4, a grammar that produces more than one parse tree for some sentence is said to be ambiguous. Put another way, an ambiguous grammar is one that produces more than one leftmost derivation or more than one rightmost derivation for the same sentence. Example 4 . 11 : The arithmetic expression grammar (4.3) permits two distinct leftmost derivations for the sentence id + id * id: E

=}

=}

=} =}

=}

E+E id + E id + E * E id + id * E id + id * id

E

=}

:::}

=}

:=}

:=}

E*E E+E*E id + E * E id + id * E id + id * id

The corresponding parse trees appear in Fig. 4.5. Note that the parse tree of Fig. 4.5 (a) reflects the commonly assumed prece dence of + and * , while the tree of Fig. 4.5 (b) does not. That is, it is customary to treat operator * as having higher precedence than +, corresponding to the fact that we would normally evaluate an expression like a + b * c as a + (b * c) , rather than as ( a + b) * c. 0 For most parsers, it is desirable that the grammar be made unambiguous, for if it is not , we cannot uniquely determine which parse tree to select for a sentence. In other cases, it is convenient to use carefully chosen ambiguous grammars, together with disambiguating rules that "throw away" undesirable parse trees, leaving only one tree for each sentence.

204

CHAPTER 4. SYNTAX ANALYSIS E

E

/ I �

id

E

I

+

I

E

E

/I � *

id

E

E

id

id

I

I

E

/ I � *

E

+

E

id

/ I �

(a)

I

I

id

(b)

Figure 4.5: Two parse trees for id+ id*id 4.2.6

Verifying the Language Generated by a Grammar

Although compiler designers rarely do so for a complete programming-language grammar, it is useful to be able to reason that a given set of productions gener ates a particular language. Troublesome constructs can be studied by writing a concise, abstract grammar and studying the language that it generates. We shall construct such a grammar for conditional statements below. A proof that a grammar G generates a language L has two parts: show that every string generated by G is in L, and conversely that every string in L can indeed be generated by G.

Example 4.12 : Consider the following grammar: S -t ( S ) S I E

( 4. 13)

It may not be initially apparent, but this simple grammar generates all strings of balanced parentheses, and only such strings. To see why, we shall show first that every sentence derivable from S is balanced, and then that every balanced string is derivable from S . To show that every sentence derivable from S is balanced, we use an inductive proof on the number of steps n in a derivation. BASIS: The basis is n = 1. The only string of terminals derivable from S in one step is the empty string, which surely is balanced. INDUCTION: Now assume that all derivations of fewer than n steps produce balanced sentences, and consider a leftmost derivation of exactly n steps. Such a derivation must be of the form

8

=> lm

(8)8 � (x)8 � (x ) y lm

lm

The derivations of x and y from S take fewer than n steps, so by the inductive hypothesis x and y are balanced. Therefore, the string ( x ) y must be balanced. That is, it has an equal number of left and right parentheses, and every prefix has at least as many left parentheses as right.

205

4.2. CONTEXT-FREE GRAMMARS

Having thus shown that any string derivable from S is balanced, we must next show that every balanced string is derivable from S. To do so, use induction on the length of a string. BASIS: If the string is of length 0, it must be

E,

which is balanced.

INDUCTION: First, observe that every balanced string has even length. As sume that every balanced string of length less than 2n is derivable from S, and consider a balanced string w of length 2n, n 2: 1 . Surely w begins with a left parenthesis. Let (x ) be the shortest nonempty prefix of w having an equal number of left and right parentheses. Then w can be written as w = ( x ) y where both x and y are balanced. Since x and y are of length less than 2n, they are derivable from 8 by the inductive hypothesis. Thus, we can find a derivation of the form

8 � (8)8 � (x)8 � (x) y proving that

4.2.7

w

= (x ) y is also derivable from S.

0

Context-Free Grammars Versus Regular Expressions

Before leaving this section on grammars and their properties, we establish that grammars are a more powerful notation than regular expressions. Every con struct that can be described by a regular expression can be described by a gram mar, but not vice-versa. Alternatively, every regular language is a context-free language, but not vice-versa. For example, the regular expression (alb) * abb and the grammar

Ao Al A2 A3

-t

aAo I bAo I aA1 bA2 bA3

-t

E

-t

-t

describe the same language, the set of strings of a's and b ' s ending in abb. We can construct mechanically a grammar to recognize the same language as a nondeterministic finite automaton ( NFA ) . The grammar above was con structed from the NFA in Fig. 3.24 using the following construction: 1 . For each state i of the NFA, create a nonterminal Ai .

2. If state i has a transition to state j on input a, add the production Ai -t aAj • If state i goes to state j on input E, add the production Ai -t Aj .

3. If i is an accepting state, add Ai -t E o 4. If i is the start state, make Ai be the start symbol of the grammar.

206

CHAPTER 4. SYNTAX ANALYSIS

On the other hand, the language L = { an bn I n � I } with an equal number of a ' s and b's is a prototypical example of a language that can be described by a grammar but not by a regular expression. To see why, suppose L were the language defined by some regular expression. We could construct a DFA D with a finite number of states, say k, to accept L. Since D has only k states, for an input beginning with more than k a ' s, D must enter some state twice, say S i , as in Fig. 4.6. Suppose that the path from S i back to itself is labeled with a sequence ai- i . Since ai bi is in the language, there must be a path labeled bi from S i to an accepting state f . But, then there is also a path from the initial state So through S i to f labeled aj bi , as shown in Fig. 4.6. Thus, D also accepts aj bi , which is not in the language, contradicting the assumption that L is the language accepted by D. path labeled ai-i

Figure 4.6: DFA D accepting both ai bi and aibi . Colloquially, we say that "finite automata cannot count," meaning that a finite automaton cannot accept a language like { an bn I n � I} that would require it to keep count of the number of a ' s before it sees the b's. Likewise, "a grammar can count two items but not three," as we shall see when we consider non-context-free language constructs in Section 4.3.5.

4.2.8 Exercises for Section 4.2 Exercise 4.2.1 : Consider the context-free grammar: and the string aa + a*. a) Give a leftmost derivation for the string. b) Give a rightmost derivation for the string. c) Give a parse tree for the string. ! d) Is the grammar ambiguous or unambiguous? Justify your answer. ! e) Describe the language generated by this grammar.

Exercise 4.2.2 : Repeat Exercise 4.2.1 for each of the following grammars and strings:

207

4.2. CONTEXT-FREE GRAMMARS a) S

--+

b) S

--+

o S 1 I 0 1 with string 00011l. + SSI

*

! c) S --+ S ( S ) S I

SSI E

a

with string + * aaa.

with string ( 0 0 ) ·

! d) S

--+

S + SISSI (S) IS

! e) S

--+

( L ) I a and L --+ L , S I S with string ( (a, a) , a, (a) ) .

! ! f) S

--+

a S b S i b S a S I E with string aabbab.

*

I a with string (a + a) * a.

! g) The following grammar for boolean expressions:

bexpr bterm bfactor

--+

--+

--+

bexpr or bterm I bterm bterm and bfactor I bfactor not bfactor I ( bexpr ) I true I false

Exercise 4.2.3 : Design grammars for the following languages: a) The set of all strings of Os and I s such that every 0 is immediately followed by at least one 1. ! b) The set of all strings of Os and I s that are palindromes; that is, the string

reads the same backward as forward.

! c) The set of all strings of Os and I s with an equal number of Os and I s . ! ! d ) The set of all strings of O s and I s with an unequal number of O s and I s. ! e) The set of all strings of Os and I s in which 011 does not appear as a

substring.

! ! f) The set of all strings of Os and I s of the form xy, where x =I- y and x and

y are of the same length.

! Exercise 4.2.4 : There is an extended grammar notation in common use. In

this notation, square and curly braces in production bodies are metasymbols (like --+ or I) with the following meanings: i) Square braces around a grammar symbol or symbols denotes that these

constructs are optional. Thus, production A --+ X [Y] Z has the same effect as the two productions A --+ X Y Z and A --+ X Z.

ii) Curly braces around a grammar symbol or symbols says that these sym bols may be repeated any number of times, including zero times. Thus, A --+ X {Y Z} has the same effect as the infinite sequence of productions A --+ X , A --+ X Y Z, A --+ X Y Z Y Z, and so on.

208

CHAPTER 4. SYNTAX ANALYSIS

Show that these two extensions do not add power to grammars; that is, any language that can be generated by a grammar with these extensions can be generated by a grammar without the extensions.

Exercise 4.2.5 : Use the braces described in Exercise 4.2.4 to simplify the following grammar for statement blocks and conditional statements: stmt

-+

I I

stmtList -+

if expr then stmt else stmt if stmt then stmt begin stmtList end stmt ; stmtList I stmt

! Exercise 4.2.6 : Extend the idea of Exercise 4.2.4 to allow any regular expres sion of grammar symbols in the body of a production. Show that this extension does not allow grammars to define any new languages. ! Exercise 4.2.7 : A grammar symbol X (terminal or nonterminal) is useless if there is no derivation of the form S =* wXy =* wxy. That is, X can never appear in the derivation of any sentence. a) Give an algorithm to eliminate from a grammar all productions containing useless symbols.

b ) Apply your algorithm to the grammar:

S A B

-+

-+ -+

OIA AB 1

Exercise 4.2.8 : The grammar in Fig. 4.7 generates declarations for a sin gle numerical identifier; these declarations involve four different, independent . properties of numbers. stmt optionList option mode scale precision base

-+

-+ -+ -+ -+

-+ -+

declare id optionList optionList option I E mode I scale I precision I base real I complex fixed I floating single I double binary I decimal

Figure 4.7: A grammar for multi-attribute declarations a) Generalize the grammar of Fig. 4.7 by allowing n options Ai , for some fixed n and for i = 1 , 2 . . . , n, where Ai can be either ai or bi· Your grammar should use only 0 (n) grammar symbols and have a total length of productions that is O (n).

4.3.

WRITING A GRAMMAR

209

! b ) The grammar of Fig. 4.7 and its generalization in part ( a) allow declara tions that are contradictory and / or redundant, such as: declare f oo real f ixed real f loat ing

We could insist that the syntax of the language forbid such declarations; that is, every declaration generated by the grammar has exactly one value for each of the n options. If we do, then for any fixed n there is only a finite number of legal declarations. The language of legal declarations thus has a grammar ( and also a regular expression) , as any finite language does. The obvious grammar, in which the start symbol has a production for every legal declaration has n! productions and a total production length of O (n x n!) . You must do better: a total production length that is O (n2 n ) . ! ! c) Show that any grammar for part ( b ) must have a total production length

of at least 2 n .

d) What does part ( c) say about the feasibility of enforcing nonredundancy and noncontradiction among options in declarations via the syntax of the programming language? 4.3

Writing a Grammar

Grammars are capable of describing most, but not all, of the syntax of pro gramming languages. For instance, the requirement that identifiers be declared before they are used, cannot be described by a context-free grammar. Therefore, the sequences of tokens accepted by a parser form a superset of the program ming language; subsequent phases of the compiler must analyze the output of the parser to ensure compliance with rules that are not checked by the parser. This section begins with a discussion of how to divide work between a lexical analyzer and a parser. We then consider several transformations that could be applied to get a grammar more suitable for parsing. One technique can elim inate ambiguity in the grammar, and other techniques - left-recursion elimi nation and left factoring - are useful for rewriting grammars so they become suitable for top-down parsing. We conclude this section by considering some programming language constructs that cannot be described by any grammar.

4.3 . 1

Lexical Versus Syntactic Analysis

As we observed in Section 4.2.7, everything that can be described by a regular expression can also be described by a grammar. We may therefore reasonably ask: "Why use regular expressions to define the lexical syntax of a language?" There are several reasons.

210

CHAPTER 4. SYNTAX ANALYSIS

1 . Separating the syntactic structure of a language into lexical and non lexical parts provides a convenient way of modularizing the front end of a compiler into two manageable-sized components. 2. The lexical rules of a language are frequently quite simple, and to describe them we do not need a notation as powerful as grammars. 3. Regular expressions generally provide a more concise and easier-to-under stand notation for tokens than grammars. 4. More efficient lexical analyzers can be constructed automatically from regular expressions than from arbitrary grammars. There are no firm guidelines as to what to put into the lexical rules, as op posed to the syntactic rules. Regular expressions are most useful for describing the structure of constructs such as identifiers, constants, keywords, and white space. Grammars, on the other hand, are most useful for dmlcribing nested structures such as bCl,lanced parentheses, matching begin-end's, corresponding if-then-else's, and so on. These nested structures cannot be described by regular expressions.

4.3.2

Eliminating Ambiguity

Sometimes an ambiguous grammar can be rewritten to eliminate the ambiguity. As an example, we shall eliminate the ambiguity from the following "dangling else" grammar:

stmt

�

I I

if expr then stmt if expr th�n stmt else stmt other

( 4.14)

Here "other" stands for any other statement. According to this grammar, the compound conditional statement

if E1 then 81 else if E2 then 82 else 83

z:,\ Aa => Sda, but it is not immediately left recursive. Algorithm 4.19, below, systematically eliminates left recursion from a gram mar. It is guaranteed to work if the grammar has no cycles (derivations of the form A J. A) or €-productions (productions of the form A -t E ) . Cycles can be eliminated systematically from a grammar, as can €-productions (see Exercises 4.4.6 and 4.4.7) .

Algorithm 4.19 : Eliminating left recursion. INPUT: Grammar G with no cycles or €-productions. OUTPUT: An equivalent grammar with no left recursion. METHOD: Apply the algorithm in Fig. 4. 1 1 to G. Note that the resulting non-left-recursive grammar may have €-productions. D

1) 2) 3) 4)

5) 6) 7)

arrange the nonterminals in some order AI , A2 , , An . for ( each i from 1 to n ) { for ( each j from 1 to i - I ) { replace each production of the form A i -t Aj 'Y by the productions A i -t (h 'Y I 82'Y I . . . I 8k 'Y , where Aj -t 81 I 82 I . . . I 8k are all current Arproductions } eliminate the immediate left recursion among the A i -productions } .

•

•

Figure 4. 11: Algorithm to eliminate left recursion from a grammar The procedure in Fig. 4. 11 works as follows. In the first iteration for i = 1 , the outer for-loop of lines (2) through (7) eliminates any immediate left recursion among Al -productions. Any remaining Al productions of the form Al -t Ala must therefore have l > 1. After the i - 1st iteration of the outer for loop, all nonterminals Ak , where k < i, are "cleaned" ; that is, any production A k -t Ala, must have l > k. As a result, on the ith iteration, the inner loop

214

CHAPTER 4. SYNTAX ANALYSIS

of lines (3) through (5) progressively raises the lower limit in any production Ai -t Am a, until we have m � i. Then, eliminating immediate left recursion for the Ai productions at line (6) forces m to be greater than i.

Example 4.20 : Let us apply Algorithm 4.19 to the grammar (4. 18). Techni cally, the algorithm is not guaranteed to work, because of the E-production, but in this case, the production A -t E turns out to be harmless. We order the nonterminals S, A. There is no immediate left recursion among the S-productions, so nothing happens during the outer loop for i = l . For i = 2, we substitute for S i n A -t S d t o obtain the following A-productions. A -t A c I A a d I b d I E Eliminating the immediate left recursion among these A-productions yields the following grammar. S -t A a I b A -t b d A' I A' A' -t c A' I a d A' I E o

4.3.4

Left Factoring

,

Left factoring is a grammar transformation that is useful for producing a gram mar suitable for predictive, or top-down, parsing. When the choice between two alternative A-productions is not clear, we may be able to rewrite the pro ductions to defer the decision until enough of the input has been seen that we can make the right choice. For example, if we have the two productions

stmt -t

if expr then stmt else stmt if expr then stmt

on seeing the input if, we cannot immediately tell which production to choose to expand stmt. In general, if A -t a/31 I a/32 are two A-productions, and the input begins with a nonempty string derived from a, we do not know whether to expand A to afh or a/32 . However, we may defer the decision by expanding A to aA'. Then, after seeing the input derived from a, we expand A' to /31 or to /32 . That is, left-factored, the original productions become A -t aA' A' -t /31 I /32

Algorithm 4.21 : Left factoring a grammar. INPUT: Grammar G. OUTPUT: An equivalent left-factored grammar.

4.3.

WRITING A GRAMMAR

215

METHOD: For each nonterminal A, find the longest prefix 0: common to two or more of its alternatives. If 0: i=- E - i.e., there is a nontrivial common prefix - replace all of the A-productions A ---t 0:f31 I 0:f32 I . . . I o:f3n I 'Y , where 'Y represents all alternatives that do not begin with 0:, by

A ---t o:A' I 'Y A' ---t f31 I f32 I . . . I f3n

Here A' is a new nonterminal. Repeatedly apply this transformation until no two alternatives for a nonterminal have a common prefix. 0

Example 4.22 : The following grammar abstracts the "dangling-else" prob lem: 5 ---t i E t 5 I i E t 5 e 5 I a E ---t b

( 4.23)

Here, i, t, and e stand for if, then, and else; E and S stand for "conditional expression" and "statement." Left-factored, this grammar becomes:

S ---t i E t S S' I a 5' ---t e 5 I E E ---t b

(4.24)

Thus, we may expand S to iEtSS' on input i, and wait until iEt5 has been seen to decide whether to expand 5' to e5 or to Eo Of course, these grammars are both ambiguous, and on input e, it will not be clear which alternative for 5' should be chosen. Example 4.33 discusses a way out of this dilemma. 0

4.3.5

Non-Context-Free Language Constructs

A few syntactic constructs found in typical programming languages cannot be specified using grammars alone. Here, we consider two of these constructs, using simple abstract languages to illustrate the difficulties.

Example 4.25 : The language in this example abstracts the problem of check ing that identifiers are declared before they are used in a program. The language consists of strings of the form wcw , where the first w represents the declaration of an identifier w , c represents an intervening program fragment, and the second w represents the use of the identifier. The abstract language is Ll = {wcw I w is in (al b) * } . L l consists of all words composed of a repeated string of a's and b's separated by c, such as aabcaab. While it is beyond the scope of this book to prove it, the non context-freedom of L l directly implies the non-context-freedom of programming languages like C and Java, which require declaration of identifiers before their use and which allow identifiers of arbitrary length. For this reason, a grammar for C or Java does not distinguish among identi fiers that are different character strings. Instead, all identifiers are represented

216

CHAPTER 4. SYNTAX ANALYSIS

by a token such as id in the grammar. In a compiler for such a language, the semantic-analysis phase checks that identifiers are declared before they are used. D

Example 4.26 : The non-context-free language in this example abstracts the problem of checking that the number of formal parameters in the declaration of a function agrees with the number of actual parameters in a use of the function. The language consists of strings of the form an bm en dm . ( Recall an means a written n times. ) Here a n and bm could represent the formal-parameter lists of two functions declared to have n and m arguments, respectively, while en and dm represent the actual-parameter lists in calls to these two functions. The abstract language is L 2 = {a n bm en dm I n � 1 and m � I } . That is, L 2 consists of strings in the language generated by the regular expression a*b*c*d* such that the number of a's and e's are equal and the number of b's and d's are equal. This language is not context free. Again, the typical syntax of function declarations and uses does not concern itself with counting the number of parameters. For example, a function call in C-like language might be specified by stmt expr_list

--t

--t

id ( expr-list ) expr_list , expr expr

with suitable productions for expr. Checking that the number of parameters in a call is correct is usually done during the semantic-analysis phase. D

4.3.6 Exercises for Section 4.3 Exercise 4.3.1 : The following is a grammar for regular expressions over sym

bols a and b only, using + in place of I for union, to avoid conflict with the use of vertical bar as a metasymbol in grammars:

rexpr rterm rfaetor rp rim ary

--t --t --t

--t

rexpr + rterm I rterm rterm rfactor I rfactor rfactor * I rprimary alb

a) Left factor this grammar.

b ) Does left factoring make the grammar suitable for top-down parsing? c ) In addition to left factoring, eliminate left recursion from the original grammar. d ) Is the resulting grammar suitable for top-down parsing?

Exercise 4.3.2 : Repeat Exercise 4.3. 1 on the following grammars:

4.4.

217

TOP-DOWN PARSING

a) The grammar of Exercise 4.2 . 1 . b ) The grammar of Exercise 4.2.2(a) . c) The grammar of Exercise 4.2.2(c) . d) The grammar of Exercise 4.2.2(e) . e) The grammar of Exercise 4.2.2(g) . ! Exercise 4.3.3 : The following grammar is proposed to remove the "dangling

else ambiguity" discussed in Section 4.3.2:

stmt

-t

matchedStmt -t I

if expr then stmt matchedStmt if expr then matchedStmt else stmt other

Show that this grammar is still ambiguous. 4 .4

Top- D own Parsing

Top-down parsing can be viewed as the problem of constructing a parse tree for the input string, starting from the root and creating the nodes of the parse tree in preorder (depth-first, as discussed in Section 2.3.4) . Equivalently, top-down parsing can be viewed as finding a leftmost derivation for an input string.

Example 4.27 : The sequence of parse trees in Fig. 4. 12 for the input id+id*id is a top-down parse according to grammar (4.2) , repeated here: E E' T T' F

-t

T E'

-t + T E' I E -t F T' -t * F T' I E -t

( E ) I id

This sequence of trees corresponds to a leftmost derivation of the input.

(4.28) 0

At each step of a top-down parse, the key problem is that of determining the production to be applied for a nonterminal, say A. Once an A-production is chosen, the rest of the parsing process consists of "matching" the terminal symbols in the production body with the input string. The section begins with a general form of top-down parsing, called recursive descent parsing, which may require backtracking to find the correct A-produc tion to be applied. Section 2.4.2 introduced predictive parsing, a special case of recursive-descent parsing, where no backtracking is required. Predictive parsing chooses the correct A-production by looking ahead at the input a fixed number of symbols, typically we may look only at one (that is, the next input symbol) .

218 E

CHAPTER 4. SYNTAX ANALYSIS E

/\

T E'

� 1m

E

/\

T

/\

F T'

E'

� 1m

E

T

/\

F T' /

E

/ '\

/\

id

E'

T

/\

F T' 1 , id €

E'

E / '\ T E' /1 / 1 ,\ F T' + T E' I I id €

E � lm / '\ T E' // / \� T E' F T' + I I / '\ id e F T'

E / '\ T E' /1 / \� F T' + T El I I / '\ id € F T' 1 id

E / '\ T E' /1 / \� T E' F T' + I I / '\ id € F T' 1 / 1 ,\ id * F T'

E,\ / E' T /1 / \� F T' + T El I I / '\ id € F T' 1 / 1 ,\ id * F T'

E � 1m / ,\ T E' /1 / \� T E' F T' + / '\ 1 1 id € TI F I / 1 ,\ id * F T' I 1 id €

E / '\ T E' /1 / \� T E' F 7;1 + 1 I / '\ 1 id € F T' € 1 / 1 ,\ id * F T' 1 I id E

I

id

Figure 4.12: Top-down parse for id + id * id

For example, consider the top-down parse in Fig. 4.12, which constructs a tree with two nodes labeled E' . At the first E' node (in preorder) , the production EI -+ + TE' is chosen; at the second EI node, the production E' -+ € is chosen. A predictive parser can choose between E'-productions by looking at the next input symbol. The class of grammars for which we can construct predictive parsers looking k symbols ahead in the input is sometimes called the LL(k) class. We discuss the LL( 1 ) class in Section 4.4.3, but introduce certain computations, called FI RST and FOLLOW, in a preliminary Section 4.4.2. From the FIRST and FOLLOW sets for a grammar, we shall construct "predictive parsing tables," which make explicit the choice of production during top-down parsing. These sets are also useful during bottom-up parsing, In Section 4.4.4 we give a nonrecursive parsing algorithm that maintains a stack explicitly, rather than implicitly via recursive calls. Finally, in Sec tion 4.4.5 we discuss error recovery during top-down parsing.

4.4.

219

TOP-DOWN PARSING

4.4.1

Recursive-Descent Parsing 1) 2) 3) 4) 5) 6) 7)

void A O { Choose an A-production, A -+ X1 X2 . . . Xk ; for ( i = 1 to k ) { if ( Xi is a nonterminal ) call procedure Xi O ; else if ( Xi equals the current input symbol a ) advance the input to the next symbol; else /* an error has occurred * /; } }

Figure 4.13: A typical procedure for a nonterminal in a top-down parser A recursive-descent parsing program consists of a set of procedures, one for each nonterminal. Execution begins with the procedure for the start symbol, which halts and announces success if its procedure body scans the entire input string. Pseudocode for a typical nonterminal appears in Fig. 4. 13. Note that this pseudocode is nondeterministic, since it begins by choosing the A-production to apply in a manner that is not specified. General recursive-descent may require backtracking; that is, it may require repeated scans over the input. However, backtracking is rarely needed to parse programming language constructs, so backtracking parsers are not seen fre quently. Even for situations like natural language parsing, backtracking is not very efficient, and tabular methods such as the dynamic programming algo rithm of Exercise 4.4.9 or the method of Earley ( see the bibliographic notes ) are preferred. To allow backtracking, the code of Fig. 4.13 needs to be modified. First, we cannot choose a unique A-production at line (1), so we must try each of several productions in some order. Then, failure at line (7) is not ultimate failure, but suggests only that we need to return to line (1) and try another A-production. Only if there are no more A-productions to try do we declare that an input error has been found. In order to try another A-production, we need to be able to reset the input pointer to where it was when we first reached line (1) . Thus, a local variable is needed to store this input pointer for future use.

Example 4.29 : Consider the grammar s

A

-+

-+

cAd ab I a

To construct a parse tree top-down for the input string w = cad, begin with a tree consisting of a single node labeled S, and the input pointer pointing to c, the first symbol of w . S has only one production, so we use it to expand S and

220

CHAPTER 4. SYNTAX ANALYSIS

obtain the tree of Fig. 4.14 ( a) . The leftmost leaf, labeled c, matches the first symbol of input w, so we advance the input pointer to a, the second symbol of w, and consider the next leaf, labeled A. s

c

/ AI �

s

d

c

/I� A

a

(a)

/\

s

d

c

b

(b)

/I� A

I

d

a

(c)

Figure 4. 14: Steps in a top-down parse Now, we expand A using the first alternative A -+ a b to obtain the tree of Fig. 4.14 ( b ) . We have a match for the second input symbol, a, so we advance the input pointer to d, the third input symbol, and compare d against the next leaf, labeled b. Since b does not match d, we report failure and go back to A to see whether there is another alternative for A that has not been tried, but that might produce a match. In going back to A, we must reset the input pointer to position 2 , the position it had when we first came to A, which means that the procedure for A must store the input pointer in a local variable. The second alternative for A produces the tree of Fig. 4.14 ( c ) . The leaf a matches the second symbol of w and the leaf d matches the third symbol. Since we have produced a parse tree for w, we halt and announce successful completion of parsing. 0 A left-recursive grammar can cause a recursive-descent parser, even one with backtracking, to go into an infinite loop. That is, when we try to expand a nonterminal A, we may eventually find ourselves again trying to expand A without having consumed any input.

4.4.2

FIRST and FOLLOW

The construction of both top-down and bottom-up parsers is aided by two functions, FIRST and FOLLOW, associated with a grammar G. During top down parsing, FIRST and FOLLOW allow us to choose which production to apply, based on the next input symbol. During panic-mode error recovery, sets of tokens produced by FOLLOW can be used as synchronizing tokens. Define FIRsT( a), where a is any string of grammar symbols, to be the set of terminals that begin strings derived from a. If a =* E, then E is also in FIRsT ( a) . For example, in Fig. 4.15, A =* C'Y, so c is in FIRST ( A ) . For a preview of how FIRST can be used during predictive parsing, consider two A-productions A -+ a I (3, where FIRST ( a ) and FIRST ({3) are disjoint sets. We can then choose between these A-productions by looking at the next input

4.4.

221

TOP-DOWN PARSING s

�/ \� a /� A

a

(J

c

Figure 4.15: Terminal c is in FIRST(A) and a is in FOLLOW(A) symbol a� since a can be in at most one of FIRST(a) and FIRST(,6) � not both. For instance� if a is in FIRST(,6) choose the production A � ,6. This idea will be explored when LL(l) grammars are defined in Section 4.4.3. Define FOLLOW( A) � for nonterminal A� to be the set of terminals a that can appear immediately to the right of A in some sentential form; that is� the set of terminals a such that there exists a derivation of the form S =* aAa,6� for some a and ,6� as in Fig. 4.15. Note that there may have been symbols between A and a� at some time during the derivation� but if so� they derived E and disappeared. In addition� if A can be the rightmost sympol in some sentential form� then $ is in FOLLOW(A) ; recall that $ is a special "endmarker" symbol that is assumed not to be a symbol of any grammar. To compute FIRST(X) for all grammar symbols X, apply the following rules until no more terminals or E can be added to any FIRST set. 1. If X is a terminal, then FIRST(X)

{X } .

=

2. If X is a nonterminal and X � YI Y2 Yk is a production for some k � 1 , then place a in FIRST(X) if for some i, a is in FIRST(Yi), and E is in all of FIRST(YI ) , . . . , FIRST(Yi_I ) ; that is, YI . . . Yi - I =* E. If E is in FIRST(}j) for all j = 1 , 2, . . . , k, then add E to FIRST(X) . For example, everything in FIRST(YI ) is surely in FIRST(X) . If Yi does not derive E� then we add nothing more to FIRST(X) , but if YI =* E, then we add FIRST(Y2 ) , and so on. •

.

.

3. If X � E is a production, then add E to FIRST(X) . Now� we can compute FIRST for any string XIX2 ' " Xn as follows. Add to FIRST(XI X2 Xn ) all non-E symbols of FIRST(XI ) . Also add the non-E sym bols of FIRST(X2) , if E is in FIRST(XI ) ; the non-E symbols of FIRST(X3 ) � if E is in FIRST(XI ) and FIRST(X2) ; and so on. Finally, add E to FIRST(XIX2 Xn ) if, for all i, E is in FIRST(Xi ) . To compute FOLLOW(A) for all nonterminals A , apply the following rules until nothing can be added to any FOLLOW set. .

•

.

• •

•

1 . Place $ in FOLLOW(S) , where S is the start symbol, and $ is the input right endmarker.

222

CHAPTER 4. SYNTAX ANALYSIS

2. If there is a production A is in FOLLOW(B) .

-+

aBj3, then everything in FIRST(j3) except

t

3. If there is a production A -+ aB, or a production A -+ aBj3, where FIRST(j3) contains t, then everything in FOLLOW(A) is in FOLLOW(B) .

Example 4.30 : Consider again the non-left-recursive grammar (4.28) . Then: 1. FIRST(F) = FIRST(T) = FIRST(E) = { ( , id}. To see why, note that the two productions for F have bodies that start with these two terminal symbols, id and the left parenthesis. T has only one production, and its body starts with F. Since F does not derive t, FIRST(T) must be the same as FIRST(F) . The same argument covers FIRST(E) . 2. FIRST(E') = {+, t} . The reason is that one of the two productions for E' has a body that begins with terminal + , and the other's body is t. When ever a nonterminal derives t, we place t in FIRST for that nonterminal. 3. FIRST(T')

=

{ * , t} . The reasoning is analogous to that for FIRST(E' ) .

4 . FOLLOW(E) = FOLLOW(E') = {), $ } . Since E i s the start symbol, FOLLOW(E) must contain $. The production body ( E ) explains why the right parenthesis is in FOLLOW(E) . For E' , note that this nonterminal appears only at the ends of bodies of E-productions. Thus, FOLLOW(E') must be the same as FOLLOW(E) . 5. FOLLOW(T) = FOLLOW(T') = {+, ) , $ } . Notice that T appears in bodies only followed by E' . Thus, everything except t that is in FIRST(E') must be in FOLLOW(T) ; that explains the symbol +. However, since FIRST(E') contains t (i.e. , E' =* t) , and E' is the entire string following T in the bodies of the E-productions, everything in FOLLOW(E) must also be in FOLLOW(T) . That explains the symbols $ and the right parenthesis. As for T' , since it appears only at the ends of the T-productions, it must be that FOLLOW(T') = FOLLOW(T) . 6. FOLLOW(F) point (5) ,

=

{+, *, ) , $ } . The reasoning is analogous to that for T in

o

4.4.3

LL ( 1 ) Grammars

Predictive parsers, that is, recursive-descent parsers needing no backtracking, can be constructed for a class of grammars called LL(I) , The first "L" in LL(I) stands for scanning the input from left to right, the second "L" for producing a leftmost derivation, and the "I" for using one input symbol of lookahead at each step to make parsing action decisions.

4.4.

223

TOP-DOWN PARSING

Transition Diagrams for Predictive Parsers Transition diagrams are useful for visualizing predictive parsers. For exam ple, the transition diagrams for nonterminals E and E' of grammar (4.28) appear in Fig. 4.16(a) . To construct the transition diagram from a gram mar, first eliminate left recursion and then left factor the grammar. Then, for each nonterminal A, 1. Create an initial and final (return) state. 2. For each production A ---+ XI X2 Xk , create a path from the initial to the final state, with edges labeled Xl , X2 , , Xk . If A ---+ E, the path is an edge labeled E. .

•

.

•

•

•

Transition diagrams for predictive parsers differ from those for lexical analyzers. Parsers have one diagram for each nonterminal. The labels of edges can be tokens or nonterminals. A transition on a token (terminal) means that we take that transition if that token is the next input symbol. A transition on a nonterminal A is a call of the procedure for A. With an LL(l) grammar, the ambiguity of whether or not to take an E-edge can be resolved by making E-transitions the default choice. Transition diagrams can be simplified, provided the sequence of gram mar symbols along paths is preserved. We may also sub stitute the dia gram for a nonterminal A in place of an edge labeled A. The diagrams in Fig. 4. 16(a) and (b) are equivalent: if we trace paths from E to an accept ing state and substitute for E' , then, in both sets of diagrams, the grammar symbols along the paths make up strings of the form T + T + . . . + T. The diagram in (b) can be obtained from (a) by transformations akin to those in Section 2.5.4, where we used tail-recursion removal and substitution of procedure bodies to optimize the procedure for a nonterminal.

The class of LL(l) grammars is rich enough to cover most programming copstructs, although care is needed in writing a suitable grammar for the source language. For example, no left-recursive or ambiguous grammar can be LL(l) . A grammar G is LL(l) if and only if whenever A ---+ a I (3 are two distinct productions of G, the following conditions hold: 1 . For no terminal a do both

a

and (3 derive strings beginning with

a.

2. At most one of a and (3 can derive the empty string. *

3. If (3 ::::} E, then a does not derive any string beginning with a terminal in FOLLOW(A) . Likewise, if a =* E, then (3 does not derive any string beginning with a terminal in FOLLOW(A) .

224

CHAPTER 4. SYNTAX ANALYSIS +

E: E' ;

T

(b)

( a)

Figure 4. 16: Transition diagrams for nonterminals E and E' of grammar 4.28 The first two conditions are equivalent to the statement that FIRST(a) and FIRST(,B) are disjoint sets. The third condition is equivalent to stating that if E is in FIRST(,B) , then FIRsT(a) and FOLLOW(A) are disjoint sets, and likewise if E is in FIRsT(a) . Predictive parsers can be constructed for LL(l) grammars since the proper production to apply for a nonterminal can be selected by looking only at the current input symbol. Flow-of-control constructs, with their distinguishing key words, generally satisfy the LL(l) constraints. For instance, if we have the productions

stmt

--+

if ( expr ) stmt else stmt while ( expr ) stmt { stmLlist }

then the keywords if, while, and the symbol { tell us which alternative is the only one that could possibly succeed if we are to find a statement. The next algorithm collects the information from FIRST and FOLLOW sets into a predictive parsing table M[A, a] , a two-dimensional array, where A is a nonterminal, and a is a terminal or the symbol $, the input endmarker. The algorithm is based on the following idea: the production A --+ a is chosen if the next input symbol a is in FIRST( a) . The only complication occurs when a = E or, more generally, a =* E. In this case, we should again choose A --+ a, if the current input symbol is in FOLLOW(A) , or if the $ on the input has been reached and $ is in FOLLOW(A) .

Algorithm 4.31 : Construction of a predictive parsing table. INPUT: Grammar G. OUTPUT: Parsing table M. METHOD: For each production A --+ a of the grammar, do the following: 1 . For each terminal

a

in FIRST(A) , add A --+ a to M[A, a] .

E is in FIRST(a) , then for each terminal b in FOLLOW(A) , add A --+ a to M[A, b] . If E is in FIRsT(a) and $ is in FOLLOW(A) , add A --+ a to M[A, $] as well.

2. If

4. 4.

225

TOP-DOWN PARSING

If, after performing the above, there is no production at all in M[A, a) : then set M[A, a] to error ( which we normally represent by an empty entry III the table) . 0

Example 4.32 : For the expression grammar (4.28) , Algorithm 4.31 produces the parsing table in Fig. 4.17. Blanks are error entries; nonblanks indicate a production with which to expand a nonterminal. INPUT SYMBOL

NON TERMINAL

id

E

E -+ TE'

*

+

E -+ TE' E' -+ +TE'

E'

T' -+

T'

€

T' -+ * FT'

€

E' -+

T' -+

€

T'

€

-+ €

F -+ (E)

F -+ id

F

E' -+ T -+ FT'

T -+ FT'

T

$

)

(

Figure 4.17: Parsing table M for Example 4.32 Consider production E

-+

T E'. Since

FIRST(T E')

FIRST(T)

{ (, id} this production is added to M[E, (] and M[E, id] . Production E' -+ +T E' is added to M[E' , +] since FIRST(+TE') = { + } . Since FOLLOW(E') = { ) , $ } , production E' -+ E is added to M[E', )] and M[E' , $) . 0 =

=

Algorithm 4.31 can be applied to any grammar G to produce a parsing table M. For every LL(l) grammar, each parsing-table entry uniquely identifies a production or signals an error. For some grammars, however, M may have some entries that are multiply defined. For example, if G is left-recursive or ambiguous, then M will have at least one multiply defined entry. Although left recursion elimination and left factoring are easy to do, there are some grammars for which no amount of alteration will produce an LL(l) grammar. The language in the following example has no LL(l) grammar at all.

Example 4.33 : The following grammar, which abstracts the dangling-else problem, is repeated here from Example 4.22: S

S'

E

-+

-+

-+

iEtSS' I a eS I E b

The parsing table for this grammar appears in Fig. 4. 18. The entry for M[S', e] contains both s' -+ eS and S' -+ E . The grammar is ambiguous and the ambiguity is manifested by a choice in what production to use when an e (else) is seen. We can resolve this ambiguity

226

CHAPTER 4. SYNTAX ANALYSIS

NON TERMINAL

S

INPUT SYMBOL a

b

e

S -t a Sf --+ E Sf -t eS

Sf E

i S -t iEtSSf

$

t

Sf

--+ E

E --+ b Figure 4. 18: Parsing table M for Example 4.33

by choosing Sf -t eS. This choice corresponds to associating an else with the closest previous then. Note that the choice Sf -t E would prevent e from ever being put on the stack or removed from the input, and is surely wrong. 0

4.4.4

N onrecursive Predictive Parsing

A nonrecursive predictive parser can be built by maintaining a stack explicitly, rather than implicitly via recursive calls. The parser mimics a leftmost deriva tion. If w is the input that has been matched so far, then the stack holds a sequence of grammar symbols a such that

S � wet lm

The table-driven parser in Fig. 4.19 has an input buffer, a stack containing a sequence of grammar symbols, a parsing table constructed by Algorithm 4.31 , and an output stream. The input buffer contains the string to be parsed, followed by the endmarker $. We reuse the symbol $ to mark the bottom of the stack, which initially contains the start symbol of the grammar on top of $. The parser is controlled by a program that considers X, the symbol on top of the stack, and a , the current input symbol. If X is a nonterminal, the parser chooses an X-production by consulting entry M[X, a] of the parsing table M. (Additional code could be executed here, for example, code to const:ruct a node in a parse tree.) Otherwise, it checks for a match between the terminal X and current input symbol a. The behavior of the parser can be described in terms of its configurations, which give the stack contents and the remaining input. The next algorithm describes how configurations are manipulated.

Algorithm 4.34 : Table-driven predictive parsing.

INPUT: A string w and a parsing table M for grammar G.

OUTPUT: If w is ih L(G) , a leftmost derivation of W ; otherwise, an error indication.

4.4.

227

TOP-DOWN PARSING Input

Stack

Predictive Parsing Program

X y

Output

Z

$

Parsing Table M

Figure 4.19: Model of a table-driven predictive parser METHOD: Initially, the parser is in a configuration with w$ in the input buffer and the start symbol S of G on top of the stack, above $. The program in Fig. 4.20 uses the predictive parsing table M to produce a predictive parse for the input. 0

set ip to point to the first symbol of w ; set X to the top stack symbol; while ( X =I=- $ ) { / * stack is not empty * / if ( X is a ) pop the stack and advance ip; else if ( X is a terminal ) errorO ; else if ( M [X, a] is an error entry ) errorO ; else if ( M[X, a] = X -t Y1 Y2 Yk ) { output the production X -t Y1 Y2 Yk ; pop the stack; push Yk , Yk - l , . . . , Y1 onto the stack, with Y1 on top; } set X to the top stack symbol; } •

.

•

• •

•

Figure 4.20: Predictive parsing algorithm

Example 4.35 : Consider grammar (4.28) ; we have already seen its the parsing table in Fig. 4.17. On input id + id * id, the nonrecursive predictive parser of Algorithm 4.34 makes the sequence of moves in Fig. 4.21. These moves correspond to a leftmost derivation (see Fig. 4.12 for the full derivation) : E =} TE' 1m

=} 1m

FT' E'

=} 1m

id T' E'

=} lm

id E' =} id + T E' lm

=} 1m

228

CHAPTER 4. SYNTAX ANALYSIS MATCHED

STACK

id id id id + id + id + id + id id + id id + id * id + id * id + id * id id + id * id id + id * id

E$ TE'$ FT'E'$ id T'E' $ T'E'$ E'$ + TE'$ TE'$ FT'E'$ id T'E'$ T'E'$ FT'E'$ * FT'E'$ id T'E' $ T'E'$ E'$ $

INPUT

id + id * idS id + id * idS id + id * idS id + id * idS + id * idS + id * idS + id * idS id * idS id * idS id * idS * id$ * id$ idS idS $ $ $

ACTION

output E -+ T E' output T -+ FT' output F -+ id match id output T' -+ t output E' -+ + T E' match + output T -+ FT' output F -+ id match id output T' -+ * FT' match * output F -+ id match id output T' -+ t output E' -+ t

Figure 4.21: Moves made by a predictive parser on input id + id * id Note that the sentential forms in this derivation correspond to the input that has already been matched (in column MATCHED) followed by the stack contents. The matched input is shown only to highlight the correspondence. For the same reason, the top of the stack is to the left; when we consider bottom-up parsing, it will be more natural to show the top of the stack to the right. The input pointer points to the leftmost symbol of the string in the INPUT column. 0

4.4.5

Error Recovery in Predictive Parsing

This discussion of error recovery refers to the stack of a table-driven predictive parser, since it makes explicit the terminals and nonterminals that the parser hopes to match with the remainder of the input; the techniques can also be used with recursive-descent parsing. An error is detected during predictive parsing when the terminal on top of the stack does not match the next input symbol or when nonterminal A is on top of the stack, a is the next input symbol, and M[A, a] is error (i.e. , the parsing-table entry is empty) . Panic Mode

Panic-mode error recovery is based on the idea of skipping symbols on the the input until a token in a selected set of synchronizing tokens appears. Its

4 . 4.

TOP-DOWN PARSING

229

effectiveness depends on the choice of synchronizing set. The sets should be chosen so that the parser recovers quickly from errors that are likely to occur in practice. Some heuristics are as follows:

1. As a starting point, place all symbols in FOLLOW(A) into the synchro nizing set for nonterminal A. If we skip tokens until an element of FOLLOW(A) is seen and pop A from the stack, it is likely that parsing can continue. 2. It is not enough to use FOLLOW(A) as the synchronizing set for A. For example, if semicolons terminate statements, as in C, then keywords that begin statements may not appear in the FOLLOW set of the nontermi nal representing expressions. A missing semicolon after an assignment may therefore result in the keyword beginning the next statement be ing skipped. Often, there is a hierarchical structure on constructs in a language; for example, expressions appear within statements, which ap pear within blocks, and so on. We can add to the synchronizing set of a lower-level construct the symbols that begin higher-level constructs. For example, we might add keywords that begin statements to the synchro nizing sets for the nonterminals generating expressions. 3. If we add symbols in FIRST(A) to the synchronizing set for nonterminal A, then it may be possible to resume parsing according to A if a symbol in FIRST(A) appears in the input.

4. If a nonterminal can generate the empty string, then the production de riving t can be used as a default. Doing so may postpone some error detection, but cannot cause an error to be missed. This approach reduces the number of nonterminals that have to be considered during error re covery. 5.

a terminal on top of the stack cannot be matched, a simple idea is to pop the terminal, issue a message saying that the terminal was inserted, and continue parsing. In effect, this approach takes the synchronizing set of a token to consist of all other tokens. If

Example 4.36 : Using FIRST and FOLLOW symbols as synchronizing tokens

works reasonably well when expressions are parsed according to the usual gram mar ( 4.28) . The parsing table for this grammar in Fig. 4.17 is repeated in Fig. 4.22, with "synch" indicating synchronizing tokens obtained from the FOLLOW set of the nonterminal in question. The FOLLOW sets for the non terminals are obtained from Example 4.30. The table in Fig. 4.22 is to be used as follows. If the parser looks up entry M[A, a] and finds that it is blank, then the input symbol a is skipped. If the entry is "synch," then the nonterminal on top of the stack is popped in an attempt to resume parsing. If a token on top of the stack does not match the input symbol, then we pop the token from the stack, as mentioned above.

230

CHAPTER 4. SYNTAX ANALYSIS INPUT SYMBOL

NON TERMINAL

id

E

E -t TE'

E' T

*

( E -t TE'

E -t +TE' T -t FT'

T' F

+

F -t id

)

$

synch

synch

E -t f. E -t f.

synch

T -t FT'

T' -t f.

T' -t * FT'

synch

synch

synch

synch

T' -t f. T' -t f. F -t ( E )

synch

synch

Figure 4.22: Synchronizing tokens added to the parsing table of Fig. 4.17 On the erroneous input ) id * + id, the parser and error recovery mechanism of Fig. 4.22 behave as in Fig. 4.23. 0

STACK

E$ E$ TE' $ FT'E' $ id T'E'$ T'E' $ * FT'E' $ FT'E' $ T'E' $ E' $ + TE' $ TE' $ FT'E' $ id T'E' $ T'E' $ E' $ $

INPUT

REMARK

error, skip ) id * + id $ id is in FIRST ( E) id * + id $ id * + id $ id * + id $ * + id $ * + id $ + id $ error, M[F, +] synch + id $ F has been popped + id $ + id $ id $ id $ id $ $ $ $

) id * + id $

=

Figure 4.23: Parsing and error recovery moves made by a predictive parser The above discussion of panic-mode recovery does not address the important issue of error messages. The compiler designer must supply informative error messages that not only describe the error, they must draw attention to where the error was discovered.

4 . 4.

TOP-DOWN PARSING

231

Phrase-level Recovery

Phrase-level error recovery is implemented by filling in the blank entries in the predictive parsing table with pointers to error routines. These routines may change, insert, or delete symbols on the input and issue appropriate error messages. They may also pop from the stack. Alteration of stack symbols or the pushing of new symbols onto the stack is questionable for several reasons. First, the steps carried out by the parser might then not correspond to the derivation of any word in the language at all. Second, we must ensure that there is no possibility of an infinite loop. Checking that any recovery action eventually results in an input symbol being consumed (or the stack being shortened if the end of the input has been reached) is a good way to protect against such loops.

Exercises for Section 4.4 Exercise 4.4. 1 : For each of the following grammars, devise predictive parsers

4.4.6

and show the parsing tables. You may left-factor and/or eliminate left-recursion from your grammars first. a) The grammar of Exercise 4.2.2(a) . b ) The grammar of Exercise 4.2.2(b) . c) The grammar of Exercise 4.2.2(c) .

d) The grammar of Exercise 4.2.2(d) .

e) The grammar of Exercise 4.2.2(e) . f) The grammar of Exercise 4.2.2(g) . ! ! Exercise 4.4.2 : Is it possible, by modifying the grammar in any way, to con

struct a predictive parser for the language of Exercise 4.2.1 (postfix expressions with operand a ) ? Exercise 4.4�3 : Compute FIRST and FOLLOW for the grammar of Exercise

4.2.1.

Exercise 4.4.4 : Compute FIRST and FOLLOW for each of the grammars of Exercise 4.2.2. Exercise 4.4.5 : The grammar S

� a S a I a a generates all even-length strings of a's. We can devise a recursive-descent parser with backtrack for this grammar. If we choose to expand by production S � a a first, then we shall only recognize the string aa. Thus, any reasonable recursive-descent parser wili try S � a S a first.

a) Show that this recursive-descent parser recognizes inputs aa, aaaa, and aaaaaaaa, but not aaaaaa.

232

CHAPTER 4. SYNTAX ANALYSIS

! ! b) What language does this recursive-descent parser recognize?

The following exercises are useful steps in the construction of a "Ohomsky Normal Form" grammar from arbitrary grammars, as defined in Exercise 4.4.8. ! Exercise 4.4.6 : A grammar is E-free if no production body is E (called an

E -production) .

a) Give an algorithm to convert any grammar into an E-free grammar that generates the same language (with the possible exception of the empty string - no E-free grammar can generate E ) . b) Apply your algorithm to the grammar S -+ aSbS I bSaS I E. Hint: First find all the nonterminals that are nullable, meaning that they generate E, perhaps by a long derivation. ! Exercise 4.4. 7 : A single production is a production whose body is a single

nonterminal, i.e., a production of the form A -+ A.

a) Give an algorithm to convert any grammar into an E-free grammar, with no single productions, that generates the same language (with the possible exception of the empty string) Hint: First eliminate E-productions, and then find for which pairs of nonterminals A and B does A * B by a sequence of single productions. b) Apply your algorithm to the grammar (4. 1) in Section 4.1.2. c) Show that, as a consequence of part (a) , we can convert a grammar into an equivalent grammar that has no cycles (derivations of one or more steps in which A * A for some nonterminal A) .

! ! Exercise 4.4. 8 : A grammar is said to be in Chomsky Normal Form (ONF) if every production is either of the form A ---t BC or of the form A -+ a, where A, B, and C are nonterminals, and a is a terminal. Show how to convert any grammar into a ONF grammar for the same language (with the possible exceptiori of the empty string - no ONF grammar can generate E) . ! Exercise 4.4.9 : Every language that has a context-free grammar can be rec

ognized in at most O(n3 ) time for strings of length n. A simple way to do so, called the Cocke- Younger-Kasami (or CYK) algorithm is based on dynamic pro gramming. That is, given a string a l a2 ' " a n , we construct an n-by-n table T such that Tij is the set of nonterminals that generate the substring aiai+ l . . . aj . If the underlying grammar is in CNF (see Exercise 4.4.8) , then one table entry can be filled in in O (n) time, provided we fill the entries in the proper order: lowest value of j - i first. Write an algorithm that correctly fills in the entries of the table, and show that your algorithm takes O(n3 ) time. Having filled in the table, how do you determine whether a l a2 . . . an is in the language?

4.5.

233

BOTTOM-UP PARSING

! Exercise 4.4.10 : Show how, having filled in the table as in Exercise 4.4.9,

we can in O (n) time recover a parse tree for a l a2 . . . an - Hint: modify the table so it records, for each nonterminal A in each table entry Tij , some pair of nonterminals in other table entries that justified putting A in Tij .

! Exercise 4.4. 1 1 : Modify your algorithm of Exercise 4.4.9 so that it will find,

for any string, the smallest number of insert, delete, and mutate errors (each error a single character) needed to turn the string into a string in the language of the underlying grammar.

stmtTail

-+

if e then stmt stmtTail while e do stmt begin list end s else stmt

list list Tail

-+ -+

stmt list Tail ; list

-+

stmt

I I

I

I

E

-+

E

Figure 4.24: A grammar for certain kinds of statements ! Exercise 4.4. 1 2 : In Fig. 4.24 is a grammar for certain statements. You may take e and s to be terminals standing for conditional expressions and "other

statements," respectively. If we resolve the conflict regarding expansion of the optional "else" (nonterminal stmt Tain by preferring to consume an else from the input whenever we see one, we can build a predictive parser for this grammar. Using the idea of synchroni:dng symbols described in Section 4.4.5: a) Build an error-correcting predictive parsing table for the grammar. b) Show the behavior of your parser on the following inputs:

(i) (ii) 4.5

if e then s ; if e then s end while e do begin s ; if e then s ; end

Bottom-Up Parsing

A bottom-up parse corresponds to the construction of a parse tree for an input string beginning at the le'aves (the bottom) and working up towards the root (the top) . It is convenient to describe parsing as the process of building parse trees, although a front end may in fact carry out a translation directly without building an explicit tree. The sequence of tree snapshots in Fig. 4.25 illustrates

234 id

CHAPTER 4. SYNTAX ANALYSIS *

id

F

1

*

id

id

T

1

*

id

T

1

F

F

id

id

1

1

*

F

1

id

T

E

11\

T

1

*

F

1

id

1

F

1

id

T

11\

T

1

F

1

*

F

1

id

id

Figure 4.25: A bottom-up parse for id * id a bottom-up parse of the token stream id * id, with respect to the expression grammar (4. 1). This section introduces a general style of bottom-up parsing known as shift reduce parsing. The largest class of grammars for which shift-reduce parsers can be built, the LR grammars, will be discussed in Sections 4.6 and 4.7. Although it is too much work to build an LR parser by hand, tools called automatic parser generators make it easy to construct efficient LR parsers from suitable gram mars. The concepts in this section are helpful for writing suitable grammars to make effective use of an LR parset generator. Algorithms for implementing patser generators appear in Section 4.7.

4.5 . 1

Reductions

We can think of bottom-up parsing as the process of "reducing" a string w to the start symbol of the grammar. At each reduction step, a specific substring matching the body of a production is replaced by the nonterminal at the head of that production. The key decisions during bottom-up parsing are about when to reduce and about what production to apply, as the parse proceeds. Example 4.37 : The snapshots in Fig. 4.25 illustrate a sequence of reductions; the grammar is the expression grammar (4. 1). The reductions will be discussed in terms of the sequence of strings id * id, F * id, T * id, T * F, T, E

The strings in this sequence are formed from the roots of all the subtrees in the

snapshots. The sequence starts with the input string id * id. The first reduction produces F * id by reducing the leftmost id to F, lising the production F ---+ id. The second reduction produces T * id by reducing F to T. Now, we have a choice between reducing the string T, which is the body of E ---+ T, and the string consisting of the second id, which is the body of F ---+ id. Rather than reduce T to E, the second id is reduced to T, resulting in the string T * F. This string then reduces to T . The parse completes with the reduction of T to the start symboi E. 0

4 . 5.

235

BOTTOM- UP PARSING

By definition, a reduction is the reverse of a step in a derivation (recall that in a derivation, a nonterminal in a sentential form is replaced by the body of one of its productions) . The goal of bottom-up parsing is therefore to construct a derivation in reverse. The following derivation corresponds to the parse in Fig. 4.25:

E => T => T * F => T * id => F * id => id * id This derivation is in fact a rightmost derivation.

4.5.2

Handle Pruning

Bottom-up parsing during a left-to-right scan of the input constructs a right most derivation in reverse. Informally, a "handle" is a substring that matches the body of a production, and whose reduction represents one step along the reverse of a rightmost derivation. For example, adding subscripts to the tokens id for clarity, the handles during the parse of id1 * id2 according to the expression grammar (4. 1) are as in Fig. 4.26. Although T is the body of the production E -+ T, the symbol T is not a handle in the sentential form T * id2 • If T were indeed replaced by E, we would get the string E * id2 , which cannot be derived from the start symbol E. Thus, the leftmost substring that matches the body of some production need not be a handle. RIGHT SENTENTIAL FORM

id1 * id2 F * id2 T * id2 T*F

HANDLE

id1 F id2 T*F

REDUCING PRODUCTION

F -+ id T -+ F F -+ id E -+ T * F

Figure 4.26: Handles during a parse of id1 * id2 Formally, if S � aAw => af3w, as in Fig. 4.27, then production A -+ f3 rm rm in the position following a is a handle of af3w. Alternatively, a handle of a right-sentential form 'Y is a production A -+ f3 and a position of 'Y where the string f3 may be found, such that replacing f3 at that position by A produces the previous right-sentential form in a rightmost derivation of 'Y. Notice that the string w to the right of the handle must contain only terminal symbols. For convenience, we refer to the body f3 rather than A -+ f3 as a handle. Note we say "a handle" rather than "the handle," because the grammar could be ambiguous, with more than one rightmost derivation of af3w. If a grammar is unambiguous, then every right-sentential form of the grammar has exactly one handle. A rightmost derivation in reverse can be obtained by "handle pruning." That is, we start with a string of terminals w to be parsed. If w is a sentence

236

CHAPTER 4. SYNTAX ANALYSIS

Figure 4.27: A handle A -+ /3 in the parse tree for a/3w of the grammar at hand, then let w = "in , where "in is the nth right-sentential form of some as yet unknown rightmost derivation S = "io ::} "iI ::} "i2 ::} . . => "in -I ::} "in = W rm rm rm rm rm .

To reconstruct this derivation in reverse order, we locate the handle f3n in "in and replace f3n by the head of the relevant production An -+ /3n to obtain the previous right-sentential form "in -I . Note that we do not yet know how handles are to be found, but we shall see methods of doing so shortly. We then repeat this process. That is; we locate the handle f3n -I in "in -I and reduce this handle to obtain the right-sentential form "in - 2 . If by continuing this process we produce a right-sentential form consisting only of the start symbol S , then we halt and announce successful completion of parsing. The reverse of the sequence of productions used in the reductions is a rightmost derivation for the input string.

4.5.3

Shift-Reduce Parsing

Shift-reduce parsing is a form of bottom-up parsing in which a stack holds grammar symbols and an input buffer holds the rest of the string to be parsed. As we shall see, the handle always appears at the top of the stack just before it is identified as the handle. We use $ to mark the bottom of the stack and also the right end of the input. Conventionally, when discussing bottom-up parsing, we show the top of the stack on the right, rather than on the left as we did for top-down parsing. Initially, the stack is empty, and the string w is on the input, as follows: STACK $

INPUT

w$

During a left-to-right scan of the input string, the parser shifts zero or more input symbols onto the stack, until it is ready to reduce a string {3 of grammar symbols on top of the stack. It then reduces {3 to the head of the appropriate production. The parser repeats this cycle until it has detected an error or until the stack contains the start symbol and the input is empty:

4. 5.

2 37

BOTTOM-UP PARSING INPUT

STACK

$ $8 Upon entering this configuration, the parser halts and announces successful completion of parsing. Figure 4.28 steps through the actions a shift-reduce parser might take in parsing the input string id1 *id2 according to the expression grammar (4.1) . INPUT

STACK

$ $ id1 $F $T $T * $ T * id2 $T*F $T $E

id1 * id2 $ * id2 $ * id2 $ * id2 $ id2 $ $ $ $ $

ACTION

shift reduce by F -+ id reduce by T -+ F shift shift reduce by F -+ id reduce by T -+ T * F reduce by E -+ T accept

Figure 4.28: Configurations of a shift-reduce parser on input id1 *id2 While the primary operations are shift and reduce, there are actually four possible actions a shift-reduce parser can make: (1) shift, (2) reduce, (3) accept, and (4) error.

1. Shift. Shift the next input symbol onto the top of the stack. 2. Reduce. The right end of the string to be reduced must be at the top of the stack. Locate the left end of the string within the stack and decide with what nonterminal to replace the string.

3. Accept. Announce successful completion of parsing. 4. Error. Discover a syntax error and call an error recovery routine. The use of a stack in shift-reduce parsing is justified by an important fact: the handle will always eventually appear on top of the stack, never inside. This fact can be shown by considering the possible forms of two successive steps in any rightmost derivation. Figure 4.29 illustrates the two possible cases. In case ( 1 ) , A is replaced by {3By, and then the rightmost nonterminal B in the body {3By is replaced by 'Y . In case (2) , A is again expanded first, but this time the body is a string y of terminals only. The next rightmost nonterminal B will be somewhere to the left of y . In other words: (1) 8 =* aAz => aj3Byz => aj3'YY z r*m rm rm (2) 8 => aBxAz => aBxyz => a'Y xyz rm

rm

rm

238

CHAPTER 4. SYNTAX ANALYSIS

(3

Y 1 Case (1)

z

1

x Y Case (2)

z

Figure 4.29: Cases for two successive steps of a rightmost derivation Consider case (1) in reverse, where a shift-reduce parser has just reached the configuration STACK

$a(31

INPUT yz$

The parser reduces the handle 1 to B to reach the configuration

$a(3B

yz$

The parser can now shift the string y onto the stack by a sequence of zero or more shift moves to reach the configuration

$a(3By

z$

with the handle (3By on top of the stack, and it gets reduced to A. Now consider case (2) . In configuration

xyz$ the handle 1 is on top of the stack. After reducing the handle 1 to B, the parser can shift the string xy to get the next handle y on top of the stack, ready to be reduced to A:

$aBxy

z$

In both cases, after making a reduction the parser had to shift zero or more symbols to get the next handle onto the stack. It never had to go into the stack to find the handle.

4.5.4

Conflicts During Shift-Reduce Parsing

There are context-free grammars for which shift-reduce parsing cannot be used. Every shift-reduce parser for such a grammar can reach a configuration in which the parser, knowing the entire stack contents and the next input symbol, cannot decide whether to shift or to reduce ( a shift/reduce conflict) , or cannot decide

4.5.

239

BOTTOM- UP PARSING

which of several reductions to make ( a reduce/reduce conflict) . We now give some examples of syntactic constructs that give rise to such grammars. Techni cally, these grammars are not in the LR ( k) class of grammars defined in Section 4.7; we refer to them as non-LR grammars. The k in LR ( k) refers to the number of symbols of lookahead on the input. Grammars used in compiling usually fall in the LR ( l ) class, with one symbol of lookahead at most. Example 4.38 : An ambiguous grammar can never be LR. For example, con

sider the danglirig-else grammar (4. 14) of Section 4.3:

stmt

-+

if expr then stmt if expr then stmt else stmt other

If we have a shift-reduce parser in configuration INPUT

STACK . " if expr then stmt

else · · · $

we cannot tell whether if expr then stmt is the handle, no matter what appears below it on the stack. Here there is a shift / reduce conflict. Depending on what follows the else on the input, it might be correct to reduce if expr then stint to stmt, or it might be correct to shift else and then to look for another stmt to complete the alternative if expr then stmt else stmt. Note that shift-reduce parsing can be adapted to parse certain ambigu ous grammars, such as the if-then-else grammar above. If we resolve the shift / reduce conflict on else in favor of shifting, the parser will behave as we expect, associating each else with the previous unmatched then. We discuss parsers for such ambiguous grammars in Section 4.8. 0 Another common setting for conflicts occurs when we know we have a han dle, but the stack contents and the next input symbol are insufficient to de termirie which production should be used in a reduction. The next example illustrates this situation. Example 4.39 : Suppose we have a lexical analyzer that returns the token name id for all names, regardless of their type. Suppose also that bur lan

guage invokes procedures by giving their names, with parameters surrounded by parentheses, and that arrays are referenced by the same syntax. Since the translation of indices in array references and parameters in procedure calls are different, we want to use different productions to generate lists of actual parameters and indices. Our grammar might therefore have ( among others ) productions such as those in Fig. 4.30. A statement beginning with p (i j ) would appear as the token stream id(id, id) to the parser. After shifting the first three tokens onto the stack, a shift-reduce parser would be in configuration ,

240

CHAPTER 4. SYNTAX ANALYSIS (1) stmt -+ (2) stmt -+ ( 3 ) parameter_list -+ ( 4) parameter_list -+ (5) parameter -+ (6) expr -+ (7) expr -+ (8) expr_list -+ (9) expr_list -+

id ( parameter_list ) expr : = expr

parameter_list , parameter parameter id id ( expr_list ) id

expr_list , expr expr

Figure 4.30: Productions involving procedure calls and array references STACK

. . . id ( id

INPUT

, id ) . . .

It is evident that the id on top of the stack must be reduced, but by which production? The correct choice is production (5) if P is a procedure, but pro duction (7) if p is an array. The stack does not tell which; information in the symbol table obtained from the declaration of p must be used. One solution is to change the token jd in production (1) to procid and to use a more sophisticated lexical analyzer that returns the token name procid when it recognizes a lexeme that is the name of a procedure. Doing so would require the lexical analyzer to consult the symbol table before returning a tok�n. If we made this modification, then on processing p ( i , j ) the parser would be either in the configuration

STACK

. . , procid ( id

INPUT

, id )

'

"

or in the configuration above. In the former case, we choose reduction by production (5) ; in the latter case by production (7) . Notice how the symbol third from the top of the stack determines the reduction to be made, even though it is not involved in the reduction. Shift-reduce parsing can utilize information far down in the stack to guide the parse. 0

4.5.5

Exercises for Section 4.5 Exercise 4 . 5 . 1 : For the grammar S -+ 0 S 1 I 0 1 of Exercise 4.2.2 ( a) ,

indicate the handle in each of the following right-sentential forms: a) 000111. b) 00811. Exercise 4.5.2 : Repeat Exercise 4.5. 1 for the grammar 8 of Exercise 4.2. 1 and the following right-sentential forms:

-+

SS+ I S8* Ia

4.6. INTRODUCTION TO LR PARSING: SIMPLE LR

241

a) SSS + a * + . b ) SS + a * a+. c) aaa * a + +. Exercise 4.5.3 : Give bottom-up parses for the following input strings and

grammars:

a) The input 000111 according to the grammar of Exercise 4.5.1 . b ) The input aaa * a + + according to the grammar of Exercise 4.5.2. 4.6

Introduct ion to LR Parsing: S imple LR

' The most prevalent type of bottom-up parser today is based on a concept called LR(k) parsing; the "L" is for left-to-right scanning of the input, the "R" for constructing a rightmost derivation in reverse, and the k for the number of input symbols of lookahead that are used in making parsing decisions. The cases k = a or k = 1 are of practical interest, and we shall only consider LR parsers with k � 1 here. When (k) is omitted, k is assumed to be 1 . This section introduces the basic concepts of LR parsing and the . easiest method for constructing shift-reduce parsers, called "simple LR" (or SLR, for short) . Some familiarity with the basic concepts is helpful even if the LR parser itself is constructed using an automatic parser generator. We begin with "items" and "parser states;" the diagnostic output from an LR parser generator typically includes parser states, which can be used to isolate the sources of parsing conflicts. Section 4.7 introduces two, more complex methods - canonical-LR and LALR - that are used in the majority of LR parsers.

4.6. 1

Why LR Parsers?

LR parsers are table-driven, much like the nonrecursive LL parsers of Sec tion 4.4.4. A grammar for which we can construct a parsing table using one of the methods in this section and the next is said to be an LR grammar. Intu itively, for a grammar to be LR it is sufficient that a left-to-right shift-reduce parser be able to recognize handles of right-sentential forms when they appear on top of the stack. LR parsing is attractive for a variety of reasons: •

LR parsers can be constructed to recognize virtually all programming language constructs for which context-free grammars can be written. Non LR context-free grammars exist, but these can generally be avoided for typical programming-language constructs.

242

CHAPTER 4. SYNTAX ANALYSIS •

The LR-parsing method is the most general nonbacktracking shift-reduce parsing method known, yet it can be implemented as efficiently as other, more primitive shift-reduce methods (see the bibliographic notes) .

•

An LR parser can detect a syntactic error as soon as it is possible to do so on a left-to-right scan of the input.

•

The class of grammars that can be parsed using LR methods is a proper superset of the class of grammars that can be parsed with predictive or LL methods. For a grammar to be LR ( k ) , we must be able to recognize the occurrence of the right side of a production in a right-sente:q.tial form, with k input symbols of lookahead. This requirement is far less stringent than that for LL ( k ) grammars where we must be able to recognize the use of a production seeing only the first k symbols of what its right side derives. Thus, it should not be surprising that LR grammars can describe more languages than LL grammars.

The principal drawback of the LR method is that it is too much work to construct an LR parser by hand for a typical programming-language grammar. A specialized tool, an LR parser generator, is needed. Fortunately, many such generators are available, and we shall discuss one of the most commonly used ones, Yacc , in Section 4.9. Such a generator takes a context-free grammar and automatically produces a parser for that grammar. If the grammar contains ambiguities or other constructs that are difficult to parse in a left-to-right scan of the input, then the parser generator locates these constructs and provides detailed diagnostic messages.

4.6.2

Items and the LR(O) Automaton

How does a shift-reduce parser know when to shift and when to reduce? For example, with stack contents $ T and next input symbol * in Fig. 4.28, how does the parser know that T on the top of the stack is not a handle, so the appropriate action is to shift and not to reduce T to E? An LR parser makes shift-reduce decisions by maintaining states to keep track of where we are in a parse. States represent sets of "items." An LR(O) item (item for short ) of a grammar G is a production of G with a dot at some position of the body. Thus, production A -+ XYZ yields the four items

A -+ ·XYZ A -+ X·YZ A -+ XY · Z A -+ XYZ· The production A -+ E generates only one item, A -+ . . Intuitively, an item indicates how much of a production we have seen at a given point in the parsing process. For example, the item A -+ ·XY Z indicates that we hope to see a string derivable from XY Z next on the input. Item

4 . 6.

INTRODUCTION TO LR PARSING: SIMPLE LR

243

Representing Item Sets A parser generator that produces a bottom-up parser may need to rep resent items and sets of items conveniently. Note that an item can be represented by a pair of integers, the first of which is the number of one of the productions of the underlying grammar, and the second of which is the position of the dot. Sets of items can be represented by a list of these pairs. However, as we shall see, the necessary sets of items often include "closure" items, where the dot is at the beginning of the body. These can always be reconstructed from the other items in the set, and we do not have to include them in the list.

A -+ X· Y Z indicates that we have just seen on the input a string derivable from X and that we hope next to see a string derivable from Y Z. Item A -+ XY Z· indicates that we have seen the body XY Z and that it may be time to reduce XYZ to A. One collection of sets of LR(O) items, called the canonical LR(O) collection, provides the basis for constructing a deterministic finite automaton that is used to make parsing decisions. Such an automaton is called an LR(O) automaton. 3 In particular, each state of the LR(O) automaton represents a set of items in the canonical LR(O) collection. The automaton for the expression grammar (4.1 ) , shown in Fig. 4.31, will serve as the running example for discussing the canonical LR( 0) collection for a grammar. To construct the canonical LR(O) collection for a grammar, we define an augmented grammar and two functions, CLOSURE and GOTO. If G is a grammar with start symbol S, then G' , the augmented grammar for G, is G with a new start symbol 8' and production 8' -+ 8. The purpose of this new starting production is to indicate to the parser when it should stop parsing and announce acceptance of the input. That is, acceptance occurs when and only when the parser is about to reduce by 8' -+ 8. Closure of Item Sets

If I is a set of items for a grammar G, then CLosuRE ( I ) is the set of items constructed from I by the two rules: 1. Initially, add every item in I to CLOSURE ( I) .

2. If A -+ a·Bf3 is in CLOSURE ( I) and B -+ 1 is a production, then add the item B -+ '1 to CLOSURE ( I) , if it is not already there. Apply this rule until no more new items can be added to CLOSURE ( I) . 3 Technically, the automaton misses being deterministic according to the definition of Sec tion 3 .6.4, because we do not have a dead state, corresponding to the empty set of items. As a result, there are some state-input pairs for which no next state exists.

244

CHAPTER 4. SYNTAX ANALYSIS E

h

19

E' -+ E·

E -+ E + T·

E -+ E · +T

T -+ T · * F

$

*

accept

Figure

4.31 : LR(O)

automaton for the expression grammar

Intuitively, A -+ (X·B{3 i n

CLOSURE(/)

(4 . 1)

indicates that, at some point i n the

parsing process, we think we might next see a substring derivable from B{3 as input.

The substring derivable from B(3 will have a prefix derivable from

B by applying one of the B-productions. We therefore add items for all the B-productions; that is , if B -+ /' is a production, we also include B -+ '/' in

CLOSURE(/) . Example

4. 40 : Consider the augmented expression grammar:

E' E T E If

/

is the set of one item

of items

/0

in Fig.

4.31.

-+ -+ -+ -+

{ [E'

-+

E E+T I T T*F I F (E) I id . E] } ,

then

CLOSURE( /)

contains the set

4.6. INTRODUCTION TO LR PARSING: SIMPLE LR

245

To see how the closure is computed, E' � ·E is put in CLOSURE(I) by rule (1) . Since there is an E immediately to the right of a dot, we add the E-productions with dots at the left ends: E � · E + T and E � ·T. Now there is a T immediately to the right of a dot in the latter item, so we add T � ·T * F and T � ·F. Next, the F to the right of a dot forces us to add F � · ( E ) and F � · id , but no other items need to be added. 0 The closure can be computed as in Fig. 4.32. A convenient way to imple ment the function closure is to keep a boolean array added, indexed by the nonterminals of G, such that added[B] is set to true if and when we add the item B � .'Y for each B-production B � 'Y . SetOfltems CLOSURE (I) {

J = I; repeat for ( each item A � cx·Bf3 in J ) for ( each production B � 'Y of G ) if ( B � .'Y is not in J ) add B � .'Y to J; until no more items are added to J on one round; return J;

} Figure 4.32: Computation of CLOSURE Note that if one B-production is added to the closure of I with the dot at the left end, then all B-productions will be similarly added to the closure. Hence, it is not necessary in some circumstances actually to list the items B � . 'Y added to I by CLOSURE. A list of the nonterminals B whose productions were so added will suffice. We divide all the sets of items of interest into two classes:

1. Kernel items : the initial item, S' � · S, and all items whose dots are not at the left end. 2. Nonkernel items: all items with their dots at the left end, except for S' � ·S. Moreover, each set of items of interest is formed by taking the closure of a set of kernel items; the items added in the closure can never be kernel items, of course. Thus, we can represent the sets of items we are really interested in with very little storage if we throw away all nonkernel items, knowing that they could be regenerated by the closure process. In Fig. 4.31 , nonkernel items are in the shaded part of the box for a state.

246

CHAPTER 4. SYNTAX ANALYSIS

The Function GOTO

The second useful function is GOTO(I, X) where I is a set of items and X is a grammar symbol. GOTO (I, X) is defined to be the closure of the set of all items [A -+ aX ·;3] such that [A -+ a . XfJ] is in I. Intuitively, the GOTO function is used to define the transitions in the LR(O) automaton for a grammar. The states of the automaton correspond to sets of items, and GOTo (I, X) specifies the transition from the state for I under input X. Example 4.41 : If I is the set of two items { [E' -+ E·] , [E -+ E· + T] } , then GOTO(I, +) contains the items

E -+ E + ·T T -+ ·T * F T -+ ·F F -+ · (E) F -+ ·id

We computed GOTo (I, +) by examining I for items with + immediately to the right of the dot. E' -+ E· is not such an item, but E -+ E· + T is. We moved the dot over the + to get E -+ E + . T and then took the closure of this singleton set. 0 We are now ready for the algorithm to construct C, the canonical collection of sets of LR(O) items for an augmented grammar G' - the algorithm is shown in Fig. 4.33. void items( G') { C = CLOSURE({[S' -+ ·S] } ) ; repeat for ( each set of items I in C ) for ( each grammar symbol X ) if ( GOTo (I, X) is not empty and not in C ) add GOTO(I, X) to C; until no new sets of items are added to C on a round;

} Figure 4.33: Computation of the canonical collection of sets of LR(O) items Example 4.42 : The canonical collection of sets of LR(O) items for grammar (4. 1) and the GOTO function are shown in Fig. 4.31. GOTO is encoded by the

transitions in the figure.

0

4 . 6.

INTRODUCTION TO LR PARSING: SIMPLE LR

2 47

Use of the LR(O) Automaton

The central idea behind "Simple LR," or SLR, parsing is the construction from the grammar of the LR(O) automaton. The states of this automaton are the sets of items from the canonical LR(O) collection, and the transitions are given by the GOTO function. The LR(O) automaton for the expression grammar (4.1) appeared earlier in Fig. 4.31 . The start state of the LR(O) automaton is CLOSURE( { [Sf --+ ·S] } ) , where Sf is the start symbol of the augmented grammar. All states are accepting states. We say "state j" to refer to the state corresponding to the set of items Ij . How can LR(O) automata help with shift-reduce decisions? Shift-reduce decisions can be made as follows. Suppose that the string '"'( of grammar symbols takes the LR(O) automaton from the start state 0 to some state j . Then, shift on next input symbol a if state j has a transition on a. Otherwise, we choose to reduce; the items in state j will tell us which production to use. The LR-parsing algorithm to be introduced in Section 4.6.3 uses its stack to keep track of states as well as grammar symbols; in fact, the grammar symbol can be recovered from the state, so the stack holds states. The next example gives a preview of how an LR(O) automaton and a stack of states can be used to make shift-reduce parsing decisions. Example 4.43 : Figure 4.34 illustrates the actions of a shift-reduce parser on input id * id, using the LR(O) automaton in Fig. 4.31. We use a stack to hold states; for clarity, the grammar symbols corresponding to the states on the stack appear in column SYMBOLS. At line (1), the stack holds the start state 0 of the automaton; the corresponding symbol is the bottom-of-stack marker $ .

LINE ( 1) (2) (3) ( 4) (5) (6) (7) (8) (9)

STACK 0 05 03 02 027 0275 0 2 7 10 02 01

SYMBOLS $

$ id

$F $T $T* $ T * id $T*F $T $E

INPUT id * id $ * id $ * id $ * id $ id $

$ $ $ $

ACTION shift to 5 reduce by F --+ id reduce by T --+ F shift to 7 shift to 5 reduce by F --+ id reduce by T --+ T * F reduce by E --+ T accept

Figure 4.34: The parse of id * id The next input symbol is id and state 0 has a transition on id to state 5. We therefore shift. At line (2) , state 5 (symbol id) has been pushed onto the stack. There is no transition from state 5 on input *, so we reduce. From item [F --+ id·] in state 5, the reduction is by production F --+ id.

248

CHAPTER 4. SYNTAX ANALYSIS

With symbols, a reduction is implemented by popping the body of the pro duction from the stack (on line (2) , the body is id) and pushing the head of the production (in this case, F) . With states, we pop state 5 for symbol id, which brings stat� 0 to the top and look for a transition on F, the head of the production. In Fig. 4.31, state 0 has a transition on F to state 3, so we push state 3, with corresponding symbol F; see line (3) . As another example, consider line (5) , with state 7 (symbol * ) on top of the stack. This state has a transition to state 5 on input id, so we push state 5 (symbol id) . State 5 has no transitions, so we reduce by F � id. When we pop state 5 for the body id, state 7 comes to the top of the stack. Since state 7 has a transition on F to state 10, we push state 10 (symbol F) . 0

4.6.3

The LR-Parsing Algorithm

A schematic of an LR parser is shown in Fig. 4.35. It consists of an input, an output, a stack, a driver program, and a parsing table that has two parts (ACTION and GOTO) . The driver program is the same for all LR parsers; only the parsing table changes from one parser to another. The parsing program reads characters from an input buffer one at a time. Where a shift-reduce parser would shift a symbol, an LR parser shifts a state. Each state summarizes the information contained in the stack below it. Input

Stack

8m 8m - l

LR Parsing Program

Output

$

ACTION

GOTO

Figure 4.35: Model of an LR parser The stack holds a sequence of states, S 0 8 1 . . . 8 m , where 8m is on top. In the SLR method, the stack holds states from the LR(O) automaton; the canonical LR and LALR methods are similar. By const:ruction, each state h as a corre sponding grammar symbol. Recall that states correspond to sets of items, and that there is a transition from state i to state j if GOTO (Ii , X) = Ij . All tran sitions to state j must be for the same grammar symbol X . Thus, each state, except the start state 0, has a unique grammar symbol associated with it. 4 4 The converse need not hold; that is, more than one state may have the same grammar

4 . 6.

INTRODUCTION TO LR PARSING: SIMPLE LR

249

Structure of the LR Parsing Table

The parsing table consists of two parts: a parsing-action function ACTION and a goto function GOTO .

1 . The ACTION function takes as arguments a state i aI1d a terminal a (or $, the input endmarker) . The value of ACTION[i , a] can have one of four forms: (a) Shift j , where j is a state. The action taken by the parser effectively shifts input a to the stack, but uses state j to represent a . (b) Reduce A -t {3. The action of the parser effectively reduces {3 on the top of the stack to head A. (c) Accept. The parser accepts the input and finishes parsing; (d) Error. The parser discovers an error in its input and takes some corrective action. We shall have more to say about how such error recovery routines work in Sections 4.8.3 and 4.9A.

2. We extend the GOTO function, defined on sets of items, to states: if GOTo [Ii , A] = Ij , then GOTO also maps a state i and a nonterminal A to state j. LR-Parser Configurations

To describe the behavior of an LR parser, it helps to have a notation repre senting the complete state of the parser: its stack and the remaining input. A configuration of an LR parser is a pair:

where the first component is the stack contents (top on the right) , and the second component is the remaining input. This configuration represents the right-sentential form

in essentially the same way as a shift-reduce parser would; the only difference is that instead of grammar symbols, the stack holds states from which grammar symbols can be recovered. That is, Xi is the grammar symbol represented by state Si . Note that So , the start state of the parser, does not represent a grammar symbol, and serves as a bottom-of-stack marker, as well as playing an important role in the parse. symbol. See for example states 1 and 8 in the LR(O) automaton in Fig. 4.31, which are both entered by transitions on E, or states 2 and 9, which are both entered by transitions on T.

250

CHAPTER 4. SYNTAX ANALYSIS

Behavior of the LR Parser

The next move of the parser from the configuration above is determined by reading ai , the current input symbol, and S m , the state on top of the stack, and then consulting the entry ACTION[s m , ai] in the parsing action table. The configurations resulting after each of the four types of move are as follows

1. If ACTION[sm , ai] = shift s, the parser executes a shift move; it shifts the next state S onto the stack, entering the configuration

The symbol ai need not be held on the stack, since it can be recovered from s, if needed (which in practice it never is ) . The current input symbol is now ai + l .

2. If ACTION[Sm , ai] = reduce A -+ /3, then the parser executes a reduce move, entering the configuration

where r is the length of /3, and s = GOTO[S m - r , A] . Here the parser first popped r state symbols off the stack, exposing state Sm-r . The parser then pushed s, the entry for GOTO[S m -r, A] , onto the stack. The current input symbol is not changed in a reduce move. For the LR parsers we shall co:q.struct, Xm -r + 1 · · · Xm , the sequence of grammar symbols corresponding to the states popped off the stack, will always match /3, the right side of the reducing production. The output of an LR parser is generated after a reduce move by executing the semantic action associated with the reducing production. For the time being, we shall assume the output consists of just printing the reducing production.

3. If ACTION[S m , ai]

=

accept, parsing is completed.

4. If ACTION[S m , ai] = error, the parser has discovered an error and calls an error recovery routine. The LR-parsing algorithm is summarized below. All LR parsers behave in this fashion; the only difference between one LR parser and another is the information in the ACTION and GOTO fields of the parsing table. Algorithm 4.44 : LR-parsing algorithm.

INPUT: An input string w and an LR-parsing table with functions ACTION and

GOTO for a grammar G.

4. 6. INTRODUCTION TO LR PARSING: SIMPLE LR

251

OUTPUT: If w is in L( G) , the reduction steps of a bottom-up parse for otherwise, an error indication.

W;

METHOD: Initially, the parser has So on its stack, where S o is the initial state, and w$ in the input buffer. The parser then executes the program in Fig. 4.36. o

let

a

be the first symbol of w$;

while(l) { / * repeat forever * / let s be the state on top of the stack; if ( ACTION[S, a] = shift t ) { push t onto the stack;

let

a

be the next input symbol;

} else if ( ACTION[S, a] = reduce A

}

-7

f3 ) {

pop 1f31 symbols off the stack; let state t now be on top of the stack; push GOTO [t, A] onto the stack; output the production A -7 f3; } else if ( ACTION[S, a] = accept ) break; / * parsing is done * / else call error-recovery routine;

Figure 4.36: LR-parsing program Example 4.45 : Figure 4.37 shows the ACTION and GOTO functions of an LR-parsing table for the expression grammar (4.1 ) , repeated here with the productions numbered:

(1) (2) (3)

E -7 E + T E -7 T T -7 T * F

(4 )

(5) (6)

T -7 F F -7 (E) F -7 id

The codes for the actions are:

1. si means shift and stack state i, 2. rj means reduce by the production numbered j, 3. acc means accept, 4. blank means error. Note that the value of GOTO[s, a] for terminal a is found in the ACTION field connected with the shift action on input a for state s. The GOTO field gives GOTO[S , A] for nonterminals A. Although we have not yet explained how the entries for Fig. 4.37 were selected, we shall deal with this issue shortly.

252

CHAPTER 4. SYNTAX ANALYSIS STATE

0 1 2 3 4 5 6 7 8 9 10 11

ACTION

id s5

s5 s5 s5

+

*

s6 r2 s7 r4 r4 r6

r6

s6 r1 s7 r3 r3 r5 r5

( s4

s4 s4 s4

GOTO

)

$

r2 r4

acc r2 r4

r6

r6

sll r1 r3 r5

r1 r3 r5

E 1

T 2

F 3

8

2

3

9

3 10

Figure 4.37: Parsing table for expression grammar On input id * id + id, the sequence of stack and input contents is shown in Fig. 4.38. Also shown for clarity, are the sequences of grammar symbols corresponding to the states held on the stack. For example, at line (1) the LR parser is in state 0, the initial state with no grammar symbol, and with id the first input symbol. The action in row 0 and column id of the action field of Fig. 4.37 is s5, meaning shift by pushing state 5. That is what has happened at line (2) : the state symbol 5 has been pushed onto the stack, and id has been removed from the input. Then, * becomes the current input symbol, and the action of state 5 on input * is to reduce by F -+ id. One state symbol is popped off the stack. State 0 is then exposed. Since the goto of state 0 on F is 3, state 3 is pushed onto the stack. We now have the configuration in line (3) . Each of the remaining moves is determined similarly. 0

4.6.4

Constructing SLR-Parsing Tables

The SLR method for constructing parsing tables is a good starting point for studying LR parsing. We shall refer to the parsing table constructed by this method as an SLR table, and to an LR parser using an SLR-parsing table as an SLR parser. The other two methods augment the SLR method with lookahead information. The SLR method begins with LR(O) items and LR(O) automata, introduced in Section 4.5. That is, given a grammar, G, we augment G to produce G' , with a new start symbol S'. From G', we construct C, the canonical collection of sets of items for G' together with the GOTO function.

4.6. INTRODUCTION TO LR PARSING: SIMPLE LR

(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14)

STACK

SYMBOLS

0 05 03 02 027 0275 0 2 7 10 02 o1 016 o165 o163 o169 o1

id F T T* T * id T*F T E E+ E + id E+F E+T E

INPUT id * id + id $ * id + id $ * id + id $ * id + id $ id + id $ + id $ + id $ + id $ + id $ id $ $ $ $ $

253

ACTION

shift reduce by F -+ id reduce by T -+ F shift shift reduce by F -+ id reduce by T -+ T * F reduce by E -+ T shift shift reduce by F -+ id reduce by T -+ F reduce by E -+ E + T accept

Figure 4.38: Moves of an LR parser on id * id + id

The ACTION and GOTO entries in the parsing table are then constructed using the following algorithm. It requires us to know FOLLOW(A) for each nonterminal A of a grammar (see Section 4.4) . Algorithm 4.46 : Constructing an SLR-parsing table.

INPUT: An augmented grammar G' . OUTPUT: The SLR-parsing table functions ACTION and GOTO for G'. METHOD:

1. Construct C = {Io , II , . . . , In } , the collection of sets of LR(O) items for G' . 2. State i is constructed from Ii . The parsing actions for state i are deter mined as follows: (a) If [A -+ a·a!3] is in Ii and GOTo (Ii , a ) = Ij , then set ACTION[i, a] to "shift j." Here a must be a terminal. (b) If [A -+ a· ] is in Ii , then set ACTION[i, a] to "reduce A -+ a" for all a in FOLLOW(A) ; here A may not be S' . (c) If [S' -+ S · ] is in Ii , then set ACTION[i , $] to "accept." If any

conflicting actions result from the above rules, we say the grammar is not SLR(l) . The algorithm fails to produce a parser in this case.

2 54

CHAPTER

4.

SYNTAX ANALYSIS

3. The goto transitions for state i are constructed for all nonterminals A using the rule: If GOTo (Ii , A) = Ij , then GOTo[i , A] = j. 4. All entries not defined by rules (2) and (3) are made "error." 5. The initial state of the parser is the one constructed from the set of items containing [8' � ·8] . o

The parsing table consisting of the ACTION and GOTO functions determined by Algorithm 4.46 is called the SLR (1) table for G. An LR parser using the SLR ( l ) table for G is called the SLR ( l ) parser for G, and a grammar having an SLR ( l ) parsing table is said to be SLR(J) . We usually omit the "( I )" after the "SLR," since we shall not deal here with parsers having more than one symbol of lookahead. Example 4.47 : Let us construct the SLR table for the augmented expression grammar. The canonical collection of sets of LR ( O ) items for the grammar was shown in Fig. 4.31. First consider the set of items 10 :

E/ � ·E E � ·E + T E � ·T T � ·T * F T � ·F F � · (E) F � ·id

The item F � · (E) gives rise to the entry ACTION[O, (] = shift 4, and the item F � ·id to the entry ACTION [O, id] = shift 5. Other items in 10 yield no actions. Now consider h : E' � E· E � E· + T

The first item yields ACTION[l , $] = shift 6. Next consider 12 :

=

accept, and the second yields ACTION[l , +]

E � T· T � T· * F

Since FOLLOW(E)

=

ACTION[2, $]

{$, +, ) } , the first item makes =

ACTION[2, +]

=

ACTION[2, )]

=

reduce E � T

The second item makes ACTION[2, *] = shift 7. Continuing in this fashion we obtain the ACTION and GOTO tables that were shown in Fig. 4.31. In that figure, the numbers of productions in reduce actions are the same as the order in which they appear in the original grammar (4. 1 ) . That is, E � E + T is number 1, E � T is 2, and so on. 0

4. 6.

INTRODUCTION TO LR PARSING: SIMPLE LR

255

Example 4.48 : Every SLR(l) grq,mmar is unambiguous, but there are many

unambiguous grammars that are not SLR(l) . Consider the grammar with productions 8 L R

-t -t -t

L=R I R *R I id L

( 4.49)

Think of L and R as standing for,, i-value and r-value, respectively, and * as an operator indicating "contents of. 5 The canonical collection of sets of LR(O) items for grammar (4.49) is shown in Fig. 4.39. 8' -t ·8 S -t ·L = R 8 -t ·R L -t · * R L -t ·id R -t ·L

15 :

L -t id·

16 :

S -t L = ·R R -t ·L L -t · * R L -t ·id

h:

8' -t S·

17 :

L -t *R·

h:

8 -t L · = R R -t L·

18 :

R -t L·

19 :

S -t L = R·

h:

8 -t R·

14 :

L -t *·R R -t ·L L -t · * R L -t ·id

10 :

Figure 4.39: Canonical LR(O) collection for grammar (4.49) Consider the set of items 12 , The first item in this set makes ACTION[2, =] be "shift 6." Since FOLLOW(R) contains = (to see why, consider the derivation S =:} L = R =:} *R = R) , the second item sets ACTION[2, =] to "reduce R -t L." Since there is both a shift and a reduce entry in ACTION[2, =] , state 2 has a shift/reduce conflict on input symbol =. Grammar (4.49) is not ambiguous. This shift/reduce conflict arises from the fact that the SLR parser construction method is not powerful enough to remember enough left context to decide what action the parser should take on input =, having seen a string reducible to L. The canonical and LALR methods, to be discussed next, will succeed on a larger collection of grammars, including 5 As in Section 2.8.3, an l-value designates a location and an r-value is a value that can be stored in a location.

256

CHAPTER 4. SYNTAX ANALYSIS

grammar (4.49) . Note, however, that there are unambiguous grammars for which every LR parser construction method will produce a parsing action table with parsing action conflicts. Fortunately, such grammars can generally be avoided in programming language applications. 0

4.6.5

Viable Prefixes

Why can LR(O) automata be used to make shift-reduce decisions? The LR(O) automaton for a grammar characterizes the strings of grammar symbols that can appear on the stack of a shift-reduce parser for the grammar. The stack contents must be a prefix of a right-sentential form. If the stack holds a and the rest of the input is x, then a sequence of reductions will take ax to 8. In terms of derivations, 8 � ax. rm Not all prefixes of right-"sentential forms can appear on the stack, however, since the parser must not shift past the handle. For example, suppose E � F * id => (E) * id rm

rm

Then, at various times during the parse, the stack will hold (, (E, and (E) , but it must not hold (E) *, since (E) is a handle, which the parser must reduce to F before shifting * . The prefixes of right sentential forms that can appear on the stack of a shift reduce parser are called viable prefixes. They are defined as follows: a viable prefix is a prefix of a right-sentential form that does not continue past the right end of the rightmost handle of that sentential form. By this definition, it is always possible to add terminal symbols to the end of a viable prefix to obtain a right'-sentential form. SLR parsing is based on the fact that LR(O) automata recognize viable prefixes. We say item A -+ (31 ' (32 is valid for a viable prefix a(31 if there is a derivation 8' � aAw => a(31 (32w. In general, an item will be valid for many rm rm viable prefixes. The fact that A -+ (31 · (32 is valid for a(31 tells us a lot about whether to shift or reduce when we find a(31 on the parsing stack. In particular, if i32 f:. E, then it suggests that we have not yet shifted the handle onto the stack, so shift is our move. If (32 = E, then it looks as if A -+ (31 is the handle, and we should reduce by this production. Of course, two vaiid items may tell us to do different things for the same viable prefix. Some of these conflicts can be resolved by looking at the next input symbol, and others can be resolved by the methods of Section 4.8, but we should not suppose that all parsing action conflicts can be resolved if the LR method is applied to an arbitrary grammar. We can easily compute the set of valid items for each viable prefix that can appear on the stack of an LR parser. In fact, it is a central theorem of LR-parsing theory that the set of valid items for a viable prefix 'Y is exactly the set of items reached from the initial state along the path labeled 'Y in the LR(O) automaton for the grammar. In essence, the set of valid items embodies

4.6. INTRODUCTION TO LR PARSING: SIMPLE LR

257

Items as States of an NFA A nondeterministic finite automaton N for recognizing viable prefixes can be constructed by treating the items themselves as states. There is a transition from A -+ a· X (3 to A -+ aX · (3 labeled X , and there is a transition from A -+ a·B(3 to B -+ . '"'( labeled t. T hen CLOSURE ( I) for set of items (states of N) I is exactly the t-closure of a set of NFA states defined in Section 3.7.1. Thus, GOTo (I, X) gives the transition from I on symbol X in the DFA constructed from N by the subset construction. Viewed in this way, the procedure items( G') in Fig. 4.33 is just the subset construction itself applied to the NFA N with items as states.

all the useful information that can be gleaned from the stack. While we shall not prove this theorem here, we shall give an example. Example 4.50 : Let us consider the augmented expression grammar again, whose sets of items and GOTO function are exhibited in Fig. 4.31. Clearly, the string E + T* is a viable prefix of the grammar. The automaton of Fig. 4.31 will be in state 7 after having read E + T*. State 7 contains the items T -+ T * ·F F -+ · (E) F -+ ·id

which are precisely the items valid for E+T* . To see why, consider the following three rightmost derivations E' => E rm => E + T rm => E + T * F rm

E' => E rm :::} E + T rm => E + T * F rm :::} E + T * (E) rm

E' => E rm => E + T rm => E + T * F rm => E + T * id rm

The first derivation shows the validity of T -+ T * ·F, the second the validity of F -+ · (E) , and the third the validity of F -+ ·id. It can be shown that there are no other valid items for E + T*, although we shall not prove that fact here. o

4.6.6

Exercises for Section 4.6

Exercise 4.6.1 : Describe all the viable prefixes for the following grammars:

a) The grammar S

-+

0 S 1 I 0 1 of Exercise 4.2.2(a) .

258

CHAPTER 4. SYNTAX ANALYSIS

! b) The grammar

S

-+

SS + I SS

! c ) The grammar

S

-+

S(S) I

€

*

I a of Exercise 4.2. 1 .

of Exercise 4;2.2(c) .

Exercise 4.6',2 : Construct the SLR sets of items for the (augmented) grammar of Exercise 4;2. 1 . Compute the GO TO function for these sets of items. Show

the parsing table for this grammar. Is the grammar SLR? Exercise 4. 6.8 : Show the actions of your parsing table from Exercise 4.6.2 on

the input aa * a+.

Exercise 4.6.4 : For each of the (augmented) grammars of Exercise 4.2. 2 (a)

(g) : a) Construct the SLR sets of items and their GOTO function. b) Indicate any action conflicts in your sets of items. c) Construct the SLR-parsing table, if one exists. Exercise 4.6.5 : Show that the following grammar:

S A B

-+

-+

-+

AaAblBbBa

€

€

is LL ( 1 ) but not SLR( I ) . Exercise 4.6.6 : Show that the following grammar:

S A

-+

-+

SAI A a

is SLR(l) but not Lt( 1 ) . ! ! Exercise 4.6.7 : Consider the family o f grammars Gn defined by: -+

S Ai

-+

Ai bi aj Ai I aj

for 1 ::; i ::; n for 1 ::; i, j ::; n and i =I- j

Show that: a) Gn has 2n 2

-n

productions.

b) Gn has 2 n + n2 + n sets of LR(O) items. c) Gn is SLR(l ) . What does this analysis say about how large L R parsers can get?

4. 7.

MORE POWERFUL LR PARSERS

259

! Exercise 4.6.8 : We suggested that individual items could be regarded as states of a nondeterministic finite automaton, while sets of valid items are the states of a deterministic finite automaton (see the box on "Items as States of an NFA" in Section 4.6.5) . For the grammar S --+ S S + I S S * I a of Exercise 4.2 . 1 : a) Draw the transition diagram (NFA) for the valid items of this grammar according to the rule given in the box cited above. b) Apply the subset construction (Algorithm 3.20) to your NFA from part (a) . How does the resulting DFA compare to the set of LR(O) items for the grammar?

! ! c) Show that in all cases, the subset construction applied to the NFA that comes from the valid items for a grammar produces the LR(O) sets of items. ! Exercise 4.6.9 : The following is an ambiguous grammar:

S A

--+

--+

ASib SAIa

Construct for this grammar its collection of sets of LR(O) items. If we try to build an LR-parsing table for the grammar, there are certain conflicting actions. What are they? Suppose we tried to use the parsing table by nondeterminis tically choosing a possible action whenever there is a conflict. Show all the possible sequences of actions on input abab. 4.7

More Powerful L R Parsers

In this section, we shall extend the previous LR parsing techniques to use one symbol of lookahead on the input. There are two different methods: 1 . The "canonical-LR" or just "LR" method, which makes full use of the

lookahead symbol(s) . This method uses a large set of items, called the LR(l) items.

2. The "lookahead-LR" or "LALR" method, which is based on the LR(O)

sets of items, and has many fewer states than typical parsers based on the LR(l) items. By carefully introducing lookaheads into the LR(O) items, we can handle many more grammars with the LALR method than with the SLR method, and build parsing tables that are no bigger than the SLR tables. LALR is the method of choice in most situations.

After introducing both these methods, we conclude with a discussion of how to compact LR parsing tables for environments with limited memory.

CHAPTER 4. SYNTAX ANALYSIS

260

4 . 7. 1

Canonical LR(1 ) Items

We shall now present the most general technique for constructing an LR parsing table from a grammar. Recall that in the SLR method, state i calls for reduction by A -+ a if the set of items Ii contains item [A -+ a·] and a is in FOLLOW(A) . In some situations, however, when state i appears on top of the stack, the viable prefix /3a on the stack is such that /3A cannot be followed by a in any right-sentential form. Thus, the reduction by A -+ a should be invalid on input

a.

Example 4.51 : Let us reconsider Example 4.48, where in state 2 we had item R -+ L·, which could correspond to A -+ a above, and a could be the = sign, which is in FOLLOW(R) . Thus, the SLR parser calls for reduction by R -+ L in state 2 with = as the next input (the shift action is also called for, because of item S -+ L · =R in state 2). However, there is no right-sentential form of the grammar in Example 4.48 that begins R = . . . . Thus state 2, which is the state corresponding to viable prefix L only, should not really call for reduction of that L to R. 0

It is possible to carry more information in the state that will allow us to rule out some of these invalid reductions by A -+ a . By splitting states when necessary, we can arrange to have each state of an LR parser indicate exactly which input symbols can follow a handle a for which there is a possible reduction to A. The extra information is incorporated into the state by redefining items to include a terminal symbol as a second component. The general form of an item becomes [A -+ a . /3, a] , where A -+ a/3 is a production and a is a terminal or the right endmarker $. We call such an object an LR(l) item. The 1 refers to the length of the second component, called the lookahead of the item. 6 The lookahead has no effect in an item of the form [A -+ a ·/3, a] , where j3 is not E, but an item of the form [A -+ a · , a] calls for a reduction by A -+ a only if the next input symbol is a. Thus, we are compelled to reduce by A -+ a only on those input symbols a for which [A -+ a·, a] is an LR(l) item in the state on top of the stack. The set of such a's will always be a subset of FOLLOW(A) , but it could be a proper subset, as in Example 4.5l. Formally, we say LR(l) item [A -+ a·/3, a] is valid for a viable prefix I' if there is a derivation S =* 8Aw => 8aj3w, where rm

1.

I' =

rm

8a, and

2. Either

a

is the first symbol of w, or w is E and a is $.

Example 4.52 : Let us consider the grammar 6 Lookaheads that are strings of length greater than one are possible, of course, but we shall not consider such lookaheads here.

4. 7. MORE POWERFUL LR PARSERS

261

S -+ B B B -+ a B I b

There is a rightmost derivation S * aaBab => aaaBab. We see that item [B -+ rm rm a·B, a] is valid for a viable prefix 'Y = aaa by letting c5 = aa, A = B, W = ab, a = a, and j3 = B in the above definition. There is also a rightmost derivation S * BaB => BaaB. From this derivation we see that item [B -+ a·B, $] is rm rm valid for viable prefix Baa. 0

4.7.2

Constructing LR( l) Sets of Items

The method for building the collection of sets of valid LR ( l ) items is essentially the same as the one for building the canonical collection of sets of LR ( O ) items. We need only to modify the two procedures CLOSURE and GOTO. SetOfItems CLOSURE(I) {

repeat for ( each item [A -+ a·Bj3, a] in I ) for ( each production B -+ 'Y in G' ) for ( each terminal b in FIRST(j3a) ) add [B -+ ''Y, b] to set I; until no more items are added to I; return I;

} SetOfItems GOTO (I, X) { initialize J to be the empty set; for ( each item [A -+ a ·Xj3, a] in I ) add item [A -+ aX ·j3, a] to set J; return CLOSURE( J ) ; } void items(G') {

}

initialize C to CLOSURE ( { [S' -+ ·S, $] } ) ; repeat for ( each set of items I in C ) for ( each grammar symbol X ) if ( GOTO (I, X) is not empty and not in C ) add GOTo(I, X) to C; until no new sets of items are added to C;

Figure 4.40: Sets-of-LR ( l ) -items construction for grammar G'

CHAPTER 4. SYNTAX ANALYSIS

262

To appreciate the new definition of the CLOSURE operation, in particular, why b must be in FIRsT(,8a) , consider an item of the form [A -t a·B,B, a] in the set of items valid for some viable prefix 1. Then there is a rightmost derivation S =* oAax ::::} oaB,Bax, where 1 oa. Suppose ,Bax derives terminal string 'I'm 'I' m by. Then for each production of the form B -t 1] for some 1], we have derivation S =* 1Bby ::::} 11]by. Thus, [B -t ''f}, b] is valid for 1. Note that b can be the 'I' m 'I' m first terminal derived from ,B, or it is possible that f3 derives E in the derivation ,Bax =* by , and b can therefore be a. To summarize both possibilities we say 'I'm

that b can be any terminal in FIRST(,Bax), where FIRST is the function from Section 4.4. Note that x cannot contain the first terminal of by, so FIRST(,Bax) = FIRsT(,Ba) . We now give the LR( l) sets of items construction. 10

It S' -+ S· , $

S' -+ · S, $ S -+ ·CC, $ C -t ·cC, c/d

12

C -+ ·d, c d

h S -+ CC·, $

S -+ C . C, $

C '-------....f C -+ 'cC, $

19

C -+ ·d1 $

16 c ,--__--!1110-1 C -+ c ' C, $

C -+ cC · , $

C -+ 'cC, $ C -+ ·d, $

d

c

d

Figure 4.41: The GOTO graph for grammar (4.55)

Algorithm 4.53 : Construction of the sets of LR(1) items. INPUT:

An augmented grammar G' .

OUTPUT: The sets of LR( l) items that are the set of items valid for one or more viable prefixes of G' .

4. 7.

MORE POWERFUL LR PARSERS

263

METHOD: The procedures CLOSURE and GOTO and the main routine items for constructing the sets of items were shown in Fig. 4.40. 0 Example 4.54 : Consider the following augmented grammar.

S' S C

-+ -+ -+

S CC cC I d

( 4.55)

We begin by computing the closure of { [Sf -+ ·S, $] } . To close, we match the item [Sf -+ ·S, $] with the item [A -+ (};·Bj3, a] in the procedure CLOSURE. That is, A = Sf, (}; = E, B = S, j3 = E, and a = $. Function CLOSURE tells us to add [B -+ ''Y, b] for each production B -+ 'Y and terminal b in FIRST(j3a) . In terms of the present grammar, B -+ 'Y must be S -+ CC, and since j3 is E and a is $ , b may only be $ . Thus we add [S -+ ·CC, $] . We continue to compute the closure by ad ding all items [C -+ ''Y, b] for b in FIRST(C$) . That is, matching [S -+ ·CC, $] against [A -+ (};·B j3 , a] , we have A = S, (}; = E, B = C, j3 = C, and a = $ . Since C does not derive the empty string, FIRST(C$) = FIRST(C) . Since FIRST(C) contains terminals c and d, we add items [C -+ 'cC, c] , [C -+ 'cC, d] , [C -+ ·d, c] and [C -+ ·d, d] . None of the new items has a nonterminal immediately to the right of the dot, so we have completed our first set of LR(l) items. The initial set of items is 1o :

S -+ ·S, $ S -+ ·CC, $ C -+ 'cC, c/ d C -+ ·d, c/ d

The brackets have been omitted for notational convenience, and we use the notation [C -+ ' cC, c/d] as a shorthand for the two items [C -+ 'cC, c] and [C -+ 'cC, d] . Now we compute GOTo(Io , X) for the various values of X . For X = S we must close the item [Sf -+ S ' , $] . No additional closure is possible, since the dot is at the right end. Thus we have the next set of items h :

S' -+ S ' , $

For X = C we close [S -+ C·C, $] . We add the C-productions with second component $ and then can add no more, yielding 12 :

S -+ C·C, $ C -+ 'cC, $ C -+ ·d, $

Next, let X = c. We must close { [C -+ c'C, c/d] } . We add the C-productions with second component c/ d, yielding

CHAPTER 4. SYNTAX ANALYSIS

264 13 :

Finally, let X

=

d,

C -+ c·C, c/d 0 -+ ·cC, c/d 0 -+ ·d, c/d

and we wind up with the set of items 14 :

0 -+ d·, c/d

We have finished considering GOTO on 10 . We get no new sets from /1 , but 12 has goto's on 0, c, and d. For GOTO(h , 0) we get Is :

S

-t

00·, $

no closure being needed. To compute GOTO(I2 , c) we take the closure of { [O -t c·O, $]} , to obtain 16 :

0 -+ c·O, $ o -t ·cC, $ 0 -+ ·d, $

Note that 16 differs from 13 only in second components. We shall see that it is common for several sets of LR(l) items for a grammar to have the same first components and differ in their second components� When we construct the collection of sets of LR(O) items for the same gramrpar, each set of LR(O) items will coincide with the set of first components of one or more sets of LR(l) items. We shall have more to say about this phenomenon when we discuss LALR parsing. Continuing with the GOTO function for 12 , GOTo(h , d) is seen to be h :

0 -+ d· , $

Turning now to la , the GOTO'S of 13 on c and d are 13 and 14 , respectively, and GOTO(I3 , 0) is 18 :

0 -t cO·, c/d

14 and Is have no GOTO'S, since all items have their dots at the right end. The GOTO's of 16 on c and d are 16 and h, respectively, and GOTo(I6 , C) is

19 :

C -+ cC·, $

The remaining sets of items yield no GOTO's, so we are done. Figure 4.41 shows the ten sets of items with their goto's. 0

4. 7.

MORE POWERFUL LR PARSERS

4.7.3

265

Canonical LR(l) Parsing Tables

We now give the rules for constructing the LR(l) ACTION and GOTO functions from the sets of LR(l) items. These functions are represented by a table, as before. The only difference is in the values of the entries. Algorithm 4.56 : Construction of canonical-LR parsing tables.

INPUT: An augmented grammar G' . OUTPUT: The canonical-LR parsing table functions ACTION and GOTO for G' . METHOD: 1 . Cop.struct C' G' .

=

{Io. , II , ' . . , In } , the collection of sets of LR(l) items for

2. State i of the parser is constructed from Ii . The parsing action for state i is determined as follows. (a) If [A -+ n·a{3, b] is in Ii and GOTO (Ii , a) = Ij , then set ACTION[i , a] to "shift j." Here a must be a terminal. (b) If [A -+ n·, a] is in Ii , A f:. S' , then set ACTION [i , a] to "reduce A -+ n." (c) If [S' -+ S·, $] is in Ii, then set ACTION[i , $] to "accept."

If any conflicting actions result from the above rules, we say the grammar is not LR(l) . The algorithm fails to produce a parser in this case. 3. The goto transitions for state i are constructed for all nonterminals A using the rule: If GOTO(Ii , A) = Ij , then GOTO [i , A] = j .

4 . All entries not defined by rules (2) and (3) are made "error." 5. The initial state of the parser is the one constructed from the set of items containing [S ' -+ · S, $] . o

The table formed from the parsing action and goto functions produced by Algorithm 4.44 is called the canonical LR(l) parsing table. An LR parser using this table is called a canonical-LR(l) parser. If the parsing action function has no multiply defined entries, then the given grammar is called an LR(l) grammar. As before, we omit the " (I)" if it is understood. Example 4.57 : The canonical parsing table for grammar (4.55) is shown in Fig. 4.42. Productions 1, 2, alld 3 are S -+ CC, C -+ cC, and C -+ d, respectively. 0

Every SLR(l) grammar is an LR(l) grammar, but for an SLR(l) grammar the canonical LR parser may have more states than the SLR parser for the same grammar. The grammar of the previous examples is SLR and has an SLR parser with seven states, compared with the ten of Fig. 4.42.

266

CHAPTER 4. SYNTAX ANALYSIS STATE

ACTION

c s3

d s4

2 3

s6 s3 r3

s7 s4 r3

6 7

s6

s7

r2

r2

0 1

4 5 8 9

GOTO $

acc

S 1

C 2 5 8

r1

r3

9

r2

Figure 4.42: Canonical parsing table for grammar (4.55)

4.7.4

Constructing LALR Parsing Tables

We now introduce our last parser construction method, the LALR (lookahead LR) technique. This method is often used in practice, because the tables ob taIned by it are considerably smaller than the canonical LR tables, yet most common syntactic constructs of programming languages can be expressed con veniently by an LALR grammar. The same is almost true for SLR grammars, but there are a few constructs that cannot be conveniently handled by SLR techniques ( see Example 4.48, for example ) . For a comparison of parser size, the SLR and LALR tables for a grammar always have the same number of states, and this number is typically several hundred states for a language like C. The canonical LR table would typically have several thousand states for the same-size language. Thus, it is much easier and more economical to construct SLR and LALR tables than the canonical LR tables. By way of introduction, let us again consider grammar (4.55) , whose sets of LR(l) items were shown in Fig. 4.41. Take a pair of similar looking states, such as 14 and h . Each of these states has only items with first component C -+ d · . In 14 i the lookaheads are c or d; in 17 , $ is the only lookahead. To see the difference between the roles of 14 and 17 in the parser, note that the grammar generates the regular language c*dc * d. When reading an input cc . . . cdcc . . . cd, the parser shifts the first group of c's and their following d onto the stack, entering state 4 after reading the d, The parser then calls for a reduction by C -+ d, provided the next input symbol is c or d. The requirement that c or d follow makes sense, since these are the symbols that could begin strings in c*d. If $ follows the first d, we have an input like ccd, which is not in the language, and state 4 correctly declares an error if $ is the next input. The parser enters state 7 after reading the second d. Then, the parser must

4. 7.

MORE POWERFUL LR PARSERS

267

see $ on the input, or it started with a string not of the form c * dc * d. It thus makes sense that state 7 should reduce by C -+ d on input $ and declare error on inputs c or d. Let us now replace 14 and h by 147 , the union of 14 and h, consisting of the set of three items represented by [C -+ d · , c/d/$] . The goto's on d to 14 or 17 from 10 , h , Is , and h now enter 147 . The action of state 47 is to reduce on any input. The revised parser behaves essentially like the original, although it might reduce d to C in circumstances where the original would declare error, for example, on input like ccd or cdcdc. The error will eventually be caught; in fact, it will be caught before any more input symbols are shifted. More generally, we can look for sets of LR(l) items having the same core, that is, set of first components, and we may merge these sets with common cores into one set of items. For example, in Fig. 4.41, 14 and h form such a pair, with core {C -+ d · } . Similarly, 13 and h form another pair, with core {C -+ c · C, C -+ · cC, C -+ · d} . There is one more pair, 18 and 19 , with common core {C -+ cC· } . Note that, in general, a core is a set of LR(O) items for the grammar at hand, and that an LR(l) grammar may produce more than two sets of items with the same core. Since the core of GOTO(I, X) depends only on the core of I, the goto's of merged sets can themselves be merged. Thus, there is no problem revising the goto function as we merge sets of items. The action functions are modified to reflect the non-error actions of all sets of items in the merger. Suppose we have an LR(l) grammar, that is, one whose sets of LR(l) items produce no parsing-action conflicts. If we replace all states having the same core with their union, it is possible that the resulting union will have a conflict, but it is unlikely for the following reason: Suppose in the union there is a conflict on lookahead a because there is an item [A -+ a·, a] calling for a reduction by A -+ a, and there is another item [B -+ jJ · a"'(, b) calling for a shift. Then some set of items from which the union was formed has item [A -+ a · , a] , and since the cores of all these states are the same, it must have an item [B -+ jJ·a"'(, c] for some c. But then this state has the same shift / reduce conflict on a, and the grammar was not LR(l) as we assumed. Thus, the merging of states with common cores can never produce a shift / reduce conflict that was not present in one of the original states, because shift actions depend only on the core, not the lookahead. It is possible, however, that a merger will produce a reduce / reduce conflict, as the following example shows. Example 4.58 : Consider the grammar 8' 8 A B

-+ -+

-+

-+

8 aAd I bBd I aBe I bAe c c

which generates the four strings acd, ace, bcd, and bce. The reader can check that the grammar is LR(l) by constructing the sets of items. Upon doing so,

268

CHAPTER 4. SYNTAX ANALYSIS

we find the set of items { [A -+ C' , d) , [B -+ C' , en valid for viable prefix ac and { [A -+ C' , e] , [B -+ C' , d) } valid for bc. Neither of these sets has a conflict, and their cores are the same. However, their union, which is A -+ C' , die B -+ C ' , die

generates a reduce / reduce conflict, since reductions by both A -+ C and B -+ C are called for on inputs d and e. 0 We are now prepared to give the first of two LALR table-construction al gorithms. The general idea is to construct the sets of LR ( l ) items, and if no conflicts arise, merge sets with common cores. We then construct the parsing table from the collection of merged sets of items. The method we are about to describe serves primarily as a definition of LALR ( l ) grammars. Constructing the entire collection of LR ( l ) sets of items requires too much space and time to be useful in practice. Algorithm 4.59 : An easy, but space-consuming LALR table construction.

INPUT: An augmented grammar G' . OUTPUT: The LALR parsing-table functions ACTION and GOTO for G'. METHOD:

1. Construct C = {1o , II , . . . , In } , the collection of sets of LR( l ) items. 2. For each core present among the set of LR ( l ) items, find all sets having that core, and replace these sets by their union. 3. Let C'

= { Jo , J1 , , Jm } be the resulting sets of LR ( l ) items. The parsing actions for state i are constructed from Ji in the same manner as in Algorithm 4.56. If there is a parsing action conflict, the algorithm fails to produce a parser, and the grammar is said not to be LALR ( l ) . . • .

4. The GOTO table is constructed as follows. If J is the union of one or more sets of LR ( l ) items, that is, J = It n 12 n . . . n h , then the cores of GOTo(I1 , X ) , GOTO (I2 , X) , . . . , GOTo (Ik , X) are the same, since II , 12 , , Ik all have the same core. Let K be the union of all sets of items having the same core as GOTO(Il , X) , Then GOTO (J, X) = K. • • •

o

The table produced by Algorithm 4.59 is called the LALR parsing table for G. If there are no parsing action conflicts, then the given grammar is said to be an LALR (l) grammar. The collection of sets of items constructed in step (3) is called the LALR(l) collection.

269

4. 7. MORE POWERFUL LR PARSERS

Example 4.60 : Again consider grammar (4.55) whose GOTO graph was shown

in Fig. 4.41. As we mentioned, there are three pairs of sets of items that can be merged. 13 and 16 are replaced by their union: 136 :

G ---t c·G, c/d/$ G ---t · cG, c/d/$ G ---t · d, c/ d/$

14 and h are replaced by their union:

and 18 and 19 are replaced by their union: 189 :

G ---t cG · , c/d/$

The LALR action and goto functions for the condensed sets of items are shown in Fig. 4.43. STATE 0 1

2 36 47 5

89

ACTION $ d c s36 s47 acc s36 s47 s36 s47 r3 r3 r3 rl r2 r2 r2

GOTO S G 2 1 5

89

Figure 4.43: LALR parsing table for the grammar of Example 4.54 To see how the GOTO's are computed, consider GOTO (I36 , G) . In the original set of LR(l) items, GOTO(I3 , G) = 18 , and 18 is now part of 189 , so we make GOTo(136 , G) be 189 . We could have arrived at the same conclusion if we considered h , the other part of 136 , That is, GOTO(I6 , G) = 19 , and 19 is now part of 189 , For another example, consider GOTo(12 , c) , an entry that is exercised after the shift action of 12 on input c. In the original sets of LR(l) items, GOTO(I2 , c) = h . Since 16 is now part of 136 , GOTO(I2 , c) becomes 136 . Thus, the entry in Fig. 4.43 for state 2 and input c is made s36, meaning shift and push state 36 onto the stack. 0 When presented with a string from the language c* dc * d, both the LR parser of Fig. 4.42 and the LALR parser of Fig. 4.43 make exactly the same sequence of shifts and reductions, although the names of the states on the stack may differ. For instance, if the LR parser puts 13 or h on the stack, the LALR

270

CHAPTER 4. SYNTAX ANALYSIS

parser will put 136 on the stack. This relationship holds in general for an LALR grammar. The LR and LALR parsers will mimic one another on correct inputs. When presented with erroneous input, the LALR parser may proceed to do some reductions after the LR parser has declared an error. However, the LALR parser will never shift another symbol after the LR parser declares an error. For example, on input ccd followed by $, the LR parser of Fig. 4.42 will put 0334

on the stack, and in state 4 will discover an error, because $ is the next input symbol and state 4 has action error on $. In contrast, the LALR parser of Fig. 4.43 will make the corresponding moves, putting 0 36 36 47

on the stack. But state 47 on input $ has action reduce C parser will thus change its stack to

--+

d. The LALR

o 36 36 89 Now the action of state 89 on input $ is reduce G --+ cG . The stack becomes 0 36 89

whereupon a similar reduction is called for, obtaining stack 02

Finally, state 2 has action error on input $, so the error is now discovered.

4.7.5

Efficient Construction of LALR Parsing Tables

There are several modifications we can make to Algorithm 4.59 to avoid con structing the full collection of sets of LR ( l ) items in the process of creating an LALR ( l ) parsing table. •

First, we can represent any set of LR ( O ) or LR ( l ) items 1 by its kernel, [8' --+ · 8] or that is, by those items that are ' either the initial item [8' --+ ·8, $] or that have the dot somewhere other than at the beginning of the production body. -

-

•

•

We can construct the LALR ( l ) -item kernels from the LR ( O ) -item kernels by a process of propagation and spontaneous generation of lookaheads, that we shall describe shortly. If we have the LALR ( l ) kernels, we can generate the LALR ( l ) parsing table by closing each kernel, using the function CLOSURE of Fig. 4.40, and then computing table entries by Algorithm 4.56, as if the LALR ( l ) sets of items were canonical LR ( l ) sets of items.

271

4. 7. MORE POWERFUL LR PARSERS

Example 4.61 : We shall use as an example of the efficient LALR ( l ) table construction method the non-SLR grammar from Example 4.48, which we re produce below in its augmented form: 8' 8 L R

--+

--+

--+

--+

S

L=R I R *R I id L

The complete sets of LR ( O ) items for this grammar were shown in Fig. 4.39. The kernels of these items are shown in Fig. 4.44. 0 10 :

8'

--+

·8

15 :

L --+ id·

II :

S'

--+

8·

16 :

8 --+ L = ·R

12 :

8 --+ L· = R R --+ L·

17 :

L --+ *R·

13 :

8 --+ R·

Is :

R --+ L ·

14 :

L --+ *·R

19 :

8 --+ L = R ·

Figure 4.44: Kernels of the sets of LR ( O ) items for grammar ( 4.49 ) Now we must attach the proper lookaheads to the LR ( O ) items in the kernels, to create the kernels of the sets of LALR ( l ) items. There are two ways a lookahead b can get attached to an LR ( O ) item B --+ "'I"b in some set of LALR ( l ) items J: 1. There is a set of items I, with a kernel item A

GOTo(I, X ) , and the construction of

--+

a· (3, a , and J =

GOTO (CLOSURE({[A --+ a·{3, a]}) , X)

as given in Fig. 4.40, contains [B --+ "'1"8, b] , regardless of a. Such a looka head b is said to be generated spontaneously for B --+ 1 . 8. 2. As a special case, lookahead $ is generated spontaneously for the item

8'

--+

·8 in the initial set of items.

3. All is as in ( 1 ) , but a = b, and GOTO (CLOSURE({[A --+ a·{3, b] } ) , X) , as given in Fig. 4.40, contains [B --+ "'1"8, b) only because A --+ a·{3 has b as one of its associated lookaheads. In such a case, we say that lookaheads propagate from A --+ a·{3 in the kernel of I to B --+ 1 . 8 in the kernel of J. Note that propagation does not depend on the particular lookahead symbol; either all lookaheads propagate from one item to another, or none do.

272

CHAPTER

4.

SYNTAX ANALYSIS

We need to determine the spontaneously generated lookaheads for each set of LR(O) items, and also to determine which items propagate lookaheads from which. The test is actually quite simple. Let # be a symbol not in the grammar at hand. Let A -t 0, . (3 be a kernel LR(O) item in set I. Compute, for each X , J == GOTO (CLOSURE({[A -t O,·(3, #] }) , X) . For each kernel item in J , we examine its set of lookaheads. If # is a lookahead, then lookaheads propagate to that item from A -t 0, . (3. Any other lookahead is spontaneously generated. These ideas are made precise in the following algorithm, which also makes use of the fact that the only kernel items in J must have X immeqiately to the left of the dot; that is, they must be of the form B -t "YX ,0. Algorithm 4.62 : Determining lookaheads.

INPUT: The kernel K of a set of LR(O) items I and a grammar symbol X . OUTPUT: The lookaheads spontaneously generated by items i n I for kernel items in GOTO(I, X) and the items in I from which lookaheads are propagated to kernel items in GOTO(I, X ) . METHOD: The algorithm is given in Fig. 4.45.

0

for ( each item A -t 0, . (3 in K ) { J : = CLOSURE( { [A -t (t ' (3,#] ) ) ; if ( [B -t "Y·Xo, a] is in J, and a is not # )

conclude that lookahead a is generated spontaneously for item B -t "YX·o in GOTO(I, X) ; if ( [B -t l' X 0, #] is in J ) conclude that lookaheads propagate from A -t (t . (3 in I to . B -t "YX·o in GOTo(I, X);

}

Figure 4.45: Discovering propagated and spontaneous lookaheads We are now ready to attach lookaheads to the kernels of the sets of LR(O) items to form the sets of LALR(I) items. First, we know that $ is a looka head for 8' -t ·8 in the initial set of LR(O) items. Algorithm 4.62 gives us all the lookaheads generated spontaneously. After listing all those lookaheads, we must Ci,llow them to propagate until no further propagation is possible. There are many different approaches, all of which in some sense keep track of "new" lookaheads that have propagated into an item but which have not yet propa gated out. The next algorithm describes one technique to propagate lookaheads to all items. Algorithm 4.63 : Efficient computation of the kernels of the LALR(I) collec-: tion of sets of items. INPUT: An augmented grammar G' .

4. 7.

MORE POWERFUL LR PARSERS

273

OUTPUT: The kernels of the LALR ( l ) collection of sets of items for G' . METHOD: 1 . Construct the kernels of the sets of LR ( O ) items for G. If space is not at

a premium, the simplest way is to construct the LR ( O ) sets of items, as in Section 4.6.2, and then remove the nonkernel items. If space is severely constrained, we may wish instead to store only the kernel items for each set, and compute GOTO for a set of items 1 by first computing the closure of 1.

2. Apply Algorithm 4.62 to the kernel of each set of LR ( O) items and gram mar symbol X to determine which lookaheads are spontaneously gener ated for kernel items in GOTo (l, X), and from which items in 1 lookaheads are propagated to kernel items in GOTo(l, X) . 3. Initialize a table that gives, for each kernel item in· each set of items, the associated lookaheads. Initially, each item has associated with it only those lookaheads that we determined in step (2) were generated sponta neously. 4. Make repeated passes over the kernel items in all sets. When we visit an item i, we look up the kernel items to which i propagates its lookaheads, using information tabulated in step (2) . The current set of lookaheads for i is added to those already associated with each of the items to which i propagates its lookaheads. We continue making passes over the kernel items until no more new lookaheads are propagated. o

Example 4 .64 : Let us construct the kernels of the LALR ( l ) items for the grammar of Example 4.61. The kernels of the LR ( O ) items were shown in Fig. 4.44. When we apply Algorithm 4.62 to the kernel of set of items 10 , we first compute CLOSURE( { [S' -+ ·S, #] }), which is

S' -+ ·S, # S -+ ·L = R, # S -+ ·R, #

L -+ . * R, #/ = L -+ ·id, #/ = R -+ ·L, #

Among the items in the closure, we see two where the lookahead = has been generated spontaneously. The first of these is L -+ . * R . This item, with * to the right of the dot, gives rise to [L -+ *·R, = ] . That is, = is a spontaneously generated lookahead for L -+ *·R, which is in set of items 14 . Similarly, [L -+ ·id, = ] tells us that = is a spontaneously generated lookahead for L -+ id· in Is .

As # is a lookahead for all six items in the closure, we determine that the item S' -+ ·S in 10 propagates lookaheads to the following six items:

274

CHAPTER 4. SYNTAX ANALYSIS 8' -+ 8· in h 8 -+ L· = R in 12 8 -+ R· in 13

10 :

12 : 14 :

16 :

FROM 8' -+ ·8

L -+ *·R in 14 L -+ id· in h R -+ L· in 12

II :

12 : 12 : 13 : 14 : h: 8 -+ L· = R 16 : 14 : L -+ *·R h: 17 : 18 : 8 -+ L = ·R 14 : h: 18 : 19 :

To 8' -+ 8· 8 -+ L· = R R -+ L· 8 -+ R· L -+ *·R L -+ id· 8 -+ L = ·R L -+ *·R L -+ id· L -+ *R· R -+ L· L -+ *·R L -+ iq· R -+ L· 8 -+ L = R·

Figure 4.46: Propagation of lookaheads In Fig. 4.47, we show steps (3) and (4) of Algorithm 4.63. The column labeled INIT shows the spontaI!-eously generated lookaheads for each kernel item. These are only the two occurrences of = discussed earlier, and the spontaneous lookahead $ for the initial item 8' -+ ·8. On the first pass, the lookahead $ propagates from 8' -+ 8 in 10 to the six items listed in Fig. 4.46. The lookahead = propagates from L -+ *·R in 14 to items L -+ * R· in h and R -+ L· in 18 , It also propagates to itself and to L -+ id · in h , but these lookaheads are already present. In the second and third passes, the only new lookahead propagated is $, discovered for the successors of 12 and 14 on pass 2 and for the successor of 16 on pass 3. No new lookaheads are rightmost propagated on pass 4, so the final . set of lookaheads is shown in the column of Fig. 4.47. Note that the shift/reduce conflict found in Example 4.48 using the SLR method has disappeared with the LALR technique. The reason is that only lookahead $ is associated with R -+ L· in 12 , so there is n() conflict with the parsing action of shift on = generated by item 8 -+ L·=R in 12 , D

275

4. 7. MORE POWERFUL LR PARSERS SET

ITEM

INIT

LOOKAHEADS PASS 1 PASS 2

PASS 3

$

$

$

S·

$

$

$

h:

S --+ L· = R R --+ L·

$ $

$ $

$ $

13 :

S --+ R·

$

$

$

14 :

L

--+

*·R

=/$

=/$

=/$

Is :

L

--+

id·

=/$

=/$

=/$

16 :

S

--+

L = ·R

$

$

h:

L --+ *R·

=/$

=/$

18 :

R --+ L·

=/$

=/$

19 :

S --+ L = R·

10 :

Sf

--+

·S

h:

Sf

--+

$

-

-

$

Figure 4.47: Computation of lookaheads

4.7.6

Compaction of LR Parsing Tables

A typical programming language grammar with 50 to 100 terminals and 100 productions may have an LALR parsing table with several hundred states. The action function may easily have 20,000 entries, each requiring at least 8 bits to encode. On small devices, a more efficient encoding than a two-dimensional array may be important. We shall mention briefly a few techniques that have been used to compress the ACTION and GO TO fields of an LR parsing table. One useful technique for compacting the action field is to recognize that usually many rows of the action table are identical. For example, in Fig. 4.42, states 0 and 3 have identical action entries, and so do 2 and 6. We can therefore save considerable space, at little cost in time, if we create a pointer for each state into a one-dimensional array. Pointers for states with the same actions point to the same location. To access information from this array, we assign each terminal a number from zero to one less than the number of terminals, and we use this integer as an offset from the pointer value for each state. In a given state, the parsing action for the ith terminal will be found i locations past the pointer value for that state. Further space efficiency can be achieved at the expense of a somewhat slower parser by creating a list for the actions of each state. The list consists of (terminal-symbol, action ) pairs. The most frequent action for a state can be

276

CHAPTER

4.

SYNTAX ANALYSIS

placed at the end of the list, and in place of a terminal we may use the notation "any," meaning that if the current input symbol has not been found so far on the list, we should do that action no matter what the input is. Moreover, error entries can safely be replaced by reduce actions, for further uniformity along a row. The errors will be detected later, before a shift move.

Example 4.65 : Consider the parsing table of Fig. 4.37. First, note that the actions for states 0, 4, 6, and 7 agree. We can represent them all by the list S YM B O L id

(

any

ACTION s5 s4 error

State 1 has a similar list: + $

any

s6 acc error

In state 2, we can replace the error entries by r2, so reduction by production 2 will occur on any input but * . Thus the list for state 2 is *

any

s7 r2

State 3 has only error and r4 entries. We can replace the former by the latter, so the list for state 3 consists of only the pair (any, r4 ) . States 5, 10, and 11 can be treated similarly. The list for state 8 is +

)

any

s6 sll error

and for state 9 *

any

s7 s11 r1

0

We can also encode the GOTO table by a list, but here it appears more efficient to make a list of pairs for each nonterminal A. Each pair on the list for A is of the form (currentState, nextState) , indicating GOTo[currentState, A]

=

next State

4. 7.

277

MORE POWERFUL LR PARSERS

This technique is useful because there tend to be rather few states in any one column of the GOTO table. The reason is that the GOTO on nonterminal A can only be a state derivable from a set of items in which some items have A immediately to the left of a dot. No set has items with X and Y immediately to the left of a dot if X ::j:. Y. Thus, each state appears in at most one GOTO column. For more space reduction, we note that the error entries in the goto table are never consulted. We can therefore replace each error entry by the most common non-error entry in its column. This entry becomes the default; it is represented in the list for each column by one pair with any in place of currentState. Example 4.66 : Consider Fig. 4.37 again. The column for F has entry 10 for state 7, and all other entries are either 3 or error. We may replace error by 3 and create for column F the list CURRENTSTATE NEXTSTATE 7 any

10 3

Similarly, a suitable list for column T is 6 any

9 2

For column E we may choose either 1 or 8 to be the default; two entries are necessary in either case. For example, we might create for column E the list 4 any

8 1

o

This space savings in these small examples may be misleading, because the total number of entries in the lists created in this example and the previous one together with the pointers from states to action lists and from nonterminals to next-state lists, result in unimpressive space savings over the matrix imple mentation of Fig. 4.37. For practical grammars, the space needed for the list representation is typically less than ten percent of that needed for the matrix representation. The table-compression methods for finite automata that were discussed in Section 3.9.8 can also be used to represent LR parsing tables.

4.7.7

Exercises for Section 4.7

Exercise 4.7. 1 : Construct the

a) canonical LR, and b) LALR

278

CHAPTER 4. SYNTAX ANALYSIS

sets of items for the grammar S

-+

SS + ISS

*

I a of Exercise 4.2 . 1 .

Exercise 4.7.2 : Repeat Exercise 4.7.1 for each of the ( augmented ) grammars of Exercise 4.2.2 ( a) - (g ) .

! Exercise 4.7.3 : For the grammar of Exercise 4.7. 1 , use Algorithm 4.63 to compute the collection of LALR sets of items from the kernels of the LR ( O) sets of items. ! Exercise 4.7.4 : Show that the following grammar S A

-+ -+

A a l bAc l dc l bda d

is LALR ( l ) but not SLR ( l ) .

! Exercise 4.7.5 : Show that the following grammar

S

A B

-+

-+

-+

AalbAclBclbBa d d

is LR ( l ) but not LALR ( l ) . 4.8

Using Ambiguous G rammars

It is a fact that every ambiguous grammar fails to be LR and thus is not in any of the classes of grammars discussed in the previous two sections. How ever, certain types of ambiguous grammars are quite useful in the specification and implementation of languages. For language constructs like expressions, an ambiguous grammar provides a shorter, more natural specification than any equivalent unambiguous grammar. Another use of ambiguous grammars is in isolating commonly occurring syntactic constructs for special-case optimiza tion. With an ambiguous grammar, we can specify the special-case constructs by carefully adding new productions to the grammar. Although the grammars we use are ambiguous, in all cases we specify dis ambiguating rules that allow only one parse tree for each sentence. In this way, the overall language specification becomes unambiguous, and sometimes it be comes possible to design an LR parser that follows the same ambiguity-resolving choices. We stress that ambiguous constructs should be used sparingly and in a strictly controlled fashion; otherwise, there can be no guarantee as to what language is recognized by a parser.

279

4.8. USING AMBIGUOUS GRAMMARS

4.8.1

Precedence and Associativity to Resolve Conflicts

Consider the ambiguous grammar (4.3) for expressions with operators + and *, repeated here for convenience: E

--+

E + E I E * E I (E) I id

This grammar is ambiguous because it does not specify the associativity or precedence of the operators + and *. The unambiguous grammar (4. 1 ) , which includes productions E --+ E + T and T --+ T * F, generates the same language, but gives + lower precedence than * , and makes both operators left associative. There are two reasons why we might prefer to use the ambiguous grammar. First, as we shall see, we can easily change the associativity and precedence of the operators + and * without disturbing the productions of (4.3) or the number of states in the resulting parser. Second, the parser for the unam biguous grammar will spend a substantial fraction of its time reducing by the productions E --+ T and T --+ F, whose sole function is to enforce associativity and precedence. The parser for the ambiguous grammar (4.3) will not waste time reducing by these single productions ( productions whose body consists of a single nonterminal ) . The sets of LR(O) items for the ambiguous expression grammar (4.3) aug mented by E' --+ E are shown in Fig. 4.48. Since grammar (4.3) is ambiguous, there will be parsing-action conflicts when we try to produce an LR parsing table from the sets of items. The states corresponding to sets of items h and I8 generate these conflicts. Suppose we use the SLR approach to constructing the parsing action table. The conflict generated by 17 between reduction by E -t E + E and shift on + or * cannot be resolved, because + and * are each in FOLLOW(E ) . Thus both actions would be called for on inputs + and * . A similar conflict is generated by I8 , between reduction by E --+ E * E and shift on inputs + and *. In fact, each of our LR parsing table-construction methods will generate these conflicts. However, these problems can be resolved using the precedence and associa tivity information for + and *. Consider the input id + id * id, which causes a parser based on Fig. 4.48 to enter state 7 after processing id + id; in particular the parser reaches a configuration PREFIX E+E

STACK

0147

INPUT * id $

For convenience, the symbols corresponding to the states 1 , 4, and 7 are also shown under PREFIX. If * takes precedence over +, we know the parser should shift * onto the stack, preparing to reduce the * and its surrounding id symbols to an expression. This choice was made by the SLR parser of Fig. 4.37, based on an unambiguous grammar for the same language. On the other hand, if + takes precedence over *, we know the parser should reduce E + E to E. Thus the relative precedence

280

CHAPTER 4. SYNTAX ANALYSIS 10 :

E' -t ·E E -t ·E + E E -t ·E * E E -t · (E) E -t ·id

Is :

-t E * ·E E -t ·E + E E -t ·E * E E -t · (E) E -t ·id

h:

E' -t E· E -t E· + E E -t E· * E

16 :

E -t (E·) E -+ E· + E E -t E· * E

h:

E -+ (·E) E -t ·E + E E -t ·E * E E -t · (E) E -t ·id

h:

E -t E + E· E -t E· + E E -t E· * E

Is :

E -t E * E· E -t E· + E E -t E· * E

19 :

E -t (E) ·

13 :

E -t id·

14 :

E -t E + ·E E -t ·E + E E -t ·E * E E -t · (E) E -t ·id

E

Figure 4.48: Sets of LR(O) items for an augmented expression grammar of + followed by * uniquely determines how the parsing action conflict between reduci�g E -t E + E and shifting on * in state 7 should be resolved. If the input had been id + id + id instead, the parser would still reach a configuration in which it had stack 0 1 4 7 after processing input id + id. On input + there is again a shift/reduce conflict in state 7. Now, however, the associativity of the + operator determines how this conflict should be resolved. If + is left associativ�; the correct action is to reduce by E -t E + E. That is, the id symbols surrounding the first + must be grouped first. Again this choice coincides with what the SLR parser for the unambiguous grammar would do. In summary, assuming + is left associative, the action of state 7 on input + should be to reduce by E -t E + E, and assuming that * takes precedence over +, the action of state 7 on input * should be to shift. Similarly, assuming that * is left associative and takes precedence over + , we can argue that state 8, which can appear on top of the stack only when E * E are the top three grammar symbols, should have the action reduce E -t E * E on both + and * inputs. In the case of input + , the reason is that * takes precedence over + , while in the case of input *, the rationale is that * is left associative.

4.8.

281

USING AMBIGUOUS GRAMMARS

Proceeding in this way, we obtain the LR parsing table shown in Fig. 4.49. Productions i through 4 are E � E + E, E � E * E, � (E) , and E � id, respectively. It is interesting that a similar parsing action table would be produced by eliminating the reductions by the single productions E � T and T � F from the SLR table for the unambiguous expression grammar (4.1) shown in Fig. 4.37. Ambiguous grammars like the one for expressions can be handled in a similar way in the context of LALIt and canonical LR parsing. GOTO

ACTION

STATE

+

id

0 1 2 3 4 5 6 7 8 9

*

(

)

$

s2

s3 s3 s3 s3

s4

s5

r4

r4

s4 r1 r2 r3

s5 s5 r2 r3

s2 s2 s2

acc r4

r4

s9 r1 r1 t2 r2 r3 r3

E 1 6 7 8

Figure 4.49: Parsing table for grammar (4.3)

4.8.2

The " D angling-Else" Ambiguity

Consider again the following grammar for conditional statements: stmt

�

I I

if expr then stmt else stmt if expr then stmt other

As we noted in Section 4.3.2, this grammar is ambiguous because it does not resolve the dangling-else ambiguity. To simplify the discussion, let us consider an abstraction of this grammar, where i stands for if expr then, e stands for else, and a stands for "all other productions." We can then write the grammar, with augmenting production 8' � 8, as S' S

�

�

S iSeS l iS l a

(4.67)

The sets of LR(O) items for grammar (4.67) are shown in Fig. 4.50. The ambi guity in (4.67) gives rise to a shift/reduce conflict in [4 . There, S � is·eS calls for a shift of e and, since FOLLOW(S) = {e, $}, item S � is· calls for reduction by S � is on input e. Translating back to the if-then-else terminology, given

282

CHAPTER 4. SYNTAX ANALYSIS 10 :

11 : 12 :

S' -+ ·8 8 -+ ·i8e8 8 -+ �i8 8 -+ ·a

8' -+ 8· 8 -+ i·8e8 8 -+ i·8 8 -+ ·i8e8 8 --t ·i8 8 -+ ·a

13 :

8 -+

14 :

8 -+ i8 · e8

h:

8 -+ i8e·8 8 -+ ·i8e$ 8 -+ ·i8 8 -+ ·a

16 :

8 -t i8e8·

a·

Figure 4.50: LR(O) states for augmented grammar (4.67) if expr then stmt

on the stack and else as the first input symbol, should we shift else onto the stack ( i.e., shift e) or reduce if expr tlIen stmt ( i.e, reduce by 8 -+ i8)? The answer is that we should shift else, because it is "associated" with the previous then. In the terminology of grammar (4.67) , the e on the input, standing for else, can only form part of the body beginning with the i8 now on the top of the stack. If what follows e on the input cannot be parsed as an 8, completing body i8 e8, then it can be shown that there is no other parse possible. We conclude that the shift / reduce conflict in 14 should be resolved in favor of shift on input e. The SLR parsing table constructed from the sets of items of Fig. 4.48, using this resolution of the parsing-action conflict in 14 on input e, is shown in Fig. 4.51 . Productions 1 through 3 are 8 -+ i8e8, 8 -+ i8, and 8 -+ a, respectively. STATE

0 1 2

3

4

5

6

GOTO

ACTION

i s2 s2 s2

e

r3 s5 r1

a s3 s3 s3

$

acc r3 r2 r1

8 1 4 6

Figure 4.51: LR parsing table for the "dangling-else" grammar

4.8.

283

USING AMBIGUOUS GRAMMARS

For example, on input iiaea, the parser makes the moves shown in Fig. 4.52, corresponding to the correct resolution of the "dangling-else." At line (5) , state 4 selects the shift action on input e, whereas at line (9) , state 4 calls for reduction by S -+ is on input $ . ( 1) (2 ) (3 ) ( 4) (5) (6 )

(7) ( 8) (9) ( 10)

STACK 0 02 022 0223 0224 02245 022453 022456 024 0 1

SYMBOLS i ii iia iiS ii Se ii Se a iiSeS is S

INPUT iiaea$ iaea$ aea$ ea$ ea$ a$ $ $ $ $

ACTION

shift shift shift shift reduce by S -+ a shift reduce by S -+ a reduce by S -+ iSeS reduce by S -+ is accept

Figure 4.52: Parsing actions on input iiaea By way of comparison, if we are unable to use an ambiguous grammar to specify conditional statements, then we would have to use a bulkier grammar along the lines of Example 4.16.

4.8.3

Error Recovery in LR Parsing

An LR parser will detect an error when it consults the parsing action table and finds an error entry. Errors are never detected by consulting the goto table. An LR parser will announce an error as soon as there is no valid continuation for the portion of the input thus far scanned. A canonical LR parser will not make even a single reduction before announcing an error. SLR and LALR parsers may make several reductions before announcing an error, but they will never shift an erroneous input symbol onto the stack. In LR parsing, we can implement panic-mode error recovery as follows. We scan down the stack until a state s with a goto on a particular nonterminal A is found. Zero or more input symbols are then discarded until a symbol a is found that can legitimately follow A. The parser then stacks the state GOTO ( s, A) and resumes normal parsing. There might be more than one choice for the nonterminal A. Normally these would be nonterminals representing major program pieces, such as an expression, statement, or block. For example, if A is the nonterminal stmt, a might be semicolon or } , which marks the end of a statement sequence. This method of recovery attempts to eliminate the phrase containing the syntactic error. The parser determines that a string derivable from A contains an error. Part of that string has already been processed, and the result of this

284

CHAPTER 4. SYNTAX ANALYSIS

processing is a sequence of states on top of the stack. The remainder of the string is still in the input, and the parser attempts to skip over the remainder of this string by looking for a symbol on the input that can legitimately follow A. By removing states from the stack, skipping over the input, and pushing GOTO (s , A) on the stack, the parser pretends that it has found an instance of A and resumes normal parsing. Phrase-level recovery is implemented by examining each error entry in the LR parsing table and deciding on the basis of language usage the most likely programmer error that would give rise to that error. An appropriate recovery procedure can then be constructed; presumably the top of the stack and / or first input symbols would be modified in a way deemed appropriate for each error entry. In designing specific error-handling routines for an LR parser, we can fill in each blank entry in the action field with a pointer to an error routine that will take the appropriate action selected by the compiler designer. The actions may include insertion or deletion of symbols from the stack or the input or both, or alteration and transposition of input symbols. We must make our choices so that the LR parser will not get into an infinite loop. A safe strategy will assure that at least one input symbol will be removed or shifted eventually, or that the stack will eventually shrink if the end of the input has been reached. Popping a stack state that covers a nonterminal should be avoided, because this modification eliminates from the stack a construct that has already been successfully parsed. Example 4.68 : Consider again the expression grammar

E -+ E + E I E * E I (E) I id

Figure 4.53 shows the LR parsing table from Fig. 4.49 for this grammar, modified for error detection and recovery. We have changed each state that calls for a particular reduction on some input symbols by replacing error entries in that state by the reduction. This change has the effect of postponing the error detection until one or more reductions are made, but the error will still be caught before any shift move takes place. The remaining blank entries from Fig. 4.49 have been replaced by calls to error routines. The error routines are as follows. e l : This routine is called from states 0, 2, 4 and 5, all of which expect the beginning of an operand, either an id or a left parenthesis. Instead, +, * , or the end of the input was found. push state 3 (the goto of states 0, 2, 4 and 5 on id) ; issue diagnostic "missing operand." e2 : Called from states 0, 1 , 2, 4 and 5 on finding a right parenthesis.

remove the right parenthesis from the input; issue diagnostic "unbalanced right parenthesis."

4.8.

285

USING AMBIGUOUS GRAMMARS STATE 0 1

2 3 4 5

6 7

8 9

GOTO

ACTION

id s3 e3 s3 r4 s3 s3 e3 rl r2 r3

+ el s4 el r4 el el s4 rl r2 r3

*

el s5 el r4 el el s5 s5 r2 r3

(

) s2 e2 e3 e2 s2 e2 r4 r4 s2 e2 s2 e2 e3 s9 rl rl r2 r2 r3 r3

$

E

el acc el r4 el el e4 rl r2 r3

1 6 7

8

Figure 4.53: LR parsing table with error routines e3: Called from states 1 or 6 when expecting an operator, and an id or right

parenthesis is found.

push state 4 (corresponding to symbol +) onto the stack; issue diagnostic "missing operator." e4: Called from state 6 when the end of the input is found.

push state 9 (for a right parenthesis) onto the stack; issue diagnostic "missing right parenthesis." On the erroneous input id + ) , the sequence of configurations entered by the parser is shown in Fig. 4.54. 0

4.8.4

Exercises for Section 4.8

! Exercise 4.8. 1 : The following is an ambiguous grammar for expressions with n binary, infix operators, at n different levels of precedence: E

-+

E (}1 E I E (}2 E I ' " E (}n E I ( E ) I id

a) As a function of n, what are the SLR sets of items? b) How would you resolve the conflicts in the SLR items so that all oper ators are left associative, and (}1 takes precedence over (}2 , which takes precedence over (}3 , and so on? c) Show the SLR parsing table that results from your decisions in part (b) .

286

CHAPTER 4. SYNTAX ANALYSIS STACK 0 03 o1 014

id E E+

INPUT id + ) $ +)$ +)$ )$

014

E+

$

0143 o147 o1

E + id E+ E+

$ $ $

SYMBOLS

ACTION

"unbalanced right parenthesis" e2 removes right parenthesis "missing operand" e1 pushes state 3 onto stack

Figure 4.54: Parsing and error recovery moves made by an LR parser d) Repeat parts (a) and (c) for the unambiguous grammar, which defines the same set of expressions, shown in Fig. 4.55. e) How do the counts of the number of sets of items and the sizes of the tables for the two (ambiguous and unambiguous) grammars compare? What does that comparison tell you about the use of ambiguous expression grammars? -+

El E2

-+

En En+ 1

-+

-+

El () E2 I E2 E2 () E3 I E3 En () En+1 I En+l ( El ) I id

Figure 4.55: Unambiguous grammar for n operators

! Exercise 4.8.2 : In Fig. 4.56 is a grammar for certain statements, similar to that discussed in Exercise 4.4.12. Again, e and s are terminals standing for conditional expressions and "other statements," respectively. a) Build an LR parsing table for this grammar, resolving conflicts in the usual way for the dangling-else problem. b) Implement error correction by filling in the blank entries in the parsing table with extra reduce-actions or suitable error-recovery routines. c) Show the behavior of your parser on the following inputs: (i) (ii)

if e then s ; if e then s end while e do begin s ; if e then s ; end

287

4.9. PARSER GENERATORS stmt

list

-+

I I

I I

if e then stmt if e then stmt else stmt while e do stmt begin list end s list ; stmt stmt

-+

I

Figure 4.56: A grammar for certain kinds of statements 4.9

Parser G enerat ors

This sectiop shows how a parser generator can be used to facilitate the construc tion of the front end of a compiler. We shall use the LALR parser generator Yacc as the basis of our discussion, since it implements many of the concepts discussed iT!- the previous two sections and it is widely available. Yacc stands for "yet another compiler-cpmpiler," reflecting the popularity of parser generators in the early 1970s when the first version of Yacc was created by S. C. Johnson. Yacc is available as a command on the UNIX system, and has been used to help implement many production compilers.

4.9. 1

The Parser Generator

Yac c

A translator can be constructed using Yacc in the manner illustrated in Fig. 4.57. First, a file, say translate . y, containing a Yacc specification of the translator is prepared. The UNIX system command yacc translate . y

transforms ttte file translat e . y into a C program called y . tab . c using the LALR method outlined in Algorithm 4.63. The program y . t ab ! c is a repre sentation of an LALR parser written in C, along with other C routines that the user may have prepared. The LALR parsing table is compacted as described in Section 4.7. By compiling y . tab . c along with the ly library tttat contains the LR parsing program using the command cc y . tab . c -ly

we obtain the desired object program a. out that performs the translation spec ified by the original Yacc program.7 If other procedures are needed, they cap be compiled or loaded with y . tab . c, just as with any C program. A Yacc source program has three parts: 7The name 1y is system dependent.

288

CHAPTER 4. SYNTAX ANALYSIS

I �� I I � I �I

Yacc specification

translat e . y y . tab . c

input

i er

�

CO

�

com iler a . out

..

y . tab . c

•

a . out

..

output

Figure 4.57: Creating an input/output translator with Yacc declarations %% translation rules %% supporting C routines

Example 4.69 : Td illustrate how to prepare a Yacc source program, let us construct a simple desk calculator that reads an arithmetic expression, evaluates it, and then prints its numeric value. We shall build the desk calculator starting with the with the following grammar for arithmetic expressions: E

T F

-+ -+ -+

E + T I T T * F I F ( E ) I digit

The token digit is a single digit between b and 9. A Yacc desk calculator program derived from this grammar is shown in Fig. 4.58� 0

The Declarations Part There are . two sections in the declarations part of a Yacc program; both are optional. In the first section, we put ordinary C declarations, delimited by % { and % }. Here we place declarations of any temporaries used by the translation rules or procedures of the second and third sections. In Fig. 4.58, this section contains only the include-statement #include < ctype . h>

that causes the C preprocessor to include the standard header file < ctype . h> that contains the predicate i sdigi t . Also in the declarations part are declarations of grammar tokens. In Fig. 4.58, the statement %token DIGIT

289

4.9. PARSER GENERATORS %{ #inc lude < ctype , h> %} % t oken DIG IT %% l ine

expr ' \n '

{ printf ( "%d\ n " , $ 1 ) ; }

expr

expr ' + ' t erm term

{ $$ = $ 1 + $3 ; }

t erm

term ' * ' f actor f actor

{ $$

$ 1 * $3 ; }

' ( ' expr ' ) '

{ $$

$2 ; }

f actor

DIGIT %% yylex ( ) { int c ; c = get char 0 ; if ( i sdigit ( c » { yylval = c- ' O ' ; return DIGIT ; } return c ; }

Figure 4.58: Yacc specification of a simple desk calculator

declares DIGIT to be a token. Tokens declared in this section can then be used in the second and third parts of the Yacc specification. , If Lex is used to create the lexical analyzer that passes token to the Yacc parser, then these token declarations are also made available to the analyzer generated by L�x, as discussed in Section 3.5.2. The Translation Rules Part

In the part of the Yacc specification after the first %% pair, we put the translation rules . Each rule consists of a grammar production and the associated serp.antic action. A set of productions that we have been writing:

(head )

-+

(body h

would be written in Yacc as

( body h I . . . I (bodY ) n

290

CHAPTER (head)

4.

SYNTAX ANALYSIS

(bodY) l (bodY)2

{ (semantic actionh } { (semantic action)2 }

(bodY) n

{ (semantic action)n }

In a Yacc production, unquoted strings of letters and digits hot declared to be tokens are taken to be nontermirials. A quoted single character, e.g. ' t ' , is taken to be the terminal symbol c , as well as the integer code for the token represented by that character (i.e., Lex would return the character code for ' c ' to the parser, as an int�ger) . Alternative bodies can be separated by a vertical bar, and a semicolon follows each head with its aiternatives and their semantic actions. The first head is taken to be the start symbol. A Yacc semantic action is a sequence of C statements. In a semaritic action, the symbol $$ refers to the attribute value associated with the nonterminal of the head, while $i refers to the value associated with the ith grammar symbol (terminal or nonterminal) of the body. The semantic action is performed when ever we reduce by the associated production, so normally the semantic action computes a value for $$ in terms of the $i's. In the Yact specification, we have written the two E-productions

E -t E + T I T and their associated semantic actions as: expr

expr term

'+'

term

{ $$

$1 + $3 ; }

Note that the nonterminal term in the first production is the third grammar symbol of the body, while + is the second. The semantic action associated with the first production adds the value of the expr and the term of the body and assigns the result as the value for the nonterminal expr of the head. We have omitted the semantic actioh for the second production altogether, since copying the value is the default action for productions with a single grammar symbol in the body. In general, { $$ = $ 1 ; } is the default semantie action. Notice that we have added a new starting production line : expr ' \n '

{ printf C " %d\n" , $ 1 ) ; }

to the Yacc specification. This production says that an input to the desk calculator is to be an expression followed by a newline character. The semantic action associated with this production p rints the decimal value of the expression followed by a newline character.

291

4.9. PARSER GENERATORS The Supporting C-Routines Part

The third part of a Yacc specification consists of supporting C-routines. A lexical analyzer by the name yylex 0 must be provided. Using Lex to produce yylex 0 is a common choice; see Section 4.9.3. Other procedures such as error recovery routines may be added as necessary. The lexical analyzer yylex 0 produces tokens consisting of a token name and its associated attribute value. If a token name such as DIGIT is returned, the token name must be declared in the first section of the Yacc specification. The attribute value associated with a token is communicated to the parser through a Yacc-defined variable yylval. The lexical analyzer in Fig. 4.58 is very crude. It reads input characters one at a time using the C-function get char O . If the character is a digit, the value of the digit is stored in the variable yylval, and the token name DIGIT is returned. Otherwise, the character itself is returned as the token name.

4.9.2

Using

Yac c

with Ambiguous Grammars

Let us now modify the Yacc specification so that the resulting desk calculator becomes more useful. First, we shall allow the desk calculator to evaluate a sequence of expressions, one to a line. We shall also allow blank lines between expressions. We do so by changing the first rule to lines

lines expr ' \n ' lines ' \n ' / * empty */

{ printf ( " %g\n" , $2) ; }

In Yacc, an empty alternative, as the third line is, denotes Eo Second, we shall enlarge the class of expressions to include numbers instead of single digits and to include the arithmetic operators +, - , ( both binary and unary ) , * , and /. The easiest way to specify this class of expressions is to use the ambiguous grammar

E --+ E + E I E - E I E

*

E I E / E I - E I number

The resulting Yacc specification is shown in Fig. 4.59. Since the grammar in the Yacc specification in Fig. 4.59 is ambiguous, the LALR algorithm will generate parsing-action conflicts. Yacc reports the num ber of parsing-action conflicts that are generated. A description of the sets of items and the parsing-action conflicts can be obtained by invoking Yacc with a -v option. This option generates an additional file y . output that contains the kernels of the sets of items found for the grammar, a description of the parsing action conflicts generated by the LALR algorithm, and a readable represen tation of the LR parsing table showing how the parsing action conflicts were resolved. Whenever Yacc reports that it has found parsing-action conflicts, it

CHAPTER 4. SYNTAX ANALYSIS

292

%{ #include #include #def ine YYSTYPE double %} %token NUMBER

1* double type f or Yacc stack *1

%left ' + ' , - , %left ' * ' ' I ' %right UMINUS %% lines

lines expr ' \n ' lines ' \n ' 1 * empty *1

expr

expr ' + ' expr , - , expr ' * ' expr ' I ' ' ( ' expr , - , expr NUMBER

{ printf ( " %g\n " , $2 ) j }

{ $$ expr { $$ expr { $$ expr { $$ expr { $$ ,) , %prec UMINUS

= $ 1 + $3 j } = $ 1 - $3 j } = $ 1 * $3 j } = $ 1 I $3 j } $2 ; } { $$ = - $2 j }

%% yylex ( ) { int C j while ( ( c = get char ( ) ) == , , ) j if ( ( c == ' . ' ) I I ( isdigit ( c ) ) ) { unget c ( c , stdin) ; scanf ( " %lf " , &yylval) j return NUMBER j } return C j }

Figure 4.59: Yacc specification for a more advanced desk calculator.

4 . 9.

�

293

PARSER GENERAT RS

is wise to create and consult the file y . output to see why the parsing-action conflicts were generated and to see whether they were resolved correctly. Unless otherwise instructed Yacc will resolve all parsing action conflicts using the following two rules: 1 . A reduce / reduce conflict is resolved by choosing the conflicting production listed first in the Yacc specification. 2. A shift / reduce conflict is resolved in favor of shift. This rule resolves the

shift / reduce conflict arising from the dangling-else ambiguity correctly.

Since these default rules may not always be what the compiler writer wants,

Yacc provides a general mechanism for resolving shift / reduce conflicts. In the

declarations portion, we can assign precedences and associativities to terminals. The declaration %left ' + ' , - ,

makes + and - be of the same precedence and be left associative. We can declare an operator to be right associative by writing %right , � ,

and we can force an operator to be a nonassociative binary operator (Le. , two occurrences of the operator cannot be combined at all ) by writing %nonassoc ' < '

The tokens are given precedences in the order in which they appear in the declarations part, lowest first. Tokens in the same declaration have the same precedence. Thus, the declaration %right UMINUS

in Fig. 4.59 gives the token UMINUS a precedence level higher than that of the five preceding terminals. Yacc resolves shift / reduce conflicts by attaching a precedence and associa tivity to each production involved in a conflict, as well as to each terminal involved in a conflict. If it must choose between shifting input symbol a and re ducing by production A -+ 0:', Yacc reduces if the precedence of the production is greater than that of a, or if the precedences are the same and the associativity of the production is left. Otherwise, shift is the chosen action. Normally, the precedence of a production is taken to be the same as that of its rightmost terminal. This is the sensible decision in most cases. For example, given productions

E

-+

E+E I E+E

294

CHAPTER 4. SYNTAX ANALYSIS

we would prefer to reduce by E -+ E + E with lookahead +, because the + in the body has the same precedence as the lookahead, but is left associative. With lookahead * , we would prefer to shift, because the lookahead has higher precedence than the + in the production. In those situations where the rightmost terminal does not supply the proper precedence to a production, we can force a precedence by appending to a pro duction the tag %prec (terminal)

The precedence and associativity of the production will then be the same as that of the terminal, which presumably is defined in the declaration section. Yacc does not report shift/reduce conflicts that are resolved using this precedence and associativity mechanism. This "terminal" can be a placeholder, like UMINUS in Fig. 4.59; this termi nal is not returned by the lexical analyzer, but is declared solely to define a precedence for a production. In Fig. 4.59, the declaration %right UMINUS

assigns to the token UMINUS a precedence that is higher than that of * and /. In the translation rules part, the tag: %prec UMINUS

at the end of the production expr

: ' - , expr

makes the unary-minus operator in this production have a higher precedence than any other operator.

4.9.3

Creating

Yac c

Lexical Analyzers with

Lex

Lex was designed to produce lexical analyzers that could be used with Yacc. The Lex library 11 will provide a driver program named yylex 0 , the name required by Yacc for its lexical analyzer. If Lex is used to produce the lexical analyzer, we replace the routine yylex ( ) in the third part of the Yace specification by

the statement

#include 1 1ex . yy . c "

and we have each Lex action return a terminal known to Yaee. By using the #inelude " lex . yy . e " statement, the program yylex has access to Yaee's names for tokens, since the Lex output file is compiled as part of the Yaec output file y . t ab . e. Under the UNIX system, if the Lex specification is in the file f irst . 1 and the Yaee specification in seeond . y, we can say

295

4.9. PARSER GENERATORS lex f irst . l yacc second . y c c y . tab . c -ly -11

to obtain the desired translator. The Lex specification in Fig. 4.60 can be used in place of the lexical analyzer in Fig. 4.59. The last pattern, meaning "any character ," must be written \n 1 since the dot in Lex matches any character except newline. •

[0-9J +\e . ? 1 [0-9J *\e . [0-9J + number %% [ J { /* skip blanks */ } {number} { sscanf (yytext , ''%If '' , &yylval) ; return NUMBER ; } \n l . { return yytext [OJ ; }

Figure 4.60: Lex specification for yylex 0 in Fig. 4.59

4.9.4

Error Recovery in

Yac c

In Yacc , error recovery uses a form of error productions. First, the user de cides what "major" nonterminals will have error recovery associated with them. Typical choices are some subset of the nontermiilals generating expressions, statements; blocks, and functions. The user then adds to the grammar error productions of the form A -+ error a, where A is a major nonterminal and a is a string of grammar symbols, perhaps the empty string; error is a Yacc reserved word. Yacc will generate a parser from such a specification, treating the error productions as ordinary productions. However, when the parser generated by Yacc encounters an error, it treats the states whose sets of items contain error productions in a special way. On encountering an error, Yacc pops symbols from its stack until it finds the top most state on its stack whose underlying set of items includes an item of the form A -+ . error a. The parser then "shifts" a fictitious token error onto the stack, as though it saw the token error on its input. When a is E, a reduction to A occurs immediately and the semantic action associated with the production A -+ . error (which might be a user-specified error-recovery routine ) is invoked. The parser then discards input symbols until it finds an input symbol on which normal parsing can proceed. If a is not empty, Yacc skips ahead on the input looking for a substring that can be reduced to a . If a consists entirely of terminals, theri it looks for this string of terminals on the input, and "reduces" them by shifting them onto the stack. At this point, the parser will have error a on top of its stack. The parser will then reduce error a to A; and resume normal parsing. For example, an error production of the form

296

CHAPTER 4.

SYNTAX ANALYSIS

%{ # include < ctype . h> #include f:tdef in� YYSTYPE dOl.J.ble

1 * double type for Yac c stack *1

%} %token NUMBER %left ' + ' , - , %left ' *- ' ' I ' %right UMINUS %% l ines

l ines expr ' \n ' l ines ' \n ' 1* empty *1

{ printf ( " %g\n " , $ 2) ; }

error ' \n ' { yyerror ( IIre enter previous l ine : " ) ; yyerrok ; } expr

expr

'+'

expr , - , expr ' * ' expr ' I ' ' ( ' expr , - , expr NUMBER

{ $$ expr { $$ expr { $$ expr { $$ ' ) , { $$ %prec UMlNUS { expr

=

=

$ 1 + $3 ; } $ 1 - $3 ; } $ 1 * $3 ; } $ 1 I $3 ; j. $2 ; } - $2 ; } $$ =

%% #include " lex . yy . c "

Figure 4.61: Desk calculator with error recovery

stmt -+

error ;

would specify to the parser that it should skip just beyond the next semicolon on seeing an error, and assume that a statement had been found. The semantic routine fOf this error production would not need to manipulate the input, but could generate a diagnostic message and set a flag to inhibit generation of object code, for example.

Example 4 . 70 : Figure 4.61 shows the Yacc desk calculator of Fig. 4.59 with the error production

l ines : error ' \n ' This error production causes the desk calculator to suspend normal parsing when a syntax error is found on an input line. On encou:qtering the error,

4. 1 0.

297

SUMMARY OF CHAPTEJl 4

the parser in the desk calculator starts popping symbols from its stack until it encounters a state that has a shift action on the token error. State 0 is such a st ate (in this example, it's the only such state) , since its items include lines

-+

. error ' \n '

Also, state 0 is always on the bottom of the stack. The parser shifts the token error onto the stack, and then proceeds to skip ahead in the input until it has found a newline character. At this point the parser shifts the newline onto the stack, reduces error ' \n ' to lines, and emits the diagnostic message "reenter previous line:" . The special Yacc routine yyerrok resets the parser to its normal mode of operation. 0

4.9.5 Exercises for Section 4.9 ! Exercise 4.9.1 : Write a Yacc program that takes boolean expressions as input

[as given by the grammar of Exercise 4.2.2(g)] and produces the truth value of the expressions.

! Exercise 4.9.2 : Write a Yacc program that takes lists (as defined by the

grammar of Exercise 4.2.2(e) , but with any single character as an element, not just a ) and produces as output a linear representation of the same list; i.e., a single list of the elements, in the same order that they appear in the input.

! Exercise 4.9.3 : Write a Yacc program that tells whether its input is a palin drome (sequence of characters that read the same forward and backward) . ! ! Exercise 4.9.4 : Write a Yacc program that takes regular expressions (as de

fined by the grammar of Exercise 4.2.2(d) , but with any single character as an argument, not just a) and produces as output a transition table for a nonde terministic finite automaton recognizing the same language. 4. 10

Summary o f C hapter 4

.. Parsers. A parser takes as input tokens from the lexical analyzer and

treats the token names as terminal symbols of a context-free grammar. The parser then constructs a parse tree for its input sequence of tokens; the parse tree may be constructed figuratively (by going through the cor responding derivation steps) or literally.

.. Context-Free Grammars. A grammar specifies a set of terminal symbols

(inputs) , another set of nonterminals (symbols representing syntactic con structs) , and a set of productions, each of which gives a way in which strings represented by one nonterminal can be constructed from terminal symbols and strings represented by certain other nont�rminals. A pro duction consists of a head (the nonterminal to be replaced ) and a body (the replacing string of grammar symbols).

298

CHAPTER 4. SYNTAX ANALYSIS

+ Derivations. The process of starting with the start-nonterminal of a gram mar and successively replacing it by the body of one of its productions is called a derivation. If the leftmost (or rightmost) nonterminal ls always replaced, then the derivation is called leftmost (respectively, rightmost) . + Parse Trees. A parse tree is a picture of a derivation, in which there is a node for each nonterminal that appears in the derivation. The children of a node are the symbols by which that nonterminal is replaced in the derivation. There is a one-to-one correspondence between parse trees, left most derivations, and rightmost derivations of the same terminal string. + Ambiguity. A grammar for which some terminal string has two or more different parse trees, or equivalently two or more leftmost derivations or two or more rightmost derivations, is said to be ambiguous. In most cases of practical interest, it is possible to redesign an ambiguous grammar so it becomes an unambiguous grammar for the same language. However, ambiguous grammars with certain tricks applied sometimes lead to more efficient parsers. +

Top-Down and Bottom- Up Parsing. Parsers are generally distinguished by whether they work top-down (start with the grammar's start symbol and construct the parse tree from t he top) or bottom-up (start with the terminal symbols that form the leaves of the parse tree and build the tree from the bottom) . Top-down parsers include recursive-descent and LL parsers, while the most common forms of bottom-up parsers are LR parsers.

+ Design of Grammars. Grammars suitable for top-down parsing often are harder to design than those used by bottom-up parsers. It is necessary to eliminate left-recursion, a situation where one nortterminal derives a string that begins with the same nonterminal. We also must ieft-factor group productions for the same nonterminal that have a common prefix in the body. + Recursive:..Descent Parsers. These parsers use a procedure for each non terminal. The procedure looks at its input and decides which production to apply for its nonterminal. Terminals in the body of the production are matched to the input at the appropriate time, while nonterminals in the body result in calls to their procedure. Backtracking, in the case when the wrong production was chosen, is a possibility. + LL(l) Parsers. A grammar such that it is possible to choose the correct production with which to expand a given nonterminal, looking only at the next input symbol, is called LL(l). These grammars allow lis to construct a predictive parsing table that gives, for each nonterminal and each lookahead symbol, the correct choice of production. Error correction can be facilitated by placing error routines in some or all of the table entries that have no legitimate production.

4. 1 0.

SUMMARY OF CHAPTER 4

299

.. Shift-Reduce Parsing. Bottom-up parsers generally operate by choosing, on the basis of the next input symbol (lookahead symbol) and the contents of the stack, whether to shift the next input onto the stack, or to reduce some symbols at the top of the stack. A reduce step takes a production body at the top of the stack and replaces it by the head of the production . .. Viable Prefixes. In shift-reduce parsing, the stack contents are always a viable prefix - that is, a prefix of some right-sentential form that ends no further right than the end of the handle of that right-sentential form. The handle is the substring that was introduced in the last step of the rightmost derivation of that sentential form. .. Valid Items. An item is a production with a dot somewhere in the body. An item is valid for a viable prefix if the production of that item is used to generate the handle, and the viable prefix includes all those symbols to the left of the dot, but not those below . .. LR Parsers. Each of the several kinds of LR parsers operate by first constructing the sets of valid items ( called LR states ) for all possible viable prefixes, and keeping track of the state for each prefix on the stack. The set of valid items guide the shift-reduce parsing decision. We prefer to reduce if there is a valid item with the dot at the right end of the body, and we prefer to shift the lookahead symbol onto the stack if that symbol appears immediately to the right of the dot in some valid item. .. Simple LR Parsers. In an SLR parser, we perform a reduction implied by a valid item with a dot at the right end, provided the lookahead symbol can follow the head of that production in some sentential form. The grammar is SLR, and this method can be applied, if there are no parsing action conflicts; that is, for no set of items, and for no lookahead symbol, are there two productions to reduce by, nor is there the option to reduce or to shift . .. Canonical-LR Parsers. This more complex form of LR parser uses items that are augmented by the set of lookahead symbols that can follow the use of the underlying production. Reductions are only chosen when there is a valid item with the dot at the right end, and the current lookahead symbol is one of those allowed for this item. A canonical-LR parser can avoid some of the parsing-action conflicts that are present in SLR parsers, but often has many more states than the SLR parser for the same grammar . .. Lookahead-LR Parsers. LALR parsers offer many of the advantages of SLR and Canonical-LR parsers, by combining the states that have the same kernels ( sets of items, ignoring the associated lookahead sets ) . Thus, the number of states is the same as that of the SLR parser, but some parsing-action conflicts present in the SLR parser may be removed in the LALR parser. LALR parsers have become the method of choice in practice.

300

CHAPTER 4. SYNTAX ANALYSIS

+ Bottom- Up Parsing of Ambiguous Grammars. In many important situa tions, such as parsing arithmetic expressions, we can use an ambiguous grammar, and exploit side information such as the precedence of operators to resolve conflicts between shifting and reducing, or between reduction by two different productions. Thus, LR parsing techniques extend to many ambiguous grammars. + Yacc . The parser-generator Yacc takes a (possibly) ambiguous grammar and conflict-resolution information and constructs the LALR states. It then produces a function that uses these states to perform a bottom-up parse and call an associated function each time a reduction is performed. 4. 1 1

References for Chapter 4

The context-free grammar formalism originated with Chomsky [5] , as part of a study on natural language. The idea also was used in the syntax description of two early languages: Fortran by Backus [2] and Algol 60 by N aur [26] . The scholar Panini devised an equivalent syntactic notation to specify the rules of Sanskrit grammar between 400 B.C. and 200 B.C. [19] . The phenomenon of ambiguity was observed first by Cantor [4] and Floyd [13] . Chomsky Normal Form (Exercise 4.4.8) is from [6] . The theory of context free grammars is summarized in [17] . Recursive-descent parsing was the method of choice for early compilers, such as [16] , and compiler-writing systems, such as META [28] and TMG [25] . LL grammars were introduced by Lewis and Stearns [24] . Exercise 4.4.5, the linear-time simulation of recursive-descent, is from [3] . One of the earliest parsing techniques, due to Floyd [14] , involved the prece dence of operators. The idea was generalized to parts of the language that do not involve operators by Wirth and Weber [29] . These techniques are rarely used today, but can be seen as leading in a chain of improvements to LR parsing. LR parsers were introduced by Knuth [22] , and the canonical-LR parsing tables originated there. This approach was not considered practical, because the parsing tables were larger than the main memories of typical computers of the day, until Korenjak [23] gave a method for producing reasonably sized parsing tables for typical programming languages. DeRemer developed the LALR [8] and SLR [9] methods that are in use today. The construction of LR parsing tables for ambiguous grammars came from [1] and [12] . Johnson's Yacc very quickly demonstrated the practicality of generating parsers with an LALR parser generator for production compilers. The manual for the Yacc parser generator is found in [20] . The open-source version, Bison, is described in [10] . A similar LALR-based parser generator called CUP [18] supports actions written in Java. Top-down parser generators incude Antlr [27] , a recursive-descent parser generator that accepts actions in C++, Java, or C#, and LLGen [15] , which is an LL(I)-based generator. Dain [7] gives a bibliography on syntax-error handling.

4. 1 1 .

REFERENCES FOR CHAPTER 4

301

The general-purpose dynamic-programming parsing algorithm described in Exercise 4.4.9 was invented independently by J. Cocke ( unpublished ) by Young er [30] and Kasami [21] ; hence the "CYK algorithm." There is a more complex, general-purpose algorithm due to Earley [11] that tabulates LR-items for each substring of the given input; this algorithm, while also O (n3 ) in general, is only O (n2 ) on unambiguous grammars.

1 . Aho, A. V., S. C. Johnson, and J. D . Ullman, "Deterministic parsing of ambiguous grammars," Comm. A CM 18:8 ( Aug., 1975) , pp. 441-452. 2. Backus, J.W, "The syntax and semantics of the proposed international algebraic language of the Zurich-ACM-GAMM Conference," Proc. Inti. Conf. Information Processing, UNESCO, Paris, (1959) pp. 125-132. 3. Birman, A. and J. D. Ullman, "Parsing algorithms with backtrack," In formation and Control 23: 1 (1973) , pp. 1-34. 4. Cantor, D. C . , "On the ambiguity problem of Backus systems," J. A CM 9 :4 (1962) , pp. 477-479. 5. Chomsky, N., "Three models for the description of language," IRE Trans. on Information Theory IT-2:3 ( 1956) , pp. 1 13-124. 6. Chomsky, N., "On certain formal properties of grammars," Information and Control 2:2 (1959) , pp. 137-167. 7. Dain, J., "Bibliography on Syntax Error Handling in Language Transla tion Systems," 1991. Available from the comp . compilers newsgroup; see http : //compilers . iecc . com/ comparch/art i cle/91 -04-050 .

8. DeRemer, F., "Practical Translators for LR ( k ) Languages," Ph.D. thesis, MIT, Cambridge, MA, 1969. 9. DeRemer, F., "Simple LR ( k ) grammars," Comm. A CM 14:7 ( July, 1971 ) , pp. 453-460. 10. Donnelly, C. and R. Stallman, "Bison: The YACC-compatible Parser Generator," http : //www . gnu . org/ software/bison/manual/ . 1 1 . Earley, J . , "An efficient context-free parsing algorithm," Comm. A CM 13:2 ( Feb. , 1970) , pp. 94-102. 12. Earley, J., "Ambiguity and precedence in syntax description," Acta In formatica 4:2 ( 1975), pp. 183-192. 13. Floyd, R. W., "On ambiguity in phrase-structure languages," Comm. A CM 5 : 10 ( Oct., 1962) , pp. 526-534. 14. Floyd, R. W., "Syntactic analysis and operator precedence," J. A CM 10:3 (1963) , pp. 316-333.

302

CHAPTER 4. SYNTAX ANALYSIS

15. Grune, D and C. J. H. Jacobs, "A programmer-friendly LL(l) parser generator," Software Practice and Experience 18:1 (Jan., 1988) , pp. 2938. See also http : //www . cs . vu . nl/- ceriel/LLgen . html . 16. Hoare, C. A. R., "Report on the Elliott Algol translator," Computer J. 5:2 (1962), pp. 127-129. 17. Hopcroft, J. E., R. Motwani, and J. D. Ullman, Introduction to Automata Theory, Languages, and Computation, Addison-Wesley, Boston MA, 2001. 18. Hudson, S. E. et al. , "CUP LALR Parser Generator in Java," Available at http : //www2 . cs . tum . edu/proj ect s/cup/ .

19. Ingerman, P. Z., "Panini-Backus form suggested," Comm. A CM 10:3 (March 1967) , p. 137. 20. Johnson, S. C., "Yacc - Yet Another Compiler Compiler," Computing Science Technical Report 32, Bell Laboratories, Murray Hill, NJ, 1975. Available at http : //dinosaur . comp ilertools . net/yacc/ . 21. Kasami, T., "An efficient recognition and syntax analysis algorithm for context-free languages," AFCRL-65-758, Air Force Cambridge Research Laboratory, Bedford, MA, 1965. 22. Knuth, D. E., "On the translation of languages from left to right," Infor mation and Control 8:6 (1965) , pp. 607-639. 23. Korenjak, A. J., "A practical method for constructing LR(k) processors," Comm. A CM 12:11 (Nov., 1969) , pp. 613-623. 24. Lewis, P. M. II and R. E. Stearns, "syntax-directed transduction," J. A CM 15:3 (1968) , pp. 465-488. 25. McClure, R. M., "TMG - a syntax-directed compiler," proc. 20th A CM Natl. Conf. (1965) , pp. 262-274. 26. Naur, P. et al., "Report on the algorithmic language ALGOL 60," Comm. A CM 3:5 (May, 1960) , pp. 299-314. See also Comm. A CM 6:1 (Jan., 1963) , pp. 1-17. 27. Parr, T., "ANTLR," http : //www . ant lr . org/ . 28. Schorre, D. V., "Meta-II: a syntax-oriented compiler writing language," Proc. 19th A CM Natl. Conj. (1964) pp. D 1 .3-1-D1.3-11. 29. Wirth, N. and H. Weber, "Euler: a generalization of Algol and its formal definition: Part I," Comm. A CM 9:1 (Jan., 1966) , pp. 13-23. 30. Younger, D.H., "Recognition and parsing of context-free languages in time n3 ," Information and Control 10:2 (1967) , pp. 189-208.

C hapter 5

Synt ax-Directed Translat ion This chapter develops the theme of Section 2.3: the translation of languages guided by context-free grammars. The translation techniques in this chapter will be applied in Chapter 6 to type checking and intermediate-code generation. The techniques are also useful for implementing little languages for specialized tasks; this chapter includes an example from typesetting. We associate information with a language construct by attaching attributes to the grammar symbol(s) representing the construct, as discussed in Sec tion 2.3.2. A syntax-directed definition specifies the values of attributes by associating semantic rules with the grammar productions. For example, an infix-to-postfix translator might have a production and rule PRODUCTION E -t El + T

SEMANTIC RULE (5.1) E. code = E1 · code II T. code " ' + ' This production has two nonterminals, E and T; the subscript in El distin guishes the occurrence of E in the production body from the occurrence of E as the head; Both E and T have a string-valued attribute code. The semantic rule specifies that the string E. code is formed by concatenating E1 . code, T. code, and the character ' + ' . While the rule makes it explicit that the translation of E is built up from the translations of E1 , T, and ' +' , it may be inefficient to implement the translation directly by manipulating strings. From Section 2.3.5, a syntax-directed translation scheme embeds program fragments called semantic actions within production bodies, as in

E -t El + T { print ' +' } (5.2) By convention, semantic actions are enclosed within curly braces. (If curly braces occur as grammar symbols, we enclose them within single quotes, as in 303

304

CHAPTER 5. SYNTAX-DIRECTED TRANSLATION

'{' and ' } '.) The position of a semantic action in a production body determines the order in which the action is executed. In production (5 .2) , the action occurs at the end, after all the grammar symbbls; in general, semantic actions may occur at any position in a production body. Between the two notations, syntax-directed definitions can be more readable, and hence more useful for specifications. However, translation schemes can be more efficient, and hence more useful for implementations. The most general approach to syntax-directed translation is to construct a parse tree or a syntax tree, and then to compute the v�lues of attributes at the nodes of the tree by visiting the nodes of the tree. In many cases, translation can be done during parsing, without building an explicit tree. We shall therefore study a class of syntax-directed translations called "L-attributed translations" ( L for left-to-right ) , which encompass virtually all translations that can be performed during parsing. We also study a smaller class, called "S-attributed tran'slations" (8 for synthesized ) , which can be performed easily in connection with a bottom-up parse.

5.1

Syntax- Directed Definitions

A syntax-directed definition ( SDD ) is a context-free grammar together with attributes and rules. Attributes are associated with grammar symbols and rules are associated with productions. If X is a symbol and a is one of its attributes, then we write X.a to denote the value of a at a particular parse-tree node labeled X. If we implement the nodes of the parse tree by records or objects, then the attributes of X can be implemented by data fields in the records that represent the nodes for X. Attributes may be of any kind: numbers, types, table references, or strings, for instance. The strings may even be long sequences of code, say code in the intermediate language used by a compiler.

5. 1 . 1

Inherited and Synthesized Attributes

We shall deal with two kinds of attributes for nonterminals: 1 . A synthesized attribute for a nonterminal A at a parse-tree node N is defined by a semantic rule associated with the production at N. Note that the production must have A as its head. A synthesized attribute at node N is defined only in terms of attribute values at the children of N and at N itself.

2. An inherited attribute for a nonterminal B at a parse-tree node N is defined by a semantic rule associated with the production at the parent of N. Note that the production must have B as a symbol in its body. An inherited attribute at node N is defined only in terms of attribute values at N's parent, N itself, and N's siblings.

305

5. 1. SYNTAX-DIRECTED DEFINITIONS

An Alternative Definition of Inherited Attributes No additional translations are enabled if we allow an inherited attribute B .c at a node N to be defined in terms of attribute values at the children of N , as well as at N itself, at its parent, and at its siblings. Such rules can be "simulated" by creating additional attributes of B, say B ,Cl , B .C2 , . These are synthesized attributes that copy the needed attributes of the children of the node labeled B. We then compute B.c as an inherited in place of attributes at the attribute, using the attributes B ,Cl , B .C2 , children. Such attributes are rarely needed in practice. .

.

.

.

•

•

While we do not allow an inherited attribute at node N to be defined in terms of attribute values at the children of node N, we do allow a synthesized attribute at node N to be defined in terms of inherited attribute values at node N itself. Terminals can have synthesized attributes, but not inherited attributes. At tributes for terminals h ave lexical values that are supplied by the lexical ana lyzer; there are no semantic rules in the SDD itself for computing the value of an attribute for a terminal.

Example 5 . 1 : The SDD in Fig. 5.1 is based on our familiar grammar for arithmetic expressions with operators + and *. It evaluates expressions termi nated by an endmarker n. In the SDD, each of the nonterminals has a single synthesized attribute, called val. We also suppose that the terminal digit has a synthesized attribute lexval, which is an integer value returned by the lexical analyzer. 1) 2) 3) 4) 5) 6)

7)

PRODUCTION

SEMANTIC RULES

L -+ E n E -+ El + T E -+ T T -+ Tl * F T -+ F F -+ ( E ) F -+ digit

L. val = E. val E. val = El . val + T.val E.val = T.val T. val = Tl . val x F. val T. val = F. val F. val = E. val F. val = digit.lexval

Figure 5.1: Syntax-directed definition of a simple desk calculator The rule for production 1 , L -+ E n, sets L. val to E. val, which we shall see is the numerical value of the entire expression. Production 2, E -+ El + T, also has one rule, which computes the val attribute for the head E as the sum of the values at El and T. At any parse-

306

CHAPTER 5. SYNTAX-DIRECTED TRANSLATION

tree node N labeled E, the value of val for E is the sum of the values of val at the children of node N labeled E and T. Production 3 , E ---t T, has a single rule that defines the value of val for E to be the same as the value of val at the child for T. Production 4 is similar to the second production; its rule multiplies the values at the children instead of adding them. The rules for produ.ctions 5 and 6 copy values at a child, like that for the third production. Production 7 gives F. val th� value of a digit, that is, the numerical value of the token digit that the lexical analyzer returned. 0 An SDD that involves only synthesized attributes is called S-attributed; the SDD in Fig. 5.1 has this property. In an S-attributed SDD, each rule computes an attribute for the nonterminal at the head of a production from attributes taken from the body of the production. For simplicity, the examples in this section have semantic rules without side effects. In practice, it is 'convenient to allow SDD's to have limited side effects, such as printing the result computed by a desk calculator or interacting with a symbol table. Once the order of evaluation of attributes is discussed in Section 5.2, we shall allow semantic rules to compute arbitrary functions, possibly involving side effects. An S-attributed SDD can be implemented naturally in conjunction with an LR parser. In fact, the SDD in Fig. 5.1 mirrors the Yacc program of Fig. 4.58, which illustrates translation during LR parsing. The difference is that, in the rule for production 1, the Yacc progra:JIl prints the value E.val as a side effect, instead of defining the attribute L.val. An SDD without side effects is sometimes called an attribute grammar. The rules in an attribute grammar define the value of an attribute purely in terms of the values of other attributes and constants.

5. 1 .2

Evaluating an SDD at the Nodes of a Parse Tree

To visualize the translation specified by an SDD, it helps to work with parse trees, even though a translator need not actually build a parse tree. Imagine therefore that the rules of an SDD are applied by first constructing a parse tree and then using the rules to evaluate all of the attributes at each of the nodes of the parse tree. A parse tree, showing the value(s) of its attribute(s) is called an annotated parse tree. How do we construct an annotated parse tree? In what order do we evaluate attributes? Before we can evaluate an attribute at a node of a parse tree, we must evaluate all the attributes upon which its value depends . . For example, if all attributes are synthesized, as in Example 5.1, then we must evaluate the val attributes at all of the children of a node before we can evaluate the val attribute at the node itself. With synthesized attributes, we can evaluate attributes in any bottom-up order, such as that of a postorder traversal of the parse tree; the evaluation of S-attributed definitions is discussed in Section 5.2.3.

5. 1 . SYNTAX-DIRECTED DEFINITIONS

307

For SDD's with both inherited and synthesized attributes, there is no guar antee that there is even one order in which to evaluate attributes at nodes. For instance, consider nonterminals A and B, with synthesized and inherited attributes A.s and B .i, respectively, along with the production and rules SEMANTIC RULES A.s = B.i; B.i = A.s + 1

PRODUCTION A --+ B

These rules are circular; it is impossible to evaluate either A.s at a node N or B.i at the child of N without first evaluating the other. The circular dependency of A.s and B.i at some pair of nodes in a parse tree is suggested by Fig. 5.2.

Figure 5.2: The circular dependency of A.s and B.i on one another It is computationally difficult to determine whether or not there exist any circularities in any of the parse trees that a given SDD could have to translate. 1 Fortunately, there are useful subclasses of SDD 's that are sufficient to guarantee that an order of evaluation exists, as we shall see in Section 5.2.

Example 5 . 2 : Figure 5.3 shows an annotated parse tree for the input string 3 * 5 + 4 n, constructed using the grammar and rules of Fig. 5.1. The values of lexval are presumed supplied by the lexical analyzer. Each of the nodes for the nonterminals has attribute val computed in a bottom-up order, and we see the resulting values associated with each node. For instance, at the node with a child labeled *, after computing T. val = 3 and F. val = 5 at its first and third children, we apply the rule that says T. val is the product of these two values, or 15. 0 Inherited attributes are useful when the structure of a parse tree does not "match" the abstract syntax of the source code. The next example shows how inherited attributes can be used to overcome such a mismatch due to a grammar designed for parsing rather than translation. 1 Without going into details, while the problem is decidable, it cannot be solved by a polynomial-time algorithm, even if P NP , since it has exponential time complexity. =

308

CHAPTER 5. SYNTAX-DIRECTED TRANSLATION

E. val =

I

19

E. val =

19

I� / +I �

T. val = *

F. val = 3

n

15

T. val =

4

15

F. val =

4

/I�

T. val = 3

I I

L. val =

F. val =

I

5

digit . lexval =

I I

digit. lexval = 4 5

digit . lexval = 3

Figure 5.3: Annotated parse tree for 3 * 5 + 4 n

Example 5 . 3 : The SDD in Fig. 5.4 computes terms like 3 * 5 and 3 * 5 * 7. The top-down parse of input 3 * 5 begins with the production T -+ F T'. Here, F generates the digit 3, but the operator * is generated by T'. Thus, the left operand 3 appears in a different subtree of the parse tree from * . An inherited attribute will therefore be used to pass the operand to the operator. The grammar in this example is an excerpt from a non-left-recursive version of the familiar expression grammar; we used such a grammar as a running example to illustrate top-down parsing in Section 4.4. PRODUCTION

SEMANTIC RULES

1)

T -+ F T'

T'. inh = F.val T. val = T' .syn

2)

T' -+ * F T{

T{ .inh = T'.inh x F. val T' . syn = T{ . syn

3)

T'

T'.syn = T' .inh

4)

F -+ digit

-+ E

F.val = digit.lexval

Figure 5.4: An SDD based on a grammar suitable for top-down parsing Each of the nonterminals T and F has a synthesized attribute val; the terminal digit has a synthesized attribute lexval. The nonterminal T' has two attributes: an inherited attribute inh and a synthesized attribute syn.

5. 1 .

309

SYNTAX-DIRECTED DEFINITIONS

The semantic rules are based on the idea that the left operand of the operator * is inherited. More precisely, the head T' of the production T' --+ * F T{ inherits the left operand of * in the production body. Given a term x * y * z , the root of the subtree for * y * z inherits x. Then, the root of the subtree for * z inherits the value of x * y, and so on, if there are more factors in the term. Once all the factors have been accumulated, the result is passed back up the tree using synthesized attributes. To see how the semantic rules are used, consider the annotated parse tree for 3 * 5 in Fig. 5.5. The leftmost leaf in the parse tree, labeled digit, has attribute value lexval 3, where the 3 is supplied by the lexical analyzer. Its parent is for production 4, F -+ digit. The only semantic rule associated with this production defines F. val = digit. lexval, which equals 3.

=

= 15 3 /� T. S/h ==15� I 3 / p l = 5 T{.T1·smynh =15 115 =5 T. val

F. val

°

do.g•t . lexva l

=

*

. val

digit . lexval

€

Figure 5.5: Annotated parse tree for 3 * 5

At the second child of the root, the inherited attribute T'. inh is defined by the semantic rule T'. inh = F. val associated with production 1. Thus, the left operand, 3, for the * operator is passed from left to right across the children of the root. The production at the node for T' is T' -+ * FT{ . ( We retain the subscript 1 in the annotated parse tree to distinguish between the two nodes for T' . ) The inherited attribute T{ . inh is defined by the semantic rule T{ , inh = T', inh x F. val associated with production 2. With T'. inh = 3 and F. val = 5 , we get T{ .inh = 15. At the lower node for T{ , the production is T' -+ t. The semantic rule T'. syn = T'. inh defines T{ .syn = 15. The syn attributes at the nodes for T' pass the value 15 up the tree to the node for T, where T. val = 15. 0

5 . 1 .3

Exercises for Section 5 . 1

Exercise 5 0 1 . 1 : For the SDD of Fig. 5 . 1 , give annotated parse trees for the following expressions:

310

CHAPTER 5. SYNTAX-DIRECTED TRANSLATION

b) 1 * 2 * 3 * (4 + 5) n. c ) (9 + 8 * (7 + 6) + 5) * 4 n.

Exercise 5 . 1 .2 : Extend the SDD of Fig. 5.4 to handle expressions as in Fig. 5.1. Exercise 5 . 1 .3 : Repeat Exercise 5.1.1, using your SDD from Exercise 5.1.2. 5.2

Evaluat ion O rders for SDD 's

"Dependency graphs" are a useful tool for determining an evaluation order for the attribute instances in a given parse tree. While an annotated parse tree shows the values of attributes, a dependency graph helps us determine how those values can be computed. In this section, in addition to dependency graphs, we define two impor tant classes of SDD's: the "S-attributed" and the more general "L-attributed" SDD's. The translations specified by these two classes fit well with the parsing methods we have studied, and most translations encountered in practice can be written to conform to the requirements of at least one of these classes.

5 .2 . 1 Dependency Graphs A dependency graph depicts the flow of information among the attribute in

stances in a particular parse tree; an edge from one attribute instance to an other means that the value of the first is needed to compute the second. Edges express constraints implied by the semantic rules. In more detail: •

For each parse-tree node, say a node labeled by grammar symbol X, the dependency graph has a node for each attribute associated with X.

•

Suppose that a semantic rule associated with a production p defines the value of synthesized attribute A.b in terms of the value of X.C (the rule may define A.b in terms of other attributes in addition to X.c) . Then, the dependency graph has an edge from X.C to A.b. More precisely, at every node N labeled A where production p is applied, create an edge to attribute b at N, from the attribute c at the child of N corresponding to this instance of the symbol X in the body of the production. 2

•

Suppose that a semantic rule associated with a production p defines the value of inherited attribute B.c in terms of the value of X.a. Then, the dependency graph has an edge from X.a to B.c. For each node N labeled B that corresponds to an occurrence of this B in the body of production p, create an edge to attribute c at N from the attribute a at the node M

2 Since a node N can have several children labeled X, we again assume that subscripts distinguish among uses of the same symbol at different places in the production.

311

5.2. EVALUATION ORDERS FOR SDD 'S

that corresponds to this occurrence of X . Note that M could be either the parent or a sibling of N .

Example 5 .4 : Consider the following production and rule: SEMANTIC RULE

PRODUCTION

E.val

E --+ El + T

=

El .val + T.val

At every node N labeled E, with children corresponding to the body of this production, the synthesized q,ttribute val at N is computed using the values of val at the two children, labeled E and T. Thus, a portion of the dependency g�aph for every parse tree in which this production is used looks like Fig. 5.6. As a convention, we shall show the parse tree edges as dotted lines, while the edges of the dependency graph are solid. 0

. '

EI

. '

.

.

.

val

/ .

.

.

.

.

.

.

E

.

.

�

.

.

val .

.

.

. "

. .

"

+

T

val

Figure 5.6: E. val is synthesized from E1 .val and E2 . val

Example 5 . 5 : An example of a complete dependency graph appears in Fig. 5.7. The nodes of the dependency graph, represented by the numbers 1 through 9, correspond to the attributes in the annotated parse tree in Fig. 5.5.

digit

2

lexval

Figure 5.7: Dependency graph for the annotated parse tree of Fig. 5.5 Nodes 1 and 2 represent the attribute lexval associated with the two leaves labeled digit. Nodes 3 and 4 represent the attribute val associated with the two nodes labeled F. The edges to node 3 from 1 and to node 4 from 2 result

312

CHAPTER 5. SYNTAX-DIRECTED TRANSLATION

from the semantic rule that defines F.val in terms of digit.lexval. In fact, F. val equals digit.lexval, but the edge represents dependence, not equality. Nodes 5 and 6 represent the inherited attribute T'. inh associated with each of the occurrences of nonterminal T'. The edge to 5 from 3 is due to the rule T'. inh = F. val, which defines T'. inh at the right child of the root from F. val at the left child. We see edges to 6 from node 5 for T'. inh and from node 4 for F. val, because these values are multiplied to evaluate the attribute inh at node 6. Nodes 7 and 8 represent the synthesized attribute syn associated with the occurrences of T'. The edge to node 7 from 6 is due to the semantic rule T'.syn = T'. inh associated with production 3 in Fig. 5.4. The edge to node 8 from 7 is due to a semantic rule associated with production 2. Finally, node 9 represents the attribute T. val. The edge to 9 from 8 is due to the semantic rule, T. val = T'.syn, associated with production 1. 0

5.2.2

Ordering the Evaluation of Attributes

The dependency graph characterizes the possible orders in which we can evalu ate the attributes at the various nodes of a parse tree. If the dependency graph has an edge from node M to node N, then the attribute corresponding to M must be evaluated before the attribute of N. Thus, the only allowable orders of evaluation are those sequences of nodes Nl , N2 , , Nk such that if there is an edge of the dependency graph from Ni to Nj , then i < j. Such an ordering embeds a directed graph into a linear order, and is called a topological sort of the graph. If there is any cycle in the graph, then there are no topological sorts; that is, there is no way to evaluate the SDD on this parse tree. If there are no cycles, however, then there is always at least one topological sort. To see why, since there are no cycles, we can surely :find a node with no edge entering. For if there were no such node, we could proceed from predecessor to predecessor until we came back to some node we had already seen, yielding a cycle. Make this node the first in the topological order, remove it from the dependency graph, and repeat the process on the remaining nodes. .

•

•

Example 5 . 6 : The dependency graph of Fig. 5.7 has no cycles. One topologi cal sort is the order in which the nodes have already been numbered: 1 , 2, . . . , 9. Notice that every edge of the graph goes from a node to a higher-numbered node, so this order is surely a topological sort. There are other topological sorts as well, such as 1, 3, 5, 2, 4, 6, 7, 8, 9. 0

5.2.3

S-Attributed Definitions

As mentioned earlier, given an SDD, it is very hard to tell whether there exist any parse trees whose dependency graphs have cycles. In practice, translations can be implemented using classes of SDD's that guarantee an evaluation order,

5.2. EVALUATION ORDERS FOR SDD'S

313

since they do not permit dependency graphs with cycles. Moreover, the two classes introduced in this section can be implemented efficiently in connection with top-down or bottom-up parsing. The first class is defined as follows: •

An SDD is S-attributed if every attribute is synthesized.

Example 5 . 7 : The SDD of Fig. 5.1 is an example of an S-attributed definition. Each attribute, L. val, E.val, T. val, and F. val is synthesized. 0 When an .SDD is S-attributed, we can evaluate its attributes in arty bottom up order of the nodes of the parse tree. It is often especially simple to evaluate the attributes by performing a postorder traversal of the parse tree and evalu ating the attributes at a node N when the traversal leaves N for the last time. That is, we apply the function postorder, defined below, to the root of the parse tree ( see also the box "Preorder and Postorder Traversals" in Section 2.3.4) :

postorder(N) { for ( each child C of N, from the left ) postorder(C) ; evaluate the attributes associated with node N; }

S-attributed definitions can be implemented during bottom-up parsing, since a bottom-up parse corresponds to a postorder traversal. Specifically, postorder corresponds exactly to the order in which an LR parser reduces a production body to its head. This fact will be used in Section 5.4.2 to evaluate synthesized attributes and store them on the stack during LR parsing, without creating the tree nodes explicitly.

5.2.4

L-Attributed Definitions

The second class of SDD's is called L-attributed definitions. The idea behind this class is that, between the attributes associated with a production body, dependency-graph edges can go from left to right, but not from right to left (hence "L-attributed" ) . More precisely, each attribute must be either 1.

Synthesized, or

2. Inherited, but with the rules limited as foliows. Suppose that there is a production A -t XIX2 · . . Xn , and that there is an inherited attribute Xi · a computed by a rule associated with this production. Then the rule may use only: ( a) Inherited attributes associated with the head A.

( b ) Either inherited or synthesized attributes associated with the occur

rences of symbols Xl , X2 ,

•

•

•

, Xi- l located to the left of Xi .

314

CHAPTER 5. SYNTAX-DIRECTED TRANSLATION (c ) Inherited or synthesized attributes associated with this occurrence

· of Xi itself, but only in such a way that there are no cycles in a dependency graph formed by the attributes of this Xi '

Example 5.8 : The SDD in Fig. 5.4 is L-attributed. To see why, consider the semantic rules for inherited attributes, which are repeated here for convenience: PRODUCTION

T -t F T' T' -t * F T{

SEMANTIC RULE

T'. inh = F.val T{ . inh = T'.inh x F.val

The first of these rules defines the inherited attribute T'. inh using only F. val, and F appears to the left of T' in the production body, as required. The second rule defines T{ . inh using the inherited attribute T' . inh associated with the head, and F. val, where F appears to the left of T{ in the production body. In each of these cases, the rules use i:q.formation "from above or from the left," as required by the class. The remaining attributes are synthesized. Hence, the SDD is L-attributed. 0

Example 5.9 : Any SDD containing the following production and rules cannot be L-attributed: PRODUCTION

A -t B C

SEMANTIC RULES

A.s = B.b; B.i = f (C.c, A.s)

The first rule, A.s = B.b, is a legitimate rule in either an S-attributed or L attributed SDD. It defines a synthesized attribute A.s in terms of an attribute at a child (that is, a symbol within the production body ) . The second rule defines an inherited attribute B.i, so the entire SDD cannot be S-attributed. Further, although the rule is legal, the SDD cannot be L attributed, because the attribute C.c is used to help define B.i, and C is to the right of B in the production body. While attributes at siblings in a parse tree may be used in L-attributed SDD 's, they must be to the left of the symbol whose attribute is being defined. 0

5.2.5

Semantic Rules with Controlled Side Effects

In practice, translations involve side effects: a desk calculator might print a result; a code generator might enter the type of an identifier into a symbol table. With SDD's, we strike a balance between attribute grammars and translation schemes. Attribute grammars have no side effects and allow any evaluation order consistent with the dependency graph. Translation scheme� impose left fragment; to-right evaluation and allow semantic actions to contain any program . translation schemes are discussed in Section 5.4. We shall control side effects in SDD's in one of the following ways:

5.2. EVALUATION ORDERS FOR SDD 'S

315

•

Permit incidental side effects that do not constrain attribute evaluation. In other words, permit side effects when attribute evaluation based on any topological sort of the dependency graph produces a "correct" translation, where "correct" depends on the application .

•

Constrain the allowable evaluation orders, so that the same translation is produced for any allowable order. The constraints can be thought of as implicit edges added to the dependency graph.

As an example of an incidental side effect, let us modify the desk calculator of Example 5.1 to print a result. Instead of the rule L. val = E. val, which saves the result in the synthesized attribute L. val, consider:

1)

PRODUCTION

SEMANTIC RULE

L -+ E n

print(E. val)

Semantic rules that are executed for their side effects, such as print(E. val) , will be treated as the definitions of dummy synthesized attributes associated with the head of the production. The modified SDD produces the same translation under any topological sort, since the print statement is executed at the end, after the result is computed into E. val.

Example 5 . 1 0 : The SDD in Fig. 5.8 takes a simple declaration D consisting of a basic type T followed by a list L of identifiers. T can be int or float. For each identifier on the list, the type is entered into the symbol-table entry for the identifier. We assume that entering the type for one identifier does not affect the symbol-table entry for any other identifier. Thus, entries can be updated in any order. This SDD does not check whether an identifier is declared more than once; it can be modified to do so. PRODUCTION

1) 2) 3) 4)

D -+ T L T -+ int T -+ float L -+ L1 , id

5)

L -+ id

SEMANTIC RULES

L. inh = T. type T. type = integer T. type = float L1 . inh = L. inh addType(id. entry, L. inh) addType(id. entry, L. inh)

Figure 5.8: Syntax-directed definition for simple type declarations Nonterminal D represents a declaration, which, from production 1 , consists of a type T followed by a list L of identifiers. T has one attribute, T. type, which is the type in the declaration D . Nonterminal L also has one attribute, which we call inh to emphasize that it is an inherited attribute. The purpose of L. inh

CHAPTER 5. SYNTAX-DIRECTED TRANSLATION

316

is to pass the declared type down the list of identifiers, so that it can be added to the appropriate symbol-table entries. Productions 2 and 3 each evaluate the synthesized attribute T. type, giving it the appropriate value, integer or float. This type is passed to the attribute L. inh in the rule for production 1 . Production 4 passes L. inh down the parse tree. That is, the value L1 . inh is computed at a parse-tree node by copying the value of L.inh from the parent of that node; the parent corresponds to the head of the production. Productions 4 and 5 also have a rule in which a function addType is called with two arguments:

1. id. entry, a lexical value that points to a symbol-table object, and 2. L. inh, the type being assigned to every identifier on the list. We suppose that function addType properly installs the type L.inh as the type of the represented identifier. A dependency graph for the input string float id 1 , id2 , id3 appears in Fig. 5.9. Numbers 1 through 10 represent the nodes of the dependency graph. Nodes 1, 2, and 3 represent the attribute entry associated with each of the leaves labeled id. Nodes 6, 8, and 10 are the dummy attributes that represent the application of the function addType to a type and one of these entry values. D

/ 5 06 � .. /7 0 8 �

...�. . . . � ·· .4 type / . ... . real .

inh

9

/> . . . L

10

�t

/

,

.

· · · · · id2

,

2

. . . . . ida 3 entry

entry

entry

id1 1 entry

Figure 5.9: Dependency graph for a declaration float id 1 , id2 , id3 Node 4 represents the attribute T. type, and is actually where attribute eval uation begins. This type is then passed to nodes 5, 7, and 9 representing L.inh associated with each of the occurrences of the nonterminal L. 0

5.2. EVALUATION ORDERS FOR SDD'S

317

5.2.6 Exercises for Section 5.2 Exercise 5.2 . 1 : What are all the topological sorts for the dependency graph

of Fig. 5.7?

Exercise 5 . 2.2 : For the SDD of Fig. 5.8, give annotated parse trees for the following expressions: a) int a , b ,

b ) f loat

w,

c. x,

y,

z.

Exercise 5 .2.3 : Suppose that we have a production A -+ BCD. Each of the four nonterminals A, B, C , and D have two attributes: s is a synthesized attribute, and i is an inherited attribute. For each of the sets of rules below, tell whether (i) the rules are consistent with an S-attributed definition (ii) the rules are consistent with an L-attributed definition, and (iii) whether the rules are consistent with any evaluation order at all? a) A.s = B.i + C.s. b ) A.s = B.i + C.s and D .i = A.i + B.s. c ) A.s = B.s + D.s.

! d ) A.s = D .i, B.i

=

A.s + C.s, C.i = B.s, and D .i = B.i + C.i.

! Exercise 5 .2.4 : This grammar generates binary numbers with a "decimal"

point:

S -+ L . L I L L -+ L B I B B -+ O l l Design an L-attributed SDD to compute B.val , the decimal-number value of an input string. For example, the translation of string 1 0 1 . 10 1 should be the decimal number 5.625. Hint: use an inherited attribute L.side that tells which side of the decimal point a bit is on. !! Exercise 5.2.5 : Design an S-attributed SDD for the grammar and translation

described in Exercise 5.2.4.

!! Exercise 5 .2.6 : Implement Algorithm 3.23, which converts a regular expres

sion into a nondeterministic finite automaton, by an L-attributed SDD on a top-down parsable grammar. Assume that there is a token char representing any character, and that char.lexval is the character it represents. You may also assume the existence of a function new O that returns a new state, that is, a state never before returned by this function. Use any convenient notation to specify the transitions of the NFA .

CHAPTER 5. SYNTAX-DIRECTED TRANSLATION

318 5.3

Applications of Syntax-D irected Translat ion

The syntax-directed translation techniques in this chapter will be applied in Chapter 6 to type checking and intermediate-code generation. Here, we consider selected examples to illustrate some representative SDD's. The main application in this section is the construction of syntax trees. Since some compilers use syntax trees as an intermediate representation, a common form of SDD turns its input string into a tree. To complete the translation to intermediate code, the compiler may then walk the syntax tree, using another set of rules that are in effect an SDD on the syntax tree rather than the parse tree. (Chapter 6 also discusses approaches to intermediate-code generation that apply an SDD without ever constructing a tree explicitly.) We consider two SDD's for constructing syntax trees for expressions. The first, an S-attributed definition, is suitable for use during bottom-up parsing. The second, L-attributed, is suitable for use during top-down parsing. The final example of this section is an L-attributed definition that deals with basic and array types.

Construction of Syntax Trees As discussed in Section 2.8.2, each node in a syntax tree represents a construct; 5.3.1

the children of the node represent the meaningful components of the construct. A syntax-tree node representing an expression El + E2 has label + and two children representing the sub expressions El and E2 · We shall implement the nodes of a syntax tree by objects with a suitable number of fields. Each object will have an op field that is the label of the node. The objects will have additional fields as follows: •

If the node is a leaf, an additional field holds the lexical value for the leaf. A constructor function Leaf( op, val) creates a leaf object. Alternatively, if nodes are viewed as records, then Leaf returns a pointer to a new record for a leaf.

•

If the node is an interior node, there are as many additional fields as the node has children in the syntax tree. A constructor function Node takes two or more arguments: Node( op, Cl , C2 , . . . , Ck ) creates an object with first field op and k additional fields for the k children Cl , . . . , Ck ·

Example 5 . 1 1 : The S-attributed definition in Fig. 5.10 constructs syntax trees for a simple expression grammar involving only the binary operators + and - . As usual, these operators are at the same precedence level and are jointly left associative. All nonterminals have one synthesized attribute node, which represents a node of the syntax tree. Every time the first production E � El + T is used, its rule creates a node with ' + ' for op and two children, El . node and T. node, for the subexpressions. The second production has a similar rule.

5 . 3.

APPLICATIONS OF SYNTAX-DIRECTED TRANSLATION PRODUCTION

1) 2) 3)

4) 5) 6)

E -+ El + T E -+ El - T E -+ T T -+ ( E ) T -+ id T -+ num

319

SEMANTIC RULES

E. node = new Node(' + ', E1 .node, T. node) E. node = new Node(' - ', E1 . node, T. node) E. node = T. node T. node = E. node T.node = new Leaf(id, id. entry) T. node = new Leaf(num, num. val)

Figure 5. 10: Constructing syntax trees for simple expressions For production 3, E -+ T, no node is created, since E. node is the same as T. node. Similarly, no node is created for production 4, T -+ ( E ) . The value of T. node is the same as E. node, since parentheses are used only for grouping; they influence the structure of the parse tree and the syntax tree, but once their job is done, there is no further need to retain them in the syntax tree. The last two T-productions have a single terminal on the right. We use the constructor Leaf to create a suitable node, which becomes the value of T. node. Figure 5 . 1 1 shows the construction of a syntax tree for the input a 4 + c. The nodes of the syntax tree are shown as records, with the op field first. Syntax-tree edges are now shown as solid lines. The underlying parse tree, which need not actually be constructed, is shown with dotted edges. The third type of line, shown dashed, represents the values of E. node and T. node; each line points to the appropriate synt�x-tree node. At the bottom we see leaves for a, 4 and c, constructed by Leaf. We suppose that the lexical value id. entry points into the symbol table, and the lexical value num. val is the numerical value of a constant. These leaves, or pointers to them, become the vqlue of T.node at the three parse-tree nodes labeled T, according to rules 5 and 6. Note thqt by rule 3, the pointer to the leaf for a is also the value of E. node for the leftmost E in the parse tree. Rule 2 causes us to create a node with op equal to the minus sign and pointers to the first two leaves. Then, rule 1 produces the root node of the syntax tree by combining the nod e for with the third leaf. If the rules are evaluated during a postorder traversal of the parse tree, or with reductions during a bottom-up parse, then the sequence of steps shown in Fig. 5.12 ends with P5 pointing to the root of the constructed syntax tree. 0 -

�

With a grammar designed for top-down parsing, the same syntax trees are constructed, using the same sequence of steps, even though the structure of the parse trees differs significantly from that of syntax trees.

Example 5 . 1 2 : The L-attributed definition in Fig. 5.13 performs the same translation as the S-attributed definition in Fig. 5. 10. The attributes for the grammar symbols E, T, id, and nuw are as discussed in Example 5.11.

320

CHAPTER 5. SYNTAX-DIRECTED TRANSLATION .$..no4(} \

E; norf,e •

E. node I

I I I I I I I

.

•

I ' I : '

I

I I I I I \

.

\

\

\

\

T. node \ \

I I I num I I 1

•

: : id

t:node I

T.node

I I I I I I

+

\ \ \

\ \

\ \ \ \ \ \ \ \ \ \

\

\ I

id

\

\

\ \ \ I I

\

"

,

\

\

to entry for

to entry for

c

a

Figure 5. 1 1: Syntax tree for 1) 2) 3) 4) 5)

a

-4+c

PI == new Leaf(id , entry-a) ; P2 == new Leaf ( num, 4) ; P3 == new Node( ' - ' , Pl , P2 ) ; P4 == new Leaf(id , entry-c) ; P5 == new Node( ' +' , P3 , P4 ) ;

Figure 5.12: Steps in the construction of the syntax tree fot a - 4 + c The rules for building syntax trees in this example are similar to the rules for the desk calculator in Example 5.3 . . In the desk-calculator example, a term x * y was evaluated by passing x as an inherited attribute, since x and * y appeared in different portions 6f the parse tree. Here, the idea is to build a syntax tree for x + y by passing x as an inherited attribute, since x and + y appear in different subtrees . Nonterrilinal E' is the counterpart of nonterminal TI hi Example 5.3. Compare the dependency graph for a - 4 + c in 5.14 with that for 3 * 5 in Fig. 5.7. Nonterminal E' has an inherited attribute inh and a synthesized attribute syn. Attribute E' . inh represents the partial syntax tree constructed so far. Specifically, it represents the root of the tree for the prefix of the input string that is to the left of the subtree for E' . At node 5 in the dependency graph in Fig. 5. 14, EI. inh denotes the root of the partial syntax tree for the identifier a; that is, the leaf for a. At node 6, E' . inh denotes the root for the partial syntax

5. 3. APPLICATIONS OF SYNTAX-DIRECTED TRANSLATION

321

SEMA N TIC RULES

PRODUCTIO N

1)

E -t T E'

E. node = E1.syn E' . inh = T. node

2)

E' -t + T E{

E{ . inh new Node('+I, E'. inh, T. node) E' .syn = Ei . syn

3)

E' -t

T E{

Ei .inh = new Node(' - ' , E' . inh, T. node) E'. syn == E{ . syn

4) 5) 6) 7)

E' -t E T -t ( E ) T -t id T -t num

E' .syn EI .inh T. node == E. node T. node == new Leaf(id, id. entry) ( T.node new Leaf num, num. vaQ

Figure 5. 13: Constructing syntax trees during top-down parsing .

....

.

E

l�K

' '· T 2 �� ode . h 5 E 12 syn

t

id 1 entry

� m � �' t . .. . . . ..) �,� � Y � t

T

num

4

3

node

val

+

in h

6

.

'"

T

id'

Figure 5. 14: Dependency graph for

a

. yn ,

8

node

' syn in '9 ljJ 10

7

entry

�

- 4 + C, with the SDD of Fig. 5.13

tree for the input a - 4. At node 9 , E' .inh denotes the syntax tree for a - 4 + c. Since there is no more inp"llt , at node 9, E' . inh points to the root of the entire syntax tree. The syn attributes pass this value back up the parse tree until it becomes the value of E. node. Specifically, the attribute value at node 10 is defined . by the rule E' . syn E' . inh associated with the production E' -t E . The attribute value at node 11 is defined by the rule E' .syn .syn associated with production 2 in Fig. 5.13. Similar rules define the attribute values at ! nodes 12 and 13. 0 =

5.3.2

The Struct ure of a Type

Inherited attributes are useful when the structure of the parse tree differs from the abstract syntax of the input; attributes can then be used to carry informa-

322

CHAPTER 5. SYNTAX-DIRECTED TRANSLATION

tian from one part of the parse tree to another. The next example shows how a mismatch in structure can be due to the design of the language, and not due to constraints imposed by the parsIng method.

Example 5.13 : In C, the type int [2] [3] can be read as, "array of 2 arrays of 3 integers." The corresponding type exp ression array(2, array(3, integer)) is represented by the tree in Fig. 5.15. The operator array takes two parameters, a number and a type. If types are represented by trees, then this operator returns a tree node labeled array with two children for a number and a type.

Figure 5.15: Type expression for int [2] [3] With the SDD in Fig. 5. 16, nonterminal T generates either a basic type or an array type. Nonterminal B generates one of the basic types int and flo at. T generates a basic type when T derives B e and C derives t. Otherwise, C generates array components consisting of a sequence of integers; each integer surrounded by brackets. PRODUCTION T

-t

BC

B

-t

int

B C

-t

float [ num ] C1

C

-t €

-t

SEMANTIC RULES T.t C.b B.t B.t C.t Cl .b C.t

= = = =

C.t B.t integer

=

float array (num.val, CI .t)

=

C.b

=

C.b

Figure 5. 16: T generates either a basic type or an array type The nonterminals B and T have a synthesized attribute t representing a type. The nonterminal C has two attributes: an inherited attribute b and a synthesized attribute t. The inherited b attributes pass a basic type down the tree, and the synthesized t attributes accumulate the result. An annotated parse tree for the input string int [ 2 ] [ 3 ] is shown in Fig. 5 . 1 7. The corresponding type expression in Fig. 5 . 1 5 is constructed by passing the type integer from B, down the chain of C's through the inherited attributes b. The array type is synthesized up the chain of C's through the attributes t . . In more detail, at the root for T -t B C, nonterminal C inherits the type from B, using the inherited attribute C.b. At the rightmost node for C, the

5.3.

APPLICATIONS OF SYNTAX-DIRECTED TRANSLATION

323

production is C --+ E, so C.t equals C.b. The semantic rules for the production C --+ [ num ] C1 form C.t by applying the operator array to the operands num. val and C1 .t. 0

T. t

/

B. t =1 Integer int

=

array(2, array(3, integer) )

�C.C.bt == ainteger rra�rraY(3 , integer)) d l C.C.bt array( integer 2 [ ] 3 , integer) =

=

[

� �C.C.bt 3 ]

= =

I

integer integer

Figure 5 . 1 7: Syntax-directed translation of array types

5.3.3

Exercises for Section 5 . 3

Exercise 5 . 3 . 1 : Below is a grammar for expressions involving operator + and

integer or floating-point operands. Floating-point numbers are distinguished by having a decimal point.

E --+ E + T I T T --+ num . num I num a) Give an SDD to determine the type of each term T and expression E. b ) Extend your SDD of ( a) to translate expressions into postfix notation. Use the unary operator intToFloat to turn an integer into an equivalent float. ! Exercise 5.3.2 : Give an SDD to translate infix expressions with + and * into

equivalent expressions without redundant parentheses. For example, since both operators associate from the left, and * takes precedence over +, ((a* (b+c) ) * (d)) translates into a * (b + c) * d.

! Exercise 5 . 3 . 3 : Give an SDD to differentiate expressions such as x * (3 * x + x * x) involving the operators + and * , the variable x, and constants. Assume

that no simplification occurs, so that, for example, 3 * x will be translated into 3 * 1 + 0 * x.

324 5.4

CHAPTER 5. SYNTAX-DIRECTED TRANSLATION Synt ax- Directed Translation S chemes

Syntax-directed translation schemes are a complementary notation to syntax directed definitions. All of the applications of syntax-directed definitions in Section 5.3 can be implemented using syntax-directed translation schemes. From Section 2.3.5, a syntax-directed translation scheme ( SDT ) is a context free grammar with program fragments embedded within production bodies. The program fragments are called semantic actions and can appear at any position within a production body. By convention, we place curly braces around actions; if braces are needed as grammar symbols, then we quote them. Any SDT can be implemented by first building a parse tree and then per forming the actions in a left-to-right depth-first order; that is, during a preorder traversal. An example appears in Section 5.4.3. Typically, SDT's are implemented during parsing, without building a parse tree. In this section, we focus on the use of SDT's to implement two important classes of SDD's:

1 . The underlying grammar is LR-parsable, and the SDD is S-attributed. 2. The underlying grammar is LL-parsable, and the SDD is L-attributed. We shall see how, in both these cases, the semantic rules in an SDD can be converted into an SDT with actions that are executed at the right time. During parsing, an action in a production body is executed as soon as all the grammar symbols to the left of the action have been matched. SDT's that can be implemented during parsing can be characterized by in troducing distinct marker nonterminals in place of each embedded action; each marker M has only one production, M -+ Eo If the grammar with marker non terminals can be parsed by a given method, then the SDT can be implemented during parsing.

5.4.1

Postfix Translation Schemes

By far the simplest SDD implementation occurs when we can parse the grammar bottom-up and the SDD is S-attributed. In that case, we can construct an SDT in which each action is placed at the end of the production and is executed along with the reduction of the body to the head of that production. SDT's with all actions at the right ends of the production bodies are called postfix SDT's.

Example 5.14 : The postfix SDT in Fig. 5.18 implements the desk calculator SDD of Fig. 5.1, with one change: the action for the first production prints a value. The remaining actions are exact counterparts of the semantic rules. Since the underlying grammar is LR, and the SDD is S-attributed, these actions can be correctly performed along with the reduction steps of the parser. 0

5.4.

SYNTAX-DIRECTED TRANSLATION SCHEMES L E E T T F F

-t

-t

-t

-t -t

-t

-t

En El + T T Tl * F F (E) digit

325

{ print ( E . val) ; } { E.val = E1 . val + T.val; } { E. val = T. val; } { T.val = T1 .val x F.val; } { T.val = F.val; } { F.val = E. val; } { F. val = digit. lexval; }

Figure 5.18: Postfix SDT implementing the desk calculator

5 .4.2

Parser-Stack Implementation of Postfix SDT's

Postfix SDT's can be implemented during LR parsing by executing the actions when reductions occur. The attribute ( s ) of each grammar symbol can be put on the stack in a place where they can be found during the reduction. The best plan is to place the attributes along with the grammar symbols ( or the LR states that represent these symbols ) in records on the stack itself. In Fig. 5.19, the parser stack contains records with a field for a grammar symbol ( or parser state) and, below it, a field for an attribute. The three grammar symbols X Y Z are on top of the stack; perhaps they are about to be reduced according to a production like A -t X Y Z. Here, we show X.x as the one attribute of X, and so on. In general, we can allow for more attributes, either by making the records large enough or by putting pointers to records on the stack. With small attributes, it may be simpler to make the records large enough, even if some fields go unused some of the time. However, if one or more attributes are of unbounded size - say, they are character strings - then it would be better to put a pointer to the attribute's value in the stack record and store the actual value in some larger, shared storage area that is not part of the stack.

t

St ate/ gramm ar symbol Synthesized attribute( s )

top

Figure 5. 19: Parser stack with a field for synthesized attributes If the attributes are all synthesized, and the actions occur at the ends of the productions, then we can compute the attributes for the head when we reduce the body to the head. If we reduce by a production such as A -t X Y Z, then we have all the attributes of X, Y, and Z available, at known positions on the stack, as in Fig. 5.19. After the action, A and its attributes are at the top of the stack, in the position of the record for X. Example 5 . 1 5 : Let us rewrite the actions of the desk-calculator SDT of Ex-

326

CHAPTER 5. SYNTAX-DIRECTED TRANSLATION

ample 5.14 so that they manipulate the parser stack explicitly. Such stack manipulation is usually done automatically by the parser. PRODUCTION

ACTIONS

L -t E n

{ print ( stack [top - 1] . val) ; top = top - 1; }

E -t El + T

{ stack [top - 2] . val = stack [top - 2] . val + stack [top] . val; top = top - 2; }

E -t T T -t Tl * F T -t F F -t ( E ) F -t digit

{ stack [top - 2] .val = stack [top - 2] .val x stack [top] .val; top = top - 2; } { stack [top - 2] . val = stack [ top - 1] . val; top = top - 2; }

Figure 5.20: Implementing the desk calculator on a bottom-up parsing stack Suppose that the stack is kept in an array of records called stack, with top a cursor to the top of the stack. Thus, stack[ top] refers to the top record on the stack, stack[top - 1] to the record below that, and so on. Also, we assume that each record has a field called val, which holds the attribute of whatever grammar symbol is represented in that record. Thus, we may refer to the attribute E. val that appears at the third position on the stack as stack[top - 2] . val. The entire SDT is shown in Fig. 5.20. For instance, in the second production, E -t El + T, we go two positions below the top to get the value of E1 , and we find the value of T at the top. The resulting sum is placed where the head E will appear after the reduction, that is, two positions below the current top. The reason is that after the reduction, the three topmost stack symbols are replaced by one. After computing E.val, we pop two symbols off the top of the stack, so the record where we placed E. val will now be at the top of the stack. In the third production, E -t T, no action is necessary, because the length of the stack does not change, and the value of T. val at the stack top will simply become the value of E.val. The same observation applies to the productions T -t F and F -t digit. Production F -t ( E ) is slightly different. Although the value does not change, two positions are removed from the stack during the reduction, so the value has to move to the position after the reduction. Note that we have omitted the steps that manipulate the first field of the stack records - the field that gives the LR state or otherwise represents the grammar symbol. If we are performing an LR parse, the parsing table tells us what the new state is every time we reduce; see Algorithm 4.44. Thus, we may

SYNTAX-DIRECTED TRANSLATION SCHEMES

5.4.

327

simply place that state in the record for the new top of stack. 0

5.4.3

SDT's With Actions Inside Productions

An action may be placed at any position within the body of a production. It is performed immediately after all symbols to its left are processed. Thus, if we have a production B --t X {a} Y, the action a is done after we have recognized X ( if X is a terminal) or all the terminals derived from X ( if X is a nonterminal ) . More precisely, •

If the parse is bottom-up, then we perform action a as soon as this oc currence of X appears on the top of the parsing stack.

•

If the parse is top-down, we perform a just before we attempt to expand this occurrence of Y ( if Y a nonterminal ) or check for Y on the input ( if Y is a terminal ) .

SDT's that can be implemented during parsing inclu de postfix SDT's and a class of SDT's considered in Section 5.5 that implements L-attributed defini tions. Not all SDT's can be implemented during parsing, as we shall see in the next example. Example 5 . 1 6 : As an extreme example of a problematic SDT, suppose that

we turn our desk-calculator running example into an SDT that prints the prefix form of an expression, rather than evaluating the expression. The productions and actions are shown in Fig. 5.21.

1) 2) 3) 4) 5) 6)

7)

L E E T T F F

--t

--t

--t

--t

--t --t

--t

En { print ( ' +') ; } El + T T { print (' *' ) ; } Tl * F F (E) digit { print ( digit . lexval) ; }

Figure 5.21: Problematic SDT for infix-to-prefix translation during parsing Unfortunately, it is impossible to implement this SDT during either top down or bottom-up parsing, because the parser would have to perform critical actions, like printing instances of * or +, long before it knows whether these symbols will appear in its input. Using marker nonterminals M2 and M4 for the actions in productions 2 and 4, respectively, on input 3, a shift-reduce parser ( see Section 4.5.3) has conflicts between reducing by M2 --t E, reducing by M4 --t E, and shifting the digit. 0

328

CHAPTER 5. SYNTAX-DIRECTED TRANSLATION Any SnT can be implemented as follows:

1. Ignoring the actions, parse the input and produce a parse tree as a result. 2. Then, examine each interior node N, say one for production A -+ 0:. Add additional children to N for the actions in 0: , so the children of N from left to right have exactly the symbols and actions of 0: . 3. Perform a preorder traversal (see Section 2 . 3 . 4) of the tree, and as soon as a node labeled by an action is visited, perform that action. For instance, Fig. 5.22 shows the parse tree for expression 3 * 5 + 4 with ac tions inserted. If we visit the nodes in preorder, we get the prefix form of the expression: + * 3 5 4. L

I� / +I � T/ . E

�

�

{ print (' + ' ) ; }

�

�

�

{ print(' * ' ) ; }

�

-

E

n

I / " /I� { print(4) ; } T I " {" print I " (5) ; } I { print(3) ; } T

�

/

�

�

*

F

F,

"

digit

"

-

digit

F,

"

digit

' , , _

Figure 5.22: Parse tree with actions embedded

5.4.4

Eliminating Left Recursion From SDT's

Since no grammar with left recursion can be parsed deterministically top-down, we examined left-recursion elimination in Section 4.3.3. When the grammar is part of an SnT, we also need to worry about how the actions are handled. First, consider the simple case, in which the only thing we care about is the order in which the actions in an SnT are performed. For example, if each action simply prints a string, we care only about the order in which the strings are printeq.. In this case, the following principle can glJ.ide us: " When transforming. the grammar, treat the actions as if they were terminal symbols:

5 . 4.

SYNTAX-DIRECTED TRANSLATION SCHEMES

329

This principle is based on the idea that the grammar transformation preserves the order of the terminals in the generated string. The actions are therefore executed in the same order in any left-to-right parse, top-down or bottom-up. The "trick" for eliminating left recursion is to take two productions

A ---+ Aa I /3 that generate strings consisting of a /3 and any number of a's, and replace them by productions that generate the same strings using a new nonterminal R (for "remainder" ) of the first production:

A ---+ /3R R ---+ aR I E If /3 does not begin with A, then A no longer has a left-recursive production. In regular-definition terms, with both sets of productions, A is defined by /3(a) * . See Section 4.3.3 for the handling of situations where A has more recursive or nonrecursive productions. Example 5 . 1 7 : Consider the following E-productions from an SDT for trans lating infix expressions into postfix notation:

E E

{ print('+'); }

---+ ---+

If we apply the standard transformation to E, the remainder of the left-recursive production is

a

=

+ T { print('+' ) ; }

and /3, the body of the other production is T. If we introduce R for the remain der of E, we get the set of productions:

o

E R R

---+

---+

---+

TR + T { print(' +') ; } R

E

When the actions of an SDD compute attributes rather than merely printing output, we must be more careful about how we eliminate left recursion from a grammar. However, if the SDD is S-attributed, then we can always construct an SDT by placing attribute-computing actions at appropriate positions in the new productions. We shall give a general schema for the case of a single recursive production, � single nonrecursive production, and a single attribute of the left-recursive nonterminal; the generalization to many productions of each type is not hard, but is notationally cumbersome. Suppose that the two productions are

330

CHAPTER 5. SYNTAX-DIRECTED TRANSLATION A A

-t Al Y {A.a g (AI .a, Y.y) } =

-t

X {A. a

=

I (X.x) }

Here, A.a is the synthesized attribute of left-recursive nonterminal A, and X and Y are single grammar symbols with synthesized attributes X.x and Y.y, respectively. These could represent a string of several grammar symbols, each with its own attribute(s) , since the schema has an arbitrary function g comput ing A.a in the recursive production and an arbitrary function I computing A.a in the second production. In each case, f and g take as arguments whatever attributes they are allowed to access if the SDD is S-attributed. We want to turn the underlying grammar into

A R

-t X R -t Y R I E

Figure 5.23 suggests what the SDT on the new grammar must do. In (a) we see the effect of the postfix SDT on the original grammar. We apply I once, corresponding to the use of production A -t X , and then apply g as many times as we use the production A -t A Y. Since R generates a "remainder" of Y's, its translation depends on the string to its left, a string of the form XYY . . . Y. Each use of the production R -t Y R results in an application of g. For R, we use an inherited attribute R.i to accumulate the result of successively applying g, starting with the value of A.a. A .a A .a A. a

=

=

g(g(f(X.x), Y1 ·Y) , Y2 ·Y)

/ �Y2 / � Y1

=

g(f( X.x) , Yl , y)

f(X. x)

I

A

X

/ R.i� /. � Yl , y) R.� Yl / � Yl ·Y), Y2 .y) Y2 R . � =

f(X.x) =

g(f(X.x) , =

X

( a)

(b)

g (g(f(X .x) ,

I

Figure 5.23: Eliminating left recursion from a postfix SDT In addition, R has a synthesized attribute R.s, not shown in Fig. 5.23. This attribute is first computed when R ends its generation of Y symbols, as signaled by the use of production R -t E. R.s is then copied up the tree, so it can become the value of A.a for the entire expression XYY . . . Y. The case where A generates XYY is shown in Fig. 5.23, and we see that the value of A.a at the root of (a) has two uses of g. So does R. i at the bottom of tree (b) , and it is this value of R.s that gets copied up that tree. To accomplish this translation, we use the following SDT:

5.4.

SYNTAX-DIRECTED TRANSLATION SCHEMES A R R

--+

--+

--+

331

{R.i = f (X.x) } R {A.a = R.s} Y {R1 .i = g(R.i, Y.y} Rl {R.s = R1 .s} E {R.s = R.i}

x

Notice that the inherited attribute R.i is evaluated immediately before a use of R in the body, while the synthesized attributes A.a and R.s are evaluated at the ends of the productions. Thus, whatever values are needed to compute these attributes will be available from what has been computed to the left.

5.4.5

SDT's for L-Attributed Definitions

In Section 5.4.1 , we converted S-attributed SDD's into postfix SDT's, with actions at the right ends of productions. As long as the underlying grammar is LR, postfix SDT's can be parsed and translated bottom-up. Now, we consider the more general case of an L-attributed SDD. We shall assume that the underlying grammar can be parsed top-down, for if not it is frequently impossible to perform the translation in connection with either an LL or an LR parser. With any grammar, the technique below can be imple mented by attaching actions to a parse tree and executing them during preorder traversal of the tree. The rules for turning an L-attributed SDD into an SDT are as follows: 1. Embed the action that computes the inherited attributes for a nonterminal A immediately before that occurrence of A in the body of the production. If several inherited attributes for A depend on one another in an acyclic fashion, order the evaluation of attributes so that those needed first are computed first.

2. Place the actions that compute a synthesized attribute for the head of a production at the end of the body of that production. We shall illustrate these principles with two extended examples. The first involves typesetting. It illustrates how the techniques of compiling can be used in language processing for applications other than what we normally think of as programming languages. The second example is about the generation of intermediate code for a typical programming-language construct: a form of while-statement. Example 5 . 1 8 : This example is motivated by languages for typesetting math ematical formulas. Eqn is an early example of such a language; ideas from Eqn are still found in the '.lEX typesetting system, which was used to produce this book. We shall concentrate on only the capability to define subscripts, subscripts of subscripts, and so on, ignoring superscripts, built-up fractions, and all other mathematical features. In the Eqn language, one writes a sub i sub j to set the expression aij ' A simple grammar for boxes (elements of text bounded by a rectangle) is

332

CHAPTER 5. SYNTAX-DIRECTED TRANSLATION

Corresponding to these four productions, a box can be either

1. Two boxes, juxtaposed, with the first, Bl , to the left of the other, B2 • 2. A box and a subscript box. The second box appears in a smaller size, lower, and to the right of the first box. 3. A parenthesized box, for grouping of boxes and subscripts. Eqn and 'lEX both use curly braces for grouping, but we shall use ordinary, round paren theses to avoid confusion with the braces that surround actions in SDT's. 4. A text string, that is, any string of characters. This grammar is ambiguous, but we can still use it to parse bottom-up if we make subscripting and juxtaposition right associative, with sub taking prece dence over juxtaposition. Expressions will be typeset by constructing larger boxes out of smaller ones. In Fig. 5.24, the boxes for El and .height are about to be juxtaposed to form the box for El .height. The left box for El is itself constructed from the box for E and the subscript 1 . The subscript 1 is handled by shrinking its box by about 30%, lowering it, and placing it after the box for E. Although we shall treat .height as a text string, the rectangles within its box show how it can be constructed from boxes for the individual letters.

1

helg ' ht

"h " t " " l

d"ept "

�"N.�

.. ,.'. 1 " " " " " .,. .i. %;3) •

,@\

1

IJljflllliWil!QIRta- - -

tt - - lJ

he

aepl:h

Figure 5.24: Constructing larger boxes from smaller ones

In this example, we concentrate on the vertical geometry of boxes only. The horizontal geometry - the widths of boxes - is also interesting, especially when different characters have different widths. It may not be readily apparent, but each of the distinct characters in Fig. 5.24 has a different width. The values associated with the vertical geometry of boxes are as follows: a) The point size is used to set text within a box. We shall assume that characters not in subscripts are set in 10 point type, the size of type in this book. Further, we assume that if a box has point size p, then its subscript box has the smaller point size 0.7p. Inherited attribute B .ps will represent the point size of block B. This attribute must be inherited, because the context determines by how much a given box needs to be shrunk, due to the number of levels of subscripting.

5.4. SYNTAX-DIRECTED TRANSLATION SCHEMES

333

b) Each box has a baseline, which is a vertical position that corresponds to the bottoms of lines of text, not counting any letters, like "g" that extend below the normal baseline. In Fig. 5.24, the dotted line represents the baseline for the boxes E, .height, and the entire expression. The baseline for the box containing the subscript 1 is adjusted to lower the subscript. c) A box has a h e ight, which is the distance from the top of the box to the baseline. Synthesized attribute B.ht gives the height of box B. d) A box has a depth, which is the distance from the baseline to the bottom of the box. Synthesized attribute B. dp gives the depth of box B. The SDD in Fig. 5.25 gives rules for computing point sizes, heights, and depths. Production 1 is used to assign B.ps the initial value 10. PRODUCTION

SEMANTIC RULE S

1)

S -+ B

B.ps = 10

2)

B -+ Bl B2

B1 .p s = B.ps B2 .ps = B.p s B.ht = max(B1 .ht, B2 .ht) B. dp = max(B1 . dp, B2 . dp)

3)

B -+ Bl sub B2

B1 .p s = B.ps B2 .ps = 0.7 x B.ps B.ht = max(B 1 .ht, B2 .ht - 0.25 x B.ps) B.dp = max(B1 . dp, B2 . dp + 0.25 x B.ps)

4)

B -+ ( Bl )

B1 .p s = B.ps B.ht = B1 .ht B. dp = Bl . dp

5)

B -+ text

B.ht = ge tHt (B ps , text. lexval) B.dp = ge tDp (B.p s , text .lexval) .

Figure 5.25: SDD for typesetting boxes Production 2 handles juxtaposition. Point sizes are copied down the parse tree; that is, two sub-boxes of a box inherit the same point size from the larger box. Heights and depths are computed up the tree by taking the maximum. That is, the height of the larger box is the maximum of the heights of its two components, and similarly for the depth. Production 3 handles subscripting and is the most subtle. In this greatly simplified example, we assume that the point size of a subscripted box is 70% of the point size of its parent. Reality is much more complex, since subscripts cannot shrink indefinitely; in practice, after a few levels, the sizes of subscripts

334

CHAPTER 5. SYNTAX-DIRECTED TRANSLATION

shrink hardly at all. Further, we assume that the baseline of a subscript box drops by 25% of the parent's point size; again, reality is more complex. Production 4 copies attributes appropriately when parentheses are used. Fi nally, production 5 handles the leaves that represent text boxes. In this matter too, the true situation is complicated, so we merely show two unspecified func tions g e tHt and ge tDp that examine tables created with each font to determine the maximum height and maximum depth of any characters in the text string. The string itself is presumed to be provided as the attribute lexval of terminal text. Our last task is to turn this SDD into an SDT, following the rules for an L attributed SDD, which Fig. 5.25 is. The appropriate SDT is shown in Fig. 5.26. For readability, since production bodies become long, we split them across lines and line up the actions. Production bodies therefore consist of the contents of all lines up to the head of the next production. 0

1)

PRODUCTION S --t

ACTIONS

B

Bl B2

{ Bl ·PS = { B2 .ps = { B.ht = B.dp =

B.ps; } B.ps; } max ( B1 .ht, B2 .ht) ; max ( B1 . dp, B2 . dp) ; }

Bl sub B2

{ Bl ·PS = { B2 .ps = { B.ht = B.dp =

B.ps; } 0.7 x B.ps; } max ( B1 .ht, B2 .ht 0.25 x B.ps) ; max ( B1 . dp, B2 . dp + 0.25 x B.ps) ; }

{ B.ps = 10; }

B

2)

3)

B

--t

--t

4)

B

--t

( Bl )

5)

B

--t

text

-

{ B1 .ps = B.ps; } { B.ht = Bl .ht ; B.dp = Bl . dp ; } { B.ht = ge tHt ( B .ps, text. lexval) ; B.dp = g e tDp ( B.ps, text. lexval) ; }

Figure 5.26: SDT for typesetting boxes Our next example concentrates on a simple while-statement and the gener ation of intermediate code for this type of statement. Intermediate code will be treated as a string-valued attribute. Later, we shall explore techniques that involve the writing of pieces of a string-valued attribute as we parse, thus avoid ing the copying of long strings to build even longer strings. The technique was introduced in Example 5. 17, where we generated the postfix form of an infix

SYNTAX-DIRECTED TRANSLATION SCHEMES

5.4.

335

expression "on-the-fly," rather than computing it as an attribute. However, in our first formulation, we create a string-valued attribute by concatenation. Example 5 . 1 9 : For this example, we only need one production:

S -+ while ( C ) 81 Here, S is the nonterminal that generates all kinds of statements, presumably including if-statements, assignment statements, and others. In this example, C stands for a conditional expression - a boolean expression that evaluates to true or false. In this flow-of-control example, the only things we ever generate are labels. All the other intermediate-code instructions are assumed to be generated by parts of the SDT that are not shown. Specifically, we generate explicit instruc tions of the form label L, where L is an identifier, to indicate that L is the label of the instruction that follows. We assume that the intermediate code is like that introduced in Section 2.8.4. The meaning of our while-statement is that the conditional C is evaluated. If it is true, control goes to the beginning of the code for 81 . If false, then control goes to the code that follows the while-statement's code. The code for 81 must be designed to jump to the beginning of the code for the while-statement when finished; the jump to the beginning of the code that evaluates C is not shown in Fig. 5.27. We use the following attributes to generate the proper intermediate code:

1 . The inherited attribute S. next labels the beginning of the code that must be executed after S is finished. 2. The synthesized attribute 8. code is the sequence of intermediate-code steps that implements a statement 8 and ends with a jump to S.next. 3. The inherited attribute C. true labels the beginning of the code that must be executed if C is true. 4. The inherited attribute C.false labels the beginning of the code that must be executed if C is false. 5. The synthesized attribute C. code is the sequence of intermediate-code steps that implements the condition C and jumps either to C. true or to C·false, depending on whether C is true or false. The SDD that computes these attributes for the while-statement is shown in Fig. 5.27. A number of points merit explanation: •

The function new generates new labels.

•

The variables L l and L2 hold labels that we need in the code. Ll is the beginning of the code for the while-statement, and we need to arrange

CHAPTER 5. SYNTAX-DIRECTED TRANSLATION

336 8 -+ while ( C ) 81

Ll = newO ; L2 = newO ; 81 . next = Ll; C.false = 8. next; C. true = L2; 8. code = label /l Ll " C. code /I label " L2 II 81 . code

Figure 5.27: SDD for while-statements that 81 jumps there after it finishes. That is why we set 81 . next to Ll. L2 is the beginning of the code for 81 , and it becomes the value of C. true, because we branch there when C is true . •

Notice that C.false is set to 8. next, because when the condition is false, we execute whatever code must follow the code for 8 .

•

We use " as the symbol for �oncatenation of intermediate-code fragments. The value of S. code thus begins with the label Ll, then the code for condition C, another label L2, and the code for 81 .

This SDD is L-attributed. When we convert it into an SDT, the only re maining issue is how to handle the labels Ll and L2, which are variables, and not attributes. If we treat actions as dummy nonterminals, then such variables can be treated as the synthesized attributes of dummy nonterminals. Since L 1 and L2 do not depend on any other attributes, they can be assigned to the first action in the production. The resulting SDT with embedded actions that implements this L-attributed definition is shown in Fig. 5.28. 0

S -+ while ( C) 81

{ Ll = newO ; L2 = newO ; C.false = 8. next; C.true = L2; } { 81 . next = Ll; } { 8. code = label " Ll II C. code I label " L2 I I S1 . code; } Figure 5.28: SDT for while-statements

5.4.6

Exercises for Section 5 .4

Exercise 5 .4. 1 : We mentioned in Section 5.4.2 that it is possible to deduce, from the LR state on the parsing stack, what grammar symbol is represented by the state. How would we discover this information? Exercise 5.4.2 : Rewrite the following SDT:

A -+ A {a } B I A B {b} I 0 B -+ B {c} A I B A {d} 1 1

337

5.5. IMPLEMENTING L-ATTRIB UTED SDD 'S so that the underlying grammar becomes non-left-recursive. Here, a, b, d are actions, and 0 and 1 are terminals.

c,

and

! Exercise 5.4.3 : The following SDT computes the value of a string of O's and

l 's interpreted as a positive, binary integer.

B

-+

B1 0 {B .val = 2 x B1 .val} B1 1 {B.val = 2 x B1 .val + I } 1 {B.val = I }

Rewrite this SDT so the underlying grammar is not left recursive, and yet the same value of B.val is computed for the entire input string. ! Exercise 5 .4.4 : Write L-attributed SDD 's analogous to that of Example 5.19

for the following productions, each of which represents a familiar flow-of-control construct, as in the programming language C . You may need to generate a three address statement to jump to a particular label L, in which case you should generate g;oto L. a) 8

-+

if ( C ) 81 else 82

b) 8 -+ do 81 while ( C ) c) 8 -+ ' { ' L '}'; L -+ L S I E Note that any statement in the list can have a jump from its middle to the next statement, so it is not sufficient simply to generate code for each statement in order. Exercise 5.4.5 : Convert each of your SDD's from Exercise 5.4.4 to an SDT in the manner of Example 5.19. Exercise 5 .4.6 : Modify the SDD of Fig. 5.25 to include a synthesized attribute B .le, the length of a box. The length of the concatenation of two boxes is the sum of the lengths of each. Then add your new rules to the proper positions in the SDT of Fig. 5.26 Exercise 5 .4. 7 : Modify the SDD of Fig. 5.25 to include superscripts denoted by operator sup between boxes. If box B2 is a superscript of box B1 ; then position the baseline of B2 0.6 times the point size of B1 above the baseline of B1 • Add the new production and rules to the SDT of Fig. 5.26.

5.5

Implementing L-Att ributed S D D ' s

Since many translation applications can b e addressed using L-attrib.uted defi nitions, we shall consider their implementation in more detail in this section. The following methods do translation by traversing a parse tree:

338

CHAPTER 5. SYNTAX-DIRECTED TRANSLATION

1 . Build the parse tree and annotate. This method works for any noncircular SDD whatsoever. We introduced annotated parse trees in Section 5.1.2. 2. Build the parse tree, add actions, and execute the actions in preorder. This approach works for any L-attributed definition. We discussed how to turn an L-attributed SDD into an SDT in Section 5.4.5; in particular, we discussed how to embed actions into productions based on the semantic rules of such an SD D. In this section, we discuss the following methods for translation during parsing:

3. Use a recursive-descent parser with one function for each nonterminal. The function for :q.onterminal A receives the inherited attributes of A as arguments and returns the synthesized attributes of A. 4. Generate code on the fly, using a recursive-descent parser. 5. Implement an SnT in conjunction with an LL-parser. The attributes are kept on the parsing stack, and the rules fetch the needed attributes from known locations on the stack. 6. Implement an Sn T in conjunction with an LR-parser. This method may be surprising, since the SDT for an L-attributed SDD typically has ac tions in the middle of productions, and we cannot be sure during an �R parse that we are even in that production until its entire body has been constructed. We shall see, however, that if the underlying grammar is LL, we can always handle both the parsing and translation bottom-up.

5.5. 1

Translation During Recursive-Descent Parsing

A recursive-descent parser has a function A for each nonterminal A, as discussed in Section 4.4.1 . We can extend the parser into a translator as follows: a) The arguments of function A are the inherited attributes of nonterminal A. b ) The return-value of function A is the collection of synthesized attributes of nonterminal A. In the body of function A, we need to both parse and handle attributes:

1 . Decide upon the production used to expand A. 2. Check that each terminal appears on the input when it is required. We shall assume that no backtracking is needed, but the extension to recur sive-descent parsing with backtracking can be done by restoring the input position upon failure, as discussed in Section 4.4. 1 .

5. 5. IMPLEMENTING L-ATTRIBUTED SDD 'S

339

3. Preserve, in local variables, the values of all attributes needed to compute inherited attributes for nonterminals in the body or synthesized attributes for the head nonterminal. 4. Call functions corresponding to nonterminals in the body of the selected production, providing them with the proper arguments. Since the un derlying SDD is L-attributed, we have already computed these attributes and stored them in local variables. Example 5 .20 : Let us consider the SDD and SDT of Example 5.19 for while

statements. A pseudocode rendition of the relevant parts of the function S appears in Fig. 5.29. string S (label next) { string Seode, Geode; j * local variables holding code fragments * / label L1, L2; / * the local labels * j if ( current input == token while ) {

} }

advance input; check ' ( ' is next on the input, and advance; L1 = newO ; L2 = newO ; Geode = G(next, L2) ; check ' ) ' is next on the input, and advance; Seode = 5( L1 ) ; return(" label" II L1 " Geode " " label" " L2 " Scode) ;

else / * other statement types * /

Figure 5.29: Implementing while-statements with a recursive-descent parser We show S as storing and returning long strings. In practice, it would be far more efficient for functions like S and G to return pointers to records that represent these strings. Then, the return-statement in function S would not physically concatenate the components shown, but rather would construct a record, or perhaps tree of records, expressing the concatenation of the strings represented by Seode and Gcode, the labels L1 and L2, and the two occurrences of the literal string " label " . 0 Example 5.21 : Now, let us take up the SDT of Fig. 5.26 for typesetting

boxes. First, we address parsing, since the underlying grammar in Fig. 5.26 is ambiguous. The following transformed grammar makes juxtaposition and subscripting right associative, with sub taking precedence over juxtaposition:

340

CHAPTER 5. SYNTAX-DIRECTED TRANSLATION s

B T F

--+

--+

--+

--+

B T Bl I T F sub Tl I F ( B ) I text

The two new nonterminals, T and F, are motivated by terms and factors in expressions. Here, a "factor," generated by F, is either a parenthesized box or a text string. A "term," generated by T, is a "factor" with a sequence of subscripts, and a box generated by B is a sequence of juxtaposed "terms." The attributes of B carry over to T and F, since the new nonterminals also denote boxes; they were introduced simply to aid parsing. Thus, both T and F have an inherited attribute ps and synthesized attributes ht and dp, with semantic actions that can be adapted from the SDT in Fig. 5.26. The grammar is not yet ready for top-down parsing, since the productions for B and T have common prefixes. Consider T, for instance. A top-down parser cannot choose between the two productions for T by looking one symbol ahead in the input. Fortunately, we can use a form of left-factoring, discussed in Section 4.3.4, to make the grammar ready. With SDT's, the notion of com mon prefix applies to actions as well. Both productions for T begin with the nonterminal F inheriting attribute ps from T. The pseudocode in Fig. 5.30 for T(ps) folds in the code for F(ps) . After left-factoring is applied to T --+ F sub Tl I F, there is only one call to F; the pseudocode shows the result of substituting the code for F in place of this call. The function T will be called as T( I O.O) by the function for B, which we do not show. It returns a pair consisting of the height and depth of the box generated by nonterminal T; in practice, it would return a record containing the height and depth. Function T begins by checking for a left parenthesis, in which case it must have the production F --+ ( B ) to work with. It saves whatever the B inside the parentheses returns, but if that B is not followed by a right parenthesis, then there is a syntax error, which must be handled in a manner not shown. Otherwise, if the current input is text , then the function T uses getHt and getDp to determine the height and depth of this text. T then decides whether the next box is a subscript and adjusts the point size, if so. We use the actions associated with the production B --+ B sub B in Fig. 5.26 for the height and depth of the larger box. Otherwise, we simply return what F would have returned: (hI , dl) . 0 5.5.2

On-The-Fly Code Generation

The construction of long strings of code that are attribute values, as in Ex ample 5.20, is undesirable for several reasons, including the time it could take to copy or move long strings. In common cases such as our running code generation example, we can instead incrementally generate pieces of the code into an array or output file by executing actions in an SDT. The elements we need to make this technique work are:

5.5. IMPLEMENTING L-ATTRIB UTED SDD 'S

341

(float , float) T(float ps) { float hI, h2, dl , d2; / * locals to hold heights and depths * /

/ * start code for F(ps) * / if ( current input == ' ( ' ) { advance input; (hI, dl) = B (ps) ; if ( current input ! = ' ) ' ) syntax error: expected ' ) ' ; advance input; } else if ( current input == text ) { let lexical value text . lexval be t; advance input; hI = getHt(ps, t) ; dl = getDp(ps, t) ; } else syntax error: expected text or ' ( ' ; / * end code for F(ps) * / if ( current input == sub ) { advance input; (h2, d2) = T(0.7 * ps) ; return (max ( h1, h2 - 0.25 * ps) , max(dl, d2 + 0.25 * ps)) ; } return (hI, dl) ;

}

Figure 5.30: Recursive-descent typesetting of boxes 1 . There is, for one or more nonterminals, a main attribute. For conve nience, we shall assume that the main attributes are all string valued. In Example 5.20, the attributes S.code and G.code are main attributes; the other attributes are not.

2. The main attributes are synthesized.

3. The rules that evaluate the main attribute ( s ) ensure that

( a) The main attribute is the concatenation of main attributes of non

terminals appearing in the body of the production involved, perhaps with other elements that are not main attributes, such as the string label or the values of labels L1 and L2. ( b) The main attributes of nonterminals appear in the rule in the same order as the nonterminals themselves appear in the production body. As a consequence of the above conditions, the main attribute can be constructed by emitting the non-main-attribute elements of the concatenation. We can rely

342

CHAPTER 5. SYNTAX-DIRECTED TRANSLATION

The Type of Main Attributes Our simplifying assumption that main attributes are of string type is really too restrictive. The true requirement is that the type of all the main attributes must have values that can be constructed by concatenation of elements. For instance, a list of objects of any type would be appropriate, as long as we represent these lists in a way that allows elements to be efficiently appended to the end of the list. Thus, if the purpose of the main attribute is to represent a sequence of intermediate-code statements, we could produce the intermediate code by writing statements to the end of an array of objects. Of course the requirements stated in Section 5.5.2 still apply to lists; for example, main attributes must be assembled from other main attributes by concatenation in order.

on the recursive calls to the functions for the nonterminals in a production body to emit the value of their main attribute incrementally. Example 5 . 22 : We can modify the function of Fig. 5.29 to emit elements of the main translation 8. code instead of saving them for concatenation into a return value of 8. code. The revised function 8 appears in Fig. 5.31. void 8(label next) { label L1, L2; / * the local labels * / if ( current input == token while ) {

advance input; check ' ( ' is next on the input, and advance; L1 = newO ; L2 = newO ; print( " label " , L1) ; G(next, L2) ; check ' ) ' is next on the input, and advance; print( " label " , L2) ; 8(L1) ;

}

else / * other statement types * /

} Figure 5.31: On-the-fly recursive-descent code generation for while-statements In Fig. 5.31, S and G now have no return value, since their only synthesized attributes are produced by printing. Further, the position of the print state ments is significant. The order in which output is printed is: first label L1, then the code for G (which is the same as the value of Gcode in Fig. 5.29) , then

5.5. IMPLEMENTING L-ATTRIB UTED SDD 'S

343

label £2, and finally the code from the recursive call to 8 ( which is the same as Scode in Fig. 5.29) . Thus, the code printed by this call to 8 is exactly the same as the value of Scode that is returned in Fig. 5.29) . 0 Incidentally, we can make the same change to the underlying SDT: turn the construction of a main attribute into actions that emit the elements of that attribute. In Fig. 5.32 we see the SDT of Fig. 5.28 revised to generate code on the fly.

8

--+

while (

C) 81

{ £1 = newO ; £2 = newO ; C.false = 8. next; C.true = L2; print( " label " , £1) ; } { 81 . next = £1; print( " label " , £2) ; }

Figure 5.32: SDT for on-the-fly code generation for while statements

5.5.3

L-Attributed SDD 's and LL Parsing

Suppose that an L-attributed SDD is based on an LL-grammar and that we have converted it to an SDT with actions embedded in the productions, as described in Section 5.4.5. We can then perform the translation during LL parsing by extending the parser stack to hold actions and certain data items needed for attribute evaluation. Typically, the data items are copies of attributes. In adqition to records representing terminals and nonterminals, the parser stack will hold action-records representing actions to be executed and synth esize-records to hold the synthesized attributes for nonterminals. We use the following two principles to manage attributes on the stq,ck: •

The inherited attributes of a nonterminal A are placed in the stack record that represents that nonterminal. The code to evaluate these attributes will usually be represented by an action-record immediately above the stack record for A; in fact, the conversion of L-attributed SDD's to SDT's ensures that the action-record will be immediately above A.

•

The synthesized attributes for a nonterminal A are placed in a separate synthesize-record that is immediately below the record for A on the stack.

This strategy places records of several types on the parsing stack, trusting that these variant record types can be managed properly as subclasses of a "stack record" class. In practice, we might combine several records into one, but the ideas are perhaps best explained by separating data used for different purposes into different records. Action-records contain pointers to code to be executed. Actions may also appear in synthesize-records; these actions typically place copies of the synthe sized attribute ( s ) in other records further down the stack, where the value of

344

CHAPTER 5. SYNTAX-DIRECTED TRANSLATION

that attribute will be needed after the synthesize-record and its attributes are popped off the stack. Let us take a brief look at LL parsing to see the need to make temporary copies of attributes. From Section 4.4.4, a table-driven LL parser mimics a leftmost derivation. If w is the .input that has been matched so far, then the * stack holds a sequence of grammar symbols et such that S =} Wet, where S lm . IS the start symboI . When the parser expands . . by a production A -+ B C , it replaces A on top of the stack by B C. Suppose nonterminal C has an inherited attribute C.i. With A -+ B C, the inherited attribute C.i may depend not only on the inherited attributes of A, but on all the attributes of B. Thus, we may need to process B completely before C.i can be evaluated. We therefore save temporary copies of all the attributes needed to evaluate C.i in the action-record that evaluates C.i. Otherwise, when the parser replaces A on top of the stack by B C, the inherited attributes of A will have disappeared, along with its stack record. Since the underlying SDD is L-attributed, we can be sure that the values of the inherited attributes of A are available when A rises to the top of the stack. The values will therefore be available in time to be copied into the action-record that evaluates the inherited attributes of C. Furthermore, space for the synthesized attributes of A is not a problem, since the space is in the synthesize-record for A, which remains on the stack, below B and C, when the parser expands by A -+ B C. As B is processed, we can perform actions (through a record just above B on the stack ) that copy its inherited attributes for use by C, as needed, and after B is processed, the synthesize-record for B can copy its synthesized attributes for use by C, if needed. Likewise, synthesized attributes of A may need temporaries to help compute their value, and these can be copied to the synthesize-r�cord for A as B and then C are processed. The principle that makes all this copying of attributes work is: •

All copying takes place among the records that are created during one expansion of one nonterminal. Thus, each of these records knows how far below it on the stack each other record is, and can write values into the records below safely.

The next example illustrates the implementation of inherited attributes dur ing LL parsing by diligently copying attribute values. Shortcuts or optimiza tions are possible, particularly with copy rules, which simply copy the value of one attribute into another. Shortcuts are deferred until Example 5.24,. which also illustrates synthesize-records. Example 5 . 23 : This example implements the the SDT of Fig. 5.32, which generates code on the fly for the while-production. This SDT does not have synthesized attributes, except for dummy att ributes that represent labels. Figure 5.33(a) shows the situation as we are about to use the while-produc tion to expand S , presumably because the lookahead symbol on the input is

345

5.5. IMPLEMENTING L-ATTRIBUTED SDD 'S

while. The record at the top of stack is for S, and it contains only the inherited

attribute S. next, which we suppose has the value x. Since we are now parsing top-down, we show the stack top at the left, according to our usual convention. top

I ne}= x I

while

( a)

top t

I CIJ

C Action snext = x false = ? true = ? L1 = ? L2 = ?

L1 = newO ; L2 = newO ; stack[ top - 1] .false = snext; stack[ top - 1] . true = L2; stack[top - 3] .all = L1; stack[top - 3] .al2 = L2; print(" label , L1); II

[IJ

Action all = ? al2 = ?

� �

stack[ top - 1] . next = all; print(" label" , al2) ;

(b)

Figure 5.33: Expansion of S according to the while-statement production Figure 5.33 ( b ) shows the situation immediately after we have expanded S. There are action-records in front of the nonterminals C and Sl , corresponding to the actions in the underlying SDT of Fig. 5.32. The record for C has room for inherited attributes true and false, while the record for Sl has room for attribute next, as all S-records must. We show values for these fields as ?, because we do not yet know their values. The parser next recognizes while and ( on the input and pops their records off the stack. Now, the first action is at the top, and it must be executed. This action-record has a field snext, which holds a copy of the inherited attribute S. next. When S is popped from the stack, the value of S. next is copied into the field snext for use during the evaluation of the inherited attributes for C. The code for the first action generates new values for L1 and L2, which we shall suppose are y and z , respectively. The next step is to make z the value of C. true. The assignment stack[ top - 1] . true = L2 is written knowing it is only executed when this action-record is at the top of stack, so top - 1 refers to the record below it - the record for C. The first action-record then copies L1 into field all in the second action, where it will be used to evaluate Sl . next. It also copies L2 into a field called al2 of the second action; this value is needed for that action-record to print its output properly. Finally, the first action-record prints label y to the output. The situation after completing the first action and popping its record off

346

CHAPTER 5. SYNTAX-DIRECTED TRANSLATION top •

C false = x true = z

[J=:J

Action

al2 all

= z =

y

� �

stack[top - l] . next = all; print("label " , al2);

Figure 5.34: After the action above C is performed the stack is shown in Fig. 5 .34. The values of inherited attributes in the record for C have been filled in properly, as have the temporaries all and al2 in the second action record. At this point, C is expanded, and we presume that the code to implement its test containing jumps to labels x and z, as appropriate, is generated. When the C-record is popped from the stack, the record for ) becomes top and causes the parser to check for ) on its input. With the action above Sl at the top of the stack, its code sets 81 , next and emits label z . When that is done, the record for 81 becomes the top of stack, and as it is expanded, we presume it correctly generates code that implements whatever kind of statement it is and then jump to label y. 0 Example 5 .24 : Now, let us consider the same while-statement, but with a translation that produces the output S. code as a synthesized attribute, rather than by on-the-fly generation. In order to follow the explanation, it is useful to bear in mind the following invariant or inductive hypothesis, which we assume is followed for every nonterminal: •

Every nonterminal that has code associated with it leaves that code, as a string, in the synthesize-record just below it on the stack.

Assuming this statement is true, we shall handle the while-production so it maintains this statement as an invariant. Figure 5.35(a) shows the situation just before 8 is expanded using the pro duction for while-statements. At the top of the stack we see the record for 8; it has a field for its inherited attribute 8. next, as in Example 5.23. Immediately below that record is the synthesize-record for this occurrence of 8. The latter has a field for 8. code, as all synthesize-records for 8 must have. We also show it with some other fields for local storage and actions, since the SDT for the while production in Fig. 5 .28 is surely part of a larger SDT. Our expansion of S is based on the SDT of Fig. 5.28, and it is shown in Fig. 5.35(b) . As a shortcut, during the expansion, we assume that the inherited attribute 8. next is assigned directly to C.false, rather than being placed in the first action and then copied into the record for C. Let us examine what each record does when it becomes the top of stack. First, the while record causes the token while to be matched with the input,

347

5. 5. IMPLEMENTING L-ATTRIB UTED SDD 'S top

~ S

next = x

Synthesize ( a)

S. code I I -cod-e-= ?--

data actions

top

80

Action Ll = ? L2 = ?

G false = ? true = ?

L l = newO ; L2 = newO ; stack[top - 1] . true = L2; stack[top - 4] . next = L l ; stack[top - 5 ] .1 1 = L 1 ; stack[top - 5] . l2 = L2;

Synthesize

Synthesize

Synthesize

C. code code = ?

Sl . code code = ? Geode = ? II = ? l2 = ?

S. code code = ?

I stack[top - 3] . Geode = code;

( b)

data actions

stack[ top - 1] . code = " labe l " I l l l II Geode II " label " I I l2 II code;

Figure 5.35: Expansion of S with synthesized attribute constructed on the stack which it must, or else we would not have expanded 8 in this way. After while and ( are popped off the stack, the code for the action-record is executed. It generates values for L1 and L2, and we take the shortcut of copying them directly to the inherited attributes that need them: Sl . next and G. true. The last two steps of the action cause L1 and L2 to be copied into the record called "Synthesize 81 . code." The synthesize-record for Sl does double duty: not only will it hold the syn thesized attribute 81 , code, but it will also serve as an action-record to complete the evaluation of the attributes for the entire production S --t while ( G ) Sl . In particular, when it gets to the top, it will compute the synthesized attribute 8. code and place its value in the synthesize-record for the head S. When G becomes the top of the stack, it has both its inherited attributes computed. By the inductive hypothesis stated above, we suppose it correctly generates code to execute its condition and jump to the proper label. We also assume that the actions performed during the expansion of G correctly place this code in the record below, as the value of synthesized attribute G. code. After G is popped, the synthesize-record for G. code becomes the top. Its code is needed in the synthesize-record for Sl . code, because that is where we concatenate all the code elements to form S. code. The synthesize-record for G. code therefore has an action to copy G. code into the synthesize-record for 81 . code. After doing so, the record for token ) reaches the top of stack, and causes a check for ) on the input. Assuming that test succeeds, the record for 81 becomes the top of stack. By our inductive hypothesis, this nonterminal is

348

CHAPTER 5. SYNTAX-DIRECTED TRANSLATION

Can We Handle L-Attributed SDD's on LR Grammars? In Section 5.4.1 , we saw that every S-attributed SDD on an LR grammar can be implemented during a bottom-up parse. From Section 5.5.3 every L-attributed SDD on an LL grammar can be parsed top-down. Since LL grammars are a proper subset of the LR grammars, and the S-attributed SDD's are a proper subset of the L-attributed SDD's, can we handle every LR grammar and L-attributed SDD bottom-up? We cannot, as the following intuitive argument shows. Suppose we have a production A ---+ B C in an LR-grammar, and there is an inherited attribute B.i that depends on inherited attributes of A. When we reduce to B, we still have not seen the input that C generates, so we cannot be sure that we have a body of production A ---+ B C. Thus, we cannot compute B.i yet, since we are unsure whether to use the rule associated with this production. Perhaps we could wait until we have reduced to C, and know that we must reduce B C to A. However, even then, we do not know the inherited attributes of A, because even after reduction, we may not be sure of the production body that contains this A. We could reason that this decision, too, should be deferred, and therefore further defer the computation of B .i. If we keep reasoning this way, we soon realize that we cannot make any decisions until the entire input is parsed. Essentially, we have reached the strategy of "build the parse tree first and then perform the translation."

expanded, and the net effect is that its code is correctly constructed and placed in the field for code in the synthesize-record for S1 . Now, all the data fields of the synthesize-record for S1 have been filled in, so when it becomes the top of stack, the action in that record can be executed. The action causes the labels and code from C. code and 81 . code to be concatenated in the proper order. The resulting string is placed in the record below; that is, in the synthesize-record for S. We have now correctly computed S. code, and when the synthesize-record for S becomes the top, that code is available for placement in another record further down the stack, where it will eventually be assembled into a larger string of code implementing a program element of which this 8 is a part. 0

5.5.4

Bottom-Up Parsing of L-Attributed SDD's

We can do bottom-up every translation that we can do top-down. More pre cisely, given an L-attributed SDD on an LL grammar, we can adapt the gram mar to compute the same SDD on the new grammar during an LR parse. The "trick" has three parts:

5.5. IMPLEMENTING L-ATTRIBUTED SDD 'S

349

1. Start with the SDT constructed as in Section 5.4.5, which places embed ded actions before each nonterminal to compute its inherited attributes and an action at the end of the production to compute synthesized at tributes. 2. Introduce into the grammar a marker nonterminal in place of each em bedded action. Each such place gets a distinct marker, and there is one production for any marker M, namely M -+ E. 3. Modify the action a if marker nonterminal M replaces it in some produc tion A -+ a {a } (3, and associate with M -+ E an action a' that

( a) Copies, as inherited attributes of M, any attributes of A or symbols

of a that action a needs. ( b ) Computes attributes in the same way as a, but makes those attributes be synthesized attributes of M. This change appears illegal, since typically the action associated with production M -+ E will have to access attributes belonging to grammar symbols that do not appear in this production. However, we shall imple ment the actions on the LR parsing stack, so the necessary attributes will always be available a known number of positions down the stack. Example 5.25 : Suppose that there is a production A -+ B C in an LL gram

mar, and the inherited attribute B.i is computed from inherited attribute A.i by some formula B.i = f (A.i ) . That is, the fragment of an SDT we care about is

A -+ {B.i = f (A.i) ; } B C We introduce marker M with inherited attribute M.i and synthesized attribute M.s . The former will be a copy of A.i and the latter will be B.i. The SDT will be written

A -+ M B C M -+ {M.i = A.i; M.s

=

f (M.i) ;}

Notice that the rule for M does not have A.i available to it, but in fact we shall arrange that every inherited attribute for a nonterminal such as A appears on the stack immediately below where the reduction to A will later take place. Thus, when we reduce E to M, we shall find A.i immediately below it, from where it may be read. Also, the value of M.s, which is left on the stack along with M, is really B.i and properly is found right below where the reduction to B will later occur. 0 Example 5.26 : Let us turn the SDT of Fig. 5.28 into an SDT that can operate

with an LR parse of the revised grammar. We introduce a marker M before C and a marker N before 81 , so the underlying grammar becomes

350

CHAPTER 5. SYNTAX-DIRECTED TRANSLATION

Why Markers Work Markers are nonterminals that derive only E and that appear only once among all the bodies of all productions. We shall not give a formal proof that, when a grammar is LL, marker nonterminals can be added at any position in the body, and the resulting grammar will still be LR. The intuition, however, is as follows. If a grammar is LL, then we can determine that a string w on the input is derived from nonterminal A, in a derivation that starts with production A -+ CY, by seeing only the first symbol of w ( or the following symbol if w = E) . Thus, if we parse w bottom-up, then the fact that a prefix of w must be reduced to CY and then to 8 is known as soon as the beginning of w appears on the input. In particular, if we insert markers anywhere in CY, the LR states will incorporate the fact that this marker has to be there, and will reduce E to the marker at the appropriate point on the input.

8 M

N

-+ while ( M C ) -+ E -+ E

N 81

Before we discuss the actions that are associated with markers M and N, let us outline the "inductive hypothesis" about where attributes are stored.

1. Below the entire body of the while-production - that is, below while on the stack - will be the inherited attribute 8. next. We may not know the nonterminal or parser state associated with this stack record, but we can be sure that it will have a field, in a fixed position of the record, that holds 8. next before we begin to recognize what is derived from this S. 2. Inherited attributes C. true and C.false will be just below the stack record for C. Since the grammar is presumed to be LL, the appearance of while on the input assures us that the while-production is the only one that can be recognized, so we can be sure that M will appear immediately below C on the stack, and M's record will hold the inherited attributes of C. 3. Similarly, the inherited attribute 81 . next must appear immediately below Sl on the stack, so we may place that attribute in the record for N. 4. The synthesized attribute C. code will appear in the record for C. As always when we have a long string as an attribute value, we expect that in practice a pointer to ( an object representing) the string will appear in the record, while the string itself is outside the stack. 5. Similarly, the synthesized attribute Sl . code will appear in the record for Sl .

351

5.5. IMPLEMENTING L-ATTRIB UTED SDD 'S

Let us follow the parsing process for a while-statement. Suppose that a record holding S. next appears on the top of the stack, and the next input is the terminal while. We shift this terminal onto the stack. It is then certain that the production being recognized is the while-production, so the LR parser can shift " ( " and determine that its next step must be to reduce E to M. The stack at this time is shown in Fig. 5.36. We also show in that figure the action that is associated with the reduction to M. We create values for Ll and L2, which live in fields of the M-record. Also in that record are fields for C. true and C.false. These attributes must be in the second and third fields of the record, for consistency with other stack records that might appear below C in other contexts and also must provide these attributes for C. The action completes by assigning values to C. true and C.false, one from the L2 just generated, and the other by reaching down the stack to where we know S. next is found. top

co I while 1 1,--�

_

+

M

C.true C.false Ll L2

Code executed during reduction of c to M

newO ; newO ; C.true L2; C.false = stack[ top - 3] . next;

Ll L2

=

=

=

Figure 5.36: LR parsing stack after reduction of E to M We presume that the next inputs are properly reduced to C. The synthesized attribute C. code is therefore placed in the record for C. This change to the stack is shown in Fig. 5.37, which also incorporates the next several records that are later placed above C on the stack.

L?l l while I I �

'---

----'

-

M

i-i C -. tr-u-e C.false

�I � �

1CE:J� ��

--'

-

Ll L2

Figure 5.37: Stack just before reduction of the while-production body to S Continuing with the recognition of the while-statement, the parser should next find " ) " on the input, which it pushes onto the stack in a record of its own. At that poipt, the parser, which knows it is working on a while-statement because the grammar is LL, will reduce E to N. The single piece of data asso ciated with N is the inherited attribute Sl . next. Note that this attribute needs

352

CHAPTER 5. SYNTAX-DIRECTED TRANSLATION

to be in the record for N because that will be just below the record for 81 . The code that is executed to compute the value of 81 . next is

81 . next = stack[top - 3] .L1; This action reaches three records below N, which is at the top of stack when the code is executed, and retrieves the value of L1. Next, the parser reduces some prefix of the remaining input to 8, which we have consistently referred to as 81 to distinguish it from the 8 at the head of the production. The value of 81 . code is computed and appears in the stack record for 81 . This step takes us to the condition that is illustrated in Fig. 5.37. At this point, the parser will reduce everything from while to 81 to 8. The code that is executed during this reduction is:

tempCode = label " stack[top - 4] .L1 " stack[top - 3] . code II label " stack[top - 4] .L2 " stack[top] . code; top = top - 5; stack[ top] . code = temp Code; That is, we construct the value of 8. code in a variable temp Code. That code is the usual, consisting of the two labels L1 and L2, the code for C and the code for S1 . The stack is popped, so S appears where while was. The vaiue of the code for 8 is placed in the code field of that record, where it can be interpreted as the synthesized attribute S. code. Note that we do not show, in any of this discussion, the manipulation of LR states, which must also appear on the stack in the field that we have populated with grammar symbols. 0

5.5.5 Exercises for Section 5 .5 Exercise 5 . 5 . 1 : Implement each of your SDD's of Exercise 5.4.4 as a recursive descent parser in the style of Section 5.5.1. Exercise 5.5.2 : Implement each of your SDD's of Exercise 5.4.4 as a recursive descent parser in the style of Section 5.5.2. Exercise 5 . 5 . 3 : Implement each of your SDD's of Exercise 5.4.4 with an LL parser in the style of Section 5.5.3, with code generated "on the fly." Exercise 5 . 5 .4 : Implement each of your SDD's of Exercise 5.4.4 with an LL parser in the style of Section 5.5.3, but with code (or pointers to the code) stored on the stack. Exercise 5 . 5 . 5 : Implement each of your SDD's of Exercise 5.4.4 with an LR parser in the style of Section 5.5.4. Exercise 5 .5 . 6 : Implement your SDD of Exercise 5.2.4 in the style of Sec tion 5.5.1. Would an implementation in the style of Section 5.5.2 be any differ ent?

5. 6. SUMMARY OF CHAPTER 5 5.6

353

S ummary of C hapt er 5

• Inherited and Synthesized Attributes : Syntax-directed definitions may use two kinds of attributes. A synthesized attribute at a parse-tree node is computed from attributes at its children. An inherited attribute at a node is computed from attributes at its parent and/or siblings. • Dependency Graphs : Given a parse tree and an SDD, we draw edges among the attribute instances associated with each parse-tree node to denote that the value of the attribute at the head of the edge is computed in terms of the value of the attribute at the tail of the edge .

.. Cyclic Definitions : In problematic SDD's, we find that there are some parse trees for which it is impossible to find an order in which we can compute all the attributes at all nodes. These parse trees have cycles in their associated dependency graphs. It is intractable to decide whether an SDD has such circular dependency graphs . .. S-Attributed Definitions : In an S-attributed SDD, all attributes are syn thesized. • L-Attributed Definitions : In an L-attributed SDD, attributes may be in herited or synthesized. However, inherited attributes at a parse-tree node may depend only on inherited attributes of its parent and on (any) at tributes of siblings to its left .

.. Syntax Trees : Each node in a syntax tree represents a construct; the chil dren of the node represent the meaningful components of the construct. • Implementing S-Attributed SDD 's: An S-attributed definition can be im plemented by an SDT in which all actions are at the end of the production (a "postfix" SDT) . The actions compute the synthesized attributes of the production head in terms of synthesized attributes of the symbols in the body. If the underlying grammar is LR, then this SDT can be imple mented on the LR parser stack.

.. Eliminating Left Recursion From SD T's : If an SDT has only side-effects (no attributes are computed) , then the standard left-recursion-elimination algorithm for grammars allows us to carry the actions along as if they were terminals. When attributes are computed, we can still eliminate left recursion if the SDT is a postfix SDT . .. Implementing L-attributed SDD 's by Recursive-Descent Parsing: If we have an L-attributed definition on a top-down pars able grammar, we can build a recursive-descent parser with no backtracking to implement the translation. Inherited attributes become arguments of the functions for their nonterminals, and synthesized attributes are returned by that func tion.

354

CHAPTER 5. SYNTAX-DIRECTED TRANSLATION

.. Implementing L-Attributed SDD 's on an LL Grammar: Every L-attribut ed definition with an underlying LL grammar can be implemented alorig with the parse. Records to hold the synthesized attributes for a non terminal are placed below that nonterminal on the stack, while inherited attributes for a nonterminal are stored with that nonterminal on the stack. Action records are also placed on the stack to compute attributes at the appropriate time . .. Implementing L-Attributed SDD 's on an LL Grammar, Bottom- Up: An L-attributed definition with an underlying LL grammar can be converted to a .translation on an LR grammar and the translation performed in con nection with a bottom-up parse. The grammar transformation introduces "marker" nonterminals that appear on the bottom-up parser's stack and hold inherited attributes of the nonterminal above it on the stack. Syn thesized attributes are kept with their nonterminal on the stack. 5.7

References for C hapter 5

Syntax-directed definitions are a form of inductive definition in which the induc tion is on the syntactic structure. As such they have long been used informally in mathematics. Their application to programming languages came with the use of a grammar to structure the Algol 60 report. The idea of a parser that calls for semantic actions can be found in Samelson and Bauer [8] and Brooker and Morris [1] . Irons [2] constructed one of the first syntax-directed compilers, using synthesized attributes. The class of L attributed definitions comes from [6] . Inherited attributes, dependency graphs, and a test for circularity of SDD's (that is, whether or not there is some parse tree with no order in which the at tributes can be computed ) are from Knuth [5] . Jazayeri, Ogden, and Rounds [3] showed that testing circularity requires exponential time, as a function of the size of the SD D. Parser generators such as Yacc [4] ( see also the bibliographic notes in Chap ter 4) support attribute evaluation during parsing. The survey by Paakki [7] is a starting point for accessing the extensive literature on syntax-directed definitions and translations.

1. Brooker, R. A; and D. Morris, "A general translation program for phrase structure languages," J. ACM 9:1 (1962) , pp. 1-10. 2. Irons, E. T., "A syntax directed compiler for Algol 60," Comm; ACM 4:1 (1961), pp. 51-55. 3. Jazayeri, M., W. F. Odgen, arid W. C. Rounds, "The intrinsic expo nential complexity of the circularity problem for attribute grammars," Comm. ACM 18:12 (1975) , pp. 697-706.

5 . 7.

REFERENCES FOR CHAPTER 5

355

4. Johnson, S. C . , "Yacc - Yet Another Compiler Compiler," Computing Science Technical Report 32, Bell Laboratories, Murray Hill, NJ, 1975. Available at http : //dinosaur . compilertools . net/yacc/ . 5. Knuth, D .E., "Semantics of context-free languages," Mathematical Sys tems Theory 2:2 (1968) , pp. 127-145. See also Mathematical Systems Theory 5:1 (1971) , pp. 95-96. 6. Lewis, P. M . II, D . J. Rosenkrantz, and R. E. Stearns, "Attributed trans lations," J. Computer and System Sciences 9:3 (1974) , pp. 279-307. 7. Paakki, J., "Attribute grammar paradigms - a high-level methodology in language implementation," Computing Surveys 27:2 (1995) pp. 196-255. 8. Samelson, K. and F. L. Bauer, "Sequential formula translation," Comm. ACM 3:2 (1960) , pp. 76-83.

Chapter

6

Intermedi ate- Code Generation In the analysis-synthesis model of a compiler, the front end analyzes a source program and creates an intermediate representation, from which the back end generates target code. Ideally, details of the source language are confined to the front end, and details of the target machine to the back end. With a suitably defined intermediate representation, a compiler for language i and machine j can then be built by combining the front end for language i with the back end for machine j. This approach to creating suite of compilers can save a considerable amount of effort: m x n compilers can be built by writing just m front ends and n back ends. This chapter deals with intermediate representations, static type checking, and intermediate code generation. For simplicity, we assume that a com piler front end is organized as in Fig. 6.1, where parsing, static checking, and intermediate-code generation are done sequentially; sometimes they can be com bined and folded into parsing. We shall use the syntax-directed formalisms of Chapters 2 and 5 to specify checking and translation. Many of the translation schemes can be implemented during either bottom-up or top-down parsing, us ing the techniques of Chapter 5. All schemes can be implemented by creating a syntax tree and then walking the tree. Static Checker front end

Intermediate intermediate Code Code code Generator Generator

--------�·�I�. back end

----

Figure 6.1: Logical structure of a compiler front end Static checking includes type checking, which ensures that operators are ap plied to compatible operands. It also includes any syntactic checks that remain

357

358

CHAPTER 6. INTERMEDIATE-CODE GENERATION

after parsing. For example, static checking assures that a break-statement in C is enclosed within a while-, for-, or switch-statement; an error is reported if such an enclosing statement does not exist. The approach in this chapter can be used for a wide range of intermediate representations, including syntax trees and three-address code, both of which were introduced in Section 2.8. The term "three-address code" comes from instructions of the general form x = y op z with three addresses: two for the operands y and z and one for the result x. In the process of translating a program in a given source language into code for a given target machine, a compiler may construct a sequence of intermediate representations, as in Fig. 6.2. High-level representations are close to the source language and low-level representations are close to the target machine. Syntax trees are high level; they depict the natural hierarchical structure of the source program and are well suited to tasks like static type checking. High Level S ource . ---- Intermediate � . . . . P rogram RepresentatIOn

Low Level Intermediate -.- Target Code Representation

Figure 6.2: A compiler might use a sequence of intermediate representations A low-level representation is suitable for machine-dependent tasks like reg ister allocation and instruction selection. Three-address code can range from high- to low-level, depending on the choice of operators. For expressions, the differences between syntax trees and three-address code are superficial, as we shall see in Section 6.2.3. For looping statements, for example, a syntax tree represents the components of a statement, whereas three-address code contains labels and jump instructions to represent the flow of control, as in machine language. The choice or design of an intermediate representation varies from compiler to compiler. An intermediate representation may either be an actual language or it may consist of internal data structures that are shared by phases of the compiler. C is a programming language, yet it is often used as an intermediate form because it is flexible, it compiles into efficient machine code, and its com pilers are widely available. The original C++ compiler consisted of a front end that generated C , treating a C compiler as a back end. 6.1

Variants of Syntax Trees

Nodes in a syntax tree represent constructs in the source program; the children of a node represent the meaningful components of a construct. A directed acyclic graph ( hereafter called a DA G) for an expression identifies the common subexpressions ( sub expressions that occur more than once ) of the expression. As we shall see in this section, DAG's can be constructed by using the same techniques that construct syntax trees.

6. 1 .

359

VARIANTS OF SYNTAX TREES

6.1.1

Directed Acyclic Graphs for Expressions

Like the syntax tree for an expression, a DAG has leaves corresponding to atomic operands and interior codes corresponding to operators. The difference is that a node N in a DAG has more than one parent if N represents a com mon subexpression; in a syntax tree, the tree for the common sub expression would be replicated as many times as the sub expression appears in the original expression. Thus, a DAG not only represents expressions more succinctly, it gives the compiler important clues regarding the generation of efficient code to evaluate the expressions. Example 6 . 1 : Figure 6.3 shows the DAG for the expression

a + a * (b - c) + (b - c) * d

The leaf for a has two parents, because a appears twice in the expression. More interestingly, the two occurrences of the common subexpression b-c are represented by one node, the node labeled - . That node has two parents, representing its two uses in the sub expressions a* (b- c ) and (b- c ) *d. Even though b and c appear twice in the complete expression, their nodes each have one parent, since both uses are in the common sub expression b-c. 0

a b

/ �

c

Figure 6.3: Dag for the expression a + a * (b - c ) + (b - c ) * d The SDD of Fig. 6.4 can construct either syntax trees or DAG's. It was used to construct syntax trees in Example 5.11, where functions Leaf and Node created a fresh node each time they were called. It will construct a DAG if, before creating a new node, these functions first check whether an identical node already exists. If a previously created identical node exists, the existing node is returned. For instance, before constructing a new node, Node( op, left, right) we check whether there is already a node with label op, and children left and right, in that order. If so, Node returns the existing node; otherwise, it creates a new node. Example 6.2 : The sequence of steps shown in Fig. 6.5 constructs the DAG in Fig. 6.3, provided Node and Leaf return an existing node, if possible, as

CHAPTER 6. INTERMEDIATE-CODE GENERATION

360

PRODUCTION

1) 2) 3) 4) 5) 6)

E --+ El + T E --+ El - T E --+ T T --+ ( E ) T --+ id T

--+

nnm

SEMANTIC RULES

E . node new Node(' +' , El . node, T. node) E . node ::::: new Node( / - ' , El .node, T.node) E . node ::::: T. node T. node ::::: E . node T. node ::::: new Leaf (id, id. entry) T. node ::::: new Leaf ( nnm, num. va0

Figure 6.4: Syntax-directed definition to produce syntax trees or DAG 's 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 1 1) 1 2) 13)

Pl ::::: Leaf (id, entry-a) P2 Leaf (id, entry-a) ::::: PI Ps Leaf (id, entry-b) P4 ::::: Leaf (id, entry-c) Ps ::::: Node( ' - ' , P3 , P4) P6 Node( ' *' , Pl , P5) P7 ::::: Node( / +' , Pb P6 ) Ps ::::: Leaf (id, entry-b) ::::: P3 P9 Leaf (id, entry-c) P4 Pl O ::::: Node( ' -' , pg , P4 ) ::::: P5 PH ::::: Leaf (id, entry-d) PI2 Node( ' *' , PS , Pl 1 ) P13 ::::: Node( ' +' , P7 , Pl 2 )

Figure 6 .5: Steps for constructing the DAG of Fig. 6.3 discussed above. We assume that entry-a points to the symbol-table entry for a, and similarly for the other identifiers. When the call to Leaf (id, entry-a) is repeated at step 2, the node created by the previous call is returned, so P2 ::::: Pl . Similarly, the nodes returned at steps 8 and 9 are the same as those returned at steps 3 and 4 ( i.e. , Ps ::::: P3 and P9 ::::: P4) ' Hence the node returned at step 10 must be the same at that returned at step 5; L e . , PlO ::::: P5 ' 0 6. 1.2

The Value- Number Met hod for Constructing DAG 's

Often, the nodes of a syntax tree or DAG are stored in an array of records, as suggested by Fig. 6.6. Each row of the array represents one record, and therefore one node. In each record, the first field is an operation code, indicating the label of the node. In Fig. 6 .6 ( b ) , leaves have one additional field, which holds the lexical value (either a symbol-table pointer or a constant, in this case) , and

361

6.1 . VARIANTS OF SYNTAX TREES

interior nodes have two additional fields indicating the left and right children.

i

to entry

1 t-----:-....... for 2 num 1 10 3 �-+---71 --+:_ 1_ 2 --i 4 = 1 1 3 : f-------'---i

1

id

I----

i

5

10

(b) Array.

(a) DAG

Figure 6.6: Nodes of a DAG for i = i + 10 allocated in an array In this array, we refer to nodes by giving the integer index of the record for that node within the array. This integer historically has been called the value number for the node or for the expression represented by the node. For instance, in Fig. 6.6, the node labeled + has value number 3, and its left and right children have value numbers 1 and 2, respectively. In practice, we could use pointers to records or references to objects instead of integer indexes, but we shall still refer to the reference to a node as its "value number." If stored in an appropriate data structure, value numbers help us construct expression DAG's efficiently; the next algorithm shows how. Suppose that nodes are stored in an array, as in Fig. 6.6, and each node is referred to by its value number. Let the signature of an interior node be the triple ( op, l, r) , where op is the label, 1 its left child's value number, and r its right child's value number. A unary operator may be assumed to have r = O. Algorithm 6.3 : The value-number method for constructing the nodes of a

DAG.

INPUT: Label op, node 1 , and node

r.

OUTPUT: The value number of a node in the array with signature ( op, 1 , r ) . METHOD: Search the array for a node M with label op, left child l, and right child r. If there is such a node, return the value number of M. If not, create in the array a new node N with label op, left child 1 , and right child r, and return its value number. 0

While Algorithm 6.3 yields the desired output, searching the entire array every time we are asked to locate one node is expensive, especially if the array holds expressions from an entire program. A more efficient approach is to use a hash table, in which the nodes are put into "buckets/' each of which typically will have only a few nodes. The hash table is one of several data structures that support dictionaries efficiently. 1 A dictionary is an abstract data type that l See Aho, A. v . , J. E. Hopcroft, and J. D. Ullman, Data Structures and Algorithms, Addison-Wesley, 1 9 83, for a discussion of data structures supporting dictionaries.

362

CHAPTER 6. INTERMEDIATE-CODE GENERATION

allows us to insert and delete elements of a set, and to determine whether a given element is currently in the set. A good data structure for dictionaries, such as a hash table, performs each of these operations in time that is constant or close to constant, independent of the size of the set. To construct a hash table for the nodes of a DAG, we need a hash function h that computes the index of the bucket for a signature ( op, l, r ) , in a way that distributes the signatures across buckets, so that it is unlikely that any one bucket will get much more than a fair share of the nodes. The bucket index h( op, l, r ) is computed deterministically from op, l, and r, so that we may repeat the calculation and always get to the same bucket index for node ( op, l, r ) . The buckets can be implemented as linked lists, as in Fig. 6.7. An array, indexed by hash value, holds the bucket headers, each of which points to the first cell of a list. Within the linked list for a bucket, each cell holds the value number of one of the nodes that hash to that bucket. That is, node ( op, l, r) can be found on the list whose header is at index h( op, l, r ) of the array. o

Array of bucket headers indexed by hash value

List elements representing nodes

9

20

Figure 6.7: Data structure for searching buckets Thus, given the input node op, l, and r, we compute the bucket index h( op, l, r) and search the list of cells in this bucket for the given input node. Typically, there are enough buckets so that no list has more than a few cells. We may need to look at all the cells within a bucket, however, and for each value number v found in a cell, we must check whether the signature ( op, l, r ) of the input node matches the node with value number v in the list of cells ( as in Fig. 6.7) . If we find a match, we return v . If we find no match, we know no such node can exist in any other bucket, so we create a new cell, add it to the list of cells for bucket index h( op, I, r ) , and return the value number in that new cell.

6.1.3

Exercises for Section 6 . 1

Exercise 6 . 1 . 1 : Construct the DAG for the expression

( (x + y) - ( (x + y) * (x - y ) ) ) + ( (x + y) * (x

-

y))

363

6.2. THREE-ADDRESS CODE

Exercise 6 . 1 . 2 : Construct the DAG and identify the value numbers for the

sub expressions of the following expressions, assuming + associates from the left. a) a + b + (a + b) . b ) a + b + a + b. c) a + a + ((a + a + a + (a + a + a + a) ) . 6.2

Three- Address Code

In three-address code, there is at most one operator on the right side of an instruction; that is, no built-up arithmetic expressions are permitted. Thus a source-language expression like x+y*z might be translated into the sequence of three-address instructions

where t l and t2 are compiler-generated temporary names. This unraveling of multi-operator arithmetic expressions and of nested flow-of-control statements makes three-address code desirable for target-code generation and optimization, as discussed in Chapters 8 and 9. The use of names for the intermediate values computed by a program allows three-address code to be rearranged easily. Example 6.4 : Three-address code is a linearized representation of a syntax tree or a DAG in which explicit names correspond tb the interior nodes of the graph. The DAG in Fig. 6.3 is repeated in Fig. 6.8, together with a correspond ing three-address code sequence. 0

tl

t2

t3

t4

a b

/

'"

t5

b

a

*

C

tl

a + t2

tl

* d

t3 + t4

c

(a) DAG

(b) Three-address code

Figure 6.8: A DAG and its corresponding three-address code

CHAPTER 6. INTERMEDIATE-CODE GENERATION

364

6.2.1

Addresses and Instructions

Three-address code is built from two concepts: addresses and instructions. In object-oriented terms, these concepts correspond to classes, and the various kinds of addresses and instructions correspond to appropriate subclasses. Al ternatively, three-address code can be implemented using records with fields for the addresses; records called quadruples and triples are discussed briefly in Section 6.2.2. An address can be one of the following: •

A name. For convenience, we allow source-program names to appear as addresses in three-address code. In an implementation, a source name is replaced by a pointer to its symbol-table entry, where all information about the name is kept.

•

A constant. In practice, a compiler must deal with many different types of constants and variables. Type conversions within expressions are con sidered in Section 6.5.2.

•

A compiler-generated temporary. It is useful, especially in optimizing com pilers, to create a distinct name each time a temporary is needed. These temporaries can be combined, if possible, when registers are allocated to variables.

We now consider the common three-address instructions used in the rest of this book. Symbolic labels will be used by instructions that alter the flow of control. A symbolic label represents the index of a three-address instruction in the sequence of instructions. Actual indexes can be substituted for the labels, either by making a separate pass or by "backpatching," discussed in Section 6.7. Here is a list of the common three-address instruction forms:

1. Assignment instructions of the form x = y op Z , where op is a binary arithmetic or logical operation, and x, y, and z are addresses. 2. Assignments of the form x = op y, where op is a unary operation. Essen tial unary operations include unary minus, logical negation, shift opera tors, and conversion operators that, for example, convert an integer to a floating-point number. 3. Copy instructions of the form x = y, where x is assigned the value of y. 4. An unconditional jump goto L. The three-address instruction with label L is the next to be executed. 5. Conditional jumps of the form if x goto L and if False x goto L. These instructions execute the instruction with label L next if x is true and false, respectively. Otherwise, the following three-address instruction in sequence is executed next, as usual.

365

6.2. THREE-ADDRESS CODE

6. Conditional jumps such as if x relop y goto L, which apply a relational operator « , ==, >=, etc. ) to x and y, and execute the instruction with label L next if x stands in relation relop to y. If not, the three-address instruction following if x relop y goto L is executed next, in sequence. 7. Procedure calls and returns are implemented using the following instruc tions: param x for parameters; call p , n and y = call p , n for procedure and function calls, respectively; and return y, where y, representing a returned value, is optional. Their typical use is as the sequence of three address instructions param param

Xl X2

param x n call p , n

generated as part of a call of the procedure p( Xl , X2 , . . . , x n ) . The in teger n, indicating the number of actual parameters in "call p , n," is not redundant because calls can be nested. That is, some of the first param statements could be parameters of a call that comes after p returns its value; that value becomes another parameter of the later call. The implementation of procedure calls is outlined in Section 6.9.

8. Indexed copy instructions of the form x = y [i] and x [i] = y . The instruc tion x = y [i] sets x to the value in the location i memory units beyond location y. The instruction x [i] = y sets the contents of the location i units beyond x to the value of y . 9. Address and pointer assignments of the form x = & y , x = * y , and * x = y. The instruction x = & Y sets the r-value of x to be the location ( l-value ) of y . 2 Presumably y is a name, perhaps a temporary, that denotes an expression with an I-value such as A [i] [j ] , and x is a pointer name or temporary. In the instruction x = * y, presumably y is a pointer or a temporary whose r-value is a location. The r-value of x is made equal to the contents of that location. Finally, * x = y sets the r-value of the object pointed to by x to the r-value of y. Example 6 . 5 : Consider the statement

do i = i+ l ; while ( a [i] < v) ;

Two possible translations of this statement are shown in Fig. 6.9. The transla tion in Fig. 6.9 uses a symbolic label L, attached to the first instruction. The 2 From Section 2.8.3, l- and r-values are appropriate on the left and right sides of assign ments, respectively.

366

CHAPTER 6. INTERMEDIATE-CODE GENERATION

translation in (b) shows position numbers for the instructions, starting arbitrar ily at position 100. In both translations, the last instruction is a conditional jump to the first instruction. The multiplication i * 8 is appropriate for an array of elements that each take 8 units of space. 0 L:

tl = i + 1 i = tl t2 = i * 8 t3 = a [ t2 ] if t 3 < v goto L (a) Symbolic labels.

100: 101: 102: 103: 104:

tl

= i + 1 i = tl t2 = i * 8 t 3 = a [ t2 ] if t 3 < v goto 100

(b) Position numbers.

Figure 6.9: Two ways of assigning labels to three-address statements The choice of allowable operators is an important issue in the design of an intermediate form. The operator set clearly must be rich enough to implement the operations in the source language. Operators that are close to machine instructions make it easier to implement the intermediate form on a target machine. However, if the front end must generate long sequences of instructions for some source-language operations, then the optimizer and code generator may have to work harder to rediscover the structure and generate good code for these operations.

6.2.2

Quadruples

The description of three-address instructions specifies the components of each type of instruction, but it does not specify the representation of these instruc tions in a data structure. In a compiler, these instructions can be implemented as objects or as records with fields for the operator and the operands. Three such representations are called "quadruples," "triples," and "indirect triples." A quadruple (or just "quaff' ) has four fields, which we call op, argl ' arg2 , and result. The op field contains an internal code for the operator. For instance, the three-address instruction x = y + Z is represented by placing + in op, y in argl ' z in arg2 , and x in result. The following are some exceptions to this rule: 1. Instructions with unary operators like x = minus y or x = y do not use arg2 . Note that for a copy statement like x = y, op is =, while for most other operations, the assignment operator is implied. 2. Operators like param use neither arg2 nor result. 3. Conditional and unconditional jumps put the target label in result.

Example 6.6 : Three-address code for the assignment a = b * - c + b * - c ; appears in Fig. 6.10(a) . The special operator minus is used to distinguish the

367

6.2. THREE-ADDRESS CODE

unary minus operator, as in c, from the binary minus operator, as in b c . Note that the unary-minus "three-address" statement has only two addresses, as does the copy statement a = t 5 · The quadruples in Fig. 6.10(b) implement the three-address code in ( a) . 0 -

-

op tl t2 t3 t4 t5 a

minus b

*

tl

minus

0 minus 1

c

t3 t 2 + t4 t5

b =

c

*

(a) Three- address code

*

2 minus 3

4

5

*

+ =

arg1 arg2 result c I tl I b I tl I t 2 I c I I t3 I t3 b I I t4 I t4 t I t5 I 2 I a I I I t5

I

..

.

(b) Quadruples

Figure 6.10: Three-address code and its quadruple representation For readability, we use actual identifiers like a, b, and c in the fields arg1 , arg2 , and result in Fig. 6.10(b ) , instead of pointers to their symbol-table entries.

Temporary names can either by entered into the symbol table like programmer defined names, or they can be implemented as objects of a class Temp with its own methods.

6.2.3 Triples A triple has only three fields, which we call op, arg1 , and arg2 ' Note that the result field in Fig. 6.10 ( b ) is used primarily for temporary names. Using

triples, we refer to the result of an operation x op y by its position, rather than by an explicit temporary name. Thus, instead of the temporary t l in Fig. 6. 10(b ) , a triple representation would refer to position (0 ) . Parenthesized numbers represent pointers into the triple structure itself. In Section 6 . 1 .2, positions or pointers to positions were called value numbers. Triples are equivalent to signatures in Algorithm 6.3. Hence, the DAG and triple representations of expressions are equivalent. The equivalence ends with expressions, since syntax-tree variants and three-address code represent control flow quite differently. Example 6 . 7 : The syntax tree and triples in Fig. 6.11 correspond to the three-address code and quadruples in Fig. 6.10. In the triple representation in Fig. 6.11 ( b ) , the copy statement a = t5 is encoded in the triple representation by placing a in the argl field and (4) in the arg2 field. 0

A ternary operation like x [iJ = y requires two entries in the triple structure; for example, we can put x and i in one triple and y in the next. Similarly, x = y [iJ can implemented by treating it as if it were the two instructions

368

CHAPTER 6. INTERMEDIATE-CODE GENERATION

Why Do We Need Copy Instructions? A simple algorithm for translating expressions generates copy instructions for assignments, as in Fig. 6.10(a) , where we copy t5 into a rather than assigning t 2 + t4 to a directly. Each subexpression typically gets its own, new temporary to hold its result, and only when the assignment operator = is processed do we learn where to put the value of the complete expression. A code-optimization pass, perhaps using the DAG of Section 6.1 . 1 as an intermediate form, can discover that t5 can be replaced by a.

a

/ "" *

b

+

/ ""

/ \

minus b

/

*

I

C

I

1

I

b

I

c

I

I

b

I

I

(1 )

I

*

2 minus * 3

""

c

0 minus

minus

I

4 5

+ =

Figure 6.11: Representations of a + a * (b

a

I

c

(a) Syntax tree

I

. . .

I

(0) (2)

(3 )

(4)

(b) Triples -

) + (b

c

-

) d

c �

t = y [i] and x = t, where t is a compiler-generated temporary. Note that the temporary t does not actually appear in a triple, since temporary values are referred to by their position in the triple structure. A benefit of quadruples over triples can be seen in an optimizing compiler, where instructions are oftep moved around. With quadruples, if we move an instruction that computes a temporary t, then the instructions that use t require no change. With triples, the result of an operation is referred to by its position, so moving an instruction may require us to change all references to that result. This problem does not occur with indirect triples, which we consider next. Indirect triples consist of a listing of pointers to triples, rather than a listing of triples themselves. For example, let us use an array instruction to list pointers to triples in the desired order. Then, the triples in Fig. 6.11 (b) might be represented as in Fig. 6.12. With indirect triples, an optimizing compiler can move an instruction by reordering the instruction list, without affecting the triples themselves. When implemented in Java, an array of instruction objects is analogous to an indi rect triple representation, since Java treats the array elements as references to objects.

369

6.2. THREE-ADDRESS CODE instruction 35 (0) 36 ( 1 ) 37 ( 2 ) 38 ( 3) 39 (4) 4 0 ( 5)

0 minus I c * 1 I b 2 minus

3

*

4

+

5

=

c

I

I

I

I

b

(1) . .

a

.

I

I

(0)

I

(2)

I

I

( 3)

I

(4 )

Figure 6.12: Indirect triples representation of three-address code

6.2.4

Static Single-Assignment Form

Static single-assignment form (SSA) is an intermediate representation that fa cilitates certain code optimizations. Two distinctive aspects distinguish SSA from three-address code. The first is that all assignments in SSA are to vari ables with distinct names; hence the term static single-assigment. Figure 6.13 shows the same intermediate program in three-address code and in static single assignment form. Note that subscripts distinguish each definition of variables P and q in the SSA representation. P = a + b

Pl

c q = p p = q * d P = e p q = p + q

ql

-

P2 P3

-

(a) Three-address code.

q2

a + b Pl - C ql * d e P2 P3 + ql -

(b) Static single-assignment form.

Figure 6.13: Intermediate program in three-address code and SSA The same variable may be defined in two different control-flow pat hs in a program. For example, the source program if ( flag ) x = y = x * a;

-

1 ; else x

1;

has two control-flow paths in which the variable x gets defined. If we use different names for x in the true part and the false part of the conditional statement, then which name should we use in the assignment y = x * a? Here is where the second distinctive aspect of SSA comes into play. SSA uses a notational convention called the ¢-function to combine the two definitions of x: if ( flag ) Xl = - 1 ; else X 2 X3 = ¢(X l ' X 2 ) ;

=

1;

370

CHAPTER 6. INTERMEDIATE-CODE GENERATION

Here, ¢(XI ' X2 ) has the value Xl if the control flow passes through the true part of the conditional and the value X2 if the control flow passes through the false part. That is to say, the ¢-function returns the value of its argument that corresponds to the control-flow path that was taken to get to tp.e assignment statement containing the ¢-function.

6.2.5

Exercises for Section 6.2

Exercise 6 . 2 . 1 : Translate the arithmetic expression

a

+ - (b + c) into:

a) A syntax tree. b) Quadruples. c) Triples. d) Indirect triples. Exercise 6.2.2 : Repeat Exercise 6.2.1 for the following assignment state

ments:

z. a = b [i] + c [j ] .

iii.

X

f (y+ 1 ) + 2.

iv. x = * P + &y . ! Exercise 6.2.3 : Show how to transform a three-address code sequence into

one in which each defined variable gets a unique variable name. 6.3

Types and Declarat ions

The applications of types can be grouped under checking and translation: •

Type checking uses logical rules to reason about the behavior of a program at run time. Specifically, it ensures that the types of the operands match the type expected by an operator. For example, the && operator in Java expects its two operands to be booleans; the result is also of type boolean.

•

Translation Applications. From the type of a name, a compiler can de termine the storage that will be needed for that name at run time. Type information is also needed to calculate the address denoted by an array reference, to insert explicit type conversions, and to choose the right ver sion of an arithmetic operator, among other things.

371

TYPES AND DECLARATIONS

6.3.

In this section, we examine types and storage layout for names declared within a procedure or a class. The actual storage for a procedure call or an object is allocated at run time, when the procedure is called or the object is created. As we examine local declarations at compile time, we can, however, lay out relative addresses, where the relative address of a name or a component of a data structure is an offset from the start of a data area.

6.3.1

Type Expressions

Types have structure, which we shall represent using type expressions: a type expression is either a basic type or is formed by applying an operator called a type constructor to a type expression. The sets of basic types and constructors depend on the language to be checked. Example 6.8 : The array type int [2J [3J can be read as "array of 2 arrays of 3 integers each" and written as a type expression array(2, array(3, integer) ) . This type is represented by the tree in Fig. 6. 14. The operator array takes two parameters, a number and a type. 0

array

2

�

/

array

3

/

�

integer

Figure 6. 14: Type expression for int [2J [3J We shall use the following definition of type expressions: •

A basic type is a type expression. Typical basic types for a language include boolean, char, integer, float, and void; the latter denotes "the absence of a value."

•

A type name is a type expression.

•

A type expression can be formed by applying the array type constructor to a number and a type expression.

•

A record is a data structure with named fields. A type expression can be formed by applying the record type constructor to the field names and their types. Record types will be implemented in Section 6.3.6 by applying the constructor record to a symbol table containing entries for the fields.

•

A type expression can be formed by using the type constructor -+ for func tion types. We write s -+ t for "function from type s to type t." Function types will be useful when type checking is discussed in Section 6.5.

372

CHAPTER 6. INTERMEDIATE-CODE GENERATION

Type Names and Recursive Types Once a class is defined, its name can be used as a type name in C++ or Java; for example, consider Node in the program fragment public class Node { . . . } public Node n ;

Names can b e used to define recursive types, which are needed for data structures such as linked lists. The pseudocode for a list element clas s Cell { int info ; Cell next ; . . . }

defines the recursive type Cell as a class that contains a field info and a field next of type Cell. Similar recursive types can be defined in C using records and pointers. The techniques in this chapter carry over to recursive types.

•

If s and t are type expressions, then their Cartesian product s x t is a type expression. Products are introduced for completeness; they can be used to represent a list or tuple of types ( e.g. , for function parameters ) . We assume that x associates to the left and that it has higher precedence than -t .

•

Type expressions may contain variables whose values are type expressions. Compiler-generated type variables will be used in Section 6.5.4.

A convenient way to represent a type expression is to use a graph. The value-number method of Section 6.1.2, can be adapted to construct a dag for a type expression, with interior nodes for type constructors and leaves for basic types, type names, and type variables; for example, see the tree in Fig. 6.14. 3

6.3.2

Type Equivalence

When are two type expressions equivalent? Many type-checking rules have the form, "if two type expressions are equal then return a certain type else error." Potential ambiguities arise when names are given to type expressions and the names are then used in subsequent type expressions. The key issue is whether a name in a type expression stands for itself or whether it is an abbreviation for another type expression. 3 Since type names denote type expressions, they can set up implicit cycles; see the box on "Type Names and Recursive Types." If edges to type names are redirected to the type expressions denoted by the names, then the resulting graph can have cycles due to recursive types.

373

6.3. TYPES AND DECLARATIONS

When type expressions are represented by graphs, two types are structurally equivalent if and only if one of the following conditions is true: •

They are the same basic type.

•

They are formed by applying the same constructor to structurally equiv alent types.

•

One is a type name that denotes the other.

If type names are treated as standing for themselves, then the first two condi tions in the above definition lead to name equivalence of type expressions. Name-equivalent expressions are assigned the same value number, if we use Algorithm 6.3. Structural equivalence can be tested using the unification algo rithm in Section 6.5.5.

6.3.3

Declarations

We shall study types and declarations using a simplified grammar that declares just one name at a time; declarations with lists of names can be handled as discussed in Example 5.10. The grammar is

D T

B C

-+

-+

-+

-+

T id ; D I E B C I record '{' D } int I float E I [ num ] C '

'

The fragment of the above grammar that deals with basic and array types was used to illustrate inherited attributes in Section 5.3.2. The difference in this section is that we consider storage layout as well as types. Nonterminal D generates a sequence of declarations. Nonterminal T gen erates basic, array, or record types. Nonterminal B generates one of the basic types int and float . Nonterminal C, for "component," generates strings of zero or more integers, each integer surrounded by brackets. An array type con sists of a basic type specified by B, followed by array components specified by nonterminal C. A record type (the second production for T) is a sequence of declarations for the fields of the record, all surrounded by curly braces.

6.3.4

Storage Layout for Local Names

From the type of a name, we can determine the amount of storage that will be needed for the name at run time. At compile time, we can use these amounts to assign each name a relative address. The type and relative address are saved in the symbol-table entry for the name. Data of varying length, such as strings, or data whose size cannot be determined until run time, such as dynamic arrays, is handled by reserving a known fixed amount of storage for a pointer to the data. Run-time storage management is discussed in Chapter 7.

374

CHAPTER 6. INTERMEDIATE-CODE GENERATION

Address Alignment The storage layout for data objects is strongly influenced by the address ing constraints of the target machine. For example, instructions to add integers may expect integers to be aligned, that is, placed at certain posi tions in memory such as an address divisible by 4. Although an array of ten characters needs only enough bytes to hold ten characters, a compiler may therefore allocate 12 bytes - the next multiple of 4 - leaving 2 bytes unused. Space left unused due to alignment considerations is referred to as padding. When space is at a premium, a compiler may pack data so that no padding is left; additional instructions may then need to be executed at run time to position packed data so that it can be operated on as if it were properly aligned.

Suppose that storage comes in blocks of contiguous bytes, where a byte is the smallest unit of addressable memory. Typically, a byte is eight bits, and some number of bytes form a machine word. Multibyte objects are stored in consecutive bytes and given the address of the first byte. The width of a type is the number of storage units needed for objects of that type. A basic type, such as a character, integer, or float, requires an integral number of bytes. For easy access, storage for aggregates such as arrays and classes is allocated in one contiguous block of bytes. 4 The translation scheme (SDT) in Fig. 6.15 computes types and their widths for basic and array types; record types will be discussed in Section 6.3.6. The SDT uses synthesized attributes type and width for each nonterminal and two variables t and w to pass type and width information from a B node in a parse tree to the node for the production C -t E. In a syntax-directed definition, t and w would be inherited attributes for C. The body of the T-production consists of nonterminal B, an action, and nonterminal C, which appears on the next line. The action between B and C sets t to B. type and w to B. width. If B -t int then B. type is set to integer and B. width is set to 4, the width of an integer. Similarly, if B -t Boat then B. type is float and B. width is 8, the width of a float. The productions for C determine whether T generates a basic type or an array type. If C -t E, then t becomes C. type and w becomes C. width. Otherwise, C specifies an array component. The action for C -t [ num ] C1 forms C. type by applying the type constructor array to the operands num . value and C1 . type. For instance, the result of applying array might be a tree structure such as Fig. 6. 14. 4 Storage allocation for pointers in C and C++ is simpler if all pointers have the same width. The reason is that the storage for a pointer may need to be allocated before we learn the type of the objects it can point to.

6. 3.

375

TYPES AND DECLARATIONS T

-+

B C

{ t = B. type; w = B. width; }

B B C C

-+

int

-+

float

-+

E

{ B. type = integer; B. width = 4; } { B. type = float; B. width = 8; } { C. type = t; C. width = w; }

-+

[ num ] C1

{ array(num.value, C1 .type) ; C. width = num. value x C1 . width; }

Figure 6.15: Computing types and their widths The width of an array is obtained by multiplying the width of an element by the number of elements in the array. If addresses of consecutive integers differ by 4, then address calculations for an array of integers will include multiplications by 4. Such multiplications provide opportunities for optimization, so it is helpful for the front end to make them explicit. In this chapter, we ignore other machine dependencies such as the alignment of data objects on word boundaries. Example 6.9 : The parse tree for the type int [2] [3] is shown by dotted lines in Fig. 6. 16. The solid lines show how the type and width are passed from B, down the chain of C's through variables t and w, and then back up the chain as synthesized attributes type and width. The variables t and w are assigned the values of B. type and B. width, respectively, before the subtree with the C nodes is examined. The values of t and w are used at the node for C -+ E to start the evaluation of the synthesized attributes up the chain of C nodes. 0

type width

T N . type ..

.

: width

hit

=

=

�

t integer w 4

=

=

.

.

. .

.

mteger · 4

=

=

array(2, array(3, integer) ) 24

. C

"-

type width

=

=

C

array(2, array(3, integer)) 24

"-

type width

=

=

c

array(3, integer) 12

"-

type width

E

Figure 6. 16: Syntax-directed translation of array types

=

=

integer 4

CHAPTER 6. INTERMEDIATE-CODE GENERATION

376 6.3.5

Sequences of Declarations

Languages such as C and Java allow all the declarations in a sirigle procedure to be processed as a group. The declarations may be distributed within a Java procedure, but they can still be processed when the procedure is analyzed. Therefore, we can use a variable, say offset, to keep track of the next available relative address. the translation scheme of Fig. 6.17 deals with a sequence of declarations of the form T id, where T generates a type as in Fig. 6.15. Before the first declaration is considered, offset is set to o. As each new name x is seen, x is entered into the symbol table with its relative address set to the current value of offset, which is then incremented by the width of the type of x.

p -+

D D -+ T id ; D -+

Dr

{ offset = 0; } { top.put(id.lexeme, T.type, offset) ; offset = offset + T. width; }

E

Figure 6.17: Computing the relative addresses of declared names The semantic action within the production D -+ T id ; D r creates a symbol table entry by executing top.put(id.lexeme, T. type, offset) . Hete top denotes the current symbol table. The method top.put creates a symbol-table entry for id. lexeme, with type T. type and relative address offset in its data area. The initialization of offset in Fig. 6.17 is more evident if the first production appears on one line as:

p -+ { offset = 0; } D

(6.1)

Nonterminals generating E, called marker nonterminals, can be used to rewrite productions so that all actions appear at the ends of right sides; see Sec tion 5.5A. Using a marker nonterminal M, (6.1) can be restated as:

P M

-+ -+

MD

E

{ offset = 0; }

Fields in Records and Classes The translation of declarations in Fig. 6.17 carries over to fields in records and classes. Record types can be added to the grammar in Fig. 6.15 by adding the 6.3.6

following production

T -+ record ' { ' D ' }'

377

TYPES AND DECLARATIONS

6.3.

The fields in this record type are specified by the sequence of declarations generated by D . The approach of Fig. 6.17 can be used to determine the types and relative addresses of fields, provided we are careful about two things: •

The field names within a record must be distinct; that is, a name may appear at most once in the declarations generated by D .

•

The offset or relative address for a field name is relative to the data area for that record.

Example 6 . 1 0 : The use of a name x for a field within a record does not

conflict with other uses of the name outside the record. Thus, the three uses of x in the following declarations are distinct and do not conflict with each other: f loat x ; record { float x ; float y ; } p ; record { int tag ; float x ; float y ; }

q;

A subsequent assignment x = p . x + q . x ; sets variable x to the sum of the fields named x in the records p and q. Note that the relative address of x in p differs from the relative address of x in q. 0 For convenience, record types will encode both the types and relative ad dresses of their fields, using a symbol table for the record type. A record type has the form record(t) , where record is the type constructor, and t is a symbol table object that holds information about the fields of this record type. The translation scheme in Fig. 6.18 consists of a single production to be added to the productions for T in Fig. 6.15. This production has two semantic actions. The embedded action before D saves the existing symbol table, denoted by top and sets top to a fresh symbol table. It also saves the current offset, and sets offset to o. The declarations generated by D will result in types and relative addresses being put in the fresh symbol table. The action after D creates a record type using top, before restoring the saved symbol table and offset.

T

-+

record ' {'

{ Env.push( top) ; top = new EnvO ; Stack.push( offset) ; offset = 0; }

D '}'

{ T. type = record(top) ; T.width = offset; top = Env.popO; offset = Stack.popO ; }

Figure 6. 18: Handling of field names in records For concreteness, the actions in Fig. 6.18 give pseudocode for a specific im plementation. Let class Env implement symbol tables. The call Env.push( top) pushes the current symbol table denoted by top onto a stack. Variable top is then set to a new symbol table. Similarly, offset is pushed onto a stack called Stack. Variable offset is then set to O.

378

CHAPTER 6. INTERMEDIATE-CODE GENERATION

After the declarations in D have been translated, the symbol table top holds the types and relative addresses of the fields in this record. Further, offset gives the storage needed for all the fields. The second action sets T. type to record( top) and T. width to offset. Variables top and offset are then restored to their pushed values to complete the translation of this record type. This discussion of storage for record types carries over to classes, since no storage is reserved for methods. See Exercise 6.3.2. 6.3.7

Exercises for Section 6 . 3

Exercise 6 . 3 . 1 : Determine the types and relative addresses for the identifiers in the following sequence of declarations:

f loat x ; record { f loat x ; f loat y ; } p ; recorq { int tag ; f loat x ; f loat y ; }

q;

! Exercise 6.3.2 : Extend the handling of field names in Fig. 6.18 t o classes and

single-inheritance class hierarchies.

a) Give an implementation of class Env that allows linked symbol tables, so that a subclass can either redefine a field name or refer directly to a field name in a superclass. b ) Give a translation scheme that allocates a contiguous data area for the fields in a class, including inherited fields. Inherited fields must maintain the relative addresses they were assigned in the layout for the superclass. 6.4

Translat ion of Expressions

The rest of this chapter explores issues that arise during the translation of ex pressions and statements. We begin in this section with the translation of ex pressions into three-address code. An expression with more than one operator, like a + b * c, will translate into instructions with at most one operator per in struction. An array reference A[i] [j] will expand into a sequence of three-address instructions that calculate an address for the reference. We shall consider type checking of expressions in Section 6.5 and the use of boolean expressions to direct the flow of control through a program in Section 6.6.

Operations Within Expressions The syntax-directed definition in Fig. 6.19 builds up the three-address code for an assignment statement S using attribute code for S and attributes addr and code for an expression E. Attributes S. code and E. code denote the three-address code for S and E, respectively. Attribute E. addr denotes the address that will 6.4. 1

379

6.4. TRANSLATION OF EXPRESSIONS SEMANTIC RULES

PRODUCTION S E

-t

-t

I

id El

=

+

E ; E2

S: eode

=

E. code " gen( top.get(id. lexeme)

' ='

E. addr)

E. addr = new Temp 0 E. eode = E1 . code " E2 . code I I gen(E. addr ' =' E1 . addr ' + ' E2 . addr)

- El

E. addr = new Temp () E. code = El . code I I gen(E . addr ' =' I minus' El . addr)

( El )

E. addr = E1 . addr E. code = El . Code

id

E. addr = top.get(id. lexeme) E. code = "

Figure 6.19: Three-address code for expressions hold the value of E. Recall from Section 6.2.1 that an address can be a name, a constant, or a compiler-generated temporary. Consider the last production, E -t id, in the syntax-directed definition in Fig. 6. 19. When an expression is a single identifier, say x, then x itself holds the value of the expression. The semantic rules for this production define E. addr to point to the symbol-table entry for this instance of id. Let top denote the current symbol table. Function top. get retrieves the entry when it is applied to the string representation id.lexeme of this instance of id. E. code is set to the empty string. When E -t ( E1 ) , the translation of E is the same as that of the sub ex pression E1 . Hence, E. addr equals E1 . addr, and E. code equals E1 . code. The operators + and unary - in Fig. 6.19 are representative of the operators in a typical language. The semantic rules for E -t El + E2 , generate code to compute the value of E from the values of El and E2 . Values are computed into newly generated temporary names. If El is computed into E1 . addr and E2 into E2 . addr, then El + E2 translates into t = E1 . addr + E2 . addr, where t is a new temporary name. E. addr is set to t. A sequence of distinct temporary names t1 , t2 , . " is created by successively executing new TempO . For convenience, we use the notation gen(x '=' y ' + ' z ) to represent the three-address instruction x = y + z. Expressions appearing in place of variables like x, y, and z are evaluated when passed to gen, and quoted strings like '=' are taken literally. 5 Other three-address instructions will be built up similarly 5 In syntax-directed definitions, gen builds an instruction and returns it. In translation schemes, gen builds an instruction and incrementally emits it by putting it into the stream

380

CHAPTER 6. INTERMEDIATE-CODE GENERATION

by applying gen to a combination of expressions and strings. When we translate the production E -+ El + E2 , the semantic rules in Fig. 6. 19 build up E. code by concatenating E1 . code, E2 . code, and an instruc tion that adds the values of El and E2 . The instruction puts the result of the addition into a new temporary name for E, denoted by E. addr. The translation of E -+ El is similar. The rules create a new temporary for E and generate an instruction to perform the unary minus operation. Finally, the production S -+ id = E ; generates instructions that assign the value of expression E to the identifier id. The semantic rule for this production uses function top.get to determine the address of the identifier represented by id, as in the rules for E -+ id. S. code consists of the instructions to compute the value of E into an address given by E. addr, followed by an assignment to the address top.get(id. lexeme) for this instance of id. -

Example 6 . 1 1 : The syntax-directed definition in Fig. 6.19 translates the as signment statement a = b + c ; into the three-address code sequence -

tl = minus c t2 = b + t l a = t2 o

6 .4.2

Incremental Translation

Code attributes can be long strings, so they are usually generated incremen tally, as discussed in Section 5.5.2. Thus, instead of building up E. code as in Fig. 6.19, we can arrange to generate only the new three-address instructions, as in the translation scheme of Fig. 6.20. In the incremental approach, gen not only constructs a three-address instruction, it appends the instruction to the sequence of instructions generated so far. The sequence may either be retained in memory for further processing, or it may be output incrementally. The translation scheme in Fig. 6.20 generates the same code as the syntax directed definition in Fig. 6.19. With the incremental approach, the code at tribute is not used, since there is a single sequence of instructions that is created by successive calls to gen. For example, the semantic rule for E -+ El + E2 in Fig. 6.20 simply calls gen to generate an add instruction; the instructions to compute El into E1 . addr and E2 into E2 . addr have already been generated. The approach of Fig. 6.20 can also be used to build a syntax tree. The new semantic action for E -+ El + E2 creates a node by using a constructor, as in E

-+

El + E2

{ E. addr

=

new Node(' +' , E1 . addr, E2 . addr) ; }

Here, attribute addr represents the address of a node rather than a variable or constant. of generated instructions.

6. 4.

381

TRANSLATION OF EXPRESSIONS S -+ id = E ;

{ gen( top.get(id. lexeme) '=' E. addr) ; }

E -+ El + E2

{ E. addr = new Temp O ; gen(E. addr ' = ' El . addr '+' E2 . addr) ; }

- El

{ E. addr = new Temp 0 ; gen(E. addr '=' ' minus' El . addr) ; }

( El )

{ E. addr = El . addr; }

id

{ E. addr = top.get(id. lexeme) ; }

Figure 6.20: Generating three-address code for expressions incrementally 6.4.3

Addressing Array Elements

Array elements can be accessed quickly if they are stored in a block of consecu tive locations. In C and Java, array elements are numbered 0, 1, . . . , n 1 , for an array with n elements. If the width of each array element is w , then the ith element of array A begins in location -

base + i

x

(6.2)

W

where base is the relative address of the storage allocated for the array. That is, base is the relative address of A[O] . The formula (6.2) generalizes to two or more dimensions. In two dimensions, we write A[il][i 2 ] in C and Java for element i 2 in row h . Let Wl be the width of a row and let W2 be the width of an element in a row. The relative address of A[il][i 2 ] can then be calculated by the formula

base + il

x

Wl + i2

X

W2

(6.3)

In k dimensions, the formula is

base + il

x

Wl + i 2

X

W2 + . . . + ik

X

Wk

(6.4)

where Wj , for 1 S j S k, is the generalization of Wl and W 2 in (6.3) . Alternatively, the relative address of an array reference can be calculated in terms of the numbers of elements nj along dimension j of the array and the width W = Wk of a single element of the array. In two dimensions (i.e. , k = 2 and W = W 2 ) , the location for A[h ] [i 2 ] is given by (6.5) In k dimensions, the following formula calculates the same address as (6.4) :

base + (( . . (il -

x

n2 + i 2 )

x

n3 + i3 ) " ' )

x

nk

+

ik)

x

W

(6.6)

382

CHAPTER 6. INTERMEDIATE-CODE GENERATION

More generally, array elements need not be numbered starting at O. In a one-dimensional array, the array elements are numbered low, low + 1 , . . . , high and base is the relative address of A[low] . Formula (6.2) for the address of A[i] is replaced by:

base + (i - low)

(6.7)

x w

The expressions ( 6.2 ) and (6.7) can be both be rewritten as i x w + c, where the subexpression c = base - low x w can be precalculated at compile time. Note that c = base when low is O. We assume that c is saved in the symbol table entry for A, so the relative address of A[i ] is obtained by simply adding i x w to c. Compile-time precalculation can also be applied to address calculations for elements of multidimensional arrays; see Exercise 6.4.5. However, there is one situation where we cannot use compile-time precalculation: when the array's size is dynamic. If we do not know the values of low and high (or their gen eralizations in many dimensions) at compile time, then we cannot compute constants such as c. Then, formulas like ( 6.7 ) must be evaluated as they are written, when the program executes. The above address calculations are based on row-major layout for arrays, which is used in C and Java. A two-dimensional array is normally stored in one of two forms, either row-major (row-by-row) or column-major (column-by column) . Figure 6.21 shows the layout of a 2 x 3 array A in (a) row-major form and (b) column-major form. Column-major form is used in the Fortran family of languages.

T First row

T

A[l , l]

A[l , l]

A[1, 2]

A[2, l]

A[1, 3]

A[1, 2]

A[2, 1]

A[2, 2]

Second row

A[2, 2]

A[1, 3]

1

A[2, 3]

A[2, 3}

(a) Row Major

(b) Column Major

t

First column

+ + Third column

Second column

l

Figure 6.2 1 : Layouts for a two-dimensional array. We can generalize row- or column-major form to many dimensions. The generalization of row-major form is to store the elements in such a way that, as we scan down a block of storage, the rightmost subscripts appear to vary fastest, like the numbers on an odometer. Column-major form generalizes to the opposite arrangement, with the leftmost subscripts varying fastest.

6.4.

TRANSLATION OF EXPRESSIONS

383

Translation of Array References

6.4.4

The chief problem in generating code for array references is to relate the address cakulation formulas in Section 6.4.3 to a grammar for array references. Let nonterminal L generate an array name followed by a sequence of inqex expres sions:

L

-+

L [ E ] I id [ E ]

As in C and Java, assume that the lowest-numbered array element is O. Let us calculate addresses based on widths, using the formula (6.4) , rather than on numbers of elements, as in (6.6) . The translation scheme in Fig. 6.22 generates three-address code for expressions with array references. It consists of the productions and semantic actions from Fig. 6.20, together with productions involving nonterminal L.

S

-+

id = E ; L=

E

L

-+

-+

E ;

{ gen( top.get(id.lexeme) '=' E. addr) ; } { gen(L. addr. base ' [' L. addr ']' '=' E. addr) ; }

E1 + E2

{ E. addr = new Temp 0 ; gen(E. addr '=' E1 . addr ' + ' E2 . addr) ; }

id

{ E. addr = top.get(id. lexeme) ; }

L

{ E. addr = new Temp O ; gen(E. addr '=' L. array. base ' [' L. addr ' ] ' ) ; }

id [ E ]

{ L. array = top.get(id. lexeme) ; L. type = L. army. type. elem; L. addr = new Temp 0 ; gen(L. addr '=' E. addr ' * ' L. type� width) ; }

L1 [ E ]

{ L. array = L1 . army; L. type = L1 . type. elem; t = new Temp 0 ; L . addr = new Temp 0 ; gen(t ' =' E. addr ' * ' L. type. width) ; } gen(L. addr '=' L1 . addr ' +' t) ; }

Figure 6.22: Semantic actions for array references Nonterminal L has three synthesized attributes: 1 . L. addr denotes a temporary that is used while computing the offset for the array reference by summing the terms ij x Wj in (6.4) .

CHAPTER 6. INTERMEDIATE-CODE GENERATION

384

2. L. array is a pointer to the syml;>oI-table entry for the array name. The base address of the array, say, L. array. base is used to determine the actual l-value of an array reference after all the index expressions are analyzed. 3. L. type is the type of the subarray generated by L. For any type · t , we assume that its width is given by t. width. We use types as attributes, rather than widths, since types are needed anyway for type checking. For any array type t, suppose that t. elem gives the element type. The production S -+ id = E ; represents an assignment to a nonarray vari able, which is handled as usual. The semantic action for S -+ .L = E ; generates an indexed copy instruction to assign the value denoted by expression E to the location denoted by the array reference L. Recall that attribute L� array gives the symbol-table entry for the array. The array's base address - the address of its Oth element -:- is given by L. array. base. Attribute L. addr denotes the temporary that holds the offset for the array reference generated by L. The location for the array reference is therefore L. array. base[L. addr] . The generated instruction copies the r-value from address E. addr into the location for L. Productions E -+ El + E2 and E -+ id are the same as before. The se mantic action for the new production E -+ L generates code to copy the value from the location denoted by L into a new temporary. This location is L; array. base[L; addr] , as discussed above for the produ,ction S -+ L = E ; . Again, attribute L. array gives the array name, and L. array. base gives its base address; Attribute L. addr denotes the temporary that holds the offset. The code for the array reference places the r-value at the location designated by the base and offset into a new temporary denoted by E. addr. x 3 array of integers, and le� c, i , and j all denote integers. Then, the type of a is array(2, array(3, integer) ) . Its width w is 24, assuming that the width of an integer is 4. The type of a [i] is array(3, integer) , of width Wl = 12. The type of a U] [j] is integer. An annotated parse tree for the expression c + a [i] [j ] is shown in Fig. 6.23. The expression is translated into t he sequence of three-address instructions in Fig. 6.24. As usual, we have used the name of each identifier to refer to its symbol-table entry. 0

Example 6 . 1 2 : Let a denote a 2

6.4.5

Exercises for Section 6.4

Exercise 6.4. 1 : Add to the translation of Fig. 6.19 rules for the following

productions:

a) E -+ El

*

E2 ·

b) E -+ + El (unary plus) . Exercise 6.4.2 : Repeat Exercise 6.4. 1 for the incremental translation of Fig.

6.20.

385

6.4. TRANSLATION OF EXPRESSIONS E. addr = t5

� l �E. addr =

E. addr = c I

t4

I L. array = a L. type = integer L. addr = t3

c

� L. array = a L. type = array(3, integer) L. addr = tl

�

[

[

� �

/

E. addr = i

a. type = array(2, array(3, integer))

I

/

�

�]

E. addr = j I

j ]

i

Figure 6.23: Annotated parse tree for c + a [i] [j ] tl t2 t3 t4 t5 =

i * 12 j * 4 tl + t2 a [ t3 ] C + t4

Figure 6.24: Three-address code for expression c + a [i] [j ] Exercise 6.4.3 : Use the translation of Fig. 6.22 to translate the following

assignments:

a) x = a [i] + b [j ] .

b ) x = a [i] [j ] + b [i] [j ] .

! c ) x = a [b [i] [j ] ] [c [k] ] .

! Exercise 6.4.4 : Revise the translation of Fig. 6.22 for array references of the Fortran style, that is, id[EI ' E2 , , En] for an n-dimensional array. •

.

•

Exercise 6.4.5 : Generalize formula (6.7) to multidimensional arrays, and in

dicate what values can be stored in the symbol table and used to compute offs�ts. Consider the following cases: a) An array A of two dimensions, in row-major form. The first dimension has indexes running from h to hI , and the second dimension has indexes from l2 to h2 . The width of a single array element is w .

386

CHAPTER 6. INTERMEDIATE-CODE GENERATION

Symbolic Type Widths The intermediate code should be relatively independent of the target ma chine, so the optimizer does not have to change much if the code generator is replaced by one for a different machine. However, as we have described the calculation of type widths, an assumption regarding how basic types is built Into the translation scheme. For instance, Example 6.12 assumes that each element of an integer array takes four bytes. Some intermediate codes, e.g., P-code for Pascal, leave it to the code generator to fill in the size of array elements, so the intermediate code is independent of the size of a machine word. We could have done the same in our translation scheme if we replaced 4 (as the width of an integer) by a symbolic constant.

b) The same as (a) , but with the array stored in column-major form. ! c) An array A of k dimensions, stored in row-major form, with elements of

size w . The jth dimension has indexes running from lj to hj .

! d) The same as (c) but with the array stored in column-major form. Exercise 6.4.6 : An integer array A[i, j] has index i ranging from 1 to 10 and index j ranging from 1 to 20. Integers take 4 bytes each. Suppose array A is stored starting at byte O. Find the location of:

a) A[4, 5] b )A[10, 8] c) A[3, 17] . Exercise 6.4.7 : Repeat Exercise 6.4;6 if A i s stored in column-major order. Exercise 6.4.8 : A real array A[i, j, k] has index i ranging from 1 to 4, index j ranging from 0 to 4, and index k ranging from 5 to 10. Reals take 8 bytes each. Suppose array A is stored starting at byte O. Find the location of:

a) A[3, 4, 5] b)A[1, 2, 7] c) A[4, 3, 9] . Exercise 6.4.9 : Repeat Exercise 6.4.8 if A is stored in column-major order.

6.5

Typ e Checking

To do type checking a compiler needs to assign a type expression to each com portent of the source program. The compiler must then determine that these type expressions conform to a collection of logical rules that is called the type system for the source language. Type checking has the potential for catching errors in programs. In principle, any check can be done dynamically, if the target code carries the type of an

387

6. 5. TYPE CHECKING

element along with the value of the element. A sound type system eliminates the need for dynamic checking for type errors, because it allows us to determine statically that these errors cannot occur when the target program runs. An implementation of a language is strongly typed if a compiler guarantees that the programs it accepts will run without type errors. Besides their use for compiling, ideas from type checking have been used to improve the security of systems that allow software modules to be imported and executed. Java programs compile into machine-independent bytecodes that include detailed type information about the operations in the bytecodes. Im ported code is checked before it is allowed to execute, to guard against both inadvertent errors and malicious misbehavior.

6.5.1

Rules for Type Checking

Type checking can take on two forms: synthesis and inference. Type synthesis builds up the type of an expression from the types of its subexpressions. It requires names to be declared before they are used. The type of El + E2 is defined in terms of the types of El and E2 . A typical rule for type synthesis has the form if f has type s -+ t and x has type then expression f (x) has type t

s,

(6.8 )

Here, f and x denote expressions, and s -+ t denotes a function from s to t. This rule for functions with one argument carries over to functions with several arguments. The rule (6.8 ) can be adapted for El +E2 by viewing it as a function application add( El , E2 ) ' 6 Type inference determines the type of a language construct from the way it is used. Looking ahead to the examples in Section 6.5.4, let null be a function that tests whether a list is empty. Then, from the usage null(x) , we can tell that x must be a list. The type of the elements of x is not known; all we know is that x must be a list of elements of some type that is presently unknown. Variables representing type expressions allow us to talk about unknown types. We shall use Greek letters ex, 13, . " for type variables in type expressions. A typical rule for type inference has the form if f (x) is an expression, then for some ex and 13, f has type

ex

-+

13 and x has type

ex

(6.9)

Type inference is needed for languages like ML, which check types, but do not require names to be declared. 6 We shall use the term "synthesis" even if some context information is used to determine types. With overloaded functions, where the same name is given to more than one function, the context of El + E2 may also need to be considered in some languages.

388

CHAPTER 6. INTERMEDIATE-CODE GENERATION

In this section, we consider type checking of expressions. The rules for checking statements are similar to those for expressions. For example, we treat the conditional statement " if ( E) S;" as if it were the application of a function if to E and S. Let the special type void denote the absence of a value. Then function if expects to be applied to a boolean and a void; the result of the application is a void.

6.5.2

Type Conversions

Consider expressions like x + i, where x is of type float and i is of type inte ger. Since the representation of integers and floating-point numbers is different within a computer and different machine instructions are used for operations on integers and floats, the compiler may need to convert one of the operands of + to ensure that both operands are of the same type when the addition occurs. Suppose that integers are converted to floats when necessary, using a unary operator (f loat ) . For example, the integer 2 is converted to a float in the code for the expression 2 * 3 . 14: tl

t2

(f loat ) 2 tl * 3 . 14

We can extend such examples to consider integer and float versions of the operators; for example, int * for integer operands and f loat * for floats. Type synthesis will be illustrated by extending the scheme in Section 6.4.2 for translating expressions. We introduce another attribute E.type, whose value is either integer or float. The rule associated with E ,,-+ El + E2 builds on the pseudocode if ( E1 .type = integer and E2 .type = integer ) E.type else if ( E1 . type = float and E2 . type = integer )

=

integer;

As the number of types subject to conversion increases, the number of cases increases rapidly. Therefore with large numbers of types, careful organization of the semantic actions becomes important. Type conversion rules vary from language to language. The rules for Java in Fig. 6.25 distinguish between widening conversions, which are intended to preserve information, and narrowing conversions, which can lose information. The widening rules are given by the hierarchy in Fig. 6.25(a) : any type lower in the hierarchy can be widened to a higher type. Thus, a char can be widened to an int or to a float, but a char cannot be widened to a short. The narrowing rules are illustrated by the graph in Fig. 6.25(b) : a type s can be narrowed to a type t if there is a path from s to t. Note that char, short, and byte are pairwise convertible to each other. Conversion from one type to another is said to be implicit if it is done automatically by the compiler. Implicit type conversions, also called coercions,

389

6.5. TYPE CHECKING double

double

float

float

long

long

t

I

t

I

I

int

/ �

char

short

I

byte (a) Widening conversions

t � t � int

char --- short

---

�

byte

(b) N arrowing conversions

Figure 6.25: Conversions between primitive types in Java are limited in many languages to widening conversions. Conversion is said to be explicit if the programmer must write something to cause the conversion. Explicit conversions are also called casts. The semantic action for checking E -+ El + E2 uses two functions:

1 . max(h , t2) takes two types tl and t2 and returns the maximum (or least upper bound ) of the two types in the widening hierarchy. It declares an error if either h or t2 is not in the hierarchy; e.g., if either type is an array or a pointer type. 2. widen( a, t, w ) generates type conversions if needed to widen an address a of type t into a value of type w. It returns a itself if t and w are the same type. Otherwise, it generates an instruction to do the conversion and place the result in a temporary t , which is returned as the result. Pseudocode for widen, assuming that the only types are integer and float, appears in Fig. 6.26.

Addr widen(Addr a , Type t, Type w ) if ( t = w ) return a; else if ( t = integer and w = float ) { temp = new TempO; gen(temp '=' ' (float)' a) ; return temp; } else error;

} Figure 6.26: Pseudocode for function widen

390

CHAPTER 6. INTERMEDIATE-CODE GENERATION

The semantic action for E -+ El + E2 in Fig. 6.27 illustrates how type conversions can be added to the scheme in Fig. 6.20 for translating expressions. In the semantic action, temporary variable al is either E1 . addr, if the type of El does not need to be converted to the type of E, or a new temporary variable returned by widen if this conversion is necessary. Similarly, a2 is either E2 . addr or a new temporary holding the type-converted value of E2 . Neither conversion is needed if both types are integer or both are float. In general, however, we could find that the only way to add values of two different types is to convert them both to a third type.

E

-+

El + E2

{ E.type = max(E1 .type, E2 . type) ; al = widen( E1 . ciddr, E1 . type, E. type) ; a2 = widen(E2 . addr, E2 .type, E. type) ; E. addr = new Temp 0 ; gen(E. addr ' =' al '+' a2) ; }

Figure 6.27: Introducing type conversions into expression evaluation

6.5.3 Overloading of Functions and Operators An overloaded symbol has different meanings depending on its context. Over loading is resolved when a unique meaning is determined for . each occurrence

of a name. In this section, we restrict attention to overloading that can be resolved by looking only at the arguments of a function, as in Java.

Example 6 . 1 3 : The + operator in Java denotes either string concatenation or addition, depending on the types of its operands. User-defined functions can be overloaded as well, as in

void err ( ) { } void err (String s ) { . . , } . . .

Note that we can choose between these two versions of a function err by looking at their arguments. 0 The following is a type-synthesis rule for overloaded functions: if f can have type S i -+ t i , for 1 ::; i ::; n , where S i i- S j for i i- j and x has type S k , for some 1 ::; k ::; n then expression f (x) has type tk

(6.10)

The value-number method of Section 6.1.2 can be applied to type expres sions to resolve overloading based on argument types, efficiently. In a DAG representing a type expression, we assign an integer index, called a value num ber, to each node. Using Algorithm 6.3, we construct a signature for a node,

6.5. TYPE CHECKING

391

consisting of its label and the value numbers of its children, in order from left to right. The signature for a function consists of the function name and the types of its arguments. The assumption that we can resolve overloading based on the types of arguments is equivalent to saying that we can resolve overloading based on signatures. It is not always possible to resolve overloading by looking only at the argu ments of a function. In Ada, instead of a single type, a sub expression standing alone may have a set of possible types for which the context must provide suffi cient information to narrow the choice down to a single type (see Exercise 6.5.2) .

6.5.4

Type Inference and Polymorphic Functions

Type inference is useful for a language like ML, which is strongly typed, but does not require names to be declared before they are used. Type inference ensures that names are used consistently. The term "polymorphic" refers to any code fragment that can be executed with arguments of different types. In this section, we consider parametric poly morphism, where the polymorphism is characterized by parameters or type variables. The running example is the ML program in Fig. 6.28, which defines a function length. The type of length can be described as, "for any type 0: , length maps a list of elements of type 0: to an integer." fun length( x ) = if null(x) then ° else length( tl(x)) + 1 ;

Figure 6.28: M L program for the length of a list Example 6 . 14 : In Fig. 6.28, the keyword fun introduces a function definition; functions can be recursive. The program fragment defines function length with one parameter x. The body of the function consists of a conditional expression. The predefined function null tests whether a list is empty, and the predefined function tl (short for "tail" ) returns the remainder of a list after the first element is removed. The function length determines the length or number of elements of a list x. All elements of a list must have the same type, but length can be applied to lists whose elements are of any one type. In the following expression, length is applied to two different types of lists (list elements are enclosed within " [" and "]" ) :

length([" sun", "mon" , " tue "] ) + length([10, 9, 8, 7])

(6. 1 1)

The list of strings has length 3 and the list of integers has length 4, so expres sion (6. 1 1) evaluates to 7. D

392

CHAPTER 6. INTERMEDIATE-CODE GENERATION

U sing the symbol V (read as "for any type" ) and the type constructor list, the type of length can be written as Va. list( a)

--+

integer

(6.12)

The V symbol is the universal quantifier, and the type variable to which it is applied is said to be bound by it. Bound variables can be renamed at will, provided all occurrences of the variable are renamed. Thus, the type expression V(3.

list({3)

--+

integer

is equivalent to (6. 12) . A type expression with a V symbol in it will be referred to informally as a "polymorphic type." Each time a polymorphic function is applied, its bound type variables can denote a different type. During type checking, at each use of a polymorphic type we replace the bound variables by fresh variables and remove the universal quantifiers. The next example informally infers a type for length, implicitly using type inference rules like (6.9) , which is repeated here: if f (x ) is an expression, then for some a and {3 , f has type a

--+

{3 and x has type a

Example 6 . 1 5 : The abstract syntax tree in Fig. 6.29 represents the definition of length in Fig. 6.28. The root of the tree, labeled fun, represents the function definition. The remaining nonleaf nodes can be viewed as function applications. The node labeled + represents the application of the operator + to a pair of children. Similarly, the node labeled if represents the application of an operator if to a triple formed by its children (for type checking, it does not matter that either the then or the else part will be evaluated, but not both) . fun

// � length if / I � x

apply

/ ""

null

0

x

+

/ ""

apply

/ ""

length

1

apply

tl

/ ""

x

Figure 6.29: Abstract syntax tree for the function definition in Fig. 6.28 From the body of function length, we can infer its type. Consider the children of the node labeled if, from left to right. Since null expects to be applied to lists, x must be a list. Let us use variable a as a placeholder for the type of the list elements; that is, x has type "list of a."

6. 5. TYPE CHECKING

393

Substitutions, Instances, and Unification If t is a type expression and S is a substitution ( a mapping from type vari ables to type expressions ) , then we write S(t) for the result of consistently replacing all occurrences of each type variable a in t by S(a) . S(t) is called an instance of t. For example, list(integer) is an instance of list(a) , since it is the result of substituting integer for a in list(a) . Note, however, that integer -+ float is not an instance of a -+ a, since a substitution must replace all occurrences of a by the same type expression. Substitution S is a unifier of type expressions h and t2 if S(tl ) = S(t2 ) ' S is the most general unifier of tl and t2 if for any other unifier of h and t2 , say S' , it is the case that for any t , S' (t) is an instance of S(t) . �n words, S' imposes more constraints on t than S does.

If nUll(x) is true, then length(x) is O. Thus, the type of length must be "function from list of a to integer." This inferred type is consistent with the usage of length in the else part, length( tl(x )) + 1. D Since variables can appear in type expressions, we have to re-examine the notion of equivalence of types. Suppose El of type 8 -+ 8' is applied to E2 of type t. Instead of simply determining the equality of 8 and t, we must "unify" them. Informally, we determine whether 8 and t can be made structurally equivalent by replacing the type variables in 8 and t by type expressions. A substitution is a mapping from type variables to type expressions. We write S(t) for the result of applying the substitution S to the variables in type expression t; see the box on "Substitutions, Instances, and Unification." Two type expressions tl and t2 unify if there exists some substitution S such that S(td = S(t2 ) ' In practice, we are interested in the most general unifier, which is a substitution that imposes the fewest constraints on the variables in the expressions. See Section 6.5. 5 for a unification algorithm. Algorithm 6 . 1 6 : Type inference for polymorphic functions.

INPUT: A program consisting of a sequence of function definitions followed by an expression to be evaluated. An expression is made up of function applications and names, where names can have predefined polymorphic types. OUTPUT: Inferred types for the names in the program. METHOD: For simplicity, we shall deal with unary functions only. The type of a function f (Xl , X2) with two parameters can be represented by a type expression 81 x 82 -+ t, where 81 and 82 are the types of Xl and X2 , respectively, and t is the type of the result f (Xl , X2 ) . An expression f (a, b) can be checked by matching the type of a with 81 and the type of b with 82 .

394

CHAPTER 6. INTERMEDIATE-CODE GENERATION

Check the function definitions and the expression in the input sequence. Use the inferred type of a function if it is subsequently used in an expression. •

•

For a function definition fun id 1 (id2) = E, create fresh type variables a and 13 · Associate the type a -t 13 with the function id1 , and the type a with the parameter id2 . Then, infer a type for expression E. Suppose a denotes type 8 and 13 denotes type t after type inference for E. The inferred type of function id1 is 8 -t t. Bind any type variables that remain unconstrained in 8 -t t by V quantifiers. For a function application El (E2 ) , infer types for El and E2 • Since El is used as a function, its type must have the form 8 -t 8'. (Technically, the type of El must unify with 13 -t 'Y, where 13 and 'Y are new type variables) . Let t be the inferred type of E1 • Unify 8 and t. If unification fails, the expression has a type error. Otherwise, the inferred type of E l (E2) is 8 ' .

•

For each occurrence of a polymorphic function, replace the bound vari ables in its type by distinct fresh variables and remove the V quantifiers. The resulting type expression is the inferred type of this occurrence .

•

For a name that is encountered for the first time, introduce a fresh variable for its type.

D

Example 6 . 1 7 : In Fig. 6.30 , we infer a type for function length. The root of

the syntax tree in Fig. 6.29 is for a functioIi definition, so we introduce variables 13 and 'Y, associate the type 13 -t 'Y with function length , and the type 13 with X ; see lines 1-2 of Fig. 6.30. At the right child of the root, we view if as a polymorphic function that is applied to a triple, consisting of a boolean and two expressions that represent the then and else parts. Its type is Va. boolean x a x a -t a. Each application of a polymorphic function can be to a different type, so we make up a fresh variable ai (where i is from "if" ) and remove the V; see line 3 of Fig. 6.30. The type of the left child of if must unify with boolean, and the types of its other two children must unify with ai . The predefined function null has type Va. li8t(a) -t boolean. We use a fresh type variable a n (where n is for "null" ) in place of the bound variable a; see line 4. From the application of null to X, we infer that the type 13 of X must match list(an) ; see line 5. At the first child of if, the type boolean for nUll( x) matches the type expected by if. At the second child, the type ai unifies with integer; see line 6. Now, consider the sub expression length(tl(x)) + 1 . We make up a fresh variable at (where t is for "tail" ) for the bound variable a in the type of tl; see line 8. From the application tl(x) , we infer list(at) = 13 = list(an) ; see line 9. Since length( tl( x)) is an operand of +, its type 'Y must unify with integer; see line 10. It follows that the type of length is list( an) -t integer. After the

39 5

6.5. TYPE CHECKING LINE

EXPRESSION

1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13)

length x

TYPE :

if

null null(x) 0 + tl tl(x) length( tl( x) ) 1 length( tl( x )) + 1 if( . . . )

:

:

:

UNIFY

f3 -+ 'Y f3 boolean x ai x ai -+ ai list( an ) -+ boolean list(an ) = f3 boolean ai = integer integer integer x integer -+ integer list(at ) -+ list(at ) list( at ) = list( an ) list( at ) 'Y = integer 'Y integer integer integer

Figure 6.30: Inferring a type for the function length of Fig. 6.28 function definition is checked, the type variable an remains in the type of length. Since no assumptions were made about an, any type can be substituted for it when the function is used. We therefore make it a bound variable and write for the type of length.

6.5.5

0

An Algorithm for Unification

Informally, unification is the problem of determining whether two expressions s and t can be made identical by substituting expressions for the variables in s and t. Testing equality of expressions is a special case of unification; if s and t have constants but no variables, then s and t unify if and only if they are identical. The unification algorithm in this section extends to graphs with cycles, so it can be used to test structural equivalence of circular types.7 We shall implement a graph-theoretic formulation of unification, where types are represented by graphs. Type variables are represented by leaves and type constructors are represented by interior nodes. Nodes are grouped into equiv alence classes; if two nodes are in the same equivalence class, then the type expressions they represent must unify. Thus, all interior nodes in the same class must be for the same type constructor, and their corresponding children must be equivalent. Example 6.18 : Consider the two type expressions 7 In some applications, it is an error to unify a variable with an expression containing that variable. Algorithm 6.19 permits such substitutions.

396

CHAPTER 6. INTERMEDIATE-CODE GENERATION

( (a1 -+ a2) x list(a3)) -+ list(a2 ) ((a3 -+ a4) x list(a3)) -+ a5 The following substitution S is the most general unifier for these expressions x a1 a2 a3 a4 a5

S(x) a1 a2 a1 a2 list(a2)

This substitution maps the two type expressions to the following expression The two expressions are represented by the two nodes labeled -+ : 1 in Fig. 6.31. The integers at the nodes indicate the equivalence classes that the nodes belong to after the nodes numbered 1 are unified. 0 --+:

1

--+:

1

Figure 6.31: Equivalence classes after unification Algorithm 6 . 1 9 : Unification of a pair of nodes in a type graph.

INPUT: A graph representing a type and a pair of nodes m and n to be unified. OUTPUT: Boolean value true if the expressions represented by the nodes m and n unify; false, otherwise. METHOD: A node is implemented by a record with fields for a binary operator and pointers to the left and right children. The sets of equivalent nodes are maintained using the set field. One node in each equivalence class is chosen to be the unique representative of the equivalence class by making its set field contain a null pointer. The set fields of the remaining nodes in the equivalence class will point ( possibly indirectly through other nodes in the set ) to the representative. Initially, each node n is in an equivalence class by itself, with n as its own representative node. The unification algorithm, shown in Fig. 6.32, uses the following two oper ations on nodes:

6.5. TYPE CHECKING

397

boolean unify(Node m , Node n ) { 8 = find(m) ; t = find(n) ; if ( 8 = t ) return true; else if ( nodes 8 and t represent the same basic type ) return true; else if (8 is an op-node with children 81 and 82 and t is an op- node with children h and t2) { union(8 , t) ; return unifY(81 , h ) and unifY(82 , t2) ; } else if 8 or t represents a variable { union(8, t) ; return true; } else return false; } Figure 6.32: Unification algorithm. •

find(n) returns the representative node of the equivalence class currently containing node n .

•

union(m, n) merges the equivalence classes containing nodes m and n. If one of the representatives for the equivalence classes of m and n is a non variable node, union makes that nonvariable node be the representative for the merged equivalence class; otherwise, union makes one or the other of the original representatives be the new representative. This asymme try in the specification of union is important because a variable cannot be used as the representative for an equivalence class for an expression containing a type constructor or basic type. Otherwise, two inequivalent expressions may be unified through that variable.

The union operation on sets is implemented by simply changing the set field of the representative of one equivalence class so that it points to the represen tative of the other. To find the equivalence class that a node belongs to, we follow the set pointers of nodes until the representative ( the node with a null pointer in the set field ) is reached. Note that the algorithm in Fig. 6.32 uses 8 = find(m) and t = find(n) rather than m and n, respectively. The representative nodes s and t are equal if m and n are in the same equivalence class. If 8 and t represent the same basic type, the call unify( m , n) returns true. If s and t are both interior nodes for a binary type constructor, we merge their equivalence classes on speculation and recursively check that their respective children are equivalent. By merging first, we decrease the number of equivalence classes before recursively checking the children, so the algorithm terminates.

398

CHAPTER 6. INTERMEDIATE-CODE GENERATION

The substitution of an expression for a variable is implemented by adding the leaf for the variable to the equivalence class containing the node for that expression. Suppose either m or n is a leaf for a variable. Suppose also that this leaf has been put into an equivalence class with a node representing an expression with a type constructor or a basic type. Then find will return a representative that reflects that type constructor or basic type, so that a variable cannot be unified with two different expressions. D Example 6.20 : Suppose that the two expressions in Example 6.18 are repre sented by the initial graph in Fig. 6.33 , where each node is in its own equiv alence class. When Algorithm 6.19 is applied to compute unify(1 , 9) , it notes that nodes 1 and 9 both represent the same operator. It therefore merges 1 and 9 into the same equivalence class and calls unify(2, 10) and unify(8 , 14) . The result of computing unify(l , 9) is the graph previously shown in Fig. 6.31. D -+ :

Ct l

:

(

x :

-+:

/ 4

Ct2

:

1

/ �

)<

2

list : 8 st

5

: : /y�t : : -+:

x

6�

9

/ �

10

Ct5

14

13

Ct3 :

7

Ct 4

12

Figure 6.33: Initial graph with each node in its own equivalence class If Algorithm 6.19 returns true, we can construct a substitution S that acts as the unifier, as follows. For each variable a, find(a) gives the node n that is the representative of the equivalence class of a. The expression represented by n is 8(a) . For example, in Fig. 6.31 , we see that the representative for a3 is node 4, which represents al . The representative for a5 is node 8 , which represents list(a2 ) . The resulting substitution 8 is as in Example 6.18.

6.5.6

Exercises for Section 6.5

Exercise 6 . 5 . 1 : Assuming that function widen in Fig. 6.26 can handle any of the types in the hierarchy of Fig. 6.25 (a) , translate the expressions below. Assume that c and d are characters, s and t are short integers, i and j are integers, and x is a float.

a) x =

8 + C.

b) i

8 + C.

c) x =

( 8 + c ) * (t + d) .

399

6. 6. CONTROL FLOW

Exercise 6 . 5 . 2 : As in Ada, suppose that each expression must have a unique type, but that from a subexpression, by itself, all we can deduce is a set of pos sible types. That is, the application of function El to argument E2 , represented by E ---+ El ( E2 ) , has the associated rule

E. type

=

{ t I for some s in E2 . type,

s ---+

t is in E1 · type }

Describe an SDD that determines a unique type for each subexpression by using an attribute type to synthesize a set of possible types bottom-up, and, once the unique type of the overall expression is determined, proceeds top-down to determine attribute unique for the type of each subexpression. 6.6

C ont rol Flow

The translation of statements such as if-else-statements and while-statements is tied to the translation of boolean expressions. In programming languages, boolean expressions are often used to 1 . Alter the flow of control. Boolean expressions are used as conditional expressions in statements that alter the flow of control. The value of such boolean expressions is implicit in a position reached in a program. For example, in if (E ) S, the expression E must be true if statement S is reached. 2. Compute logical values. A boolean expression can represent true or false as values. Such boolean expressions can be evaluated in analogy to arith metic expressions using three-address instructions with logical operators. The intended use of boolean expressions is determined by its syntactic con text . For example, an expression following the keyword if is used to alter the flow of control, while an expression on the right side of an assignment is used to denote a logical value. Such syntactic contexts can be specified in a number of ways: we may use two different nonterminals, use inherited attributes, or set a flag during parsing. Alternatively we may build a syntax tree and invoke different procedures for the two different uses of boolean expressions. This section concentrates on the use of boolean expressions to alter the flow of control. For clarity, we introduce a new nonterminal B for this purpose. In Section 6.6.6, we consider how a compiler can allow boolean expressions to represent logical values. 6.6 . 1

Boolean Expressions

Boolean expressions are composed of the boolean operators ( which we denote and !, using the C convention for the operators AND, OR, and NOT, respectively ) applied to elements that are boolean variables or relational ex pressions. Relational expressions are of the form El reI E2 , where El and && , I I ,

400

CHAPTER 6. INTERMEDIATE-CODE GENERATION

E2 are arithmetic expressions. In this section, we consider boolean expressions generated by the following grammar: B

-+

B I I B I B && B I ! B I ( B ) I E reI E I true I false

We use the attribute reI. op to indicate which of the six comparison operators , or > = is represented by reI. As is customary, we assume that I I and && are left-associative, and that I I has lowest precedence, then &&, then ! . Given the expression B1 I I B2 , i f we determine that B1 i s true, then we can conclude that the entire expression is true without having to evaluate B2 • Similarly, given B1 &&B2, if B1 is false, then the entire expression is false. The semantic definition of the programming language determines whether all parts of a boolean expression must be evaluated. If the language definition permits (or requires) portions of a boolean expression to go unevaluated, then the compiler can optimize the evaluation of boolean expressions by computing only enough of an expression to determine its value. Thus, in an expression such as B1 I I B2 , neither B1 nor B2 is necessarily evaluated fully. If either B1 or B2 is an expression with side effects (e.g., it contains a function that changes a global variable), then an unexpected answer may be o�tained. 200 && x ! = y ) x = 0 ;

might be translated into the code of Fig. 6.34. In this translation, the boolean expression is true if control reaches label L2• If the expression is false, control goes immediately to L1 , skipping L 2 and the assignment x = o . 0

L2 : L1 :

if x < 1 00 goto L2 if False x > 200 goto L1 if False x ! = y goto L 1 x = 0

Figure 6.34: Jumping code

401

6. 6. CONTROL FLOW

6.6.3

Flow-of-Control Statements

We now consider the translation of boolean expressions into three-address code in the context of statements such as those generated by the following grammar: 5 8 8

-+

-+

-+

if ( B ) 51 if ( B ) 81 else 82 while ( B ) 81

In these productions, nonterminal B represents a boolean expression and non terminal 8 represents a statement . This grammar generalizes the running example of while expressions that we introduced ih Example 5. 19. As in that example, both B and 8 have a synthe sized attribute code, which gives the translation into three-address instructions. For simplicity, we build up the translations B. code and 8. code as strings, us ing syntax-directed definitions. The semantic rules defining the . code attributes could be implemented instead by building up syntax trees and then emitting code during a tree traversal, or by any of the approaches outlined in Section 5.5. The translation of if (B) 81 consists of B. code followed by 51 . code, as illus trated in Fig. 6.35 ( a) . Within B. code are jumps based on the value of B. If B is true, control flows to the first instruction of 81 . code, and if B is false, control flows to the instruction immediately following 81 . code. B. code B. true :

to B. true to B./alse

B. code B. true :

81 · code

B.false : B./alse : S. next :

B. true :

-r-------.-

Sl . code goto S. next

( a) if

begin :

t 0 B. true to B.false

-r-------.-

B. code

to B. true to B./alse

S2 . code ...

( b ) if-else

Sl . code goto begin

B./alse

(c) while Figure 6.35: Code for if-, if-else-, and while-statements

The labels for the jumps in B. code and S. code are managed using inherited attributes. With a boolean expression B, we associate two labels: B. true, the

402

CHAPTER 6. INTERMEDIATE-CODE GENERATION

label to which control flows if B is true, and B .false, the label to which control flows if B is false. With a statement 8, we associate an inherited attribute 8. next denoting a label for the instruction immediately after the code for 8. In some cases, the instruction immediately following 8. code is a jump to some label L. A jump to a jump to L from within S. code is avoided using 8. next. The syntax-directed definition in Fig. 6.36-6.37 produces three-address code for boolean expressions in the context of if-, if-else-, and while-statements. PRODUCTION P -+ 8

SEMANTIC RULES 8. next newlabelO p. code 8. code I I label( 8. next)

8 -+ assign

8. code

8 -+ if ( B ) 81

B . true B .false S. code

8 -+ if ( B ) Sl else 82

S

-+

while ( B ) 81

8 -+ 81 82

assign. code = = =

newlabelO Sl . next = 8. next B . code " label(B. true) 1 / 81 . code

B. true = newlabelO B .false = newlabelO 81 , next = 82 . next = 8. next 8. code = B. code " label(B . true) " Sl . code I I gen(' goto ' 8. next) " label(B.false) " 82 . code begin = newlabelO B. true = newlabelO B .false = S. next Sl . next = begin S. code = label( begin) " B . code " label(B. true) " 81 . code " gen(' goto' begin) 81 . next = newlabelO 32 , next = S. next S. code = 81 . code I l label(Sl . next) " 82 · code

Figure 6.36: Syntax-directed definition for flow-of-control statements. We assume that newlabelO creates a new label each time it is called, and that label(L) attaches label L to the next three-address instruction to be generated.8 8 If implemented literally, the semantic rules will generate lots of labels and may attach more than one label to a three-address instruction. The backpatching approach of Section 6.7

403

6. 6. CONTROL FLOW

A program consists of a statement generated by P -+ 8. The semantic rules associated with this production initialize 8. next to a new label. P. code consists of 8. code followed by the new label 8. next. Token assign in the production 8 -+ assign is a placeholder for assignment statements. The translation of assignments is as discussed in Section 6.4; for this discussion of control flow, S. code is simply assign. code. In translating 8 -+ if (B) 81 , the semantic rules in Fig. 6.36 create a new label B. true and attach it to the first three-address instruction generated for the statement 81 , as illustrated in Fig. 6.35(a) . Thus, jumps to B. true within the code for B will go to the code for Sl . Further, by setting B .false to S. next, we ensure that control will skip the code for Sl if B evaluates to false. In translating the if-else-statement S -+ if (B) 81 else 82 , the code for the boolean expression B has jumps out of it to the first instruction of the code for 81 if B is true, and to the first instruction of the code for S2 if B is false, as illustrated in Fig. 6.35 (b) . Further, control flows from both 81 and 82 to the three-address instruction immediately following the code for S its label is given by the inherited attribute 8. next. An explicit got o S. next appears after the code for Sl to skip over the code for S2 . No goto is needed after 82 , since S2 . next is the same as S. next. The code for 8 -+ while (B) 81 is formed from B. code and 81 . code as shown in Fig. 6.35 (c) . We use a local variable begin to hold a new label attached to the first instruction for this while-statement, which is also the first instruction for B. We use a variable rather than an attribute, because begin is local to the semantic rules for this production. The inherited label 8. next marks the instruction that control must flow to if B is false; hence, B .false is set to be S. next. A new label B. true is attached to the first instruction for Sl ; the code for B generates a jump to this label if B is true. After the code for 81 we place the instruction goto begin, which causes a jump back to the beginning of the code for the boolean expression. Note that Sl . next is set to this label begin, so jumps from within 81 . code can go directly to begin. The code for S -+ 81 S2 consists of the code for 81 followed by the code for 82 • The semantic rules manage the labels; the first instruction after the code for 81 is the beginning of the code for 82 ; and the instruction after the code for 82 is also the instruction after the code for S. We discuss the translation of flow-of-control statements further in Section 6.7. There we shall see an alternative method, called "backpatching," which emits code for statements in one pass. -

6.6.4

Control-Flow Translation of Boolean Expressions

The semantic rules for boolean expressions in Fig. 6.37 complement the semantic rules for statements in Fig. 6.36. As in the code layout of Fig. 6.35, a boolean expression B is translated into three-address instructions that evaluate B using creates labels only when they are needed. Alternatively, unnecessary labels can be eliminated during a subsequent optimization phase.

404

CHAPTER 6. INTERMEDIATE-CODE GENERATION

conditional and unconditional jumps to one of two labels: B . true if B is true, and B .false if B is false. PRODUCTION

SEMANTIC RULES

B -+ B1 I I B2

B1 . true = B . true B1 .false = newlabelO B2 . true = B . true B2 .false = B .false B . code = B1 . code " label(B1 .false) " B2 . code

B

-+

B1 && B2

B1 . true = newlabelO B1 .false = B ·false B2 . true = B . true B2 .false = B .false B . code = B1 . code I I label(B1 . true) I I B2 . code

B -+ ! B1

B1 . true = B .false B1 .false = B . true B . code = B1 . code

B -+ E1 reI E2

B . code = E1 . code I I E2 . code I I gen(' if' E1 . addr reI. op E2 . addr 'goto' B. true) I I gen(' goto' B ·false)

B -+ true

B . code

=

gen('goto' B . true)

B -+ false

B. code

=

gen( ' goto' B ·false)

Figure 6.37: Generating three-address code for booleans The fourth production in Fig. 6.37, B -+ E1 reI E2 , is translated directly into a comparison three-address instruction with jumps to the appropriate places. For instance, B of the form a < b translates into: if a < b goto B . true goto B .false

The remaining productions for B are translated as follows:

1. Suppose B is of the form B1 I I B2 . If B1 is true, then we immediately know that B itself is true, so B1 . true is the same as B. true. If B1 is false, then B2 must be evaluated, so we make B1 .false be the label of the first instruction in the code for B2 . The true and false exits of B2 are the same as the true and false exits of B , respectively.

405

6.6. CONTROL FLOW 2. The translation of B1 && B2 is similar.

3. No code is needed for an expression B of the form ! B1 : just interchange the true and false exits of B to get the true and false exits of B1 . 4. The constants true and false translate into jumps to B. true and B ·false, respectively. Example 6.22 : Consider again the following statement from Example 6.21:

if ( x < 100 I I x > 200 && x ! = y ) x = O J

(6.13)

Using the syntax-directed definitions in Figs. 6.36 and 6.37 we would obtain the code in Fig. 6.38.

L3 : L4 : L2 : L1 :

if x < 100 goto L2 goto L3 if x > 200 got o L4 goto L1 if x ! = y goto L2 goto L1 x = 0

Figure 6.38: Control-flow translation of a simple if-statement The statement (6.13) constitutes a program generated by P -t 8 from Fig. 6.36. The semantic rules for the production generate a new label L1 for the instruction after the code for 8. Statement 8 has the form if (B) 81 , where 81 is x = 0 ; , so the rules in Fig. 6.36 generate a new label L2 and attach it to the first (and only, in this case) instruction in 81 , code, which is x = O. Since I I has lower precedence than && , the boolean expression in (6.13) has the form B1 I I B2 , where B1 is x < 100. Following the rules in Fig. 6.37, B1 . true is L2 , the label of the assignment x = 0 ; . B1 .false is a new label L3 , attached to the first instruction in the code for B2 • Note that the code generated is not optimal, in that the translation has three more instructions (goto's) than the code in Example 6.21. The instruction goto L3 is redundant, since L3 is the label of the very next instruction. The two goto L1 instructions can be eliminated by using if False instead of if instructions, as in Example 6.21. 0

6.6.5

Avoiding Redundant Gotos

In Example 6.22, the comparison x

>

200 translates into the code fragment:

406

CHAPTER 6. INTERMEDIATE-CODE GENERATION if x > 200 goto L4 goto L1

Instead, consider the instruction: L4 :

if False x > 200 goto L1

This if False instruction takes advantage of the natural flow from one instruc tion to the next in sequence, so control simply "falls through" to label L4 if x > 200 is false, thereby avoiding a jump. In the code layouts for if- and while-statements in Fig. 6.35, the code for statement 81 immediately follows the code for the boolean expression B . By using a special label fall (i.e., "don't generate any jump" ) , we can adapt the semantic rules in Fig. 6.36 and 6.37 to allow control to fall through from the code for B to the code for 81 . The new rules for 8 -+ if (B) 81 in Fig. 6.36 set B. true to fall: B. true = fall B·false = 81 . next = 8. next 8. code = B. code I I 81 . code

Similarly, the rules for if-else- and while-statements also set B. true to fall. We now adapt the semantic rules for boolean expressions to allow control to fall through whenever possible. The new rules for B -+ E1 rei E2 in Fig. 6.39 generate two instructions, as in Fig. 6.37, if both B. true and B.false are explicit labels; that is, neither equals fall. Otherwise, if B. true is an explicit label, then B .false must be fall, so they generate an if instruction that lets control fall through if the condition is false. Conversely, if B .false is an explicit label, then they generate an if False instruction. In the remaining case, both B. true and B .false are fall, so no jump in generated.9 In the new rules for B -+ B1 I I B2 in Fig. 6.40, note that the meaning of label fall for B is different from its meaning for B1 . Suppose B. true is fall; i.e, control falls through B, if B evaluates to true. Although B evaluates to true if B1 does, B1 . true must ensure that control jumps over the code for B2 to get to the next instruction after B . On the other hand, if B 1 evaluates to false, the truth-value of B is de termined by the value of B2 , so the rules in Fig. 6.40 ensure that B1 .false corresponds to control falling through from B1 to the code for B2 . The semantic rules are for B -+ B1 && B2 are similar to those in Fig. 6.40. We leave them as an exercise. Example 6.23 : With the new rules using the special label fall, the program

(6. 13) from Example 6.21

gIn C and Java, expressions may contain assignments within them, so code must be gen erated for the subexpressions E l and E2 , even if both B. true and B.false are fall. If desired, dead code can be eliminated during an optimization phase.

407

6. 6. CONTROL FLOW test = E1 . addr reI. op E2 . addr s

=

if B. true -:I fall and B .false -:I fall then gen(/ if ' test 'goto' B . true) I I gen(/goto ' B .false) else if B. true -:I fall then gen(/ if' test 'goto' B. true) else if B .false -:I fall then gen(' ifFalse' test I goto' B ·false) else I I

B. code

=

E1 · code I I E2 · code I I s

Figure 6.39: Semantic rules for B B1 . true = B1 .false = B2 . true = B2 .false = B. code =

---+

E1 reI E2

if B. true -:I fall then B . true else newlabelO fall B . true B ·false if B . true -:I fall then B1 · code I I B2 · code else B1 . code I I B2 . code I l label(B1 . true)

Figure 6.40: Semantic rules for B ---+ B1

I I B2

if ( x < 1 00 I I x > 200 && x ! = y ) x

OJ

translates into the code of Fig. 6 .41 .

L2 : L1 :

if x < 1 00 goto L2 if False x > 200 goto L1 if False x ! = y goto L1 x = 0

Figure 6.41 : If-statement translated using the fall-through technique As in Example 6.22, the rules for P ---+ S create label L1 . The difference from Example 6.22 is that the inherited attribute B . true is fall when the semantic rules for B ---+ B1 I I B2 are applied (B.false is Ld . The rules in Fig. 6.40 create a new label L2 to allow a jump over the code for B2 if B1 evaluates to true. Thus, B1 . true is L2 and B1 .false is fall, since B2 must be evaluated if Bl is false. The production B ---+ El reI E2 that generates x < 100 is therefore reached with B. true = L2 and B.false = fall. With these inherited labels, the rules in Fig. 6.39 therefore generate a single instruction if x < 100 goto L2 . D

CHAPTER 6. INTERMEDIATE-CODE GENERATION

408

6.6.6

Boolean Values and Jumping Code

The focus in this section has been on the use of boolean expressions to alter the flow of control in statements. A boolean expression may also be evaluated for its value, as in assignment statements such as x = true ; or x = a d goto L1 t = true goto L2 f alse t x = t

Figure 6.42: Translating a boolean assignment by computing the value of a temporary

Exercise 6.6.2 : Modern machines try to execute many instructions at the same time, including branching instructions. Thus, there is a severe cost if the machine speculatively follows one branch, when control actually goes another way ( all the speculative work is thrown away ) . It is therefore desirable to min imize the number of branches. Notice that the implementation of a while-loop in Fig. 6.35 ( c ) has two branches per interation: one to enter the body from the condition B and the other to jump back to the code for B. As a result, it is usually preferable to implement while (B) S as if it were if (B) { re peat S until ! (B) }. Show what the code layout looks like for this translation, and revise the rule for while-loops in Fig. 6.36. ! Exercise 6.6.3 : Suppose that there were an "exclusive-or" operator ( true if

and only if exactly one of its two arguments is true ) in C. Write the rule for this operator in the style of Fig. 6.37.

Exercise 6.6.4 : Translate the following expressions using the goto-avoiding

translation scheme of Section 6.6.5:

a) if Ca==b && c==d " e==f ) x

b) if Ca==b I I c==d " e==f ) x

c ) if C a==b && c==d && e==f ) x

--

1;

--

l ,'

--

l ,'

Exercise 6.6.5 : Give a translation scheme based on the syntax-directed defi nition in Figs. 6.36 and 6.37. Exercise 6.6.6 : Adapt the semantic rules in Figs. 6.36 and 6.37 to allow control to fall through, using rules like the ones in Figs. 6.39 and 6.40. ! Exercise 6.6.7 : The semantic rules for statements in Exercise 6.6.6 generate

unnecessary labels. Modify the rules for statements in Fig. 6.36 to create labels as needed, using a special label deferred to mean that a label has not yet been created. Your rules must generate code similar to that in Example 6.21.

410

CHAPTER 6. INTERMEDIATE-CODE GENERATION

! J Exercise 6.6.8 : Section 6.6.5 talks about using fall-through code to minimize

the number of jumps in the generated intermediate code. However, it does not take advantage of the option to replace a condition by its complement, e.g., re place if a < b goto L1 ; goto L 2 by if b >= a goto L2 ; goto L1 . Develop a SDD that does take advantage of this option when needed. 6.7

B ackpatching

A key problem when generating code for boolean expressions and flow-of-control statements is that of matching a jump instruction with the target of the jump. For example, the translation of the boolean expression B in if ( B ) S contains a jump, for when B is false, to the instruction following the code for S. In a one-pass translation, B must be translated before S is examined. What then is the target of the goto that jumps over the code for S? In Section 6.6 we addressed this problem by passing labels as inherited attributes to where the relevant jump instructions were generated. But a separate pass is then needed to bind labels to addresses. This section takes a complementary approach, called backpatching, in which lists of jumps are passed as synthesized attributes. Specifically, when a jump is generated, the target of the jump is temporarily left unspecified. Each such jump is put on a list of jumps whose labels are to be filled in when the proper label can be determined. All of the jumps on a list have the same target label.

6.7.1

One-Pass Code Generation Using Backpatching

Backpatching can be used to generate code for boolean expressions and flow of-control statements in one pass. The translations we generate will be of the same form as those in Section 6.6, except for how we manage labels. In this section, synthesized attributes truelist and falselist of nonterminal B are used to manage labels in jumping code for boolean expressions. In particu lar, B. truelist will be a list of jump or conditional jump instructions into which we must insert the label to which control goes if B is true. B .falselist likewise is the list of instructions that eventually get the label to which control goes when B is false. As code is generated for B, jumps to the true and false exits are left incomplete, with the label field unfilled. These incomplete jumps are placed on lists pointed to by B. truelist and B .falselist, as appropriate. Similarly, a statement S has a synthesized attribute S. nextlist, denoting a list of jumps to the instruction immediately following the code for S. For specificity, we generate instructions into an instruction array, and labels will be indices into this array. To manipulate lists of jumps, we use three functions: 1. makelist(i) creates a new list containing only i, an index into the array of instructions; makelist returns a pointer to the newly created list.

411

6. 7. BACKPATCHING

2. merge(PI , P2 ) concatenates the lists pointed to by PI and P2 , and returns a pointer to the concatenated list. 3. backpatch(p, i) inserts i as the target label for each of the instructions on the list pointed to by p.

6.7.2

Backpatching for Boolean Expressions

We now construct a translation scheme suitable for generating code for boolean expressions during bottom-up parsing. A marker nonterminal M in the gram mar causes a semantic action to pick up, at appropriate times, the index of the next instruction to be generated. The grammar is as follows: B -+ BI I I M B2 I BI && M B2 I ! BI I ( BI ) I EI reI E2 I true I false M -+ E

The translation scheme is in Fig. 6.43.

1)

B -+ Bl I I M B2

{ backpatch( BI .falselist, M. instr) ; B . truelist = merge( BI . truelist, B2 . truelist) ; B .falselist = B2 .falselist; }

2)

B -+ Bl && M B2

3)

B -+ ! BI

4)

B -+ ( BI )

5)

{ backpatch( BI . truelist, M. instr) ; B . truelist = B2 . truelist; B .falselist = merge( BI .falselist, B2 .falselist) ; } { B . truelist = BI .falselist; B .falselist = BI . truelist; } { B . truelist = BI . truelist; B .falselist

B -+ EI reI E2

=

BI .falselist; }

{ B . truelist = makelist( nextinstr) ; B .falselist = makelist( nextinstr + 1 ) ;

emit(' if ' EI . addr rel.op E2 . addr ' goto emit('goto _') ; }

6)

B -+ true

-

'

) ,.

{ B . truelist = makelist( nextinstr) ; emit(' goto _') ; }

7)

B -+ false

{ B .falselist = makelist( nextinstr) ; emit('got o _' ) ; }

8)

M -+

E

{ M. instr = nextinstr; }

Figure 6.43: Translation scheme for boolean expressions Consider semantic action (1) for the production B -+ BI I I M B2 . If BI is true, then B is also true, so the jumps on B1 . truelist become part of B . truelist. If BI is false, however, we must next test B2 , so the target for the jumps

CHAPTER 6. INTERMEDIATE-CODE GENERATION

412

B1 .falselist must be the beginning of the code generated for B2 . This target is obtained using the marker nonterminal M. That nonterminal produces, as a synthesized attribute M. instr, the index of the next instruction, just before B2 code starts being generated. To obtain that instruction index, we associate with the production M ---+ E the semantic action

{ M. instr = nextinstr; } The variable nextinstr holds the index of the next instruction to follow. This value will be backpatched onto the B1 .falselist (Le., each instruction on the list B1 .falselist will receive M. instr as its target label) when we have seen the remainder of the production B ---+ Bl I I M B2 . Semantic action (2) for B ---+ Bl && M B2 is similar to (1) . Action (3) for B ---+ ! B swaps the true and false lists. Action (4) ignores parentheses. For simplicity, semantic action (5) generates two instructions, a conditional goto and an unconditional one. Neither has its target filled in. These instruc tions are put on new lists, pointed to by B. truelist and B .falselist, respectively. B.t B.!

= =

{lOO , l04 } {l0 3 , l0 5 }

�/ M\� I / � �oo /.1 M\� I

B.t B.!

=

=

{ lOa} { lOl}

B.t B.!

E

B.t B.! x

=

{l0 2 } {l0 3 }

>

200

=

=

=

{l04 } {l0 3, l0 5 }

B.t B.!

E

/ I \

x

Figure 6.44: Annotated parse tree for x

<

100

I I x >

=

{l0 4 } {l0 5 }

!=

Y

=

/ I \

200 && x ! = y

Example 6.24 : Consider again the expression

x

<

100

I I x >

200 && x ! =

Y

An annotated parse tree is shown in Fig. 6.44; for readability, attributes tru elist, falselist, and instr are represented by their initial letters. The actions are performed during a depth-first traversal of the tree. Since all actions appear at the ends of right sides, they can be performed in conjunction with reductions during a bottom-up parse. In response to the reduction of x < 100 to B by production (5) , the two instructions

6. 7.

413

BACKPATCHING 100: 101 :

if x < 1 00 got o goto _

_

are generated. (We arbitrarily start instruction numbers at 100.) The marker nonterminal M in the production

records the value of nextinstr, which at this time is 102. The reduction of x > 200 to B by production (5) generates the instructions 102: 103:

if x > 200 goto _ goto _

The sub expression x > 200 corresponds to Bl in the production

The marker nonterminal M records the current value of nextinstr, which is now 104. Reducing x ! = y into B by production (5) generates 104: 105:

if x ! = goto _

Y

goto _

We now reduce by B -+ Bl && M B2 • The corresponding semantic ac tion calls backpatch ( B1 . truelist, M. instr) to bind the true exit of Bl to the first instruction of B2 . Since Bl . truelist is { 102} and M. instr is 104, this call to backpatch fills in 104 in instruction 102. The six instructions generated so far are thus as shown in Fig. 6.45 (a) . The semantic action associated with the final reduction by B -+ Bl I I M B2 calls backpatch ( { 101 } ,102) which leaves the instructions as in Fig. 6.45(b) . The entire expression is true if and only if the gotos of instructions 100 or 104 are reached, and is false if and only if the gotos of instructions 103 or 105 are reached. These instructions will have their targets filled in later in the compilation, when it is seen what must be done depending on the truth or falsehood of the expression. 0

6.7.3

Flow-of-Control Statements

We now use backpatching to translate flow-of-control statements in one pass. Consider statements generated by the following grammar:

S L

-+

if ( B ) S I if ( B ) S else S I while ( B ) S I { L } I A ;

-+

LS I S

Here S denotes a statement, L a statement list, A an assignment-statement, and B a boolean expression. Note that there must be other productions, such as

414

CHAPTER 6. INTERMEDIATE-CODE GENERATION 100: if x 101: goto 102: if x 103: goto 104: if x 105: goto

< 1 00 goto _ > 200 goto 1 04 _ ! = Y goto _ _

(a) After backpatching 104 into instruction 102. 100: 101: 102: 103: 104: 105:

if x goto if y goto if x goto

< 100 goto _ 102 > 200 goto 104 _ ! = Y goto _ _

(b) After backpatching 102 into instruction 101. Figure 6.45: Steps in the backpatch process those for assignment-statements. The productions given, however, are sufficient to illustrate the techniques used to translate flow-of-control statements. The code layout for if-, if-else-, and while-statements is the same as in Section 6.6. We make the tacit assumption that the code sequence in the instruction array reflects the natural flow of control from one instruction to the next. If not, then explicit jumps must be inserted to implement the natural sequential flow of control. The translation scheme in Fig. 6.46 maintains lists of jumps that are filled in when their targets are found. As in Fig. 6.43, boolean expressions generated by nonterminal B have two lists of jumps, B. truelist and B .falselist, corresponding to the true and false exits from the code for B, respectively. Statements gener ated by nonterminals S and L have a list of unfilled jumps, given by attribute nextlist, that must eventually be completed by backpatching. S. nextlist is a list of all conditional and unconditional jumps to the instruction following the code for statement S in execution order. L. nextlist is defined similarly. Consider the semantic action (3) in Fig. 6.46. The code layout for production S -+ while ( B ) 81 is as in Fig. 6.35(c) . The two occurrences of the marker nonterminal M in the production

record the instruction numbers of the beginning of the code for B and the beginning of the code for 81 . The corresponding labels in Fig. 6.35(c) are begin and B. true, respectively.

6. 7.

415

BACKPATCHING 1)

8

--+

if C B ) M 81 { backpatch(B. truelist, M. instr) ; 8� nextlist = merge(B .falselist, 81 , nextlist) ; }

2) 8 --+ if C B. ) Ml 81 N else M2 82

{ backpatch( B . truelist, MI ' instr) ; backpatch( B .falselist, M2 . instr) ; temp = merge(81 . nextlist, N. nextlist) ; 8. nextlist = merge( temp, 82 . nextlist) ; }

3) 8 --+ while Ml C B ) M2 81

{ backpatch(81 . nextlist, MI ' instr) ; backpatch(B. truelist, M2 • instr) ; 8. nex#ist = B .falselist; emit(' goto' MI . instr) ; }

4) 8 --+ { L }

{ 8. nextlist

=

L. nextlist; }

5)

A ;

{ 8. nextlist

=

null; }

E

{ M. instr

8 --+

6) M --+

=

nextinstr; }

7) N --+ E

{ N. nextlist = makelist( nextinstr) ; emit(' goto _' ) ; }

8) L --+ Ll M 8

{ backpatch(L 1 . nextlist, M. instr) ; L. nextlist 8. nextlist; }

9) L --+ 8

{

L. nextlist

8. nextlist; }

Figure 6.46: Translation of statements Again, the only production for M is M --+ E. Action (6) in Fig. 6.46 sets attribute M. instr to the number of the next instruction. After the body 81 of the while-statement is executed, control flows to the beginning. Therefore, when we requce while Ml C B ) M2 81 to 8, we backpatch 81 . nextlist to make all targets on that list be MI ' instr. An explicit jump to the beginning of the code for B is appended after the code for 81 because control may also "fall out the bottom." B. truelist is backpatched to go to the beginning of 81 by making jumps on B . truelist go to M2 . instr. A more compelling argument for using 8. nextlist and L. nextlist comes when code is generated for the conditional statement if ( B ) 81 else 82 . If control "falls out the bottom" of 81 , as when 81 is an assignment, we must include at the end of the code for 81 a jump over the code for 82 . We use another marker nonterminal to generate this jump after 81 . Let n onterminal N be this

CHAPTER 6. INTERMEDIATE-CODE GENERATION

416

marker with production N -+ E. N has attribute N. nextlist, which will be a list consisting of the instruction number of the jump goto that is generated by the semantic action (7) for N. Semantic action (2) in Fig. 6.46 deals with if-else-statements with the syntax _

8 -+ if ( B ) Ml 81 N else M2 82

We backpatch the jumps when B is true to the instruction MI . instr; the latter is the beginning of the code for 81 . Similarly, we backpatch jumps when B is false to go to the beginning of the code for 82 . The list 8. nextlist includes all jumps out of 81 and 82 , as well as the jump generated by N. (Variable temp is a temporary that is used only for merging lists.) Semantic actions (8) and (9) handle sequences of statements. In L -+ Ll M 8

the instruction following the code for L1 in order of execution is the beginning of 8. Thus the Ll . nextlist list is backpatched to the beginning of the code for 8, which is given by M. instr. In L -+ 8, L . nextlist is the same as 8. nextlist. Note that no new instructions are generated anywhere in these semantic rules, except for rules (3) and (7) . All other code is generated by the semantic actions associated with assignment-statements and expressions. The flow of control causes the proper backpatching so that the assignments and boolean expression evaluations will connect properly.

6.7.4

Break-, Continue- , and Goto-Statements

The most elementary programming language construct for changing the flow of control in a program IS the goto-statement. In C, a statement like goto L sends control to the statement labeled L - there must be precisely one statement with label L in this scope. Goto-statements can be implemented by maintaining a list of unfilled jumps for each label and then backpatching the target when it is known. Java does away with goto-statements. However, Java does permit disci plined jumps called break-statements, which send control out of an enclosing construct, and continue-statements, which trigger the next iteration of an en closing loop. The following excerpt from a lexical analyzer illustrates simple break- arid continue-statements: 1) f or ( ; ; readch ( ) ) { 2) 3) 4)

if ( peek == , , I I peek == ' \t ' ) cont inue ; else if ( peek == ' \n ' ) l ine = l ine + 1 ; else break ;

5) } Control jumps from the break-statement on line 4 to the next statement after the enclosing for loop. Control jumps from the continue-statement on line 2 to code to evaluate readchO and then to the if-statement on line 2.

6. 7.

4 17

BACKPATCHING

If S is the enclosing construct, then a break-statement is a jump to the first instruction after the code for S. We can generate code for the break by (1) keeping track of the enclosing statement S, (2) generating an unfilled jump for the break-statement, and (3) putting this unfilled jump on S. nextlist, where nextlist is as discussed in Section 6.7.3. In a two-pass front end that builds syntax trees, S. nextlist can be imple mented as a field in the node for S. We can keep track of S by using the symbol table to map a special identifier break to the node for the enclosing statement S. This approach will also handle labeled break-statements in Java, since the symbol table can be used to map the label to the syntax-tree node for the enclosing construct. Alternatively, instead of using the symbol table to access the node for S , we can put a pointer to S. nextlist in the symbol table. Now, when a break statement is reached, we generate an unfilled jump, look up nextlist through the symbol table, and add the jump to the list, where it will be backpatched as discussed in SectiQn 6.7.3. Continue-statements can be handled in a manner analogous to the break statement. The main difference between the two is that the target of the gen erated jump is different.

6.7.5

Exercises for Section 6.7

Exercise 6.7.1 : Using the translation of Fig. 6.43, translate each of the fol

lowing expressions. Show the true and false lists for each subexpression. You may assume the address of the first instruction generated is 100. a) a==b && ( c==d b) ( a==b I I c==d)

I I e==f ) I I e==f

c) (a==b &8r, c==d) && e==f Exercise 6.7.2 : In Fig. 6.47(a) is the outline of a program, and Fig. 6.47(b) sketches the structure of the generated three-address code, using the backpatch ing translation of Fig. 6.46. Here, i1 through i8 are the labels of the generated instructions that begin each of the "Code" sections. When we implement this translation, we maintain, for each boolean expression E, two lists of places in the code for E, which we denote by E. true and E./alse. The places on list E. true are those places where we eventually put the label of the statement to which control must flow whenever E is true; E./alse similarly lists the places where we put the label that control flows to when E is found to be false. Also, we maintain for each statement S, a list of places where we must put the label to which control flows when S is finished. Give the value (one of i1 through i8) that eventually replaces each place on each of the following lists:

(a) E3 .false

(b) S2 . next (c) E4 .false (d) Sl . next (e) E2 . true

418

CHAPTER 6. INTERMEDIATE-CODE GENERATION while (El ) { if (E2 ) else {

while (E3 ) SI ; if (E4 )

S2 ;

}

il : i2: i3 : i4: i5 : i6 : i7: i8 :

Code for El Code for E2 Code for E3 Code for SI Code for E4 Code for S2 Code for S3 ..,

} ( a)

(b )

Figure 6.47: Control-flow structure of program for Exercise 6.7.2 Exercise 6. if . 3 : When performing the translation of Fig. 6.47 using the scheme of Fig. 6.46, we create lists S. next for each statement, starting with the assign ment-statements SI , S2 , and S3 , and proceeding to progressively larger if statements, if-else-statements, while-statements, and statement blocks. There are five constructed statements of this type in Fig. 6.47:

S4 : while (E3 ) SI . S5 : if (E4 ) S2 . S6 : The block consisting of S5 and S3 ' S7 : The statement if S4 else S6 ' S8 : The entire program.

For each of these constructed statements, there is a rule that allows us to construCt Si . next in terms of other Sj . next lists, and the lists Ek . true and Ek .false for the expressions in the program. Give the rules for (a) S4 . next (b) S5 . next ( c) S6 . next (d) S7 . next (e) S8 · next 6.8

Switch- Statement s

The "switch" or "case" statement is available in a variety of languages. Our switch-statemerit syntax is shown in Fig. 6.48. There is a selector expression E, which is to be evaluated, followed by n constant values VI , V2 , . . . , Vn that the expression might take, perhaps including a default "value," which always matches the expression if no other value does.

6.S.

419

SWITCH-STATEMENTS switch ( E ) { case VI : SI case V2 : S2 case Vn - 1 : Sn- l default : Sn

} Figure 6.48: Switch-statement syntax 6.8.1

Translation of Switch-Statements

The intended translation of a switch is code to: 1. Evaluate the expression E.

2. Find the value Vj in the list of cases that is the same as the value of the expression. Recall that the default value matches the expression if none of the values explicitly mentioned in cases does. 3. Execute the statement Sj associated with the value found. Step (2) is an n-way branch, which can be implemented in one of several ways. If the number of cases is small, say 10 at most, then it is reasonable to use a sequence of conditional jumps, each of which tests for an individual value and transfers to the code for the corresponding statement. A compact way to implement this sequence of conditional jumps is to create a table of pairs, each pair consisting of a value and a label for the corresponding statement's code. The value of the expression itself, paired with the label for the default statement is placed at the end of the table at run time. A simple loop generated by the compiler compares the value of the expression with each value in the table, being assured that if no other match is found, the last ( default ) entry is sure to match. If the number of values exceeds 10 or so, it is more efficient to construct a hash table for the values, with the labels of the various statements as entries. If no entry for the value possessed by the switch expression is found, a jump to the default statement is generated. There is a common special case that can be implemented even more effi ciently than by an n-way branch. If the values all lie in some small range, say min to max, and the number of different values is a reasonable fraction of max - min, then we can construct an array of max - min "buckets," where bucket j - min contains the label of the statement with value j; any bucket that would otherwise remain unfilled contains the default label. To perform the switch, evaluate the expression to obtain the value j; check that it is in the range min to max and transfer indirectly to the table entry at offset j - min. For example, if the expression is of type character, a table of,

420

CHAPTER 6. INTERMEDIATE-CODE GENERATION

say, 1 28 entries (depending on the character set) may be created and transferred through with no range testing. 6.8.2

Syntax- Directed Translation of Swit ch-Statement s

The intermediate code in Fig. 6.49 is a convenient translation of the switch statement in Fig. 6.48. The tests all appear at the end so that a simple code generator can recognize the multiway branch and generate efficient code for it, using the most appropriate implementation suggested at the beginning of this section. code to evaluate E into t got o t e st

Ll :

code for 81 goto next

L2 :

code for 82 goto next

code for got o next

Ln:

code for 8n

test :

goto next VI gote if t if t = V2 got e

Ll L2

if t = Vn- 1 gete got o Ln next :

Figure 6.49: Translation of a switch-statement The more straightforward sequence shown in Fig. 6.50 would require the compiler to do extensive analysis to find the most efficient implementation. Note that it is inconvenient in a one-pass compiler to place the branching statements at the beginning, because the compiler could not then emit code for each of the statements 8i as it saw them. To translate into the form of Fig. 6 .49 , when we see the keyword switch, we generate two new labels t e st and next , and a new temporary t. Then, as we parse the expression E, we generate code to evaluate E into t. After processing we generate the jump gete test. Then, as we see each case keyword, we create a new label and enter it into the symbol table. We place in a queue, used only to store cases, a value-label pair consisting of the value Vi of the case constant and Li (or a pointer to the symbol-table entry for Li) . We process each statement case Vi : 8i by emitting the label Li attached to the code for Si , followed by the jump got o next.

6.S.

SWITCH-STATEMENTS

421

code to evaluate E into t if t ! = VI goto Ll

code for Sl Ll :

Ln - 2 :

Ln-l : next :

goto next if t ! = V2 goto L2 code for S2 goto next if t ! = Vn- l goto Ln - l code for Sn - l goto next code for Sn

Figure 6.50: Another translation of a switch statement When the end of the switch is found, we are ready to generate the code for the n-way branch. Reading the queue of value-label pairs, we can generate a sequence of three-address statements of the form shown in Fig. 6.51. There, t is the temporary holding the value of the selector expression E, and Ln is the label for the default statement. case t VI Ll case t V2 L2 case t Vn- l Ln- l case t t Ln label next

Figure 6.51: Case three-address-code instructions used to translate a switch statement The case t Vi Li instruction is a synonym for if t = Vi goto Li in Fig. 6.49, but the case instruction is easier for the final code generator to detect as a candidate for special treatment. At the code-generation phase, these sequences of case statements can be translated into an n-way branch of the most efficient type, depending on how many there are and whether the values fall into a small range.

6.8.3

Exercises for Section 6.8

! Exercise 6 . S . 1 : In order to translate a switch-statement into a sequence of

case-statements as in Fig. 6.51, the translator needs to create the list of value-

422

CHAPTER 6. INTERMEDIATE-CODE GENERATION

label pairs, as it processes the source code for the switch. We can do so, using an additional translation that accumulates just the pairs. Sketch a syntax direction definition that produces the list of pairs, while also emitting code for the statements Si that are the actions for each case. 6.9

Intermediate Code for P ro cedures

Procedures and their implementation will be discussed at length in Chapter 7, along with the run-time management of storage for names. We use the term function in this section for a procedure that returns a value. We briefly discuss function declarations and three-address code for function calls. In three-address code, a function call is unraveled into the evaluation of parameters in prepa ration for a call, followed by the call itself. For simplicity, we assume that parameters are passed by value; parameter-passing methods are discussed in Section 1 .6.6. Example 6.25 : Suppose that a is an array of integers, and that f is a function from integers to integers. Then, the assignment n = f ( a [i] ) ;

might translate into the following three-address code: 1) 2) 3) 4) 5)

tl = i * 4 t2 = a [ t l ] param t2 t3 = call f , 1 n = t3

The first two lines compute the value of the expression a [i] into temporary t2 , as discussed in Section 6.4. Line 3 makes t2 an actual parameter for the call on line 4 of f with one parameter. Line 5 assigns the value returned by the function call to t3 . Line 6 assigns the returned value to n. 0 The productions in Fig. 6.52 allow function definitions and function calls. (The syntax generates unwanted commas after the last parameter, but is good enough for illustrating translation. ) Nonterminals D and T generate declara tions and types, respectively, as in Section 6.3. A function definition gener ated by D consists of keyword define, a return type, the function name, for mal parameters in parentheses and a function body consisting of a statement. Nonterminal F generates zero or more formal parameters, where a formal pa rameter consists of a type followed by an identifier. Nonterminals S and E generate statements and expressions, respectively. The production for S adds a statement that returns the value of an expression. The production for E adds function calls, with actual parameters generated by A. An actual parameter is an expression.

INTERMEDIATE CODE FOR PROCEDURES

6 . 9.

D

-+

F

-+

S

-+ -+

define T id ( F ) { S } f i T id , F return E ; id ( A )

-+

f i E , A

E A

423

Figure 6.52: Adding functions to the source language Function definitions and function calls can be translated using concepts that have already been introduced in this chapter. •

Function types. The type of a function must encode the return type and the types of the formal parameters. Let void be a special type that repre sents no parameter or no return type. The type of a function popO that returns an integer is therefore "function from void to integer." Function types can be represented by using a constructor fun applied to the return type and an ordered list of types for the parameters.

•

Symbol tables. Let s be the top symbol table when the function definition is reached. The function name is entered into s for use in the rest of the program. The formal parameters of a function can be handled in analogy with field names in a record (see Fig. 6. 18. In the production for D , after seeing define and the function name, we push s and set up a new symbol table Env.push(top) ; top = new Env(top) ;

Call the new symbol table, t. Note that top is passed as a parameter in new Env( top) , so the new symbol table t can be linked to the previous one, s . The new table t is used to translate the function body. We revert

to the previous symbol table

s

after the function body is translated.

•

Type checking. Within expressions, a function is treated like any other operator. The discussion of type checking in Section 6.5.2 therefore carries over, including the rules for coercions. For example, if f is a function with a parameter of type real, then the integer 2 is coerced to a real in the call f (2) .

•

Function calls. When generating three-address instructions for a function call id(E, E, . " , E) , it is sufficient to generate the three-address instruc tions for evaluating or reducing the parameters E to addresses, followed by a param instruction for each parameter. If we do not want to mix the parameter-evaluating instructions with the param instructions, the attribute E. addr for each expression E can be saved in a data structure

424

CHAPTER 6. INTERMEDIATE�CODE GENERATION such as a queue. Once all the expressions are translated, the param in structions can be generated as the queue is emptied.

The procedure is such an important and frequently used programming con struct that it is imperative for a compiler to good code for procedure calls and returns. The run-time routines that handle procedure parameter passing, calls, and returns are part of the run-time support package. Mechanisms for run-time support are discussed in Chapter 7. 6 . 10

Summary of Chapt er 6

The techniques in this chapter can be combined to build a simple compiler front end, like the one in Appendix A. The front end can be built incrementally: .. Pick an intermediate representation : An intermediate representation is

typically some combination of a graphical notation and three-address code. As in syntax trees, a node in a graphical notation represents a construct; the children of a node represent its subconstructs. Three ad-: dress code takes its name from instructions of the form x = y op Z, with at most one operator per instruction. There are additional instructions for control flow .

.. Translate expressions : Expressions with built-up operations can be un

wound into a sequence of individual operations by attaching actions to each production of the form E -+ El op E2 • The action either creates a node for E with the nodes for El and E2 as children, or it generates a three-address instruction that applies op to the addresses for El and E2 and puts the result into a new temporary name, which becomes the address for E.

.. Check types : The type of an expression El op E2 is determined by the operator op and the types of El and E2 • A coercion is an implicit type conversion, such as from integer to float. Intermediate code contains ex

plicit type conversions to ensure an exact match between operand types and the types expected by an operator.

.. Use a symbol table to implement declarations : A declaration specifies the

type of a name. The width of a type is the amount of storage needed for a name with that type. Using widths, the relative address of a name at run time can be computed as an offset from the start of a data area. The type and relative address of a name are put into the symbol table due to a declaration, so the translator can subsequently get them when the name appears in an expression.

.. Flatten arrays : For quick access, array elements are stored in consecutive

locations. Arrays of arrays are flattened so they can be treated as a one-

6. 11. REFERENCES FOR CHAPTER 6

425

dimensional array of individual elements. The type of an array is used to calculate the address of an array element relative to the base of the array. .. Generate jumping code for boolean expressions : In short-circuit or jump

ing code, the value of a boolean expression is implicit in the position reached in the code. Jumping code is useful because a boolean expression B is typically used for control flow, as in if (B) S. Boolean values can be computed by jumping to t = true or t = f alse, as appropriate, where t is a temporary name. Using labels for jumps, a boolean expression can be translated by inheriting labels corresponding to its true and false exits. The constants true and false translate into a jump to the true and false exits, respectively.

.. Implement statements using control flow : Statements can be translated by inheriting a label next, where next marks the first instruction after the code for this statement. The conditional 8 -+ if (B) 81 can be translated by attaching a new label marking the beginning of the code for 81 and passing the new label and 8. next for the true and false exits, respectively, of B. ... Alternatively, use backpatching: Backpatching is a technique for generat

ing code for boolean expressions and statements in one pass. The idea is to maintain lists of incomplete jumps, where all the jump instructions on a list have the same target. When the target becomes known, all the instructions on its list are completed by filling in the target .

... Implement records : Field names in a record or class can be treated as a

sequence of declarations. A record type encodes the types and relative addresses of the fields. A symbol table object can be used for this purpose.

6.11

References for C hapter 6

Most of the techniques in this chapter stem from the flurry of design and im plementation activity around Algol 60. Syntax-directed translation into inter mediate code was well established by the time Pascal [11] and C [6, 9] were created. UNCOL (for Universal Compiler Oriented Language) is a mythical universal intermediate language, sought since the mid 1950's. Given an UNCaL, com pilers could be constructed by hooking a front end for a given source language with a back end for a given target language [10] . The bootstrapping techniques given in the report [10] are routinely used to retarget compilers. The UNCaL ideal of mixing and matching front ends with back ends has been approached in a number of ways. A retargetable compiler consists of one front end that can be put together with several back ends to implement a given language on several machines. Neliac was an early example of a language with a retarget able compiler [5] written in its own language. Another approach is to

426

CHAPTER 6. INTERMEDIATE-CODE GENERATION

retrofit a front end for a new language onto an existing compiler. Feldman [2] describes the addition of a Fortran 77 front end to the C compilers (6J and (9] . GCC, the GNU Compiler Collection [3) , supports front ends for C , Objective-C , Fortran, Java, and Ada. Value numbers and their implementatibn by hashing are from Ershov [1] . The use of type information to improve the security of Java bytecodes is described by Gosling [4J . Type inference by using unification to solve sets of equations has been re discovered several times; its application to ML is described by Milner [7) . See Pierce [8] for a comprehensive treatment of types. 1 . Ershov, A. "On programming of arithmetic operations," Comm. A CM 1 :8 (1958) , pp. 3-6. See also Comm. A CM 1 :9 (1958) , p. 16. 2 . Feldman, S . 1., "Implementation of a portable Fortran 77 compiler using modern tools," A CM SIGPLAN Notices 14:8 (1979) , pp. 98-106 3. Gee home page http : //ge e . gnu . org/ , Free Software Foundation. 4. Gosling, J., "Java intermediate bytecodes," Proc. A CM SIGPLAN Work shop on Intermediate Representations (1995) , pp. 1 1 1-118. 5. Huskey, H. D., IVI. H. Halstead, and R. McArthur, "Neliac Algol," Comm. A CM 3:8 (1960) , pp. 463-468.

a dialect of

6. Johnson, S. C . , "A tbur through the portable C compiler," Bell Telephone Laboratories, Inc., Murray Hill, N. J . , 1979. 7. Milner, R. , "A theory of type polymorphism in programming," J. Com puter and System Sciences 17:3 (1978), pp. 348-375. 8. Pierce, B . C., Types and Programming Languages, MIT Press, Cambridge, Mass., 2002. 9. Ritchie, D. M . , "A tour through the UNIX C compiler," Bell Telephone Laboratories, Inc . , Murray Hill, N. J., 1979. 10. Strong, J., J. Wegstein, A. Tritter, J. Olsztyn, O. Mock, and T. Steel, "The problem of programming communication with changing machines: a proposed solution," Comm. A CM 1 :8 ( 1 958) , pp. 12-1 8. Part 2: 1 : 9 (1958), pp. 9-15. Report of the Share Ad-Hoc committee o n Universal Languages . 1 1 . Wirth, N. "The design o f a Pascal compiler," Software-Practice and Experience 1 :4 (1971 ) , pp. 309-333.

C hapter 7

Run- Time Environments A compiler must accurately implement the abstractions embodied in the source language definition. These abstractions typically include the concepts we dis cussed in Section 1.6 such as names, scopes, bindings, data types, operators, procedures, parameters, and flow-of-control constructs. The compiler must co operate with the operating system and other systems software to support these abstractions on the target machine. To do so, the compiler creates and manages a 'fun-time environment in which it assumes its target programs are being executed. This environment deals with a variety of issues such as the layout and allocation of storage locations for the objects named in the source program, the mechanisms used by the target pro gram to access variables, the linkages between procedures, the mechanisms for passing parameters, and the interfaces to the operating system, input / output devices, and other programs. The two themes in this chapter are the allocation of storage locations and access to variables and data. We shall discuss memory management in some detail, including stack allocation, heap management, and garbage collection. In the next chapter, we present techniques for generating target code for many common language constructs. 7.1

Storage O rganizat ion

From the perspective of the compiler writer, the executing target program runs in its own logical address space in which each program value has a location. The management and organization of this logical address space is shared between the compiler, operating system, and target machine. The operating system maps the logical addresses into physical addresses, which are usually spread throughout memory. The run-time representation of an object program in the logical address space consists of data and program areas as shown in Fig. 7. 1. A compiler for a 427

428

CHAPTER 7. RUN-TIME ENVIRONMENTS

language like C++ on an operating system like Linux might subdivide memory in this way. Code Static Heap

t Free Memory

• Stack

Figure 7.1 : Typical subdivision of run-time memory into code and data areas Throughout this book, we assume the run-time storage comes in blocks of contiguous bytes, where a byte is the smallest unit of addressable memory. A byte is eight bits and four bytes form a machine word. Multibyte objects are stored in consecutive bytes and given the address of the first byte. As discussed in Chapter 6, the amount of storage needed for a name is de termined from its type. An elementary data type, such as a character, integer, or float, can be stored in an integral number of bytes. Storage for an aggre gate type, such as an array or structure, must be large enough to hold all its components. The storage layout for data objects is strongly influenced by the addressing constraints of the target machine. On many machines, instructions to add integers may expect integers to be aligned, that is, placed at an address divisible by 4. Although an array of ten characters needs only enough bytes to hold ten characters, a compiler may allocate 12 bytes to get the proper alignment , leaving 2 bytes unused. Space left unused due to alignment considerations is referred to as padding. When space is at a premium, a compiler may pack data so that no padding is left; additional instructions may then need to be executed at run time to position packed data so that it can be operated on as if it were properly aligned. code is fixed at compile time, so the comThe size of the generated piler can place the executable target code in a statically determined area Code, usually in the low end of memory. Similarly, the size of some program data objects, such as global constants, and data generated by the compiler, such as information to support garbage collection, may be known at compile time, and these data objects can be placed in another statically determined area called Static. One reason for statically allocating as many data objects as possible is

7. 1. STORAGE ORGANIZATION

429

that the addresses of these objects can be compiled into the target code. In early versions of Fortran, all data objects could be allocated statically. To maximize the utilization of space at run time, the other two areas, Stack and Heap, are at the opposite ends of the remainder of the address space. These areas are dynamic; their size can change as the program executes. These areas grow towards each other as needed. The stack is used to store data structures called activation records that get generated during procedure calls. In practice, the stack grows towards lower addresses, the heap towards higher. However, throughout this chapter and the next we shall assume that the stack grows towards higher addresses so that we can use positive offsets for notational convenience in all our examples. As we shall see in the next section, an activation record is used to store information about the status of the machine, such as the value of the program counter and machine registers, when a procedure call occurs. When control returns from the call, the activation of the calling procedure can be restarted after restoring the values of relevant registers and setting the program counter to the point immediately after the call. Data objects whose lifetimes are con tained in that of an activation can be allocated on the stack along with other information associated with the activation. Many programming languages allow the programmer to allocate and deal locate data under program control. For example, C has the functions malloe and free that can be used to obtain and give back arbitrary chunks of stor age. The heap is used to manage this kind of long-lived data. Section 7.4 will discuss various memory-management algorithms that can be used to maintain the heap.

7.1.1

Static Versus Dynamic Storage Allocation

The layout and allocation of data to memory locations in the run-time envi ronment are key issues in storage management. These issues are tricky because the same name in a program text can refer to multiple locations at run time. The two adjectives static and dynamic distinguish between compile time and run time, respectively. We say that a storage-allocation decision is static, if it can be made by the compiler looking only at the text of the program, not at what the program does when it executes. Conversely, a decision is dynamic if it can be decided only while the program is running. Many compilers use some combination of the following two strategies for dynamic storage allocation: 1. Stack storage. Names local to a procedure are allocated space on a stack. We discuss the "run-time stack" starting in Section 7.2. The stack sup ports the normal call/return policy for procedures.

2. Heap storage. Data that may outlive the call to the procedure that cre ated it is usually allocated on a "heap" of reusable storage. We discuss heap management starting in Section 7.4. The heap is an area of virtual

430

CHAPTER

7.

RUN-TIME ENVIRONMENTS

memory that allows objects or other data elements to obtain storage when they are created and to return that storage when they are invalidated. To support heap management, "garbage collection" enables the run-time system to detect useless data elements and reuse their storage, even if the pro grammer does not return their space explicitly. Automatic garbage collection is an essential feature of many modern languages, despite it being a difficult operation to do efficiently; it may not even be possible for some languages. 7. 2

Stack A llocation of S pace

Almost all compilers for languages that use procedures, functions, or methods as units of user-defined actions manage at least part of their run-time memory as a stack. Each time a procedure1 is called, space for its local variables is pushed onto a stack, and when the procedure terminates, that space is popped off the stack. As we shall see, this arrangement not only allows space to be shared by procedure calls whose durations do not overlap in time, but it allows us to compile code for a procedure in such a way that the relative addresses of its nonlocal variables are always the same, regardless of the sequence of procedure calls .

7.2 . 1

Activation Trees

Stack allocation would not be feasible if procedure calls, or activations of pro cedures, did not nest in time. The following example illustrates nesting of procedure calls. Example 7. 1 : Figure 7.2 contains a sketch of a program that reads nine inte gers into an array a and sorts them using the recursive quicksort algorithm. The main function has three tasks. It calls readArray, sets the sentinels, and then calls quicksort on the entire data array. Figure 7.3 suggests a sequence of calls that might result from an execution of the program. In this execution, the call to partition(l , 9) returns 4, so a[l] through a[3] hold elements less than its chosen separator value v, while the larger elements are in a[5] through a[9] . 0

In this example, as is true in general, procedure activations are nested in time. If an activation of procedure p calls procedure q, then that activation of q must end before the activation of p can end. There are three common cases: 1. The activation of q terminates normally. Then in essentially any language, control resumes just after the point of p at which the call to q was made.

2. The activation of q, or some procedure q called, either directly or indi rectly, aborts; i.e., it becomes impossible for execution to continue. In that case, p ends simultaneously with q . 1 Recall we use "procedure" as a generic term for function, procedure, method, o r subrou tine.

7.2. STACK ALLOCATION OF SPACE

431

int a [ i i] ; void readArray O { 1* Reads 9 integers into a[l] , ... , a[9] . *1 int i ;

}

int part it ion ( int m , int n) { 1*

Picks a separator value v, and partitions a[m . . n] so that a[m . p - 1 ] are less than v, a[p] = v, and a[p + 1 . . n] are equal to or greater than v . Returns p. * 1 .

}

void qui cksort ( int m , int n ) { int i ; if (n > m) { i = part ition (m , n) ; qui cksort (m , i- i ) ; qui cksort ( i+ i , n) ;

}

}

main O { readArray ( ) ; a [O] = - 9999 ; a [ i0] = 9999 ; qui cksort ( 1 , 9) ;

} Figure 7.2: Sketch of a quicksort program

3. The activation of q terminates because of an exception that q cannot han dle. Procedure p may handle the exception, in which case the activation of q has terminated while the activation of p continues, although not nec essarily from the point at which the call to q was made. If p cannot handle the exception, then this activation of p terminates at the same time as the activation of q, and presumably the exception will be handled by some other open activation of a procedure. We therefore can represent the activations of procedures during the running of an entire program by a tree, called an activation tree. Each node corresponds to one activation, and the root is the activation of the "main" procedure that initiates execution of the program. At a node for an activation of procedure p, the children correspond to activations of the procedures called by this activation of p. We show these activations in the order that they are called, from left to right. Notice that one child must finish before the activation to its right can begin.

432

CHAPTER

7.

RUN-TIME ENVIRONMENTS

A Version of Quicksort The sketch of a quicksort program in Fig. 7.2 uses two auxiliary functions readArray and partition. The function readArray is used only to load the data into the array a. The first and last elements of a are not used for data, but rather for "sentinels" set in the main function. We assume a[O] is set to a value lower than any possible · data value, and a[10] is set to · a value higher than any data value. The function partition divides a portion of the array, delimited by the arguments m and n, so the low elements of a[m] through a[n] are at the beginning, and the high elements are at the end, although n�ither group is necessarily in sorted order. We shall not go into the way partition works, except that it may rely on the existence of the sentinels. One possible algorithm for partition is suggested by the more detailed code in Fig. 9.1. Recursive procedure quicksort first decides if it needs to sort more than one element of the array. Note that one element is always "sorted," so quicksort has nothing to do in that case. If there are elements to sort, quicksort first calls partition, which returns an index i to separate the low and high elements. These two groups of elements are then sorted by two recursive calls to quicksort.

Example 7.2 : One possible activation tree that completes the sequence of calls and returns suggested in Fig. 7.3 is shown in Fig. 7.4. Functions are represented by the first letters of their names. Remember that this tree is only one possibility, since the arguments of subsequent calls, and also the number of calls along any branch is influenced by the values returned by partition. 0

The use of a run-time stack is enabled by several useful relationships between the activation tree and the behavior of the program: 1 . The sequence of procedure calls corresponds to a preorder traversal of the q,ctivation tree.

2. The sequence of returns corresponds to a postorder traversal of the acti vation tree. 3. Suppose that control lies within a particular activation of some procedure, corresponding to a node N of the activation tree. Then the activations tp.at are currently open ( live) are those that correspond to node N and its �ncestors. The order in which these activations were called is the order in which they appear along the path to N, starting at the root, and they will return in the reverse of that order.

433

7.2. STACK ALLOCATION OF SPACE enter main ( ) enter readArray ( ) leave readArray ( ) ent er qu i cksort ( 1 , 9) enter part i t i on ( 1 , 9 ) leave part i t i on ( 1 , 9 ) ent er qui cks ort ( 1 , 3 ) leave quicksort ( 1 , 3 ) enter qui cksort ( 5 , 9 ) leave qui cksort ( 5 , 9 ) leave qu i cksort ( 1 , 9 ) leave main O

Figure 7.3: Possible activations for the program of Fig. 7.2 m

\

q ( 1 , 9)

r p ( 1 , 9)

� I

q ( 1 , 3)

/ I �

p ( 1 , 3) q ( l , O) q (2, 3)

/ I �

p (2, 3) q(2, 1 ) q(3, 3)

q(5, 9)

/ I �

p (5, 9) q(5, 5) q(7, 9)

/ I �

p (7, 9) q(7, 7) q(9, 9)

Figure 7.4: Activation tree representing calls during an execution of quicksort 7.2.2

Act ivat ion Records

Procedure calls and returns are usually managed by a run-time stack called the control stack. Each live activation has an activation record ( sometimes called a frame) on the control stack, with the root of the activation tree at the bottom, and the entire sequence of activation records on the stack corresponding to the path in the activation tree to the activation where control currently resides. The latter activation has its record at the top of the stack. Example 7.3 : If control is currently in the activation q(2, 3) of the tree of

Fig. 7.4, then the activation record for q(2, 3) is at the top of the control stack. Just below is the activation record for q(l, 3) , the parent of q (2, 3) in the tree. Below that is the activation record q ( l , 9 ) , and at the bottom is the activation record for m , the main function and root of the activation tree. 0

434

CHAPTER

7.

RUN-TIME ENVIRONMENTS

We shall conventionally draw control stacks with the bottom of the stack higher than the top, so the elements in an activation record that appear lowest on the page are actually closest to the top of the stack. The contents of activation records vary with the language being imple mented. Here is a list of the kinds of data that might appear in an activation record (see Fig. 7.5 for a summary and possible order for these elements) : Actual parameters Returned values Control link Access link Saved machine status Local data

- - - - - - - - - - - - - - - - -

Temporaries Figure 7.5: A general activation record

1 . Temporary values, such as those arising from the evaluation of expres sions, in cases where those temporaries cannot be held in registers. 2. Local data belonging to the procedure whose activation record this is. 3. A saved machine status, with information about the state of the machine just before the call to the procedure. This information typically includes the return address ( value of the program counter, to which the called procedure must return ) and the contents of registers that were used by the calling procedure and that must be restored when the return occurs. 4. An "access link" may be needed to locate data needed by the called proce dure but found elsewhere, e.g., in another activation record. Access links are discussed in Section 7.3.5. 5. A control link, pointing to the activation record of the caller.

6. Space for the return value of the called function, if any. Again, not all called procedures return a value, and if one does, we may prefer to place that value in a register for efficiency. 7. The actual parameters used by the calling procedure. Commonly, these values are not placed in the activation record but rather in registers, when possible, for greater efficiency. However, we show a space for them to be completely general.

435

7.2. STACK ALLOCATION OF SPACE

Example 7.4 : Figure 7.6 shows snapshots of the run-time stack as control 7.4. Dashed lines in the partial trees flows through the activation tree of go to activations that have ended. Since array a is global, space is allocated for it before execution begins with an activation of procedure main, as shown in Fig. 7.6(a) .

main

main r

( a) Frame for main

/ (b)

r

is activated integer a 1 1

mam

main

main

q ( 1 , 9) integer i

p(l, 3) q(l, O)

q(l, 3) -

integer i

(c)

r

has been popped and q (1 , 9) pushed

(d) Control returns to q ( l , 3)

Figure 7.6: Downward-growing stack of activation records When control reaches the first call in the body of main, procedure r is activated, and its activation record is pushed onto the stack (Fig. 7.6 (b») . The activation record for r contains space for local variable i. Recall that the top of stack is at the bottom of diagrams. When control returns from this activation, its record is popped, leaving just the record for main on the stack. Control then reaches the call to q (quicksort) with actual parameters 1 and 9, and an activation record for this call is placed on the top of the stack, as in Fig. 7.6 (c) . The activation record for q contains space for the parameters m and n and the local variable i, following the general layout in Fig. 7.5. Notice that space once used by the call of r is reused on the stack. No trace of data local to r will be available to q ( l , 9). When q ( l , 9) returns, the stack again has only the activation record for main. Several activations occur between the last two snapshots in Fig. 7.6. A recursive call to q( 1 , 3) was made. Activations p e l , 3) and q( l , 0) have begun and ended during the lifetime of q ( l , 3) , leaving the activation record for q(l, 3)

436

CHAPTER 7. RUN-TIME ENVIRONMENTS

on top ( Fig. 7.6 ( d )) . Notice that when a procedure is recursive, it is normal to have several of its activation records on the stack at the same time. 0

7.2.3

Calling Sequences

Procedure calls are implemented by what are known as calling sequences, which consists of code that allocates an activation record on the stack and enters information into its fields. A return sequence is similar code to restore the state of the machine so the calling procedure can continue its execution after the call. Calling sequences and the layout of activation records may differ greatly, even among implementations of the same language. The code in a calling se quence is often divided between the calling procedure ( the "caller" ) and the procedure it calls (the "callee" ) . There is no exact division of run-time tasks between caller and callee; the source language, the target machine, and the op erating system impose requirements that may favor one solution over another. In general, if a procedure is called from n different points, then the portion of the calling sequence assigned to the caller is generated n times. However, the portion assigned to the callee is generated only once. Hence, it is desirable to put as much of the calling sequence into the callee as possible - whatever the callee can be relied upon to know. We shall see, however, that the callee cannot know everything. When designing calling sequences and the layout of activation records, the following principles are helpful: 1. Values communicated between caller and callee are generally placed at the beginning of the callee's activation record, so they are as close as possible to the caller's activation record. The motivation is that the caller can compute the values of the actual parameters of the call and place them on top of its own activation record, without having to create the entire activation record of the callee, or even to know the layout of that record. Moreover, it allows for the use of procedures that do not always take the same number or type of arguments, such as C 's printf function. The callee knows where to place the return value, relative to its own activation record, while however many arguments are present will appear sequentially below that place on the stack. 2. Fixed-length items are generally placed in the middle. From Fig. 7.5, such items typically include the control link, the access link, and the machine status fields. If exactly the same components of the machine status are saved for each call, then the same code can do the saving and restoring for each. Moreover, if we standardize the machine's status information, then programs such as debuggers will have an easier time deciphering the stack contents if an error occurs. 3. Items whose size may not be known early enough are placed at the end of the activation record. Most local variables have a fixed length, which

437

7.2. STACK ALLOCATION OF SPACE

can be determined by the compiler by examining the type of the variable. However, some local variables have a size that cannot be determined until the program executes; the most common example is a dynamically sized array, where the value of one of the callee's parameters determines the length of the array. Moreover, the amount of space needed for tempo raries usually depends on how successful the code-generation phase is in keeping temporaries in registers. Thus, while the space needed for tem poraries is eventually known to the compiler, it may not be known when the intermediate code is first generated.

4. We must locate the top-of-stack pointer judiciously. A common approach is to have it point to the end of the fixed-length fields in the activation record. Fixed-length data can then be accessed by fixed offsets, known to the intermediate-code generator, relative to the top-of-stack pointer. A consequence of this approach is that variable-length fields in the activation records are actually "above" the top-of-stack. Their offsets need to be calculated at run time, but they too can be accessed from the top-of stack pointer, by using a positive offset.

T

Parameters and returned value

-

- - - - - -

-

- - - - - - - - - - - - - -

Control link Links and saved status

- -

- - - - - - - -

- -

- - - - - - - -

Temporaries and local data Parameters and returned value

top_sp

- -

- -

- - - - - - - - - - - - - - - - -

- -

- -

- - - - - - - -

-

Control link Links and saved status -

-

- - - - - - - -

Temporaries and local data

T

Caller's activation record

t J

Caller's reSPO bililY

Callee's responsibility

Callee's activation record

I

Figure 7.7: Division of tasks between caller and callee

1

An example of how caller and callee might cooperate in managing the stack is suggested by Fig. 7.7. A register top_sp points to the end of the machine status field in the current top activation record. This position within the callee's activation record is known to the caller, so the caller can be made responsible for setting top_sp before control is passed to the callee. The calling sequence and its division between caller and callee is as follows: 1. The caller evaluates the actual parameters.

438

CHAPTER 7. RUN-TIME ENVIRONMENTS

2. The caller stores a return address and the old value of top_sp into the callee's activation record. The caller then increments top_sp to the po sition shown in Fig. 7.7. That is, top_sp is moved past the caller's local data and temporaries and the callee's parameters and status fields.

3. The callee saves the register values and other status information. 4. The callee initializes its local data and begins execution. A suitable, corresponding return sequence is:

1. The callee places the return value next to the parameters, as in Fig. 7.5. 2. Using information in the machine-status field, the callee restores top_sp and other registers, and then branches to the return address that the caller placed in the status field.

3. Although top_sp has been decremented, the caller knows where the return value is, relative to the current value of top-sp; the caller therefore may use that value. The above calling and return sequences allow the number of arguments of the called procedure to vary from call to call (e.g., as in C 's printf function) . Note that at compile time, the target code of the caller knows the number and types of arguments it is supplying to the callee. Hence the caller knows the size of the parameter area. The target code of the callee, however, must be prepared to handle other calls as well, so it waits until it is called and then examines the parameter field. Using the organization of Fig. 7.7, information describing the parameters must be placed next to the status field, so the callee can find it. For example, in the printf function of C, the first argument describes the remaining arguments, so once the first argument has been located, the caller can find whatever other arguments there are.

7.2.4

Variable-Length Data on the Stack

The run-time memory-management system must deal frequently with the allo cation of space for objects the sizes of which are not known at compile time, but which are local to a procedure and thus may be allocated on the stack. In modern languages, objects whose size cannot be determined at compile time are allocated space in the heap, the storage structure that we discuss in Section 7.4. However, it is also possible to allocate objects, arrays, or other structures of unknown size on the stack, and we discuss here how to do so. The reason to prefer placing objects on the stack if possible is that we avoid the expense of garbage collecting their space. Note that the stack can be used only for an object if it is local to a procedure and becomes inaccessible when the procedure returns. A common strategy for allocating variable-length arrays (i.e., arrays whose size depends on the value of one or more parameters of the called procedure) is

439

7.2. STACK ALLOCATION OF SPACE

shown in Fig. 7.8. The same scheme works for objects of any type if they are local to the procedure called and have a size that depends on the parameters of the call. In Fig. 7.8, procedure p has three local arrays, whose sizes we suppose cannot be determined at compile time. The storage for these arrays is not part of the activation record for p, although it does appear on the stack. Only a pointer to the beginning of each array appears in the activation record itself. Thus, when p is executing, these pointers are at known offsets from the top-of-stack pointer, so the target code can access array elements through these pointers.

T

Control link and saved status _ _ _ _ _ _

J�oll.!t�� Q. g, :eQJgte!: SQ !! :eQ.igte!: SQ f

_ _ _ _ _ _ _

_ _ _ _ _ _ _

_ _ _ _ _ _ _

_ _ _ _ _ _ _

_ _ _ _ _ _ _

Array

a

Array b Array

Activation record f

c

Control link and saved status

Arr

tt

of p

+

Activation record for procedure alled by p

Arrays of q

top

l

Figure 7.8: Access to dynamically allocated arrays Also shown in Fig. 7.8 is the activation record for a procedure q, called by p. The activation record for q begins after the arrays of p, and any variable-length arrays of q are located beyond that. Access to the data on the stack is through two pointers, top and top_sp. Here, top marks the actual top of stack; it points to the position at which the next activation record will begin. The second, top_sp is used to find local, fixed-length fields of the top activation record. For consistency with Fig. 7.7, we shall suppose that top_sp points to the end of the machine-status field. In Fig. 7.8, top_sp points to the end of this field in the activation record for q . From there, we can find the control-link field for q , which leads us to the place . in the activation record for p where top_sp pointed when p was on top. The code to reposition top and top_sp can be generated at compile time,

440

CHAPTER 7. RUN-TIME ENVIRONMENTS

in terms of sizes that will become known at run time. When q returns, top_sp can be restored from the saved control link in the activation record for q. The new value of top is (the old unrestored value of) top_sp minus the length of the machine-status, control and access link, return-value, and parameter fields ( as in Fig. 7.5) in q's activation record. This length is known at compile time to the caller, although it may depend on the caller, if the number of parameters can vary across calls to q.

7.2.5

Exercises for Section 7.2

Exercise 7.2 . 1 : Suppose that the program of Fig. 7.2 uses a partition function that always picks arm] as the separator v. Also, when the array a[m] , . . . , a[n] is reordered, assume that the order is preserved as much as possible. Th at is, first come all the elements less than v, in their original order, then all elements equal to v, and finally all elements greater than v, in their original order. a) Draw the activation tree when the numbers 9, 8, 7, 6, 5, 4, 3, 2, 1 are sorted.

b) What is the largest number of activation records that ever appear together

on the stack?

Exercise 7.2.2 : Repeat Exercise 7.2. 1 when the initial order of the numbers is 1 ,3,5,7,9,2,4,6,8. Exercise 7.2.3 : In Fig. 7.9 is C code to compute Fibonacci numbers recur sively. Suppose that the activation record for 1 includes the following elements in order: ( return value, argument n, local s , local t) ; there will normally be other elements in the activation record as well. The questions below assume that the initial call is 1 (5) . a) Show the complete activation tree.

b ) What does the stack and its activation records look like the first time 1 (1 ) is about to return?

! c ) What does the stack and its activation records look like the fifth time 1 (1) is about to return?

Exercise 7.2.4 : Here is a sketch of two C functions 1 and g: int f ( int x) { int i ; . . . return i+1 ; . . } int g ( int y) { int j ; . . . f ( j + 1 ) . . } .

.

That is, function 9 calls 1. Draw the top of the stack, starting with the acti vation record for g, after 9 calls 1 , and 1 is about to return. You can consider only return values, parameters, control links, and space for local variables; you do not have to consider stored state or temporary or local values not shown in the code sketch. However, you should indicate:

7.3. ACCESS TO NONLOCAL DATA ON THE STACK

441

int f e int n) { int t , s ; if (n < 2 ) return 1 ; s = f (n- 1 ) ; t = f (n-2) ; return s+t ; }

Figure 7.9: Fibonacci program for Exercise 7.2.3 a) Which function creates the space on the stack for each element? b) Which function writes the value of each element? c) To which activation record does the element belong? Exercise 7.2 . 5 : In a language that passes parameters by reference, there is a function f ( x, y) that does the following:

x = x + 1 ; Y = Y + 2 ; return x+y ;

If a is assigned the value 3, and then f (a, a) is called, what is returned? Exercise 7.2 . 6 : The C function f is defined by:

int f e int x , *py , * *ppz) { **ppz += 1 ; *py += 2 ; x += 3 ; return x+y+z ; }

Variable a is a pointer to b; variable b is a pointer to c, and c is an integer currently with value 4. If we call f (c, b, a) , what is returned? 7.3

Access to Nonlo cal Data on t he Stack

In this section, we consider how procedures access their data. Especially im portant is the mechanism for finding data used within a procedure p but that does not belong to p. Access becomes more complicated in languages where procedures can be declared inside other procedures. We therefore begin with the simple case of C functions, and then introduce a language, ML, that permits both nested function declarations and functions as "first-class objects;" that is, functions can take functions as arguments and return functions as values. This capability can be supported by modifying the implementation of the run-time stack, and we shall consider several options for modifying the stack frames of Section 7.2.

442

CHAPTER 7. RUN-TIME ENVIRONMENTS

7.3.1

Data Access Without Nested Procedures

In the C family of languages, all variables are defined either within a single function or outside any function ( "globally" ) . Most importantly, it is impossible to declare one procedure whose scope is entirely within another procedure. Rather, a global variable v has a scope consisting of all the functions that follow the declaration of v, except where there is a local definition of the identifier v . Variables declared within a function have a scope consisting of that function only, or part of it, if the function has nested blocks, as discussed in Section 1 .6.3. For languages that do not allow nested procedure declarations, allocation of storage for variables and access to those variables is simple: 1. Global variables are allocated static storage. The locations of these vari

ables remain fixed and are known at compile time. So to access any variable that is not local to the currently executing procedure, we simply use the statically determined address.

2. Any other name must be local to the activation at the top of the stack. We may access these variables through the top_sp pointer of the stack. An important benefit of static allocation for globals is that declared proce dures may be passed as parameters or returned as results (in C, a pointer to the function is passed) , with no substantial change in the data-access strategy. With the C static-scoping rule, and without nested procedures, any name non local to one procedure is nonlocal to all procedures, regardless of how they are activated. Similarly, if a procedure is returned as a result, then any nonlocal name refers to the storage statically allocated for it.

7.3.2

Issues With Nested Procedures

Access becomes far more complicated when a language allows procedure dec larations to be nested and also uses the normal static scoping rule; that is, a procedure can access variables of the procedures whose declarations surround its own declaration, following the nested scoping rule described for blocks in Section 1 .6.3. The reason is that knowing at compile time that the declaration of p is immediately nested within q does not tell us the relative positions of their activation records at run time. In fact, since either p or q or both may be recursive, there may be several activation records of p and/or q on the stack. Finding the declaration that applies to a nonlocal name x in a nested pro cedure p is a static decision; it can be done by an extension of the static-scope rule for blocks. Suppose x is declared in the enclosing procedure q. Finding the relevant activation of q from an activation of p is a dynamic decision; it re quires additional run-time information about activations. One possible solution to this problem is to use "access links," which we introduce in Section 7.3.5.

ACCESS TO NONLOCAL DATA ON THE STACK

7. 3.

7.3.3

443

A Language With Nested Procedure Declarations

The C family of languages, and many other familiar languages do not support nested procedures, so we introduce one that does. The history of nested pro cedures in languages is long. Algol 60, an ancestor of C, had this capability, as did its descendant Pascal, a once-popular teaching language. Of the later languages with nested procedures, one of the most influential is ML, and it is this language whose syntax and semantics we shall borrow (see the box on "More about ML" for some of the interesting features of ML) : •

ML is a functional language, meaning that variables, once declared and initialized, are not changed. There are only a few exceptions, such as the array, whose elements can be changed by special function calls.

•

Variables are defined, and have their unchangeable values initialized, by a statement of the form: val (name) = (expression)

•

Functions are defined using the syntax: fun (name) ( (arguments) ) = (body)

•

For function bodies we shall use let-statements of the form: let (list of definitions) in (statements) end

The definitions are normally val or fun statements. The scope of each such definition consists of all following definitions, up to the in, and all the statements up to the end. Most importantly, function definitions can be nested. For example, the body of a function p can contain a let-statement that includes the definition of another (nested) function q. Similarly, q can have function definitions within its own body, leading to arbitrarily deep nesting of functi,ons.

Nesting Depth Let us give nesting depth 1 to procedures that are not nested within any other 7.3.4

procedure. For example, all C functions are at nesting depth 1. However, if a procedure p is defined immediately within a procedure at nesting depth i , then give p the nesting depth i + 1 . Example 7.5 : Figure 7.10 contains a sketch in M L of our running quicksort

example. The only function at nesting depth 1 is the outermost function, sort, which reads an array a of 9 integers and sorts them using the quicksort algo rithm. Defined within sort, at line (2) , is the array a itself. Notice the form

444

CHAPTER 7. RUN-TIME ENVIRONMENTS

More About ML In addition to being almost purely functional, ML presents a number of other surprises to the programmer who is used to C and its family. •

ML supports higher-order functions. That is, a function can take functions as arguments, and can construct and return other func tions. Those functions, in turn, can take functions as arguments, to any level.

•

ML has essentially no iteration, as in C 's for- and while-statements, for instance. Rather, the effect of iteration is achieved by recur sion. This approach is essential in a functional language, since we cannot change the value of an iteration variable like i in " for ( i=O ; i < 1 0 ; i++) " of C. Instead, ML would make i a function argument, and the function would call itself with progressively higher values of i until the limit was reached.

•

ML supports lists and labeled tree structures as primitive data types.

•

ML does not require declaration of variable types. Rather, it deduces types at compile time, and treats it as an error if it cannot. For example, val x = 1 evidently makes x have integer type, and if we also see val y = 2*x, then we know y is also an integer.

of the ML declaration. The first argument of array says we want the array to have 1 1 elements; all ML arrays are indexed by integers starting with 0, so this array is quite similar to the C array a from Fig. 7.2. The second argument of array says that initially, all elements of the array a hold the value O. This choice of initial value lets the ML compiler deduce that a is an integer array, since 0 is an integer, so we never have to declare a type for a. Also declared within sort are several functions: readArray, exchange, and quicksort. On lines (4) and (6) we suggest that readArray and exchange each access the array a. Note that in ML, array accesses can violate the functional nature of the language, and both these functions actually change values of a's elements, as in the C version of quicksort. Since each of these three functions is defined immediately within a function at nesting depth 1, their nesting depths are all 2. Lines (7) through (11) show some of the detail of quicksort. Local value v , the pivot for the partition, is declared at line (8) . Function partition is defined at line (9) . In line (10) we suggest that partition accesses both the array a and the pivot value v, and also calls the function exchange. Since part it ion is defined immediately within a function at nesting depth 2, it is at depth 3. Line

7.3. ACCESS TO NONLOCAL DATA ON THE STACK

445

1) fun sort ( inputFile , outputFile) let val a = array ( l l , O) ; 2) fun readArray ( inputFile) 3) .., a '" ; 4) fun exchange ( i , j ) = 5) ... a .,. ; 6) fun qui cksort (m , n) = 7) let val v = . , . ; 8) fun part it ion (y , z) 9) . . . a . . . v . . . exchange 10) in a . . . v . . . part it ion ' " qui cksort 1 1) end in a . . . readArray . . . qui cksort . . . 12) end ;

Figure 7 . 10: A version of quicksort, in ML style, using nested functions ( 1 1 ) suggests that quicksort accesses variables a and v, the function partition, and itself recursively. Line (12) suggests that the outer function sort accesses a and calls the two procedures readArray and quicksort. 0

7.3.5

Access Links

A direct implementation of the normal static scope rule for nested functions is obtained by adding a pointer called the access link to each activation record. If procedure p is nested immediately within procedure q in the source code, then the access link in any activation of p points to the most recent activation of q . Note that the nesting depth of q must be exactly one less than the nesting depth of p . Access links form a chain from the activation record at the top of the stack to a sequence of activations at progressively lower nesting depths. Along this chain are all the activations whose data and procedures are accessible to the currently executing procedure. Suppose that the procedure p at the top of the stack is at nesting depth np , and p needs to access x, which is an element defined within some procedure q that surrounds p and has nesting depth nq . Note that nq S np , with equality only if p and q are the same procedure. To find x , we start at the activation record for p at the top of the stack and follow the access link np - nq times, from activation record to activation record. Finally, we wind up at an activation record for q, and it will always be the most recent ( highest ) activation record

446

CHAPTER 7. RUN-TIME ENVIRONMENTS

for q that currently appears on the stack. This activation record contains the element x that we want. Since the compiler knows the layout of activation records, x will be found at some fixed offset from the position in q's activation record that we can reach by following the last access link. Example 7.6 : Figure 7. 1 1 shows a sequence of stacks that might result from

execution of the function sort of Fig. 7.10. As before, we represent function names by their first letters, and we show some of the data that might appear in the various activation records, as well as the access link for each activation. In Fig. 7. 1 1 (a) , we see the situation after sort has called readArray to load input into the array a and then called quicksort(l, 9) to sort the array. The access link from quicksort(1, 9) points to the activation record for sort, not because sort called quicksort but because sort is the most closely nested function surrounding quicksort in the program of Fig. 7.10. s access

link

a

q ( 1 , 9) access v

( a)

link

- - - s- - - - - -link -

access -

�

a

-

- -q (-1-, 9) --- - - - -link --

- - - s- - - - - - - -link --

- - - s- - - - - - - -link --

access

access

a

a

- -q (-1,- 9- )- - - - - -link --

- -q (-1-, 9- )- - - - - -link --

access

access

access

v

v

v

- -q (-1-, -3)- - - - - -link --

- -q (-1-, 3- )- - - - - -link --

access

access

v

v

(b)

- p- (-1-, 3) --- - - - -link --

access

(c)

- q- (-1,- -3)- - - - - -link --

access v

- p- (-1,- 3) ---

- - - - -link --

access

- -e ( 1,- 3- )- access link -------

(d)

Figure 7.1 1 : Access links for finding nonlocal data In successive steps of Fig. 7. 1 1 we see a recursive call to quicksort(l , 3) , followed by a call to partition, which calls exchange. Notice that quicksort(l , 3) 's access link points to sort, for the same reason that quicksort(l , 9) 's does. In Fig. 7. 1 1 (d) , the access link for exchange bypasses the activation records for quicksort and partition, since exchange is nested immediately within sort. That arrangement is fine, since exchange needs to access only the array a, and the two elements it must swap are indicated by its own parameters i and j . 0

7.3.

ACCESS TO NONLOCAL DATA- ON THE STACK

7.3.6

447

Manipulating Access Links

How are access links determined? The simple case occurs when a procedure call is to a particular procedure whose name is given explicitly in the procedure call. The harder case is when the call is to a procedure-parameter; in that case, the particular procedure being called is not known until run time, and the nesting depth of the called procedure may differ in different executions of the call. Thus, let us first consider what should happen when a procedure q calls procedure p, explicitly. There are three cases: 1 . Procedure p is at a higher nesting depth than q. Then p must be defined immediately within q, or the call by q would not be at a position that is within the scope of the procedure name p. Thus, the nesting depth of p is exactly one greater than that of q, and the access link from p must lead to q . It is a simple matter for the calling sequence to include a step that places in the access link for p a pointer to the activation record of q. Examples include the call of quicksort by sort to set up Fig. . 7. 1 1 (a) , and the call of partition by quicksort to create Fig. 7. 1 1 (c) .

2. The call is recursive, that is, p = q. 2 Then the access link for the new acti vation record is the same as that of the activation record below it. An ex ample is the call of quicksort(l, 3) by quicksort(l, 9) to set up Fig. 7. 1 l (b) . 3. The nesting depth np of p is less than the nesting depth n q of q. In order for the call within q to be in the scope of name p, procedure q must be nested within some procedure r, while p is a procedure defined immediately within r. The top activation record for r can therefore be found by following the chain of access links, starting in the activation record for q, for n q - np + 1 hops. Then, the access link for p must go to this activation of r. Example 7.7 : For an example of case (3) , notice how we· go from Fig. 7. 1 1 (c)

to Fig. 7.1 1 (d) . The nesting depth 2 of the called function exchange is one less than the depth 3 of the calling function partition. Thus, we start at the activation record for partition and follow 3 - 2 + 1 = 2 access links, which takes us from partition' s activation record to that of quicksort(1 , 3) to that of sort. The access link for exchange therefore goes to the activation record for sort, as we see in Fig. 7. 1 1 (d) . An equivalent way to discover this access link is simply to follow access links for n q - np pops, and copy the access link found in that record. In our example, we would go one hop to the activation record for quicksort(1, 3) and copy its access link to sort. Notice that this access link is correct for exchange, even though exchange is not in the scope of quicksort, these being sibling functions nested within sort. 0 2 ML allows mutually recursive functions, which would be handled the same way.

CHAPTER

448

7.3.7

7.

RUN-TIME ENVIRONMENTS

Access Links for Procedure Parameters

When a procedure p is passed to another procedure q as a parameter, and q then calls its parameter (and therefore calls p in this activation of q) , it is possible that q does not know the context in which p appears in the program. If so, it is impossible for q to know how to set the access link for p. The solution to this problem is as follows: when procedures are used as parameters, the caller needs to pass, along with the name of the procedure-parameter, the proper access link for that parameter. The caller always knows the link, since if p is passed by procedure r as an actual parameter, then p must be a name accessible to r, and therefore, r can determine the access link for p exact iy as if p were being called by r directly. That is, we use the rules for constructing access links given in Section 7.3.6. Example 7.8 : In Fig. 7.12 we see a sketch of an ML function a that has functions b and c nested within it. Function b has a function-valued parameter f , which it calls. Function c defines within it a function d, and c then calls b with actual parameter d.

fun a (x) let

=

=

fun b (f ) .

..

f

fun c (y) let

...

=

fun d (z ) in

=

...

. . . b (d) . , .

end in c O) end ;

Figure 7.12: Sketch of ML program that uses function-parameters Let us trace what happens when a is executed. First, a calls C; so we place an activation record for c above that for a on the stack. The access link for c points to the record for a , since c is defined immediately within a . Then c calls b ( d) . The calling sequence sets up an activation record for b, as shown in Fig. 7.13 ( a) . Within this activation record is the actual parameter d and its access link, which together form the value of formal param�ter f in the activation record for b. Notice that c knows about d, since d is defined within c, and therefore c passes a pointer to its own activation record as the access link. No matter where d was defined, if c is in the scope of that definition, then one of the three rules of Section 7.3.6 must apply, and c can provide the link.

449

7.3. ACCESS TO NONLOCAL DATA ON THE STACK a

a

c - - - - - - -

c - - - - - - -

access

access

link ,

b

- - - - - - -

link - - - - - - -

access

f : ( d, )

- -

( a)

link

- - - - - - -

\

\ I

,

b

\

- - - - - - -

access

link

- - - - - - -

f : ( d, -) -

d

-

- - - - - ..,.. -

access

link

- - - - - - -

(b)

Figure 7.13: Actual parameters carry their access link with them Now, let us look at what b does. We know that at some point, it uses its parameter f, which has the effect of calling d. An activation record for d appears on the stack, as shown in Fig. 7. 13(b) . The proper access link to place in this activation record is found in the value for parameter f; the link is to the activation record for c, since c immediately surrounds the definition of d. Notice that b is capable of setting up the proper link, even though b is not in the scope of c's definition. 0

7.3.8

Displays

One problem with the access-link approach to nonlocal data is that if the nesting depth gets large, we may have to follow long chains of links to reach the data we need. A more efficient implementation uses an auxiliary array d, called the di8play, which consists of one pointer for each nesting depth. We arrange that, at all times, d[i] is a pointer to the highest activation record on the stack for any procedure at nesting depth i. Examples of a display are shown in Fig. 7.14. For instance, in Fig. 7. 14(d) , we see the display d, with d[l] holding a pointer to the activation record for sort , the highest (and only) activation record for a function at nesting depth 1. Also, d[2] holds a pointer to the activation record for exchange, the highest record at depth 2, and d[3] points to partition, the highest record at depth 3. The advantage of using a display is that if procedure p is executing, and it needs to access element x belonging to some procedure q, we need to look only in d[i] , where i is the nesting depth of q; we follow the pointer d[i] to the activation record for q , wherein x is found at a known offset. The compiler knows what i is, so it can generate code to access x using d[i] and the offset of

450

CHAPTER 7. R UN- TIME ENVIRONMENTS

d[l] d[2]

s

( a)

d [ l] d(2]

8

(b)

5_(!,_9) _ s �ve�_�[�)

5_(! ,_9) _ saved �[�] _q_(! ,_3) _

d[l] d[2] d[3]

8

(c )

_qJ! ,_9) s �ve�_�[�)

d[l] d[2] d [3]

s

(d)

_q_(! ,_3) s�ve�_dJ�] _pJ! ,_3) _ s�ve� dJ�]

Figure 7.14: Maintaining the display

_q_(! ,_9) sa�ed �[ �] _q_(! ,_3] _

7.3. ACCESS TO NONLOCAL DATA ON THE STACK

451

x from the top of the activation record for q. Thus, the code never needs to

follow a long chain of access links. In order to maintain the display correctly, we need to save previous values of display entries in new activation records. If procedure p at depth np is called, and its activation record is not the first on the stack for a procedure at depth np , then the activation record for p needs to hold the previous value of d[np] , while d[np] itself is set to point to this activation of p. When p returns, and its activation record is removed from the stack, we restore d[np] to have its value prior to the call of p. Example 7.9 : Several steps of manipulating the display are illustrated in Fig. 7.14. In Fig. 7.14(a) , sort at depth 1 has called quicksort(1, 9) at depth 2. The activation record for quicksort has a place to store the old value of d[2] , indicated as saved d[2] , although in this case since there was no prior activation record at depth 2, this pointer is null. In Fig. 7. 14(b) , quicksort(1 , 9) calls quicksort(1, 3) . Since the activation records for both calls are at depth 2, we must store the pointer to quicksort(l , 9) , which was in d[2] , in the record for quicksort(1 , 3) . Then, d[2] is made to point to quicksort(l , 3) . Next, partition is called. This function is at depth 3, so we use the slot d[3] in the display for the first time, and make it point to the activation record for partition. The record for partition has a slot for a former value of d[3] , but in this case there is none, so the pointer remains null. The display and stack at this time are shown in Fig. 7. 14(c) . Then, partition calls exchange. That function is at depth 2, so its activa tion record stores the old pointer d[2] , which goes to the activation record for quicksort(1, 3) . Notice that the display pointers "cross" ; that iSl d[3] points further down the stack than d[2] does. However, that is a proper situation; exchange can only access its own data and that of sort, via d[l] . 0

Exercises for Section 7.3 Exercise 7.3 . 1 : In Fig. 7.15 is a ML function main that computes Fibonacci 7.3.9

numbers in a nonstandard way. Function f ibO will compute the nth Fibonacci number for any n � O. Nested within in is f ib 1 , which computes the nth Fibonacci number on the assumption n � 2, and nested within f ib 1 is f ib2, which assumes n � 4. Note that neither f ib 1 nor f ib2 need to check for the basis cases. Show the stack of activation records that result from a call to main, up until the time that the first call (to f ibO ( 1 ) ) is about to return. Show the access link in each of the activation records on the stack. Exercise 7.3.2 : Suppose that we implement the functions of Fig. 7. 15 using a display. Show the display at the moment the first call to f i bO ( 1 ) is about to

return. Also, indicate the saved display entry in each of the activation records on the stack at that time.

452

CHAPTER 7. RUN-TIME ENVIRONMENTS fun main 0 { let fun f ibO (n) let fun f ibl (n) let fun f ib2 (n) = f ib l (n- l ) + f ib 1 (n-2) in if n >= 4 then f ib2 (n) else f ibO (n- l ) + f ibO (n-2) end in if n >= 2 then f ib 1 (n) else 1 end in f ibO (4) end ;

Figure 7.15: Nested functions computing Fibonacci numbers 7.4

Heap Management

The heap is the portion of the store that is used for data that lives indefinitely, or until the program explicitly deletes it. While local variables typically become inaccessible when their procedures end, many languages enable us to create objects or other data whose existence is not tied to the procedure activation that creates them. For example, both C++ and Java give the programmer new to create objects that may be passed - or pointers to them may be passed from procedure to procedure, so they continue to exist long after the procedure that created them is gone. Such objects are stored on a heap. In this section, we discuss the memory manager, the subsystem that allo cates and deallocates space within the heap; it serves as an interface between application programs and the operating system. For languages like C or C++ that deallocate chunks of storage manually (i.e. , by explicit statements of the program, such as free or delete) , the memory manager is also responsible for implementing deallocation. In Section 7 . 5, we discuss garbage collection, which is the process of finding spaces within the heap that are no longer used by the program and can therefore be reallocated to house other data items. For languages like Java, it is the garbage collector that deallocates memory. When it is required, the garbage collector is an important subsystem of the memory manager.

HEAP MANAGEMENT

7. 4.

7.4 . 1

453

The Memory Manager

The memory manager keeps track of all the free space in heap storage at all times. It performs two basic functions: 3 • Allocation. When a program requests memory for a variable or object, the memory manager produces a chunk of contiguous heap memory of the requested size. If possible, it satisfies an allocation request using free space in the heap; if no chunk of the needed size is available, it seeks to increase the heap storage space by getting consecutive bytes of virtual memory from the operating system. If space is exhausted, the memory manager passes that information back to the application program. •

Deallocation. The memory manager returns deallocated space to the pool of free space, so it can reuse the space to satisfy other allocation requests. Memory managers typically do not return memory to the operating sys tem, even if the program's heap usage drops.

Memory management would be simpler if (a) all allocation requests were for chunks of the same size, and (b) storage were released predictably, say, first-allocated first-deallocated. There are some languages, such as Lisp, for which condition (a) holds; pure Lisp uses only one data element - a two pointer cell - from which all data structures are built. Condition (b) also holds in some situations, the most common being data that can be allocated on the run-time stack. However, in most languages, neither (a) nor (b) holds in general. Rather, data elements of different sizes are allocated, and there is no good way to predict the lifetimes of all allocated objects. Thus, the memory manager must be prepared to service, in any order, allo cation and deallocation requests of any size, ranging from one byte to as large as the program's entire address space. Here are the properties we desire of memory managers: •

Space Efficiency. A memory manager should minimize the total heap space needed by a program. Doing so allows larger programs to run in a fixed virtual address space. Space efficiency is achieved by minimizing "fragmentation," discussed in Section 7.4.4.

•

Program Efficiency. A memory manager should make good use of the memory subsystem to allow programs to run faster. As we shall see in Section 7.4.2, the time taken to execute an instruction can vary widely depending on where objects are placed in memory. Fortunately, programs tend to exhibit "locality," a phenomenon discussed in Section 7.4.3, which refers to the nonrandom clustered way in which typical programs access memory. By attention to the placement of objects in memory, the memory manager can make better use of space and, hopefully, make the program run faster.

3 In what follows, we shall refer to things requiring memory space as "objects," even if they are not true objects in the "object-oriented programming" sense.

454

CHAPTER •

7.

RUN-TIME ENVIRONMENTS

Low Overhead. Because memory allocations and deallocations are fre quent operations in many programs, it is important that these operations be as efficient as possible. That is, we wish to minimize the overhead the fraction of execution time spent performing allocation and dealloca tion. Notice that the cost of allocations is dominated by small requests; the overhead of managing large objects is less important, because it usu ally can be amortized over a larger amount of computation.

7.4.2

The Memory Hierarchy of a Computer

Memory management and compiler optimization must be done with an aware ness of how memory behaves. Modern machines are designed , so that program mers can write correct programs without concerning themselves with the details of the memory subsystem. However, the efficiency of a program is determined not just by the number of instructions executed, but also by how long it takes to execute each of these instructions. The time taken to execute an instruction can vary significantly, since the time taken to access different parts of memory can vary from nanoseconds to milliseconds. Data-intensive programs can there fore benefit significantly from optimizations that make good use of the memory subsystem. As we shall see in Section 7.4.3, they can take advantage of the phenomenon of "locality" - the nonrandom behavior of typical programs. The large variance in memory access times is due to the fundamental limi tation in hardware technology; we can build small and fast storage, or large and slow storage, but not storage that is both large and fast. It is simply impos sible today to build gigabytes of storage with nanosecond access times, which is how fast high-performance processors run. Therefore, practically all modern computers arrange their storage as a memory hierarchy. A memory hierarchy, as shown in Fig. 7.16, consists of a series of storage elements, with the smaller faster ones "closer" to the processor, and the larger slower ones further away. Typically, a processor has a small number of registers, whose contents are under software control. Next, it has one or more levels of cache, usually made out of static RAM, that are kilobytes to several megabytes in size. The next level of the hierarchy is the physical (main) memory, made out of hundreds of megabytes or gigabytes of dynamic RAM. The physical memory is then backed up by virtual memory, which is implemented by gigabytes of disks. Upon a memory access, the machine first looks for the data in the closest (lowest-level) storage and, if the data is not there, looks in the next higher level, and so on. Registers are scarce, so register usage is tailored for the specific applications and managed by the code that a compiler generates. All the other levels of the hierarchy are managed automatically; in this way, not only is the programming task simplified, but the same program can work effectively across machines with different memory configurations. With each memory access, the machine searches each level of the memory in succession, starting with the lowest level, until it locates the data. Caches are managed exclusively in hardware, in order to keep up with the relatively fast RAM access times. Because disks are rela-

455

7.4. HEAP MANAGEMENT Typical Access Times

Typical Sizes

Virtual Memory (Disk)

3 - 15 ms

256MB - 2GB

Physical Memory

100 - 150 ns

128KB - 4MB

2nd-Level Cache

40 - 60 ns

16 - 64KB

Ist-Level Cache

5 - 10 ns

32 Words

Registers (Processor )

1 ns

>

2GB

Figure 7. 16: Typical Memory Hierarchy Configurations tively slow, the virtual memory is managed by the operating system, with the assistance of a hardware structure known as the "translation lookaside buffer." Data is transferred as blocks of contiguous storage. To amortize the cost of access, larger blocks are used with the slower levels of the hierarchy. Be tween main memory and cache, data is transferred in blocks known as cache lines, which are typically from 32 to 256 bytes long. Between virtual memory ( disk) and main memory, data is transferred in blocks known as pages, typically between 4K and 64K bytes in size.

7.4.3

Locality in Programs

Most programs exhibit a high degree of locality; that is, they spend most of their time executing a relatively small fraction of the code and touching only a small fraction of the data. We say that a program has temporal locality if the memory locations it accesses are likely to be accessed again within a short period of time. We say that a program has spatial locality if memory locations close to the location accessed are likely also to be accessed within a short period of time. The conventional wisdom is that programs spend 90% of their time executing 10% of the code. Here is why: •

Programs often contain many instructions that are never executed. Pro grams built with components and libraries use only a small fraction of the provided functionality. Also as requirements change and programs evolve, legacy systems often contain many instructions that are no longer used.

456

CHAPTER 7. RUN-TIME ENVIRONMENTS

Static and Dynamic RAM Most random-access memory' is dynamic, which means that it is built of very simple electronic circuits that lose their charge ( and thus "forget" the bit they were storing) in a short time. These circuits need to be refreshed - that is, their bits read and rewritten - periodically. On the other hand, static RAM is designed with a more complex circuit for each bit, and consequently the bit stored can stay indefinitely, until it is changed. Evidently, a chip can store more bits if it uses dynamic-RAM circuits than if it uses static-RAM circuits, so we tend to see large main memories of the dynamic variety, while smaller. memories, like caches, are made from static circuits.

•

Only a small fraction of the code that could be invoked is actually executed in a typical run of the program. For example, instructions to handle illegal inputs and exceptional cases, though critical to the correctness of the program, are seldom invoked on any particular run.

•

The typical program spends most of its time executing innermost loops and tight recursive cycles in a program.

Locality allows us to take advantage of the memory hierarchy of a modern computer, as shown in Fig. 7.16. By placing the most common instructions and data in the fast-but-small storage, while leaving the rest in the slow-but-Iarge storage, we can lower the average memory-access time of a program significantly. It has been found that many programs exhibit both temporal and spatial locality in how they access both instructions and data. Data-access patterns, however, generally show a greater variance than instruction-access patterns. Policies such as keeping the most recently used data in the fastest hierarchy work well for common programs but may not work well for some data-intensive programs - ones that cycle through very lq,rge arrays, for example. We often cannot tell, just from looking at the code, which sections of the code will be heavily used, especially for a particular input. Even if we know which instructions are executed heavily, the fastest cache often is not large enough to hold all of them at the same time. We must therefore adjust the contents of the fastest storage dynamically and use it to hold instructions that are likely to be used heavily in the near future. Optimization Using the Memory Hierarchy

The policy of keeping the most recently used instructions in the cache tends to work well; in other words, the past is generally a good predictor of future memory usage. When a new instruction is executed, there is a high proba bility that the next instruction also will be executed. This phenomenon is an

7.4. HEAP MANAGEMENT

457

Cache Architectures How do we know if a cache line is in a cache? It would be too expensive to check every single line in the cache, so it is common practice to restrict the placement of a cache line within the cache. This restriction is known as set associativity. A cache is k-way set associative if a cache line can reside only in k locations. The simplest cache is a 1-way associative cache, also known as a direct-mapped cache. In a direct-mapped cache, data with memory address n can be placed only in cache address n mod s, where s is the size of the cache. Similarly, a k-way set associative cache is divided into k sets, where a datum with address n can be mapped only to the location n mod ( s / k) in each set. Most instruction and data caches have associativity between 1 and 8. When a cache line is brought into the cache, and all the possible locations that can hold the line are occllpied, it is typical to evict the line that has been the least recently used.

example of spatial locality. One effective technique to improve the spatial lo cality of instructions is to have the compiler place basic blocks ( sequences of instructions that are always executed sequentially) that are likely to follow each other contiguously - on the same page, or even the same cache line, if possi ble. Instructions belonging to the same loop or same function also have a high probability of being executed together. 4 We can also improve the temporal and spatial locality of data accesses in a program by changing the data layout or the order of the computation. For example, programs that visit large amounts of data repeatedly, each time per forming a small amount of computation, do not perform well. It is better if we CCl.,n bring some data from a slow level of the memory hierarchy to a faster level ( e.g., disk to main memory ) once, and perform all the necessary computations on this data while it resides at the faster level. This concept can be applied recllrsively to reuse data in physical memory, in the caches and in the registers.

7.4.4 Reducing Fragmentation At the beginning of program execution, the heap is one contiguous unit of free space. As the program allocates and deallocates memory, this space is broken up into free and used chunks of memory, and the free chunks need not reside in a contiguous area of the heap. We refer to the free chunks of memory as holes. With each allocation request, the memory manager must place the requested chunk of memory into a large-enough hole. Unless a hole of exactly the right size is found, we need to split some hole, creating a yet smaller hole. 4 As a machine fetches a word in memory, it is relatively inexpensive to prefetch the next several contiguous words of memory as well. Thus, a common memory-hierarchy feature is that a multiword block is fetched from a level of memory each time that level is accessed.

458

CHAPTER 7. RUN-TIME ENVIRONMENTS

With each deallocation request, the freed chunks of memory are added back to the pool of free space. We coalesce contiguous holes into larger holes, as the holes can only get smaller otherwise. If we are not careful, the memory may end up getting fragmented, consisting of large numbers of small, noncontiguous holes. It is then possible that no hole is large enough to satisfy a future request, even though there may be sufficient aggregate free space. Best-Fit and Next-Fit Object Placement

We reduce fragmentation by controlling how the memory manager places new objects in the heap. It has been found empirically that a good strategy for mini mizing fragmentation for real-life programs is to allocate the requested memory in the smallest available hole that is large enough. This best-fit algorithm tends to spare the large holes to satisfy subsequent, larger requests. An alternative, called first-fit, where an object is placed in the first (lowest-address) hole in which it fits, takes less time to place objects, but has been found inferior to best-fit in overall performance. To implement best-fit placement more efficiently, we can separate free space into bins, according to their sizes. One practical idea is to have many more bins for the smaller sizes, because there are usually many more small objects. For example, the Lea memory manager, used in the GNU C compiler gcc , aligns all chunks to 8-byte boundaries. There is a bin for every multiple of 8-byte chunks from 16 bytes to 512 bytes. Larger-sized bins are logarithmically spaced (Le., the minimum size for each bin is twice that of the previous bin) , and within each of these bins the chunks are ordered by their size. There is always a chunk of free space that can be extended by requesting more pages from the operating system. Called the wilderness chunk, this chunk is treated by Lea as the largest-sized bin because of its extensibility. Binning makes it easy to find the best-fit chunk. •

If, as for small sizes requested from the Lea memory manager, there is a bin for chunks of that size only, we may take any chunk from that bin.

•

For sizes that do not have a private bin, we find the one bin that is allowed to include chunks of the desired size. Within that bin, we can use either a first-fit or a best-fit strategy; Le., we either look for and select the first chunk that is sufficiently large or, we· spend more time and find the smallest chunk that is sufficiently large. Note that when the fit is not exact, the remainder of the chunk will generally need to be placed in a bin with smaller sizes.

•

However, it may be that the target bin is empty, or all chunks in that bin are too small to satisfy the request for space. In that case, we simply repeat the search, using the bin for the next larger size(s) . Eventually, we either find a chunk we can use, or we reach the "wilderness" chunk, from which we can surely obtain the needed space, possibly by going to the operating system and getting additional pages for the heap.

7.4. HEAP MANAGEMENT

459

While best-fit placement tends to improve space utilization, it may not be the best in terms of spatial locality. Chunks allocated at about the same time by a program tend to have similar reference patterns and to have similar lifetimes. Placing them close together thus improves the program's spatial locality. One useful adaptation of the best-fit algorithm is to modify the placement in the case when a chunk of the exact requested size cannot be found. In this case, we use a next-fit strategy, trying to allocate the object in the chunk that has last been split, whenever enough space for the new object remains in that chunk. Next-fit also tends to improve the speed of the allocation operation. Managing and Coalescing Free Space

When an object is deallocated manually, the memory manager must make its chunk free, so it can be allocated again. In some circumstances, it may also be possible to combine ( coalesce) that chunk with adjacent chunks of the heap, to form a larger chunk. There is an advantage to doing so, since we can always use a large chunk to do the work of small chunks of equal total size, but many small chunks cannot hold one large object, as the combined chunk could. If we keep a bin for chunks of one fixed size, as Lea does for small sizes, then we may prefer not to coalesce adjacent blocks of that size into a chunk of double the size. It is simpler to keep all the chunks of one size in as many pages as we need, and never coalesce them. Then, a simple allocation/ deallocation scheme is to keep a bitmap, with one bit for each chunk in the bin. A 1 indicates the chunk is occupied; 0 indicates it is free. When a chunk is deallocated, we change its 1 to a O. When we need to allocate a chunk, we find any chunk with a 0 bit, change that bit to a 1 , and use the corresponding chunk. If there are no free chunks, we get a new page, divide it into chunks of the appropriate size, and extend the bit vector. Matters are more complex when the heap is managed as a whole, without binning, or if we are willing to coalesce adjacent chunks and move the resulting chunk to a different bin if necessary. There are two data structures that are useful to support coalescing of adjacent free blocks: •

Boundary Tags. At both the low and high ends of each chunk, whether free or allocated, we keep vital information. At both ends, we keep a free/used bit that tells whether or not the block is currently allocated (used) or available (free) . Adjacent to each free/used bit is a count of the total number of bytes in the chunk.

•

A Doubly Linked, Embedded Free List. The free chunks (but not the allocated chunks) are also linked in a doubly linked list. The pointers for this list are within the blocks themselves, say adjacent to the boundary tags at either end. Thus, no additional space is needed for the free list, although its existence does place a lower bound on how small chunks can get; they must accommodate two boundary tags and two pointers, even if the object is a single byte. The order of chunks on the free list is left

460

CHAPTER 7. RUN-TIME ENVIRONMENTS unspecified. For example, the list could be sorted by size, thus facilitating best-fit placement.

Example 7. 10 : Figure 7.17 shows part of a heap with three adjacent chunks, A, B, and C. Chunk B, of size 100, has just been deallocated and returned to the free list. Since we know the beginning (left end ) of B, we also know the end of the chunk that happens to be immediately to B's left, namely A in this example. The free / used bit at the right end of A is currently 0, so A too is free. We may therefore coalesce A and B into one chunk of 300 bytes. Chunk A

Chunk B

Chunk C

Figure 7. 17: Part of a heap and a doubly linked free list It might be the case that chunk C, the chunk immediately to B's right, is also free, in which case we can combine all of A, B, and C. Note that if we always coalesce chunks when we can, then there can never be two adjacent free chunks, so we never have to look further than the two chunks adjacent to the one being deallocated. In the current case, we find the beginning of C by starting at the left end of B, which we know, and finding the total number of bytes in B, which is found in the left boundary tag of B and is 100 bytes. With this information, we find the right end of B and the beginning of the chunk to its right. At that point, we examine the free/used bit of C and find that it is 1 for used; hence, C is not available for coalescing. Since we must coalesce A and B, we need to remove one of them from the free list. The doubly linked free-list structure lets us find the chunks before and after each of A and B. Notice that it should not be assumed that physical neighbors A and B are also adjacent on the free list. Knowing the chunks preceding and following A and B on the free list, it is straightforward to manipulate pointers on the list to replace A and B by one coalesced chunk. 0 Automatic garbage collection can eliminate fragmentation altogether if it moves all the allocated objects to contiguous storage. The interaction between garbage collection and memory management is discussed in more detail in Sec tion 7.6.4.

7.4.5

Manual Deallocation Requests

We close this section with manual memory management, where the programmer must explicitly arrange for the deallocation of data, as in C and C++. Ideally, any storage that will no longer be accessed should be deleted. Conversely, any storage that may be referenced must not be deleted. Unfortunately, it is hard to enforce either of these properties. In addition to considering the difficulties with

7.4. HEAP MANAGEMENT

461

manual deallocation, we shall describe some of the techniques programmers use to help with the difficulties. Problems with Manual Deallocation

Manual memory management is error-prone. The common mistakes take two forms: failing ever to delete data that cannot be referenced is called a memory leak error, and referencing deleted data is a dangling-pointer-dereference error. It is hard for programmers to tell if a program will never refer to some stor age in the future, so the first common mistake is not deleting storage that will never be referenced. Note that although memory leaks may slow down the exe cution of a program due to increased memory usage, they do not affect program correctness, as long as the machine does not run out of memory. Many pro grams can tolerate memory leaks, especially if the leakage is slow. However, for long-running programs, and especially nonstop programs like operating systems or server code, it is critical that they not have leaks. Automatic garbage collection gets rid of memory leaks by deallocating all the garbage. Even with automatic garbage collection, a program may still use more memory than necessary. A programmer may know that an object will never be referenced, even though references to that object exist somewhere. In that case, the programmer must deliberately remove references to objects that will never be referenced, so the objects can be deallocated automatically. Being overly zealous about deleting objects can lead to even worse problems than memory leaks. The second common mistake is to delete some storage and then try to refer to the data in the deallocated storage. Pointers to storage that has been deallocated are known as dangling pointers. Once the freed storage has been reallocated to a new variable, any read, write, or deallocation via the dangling pointer can produce seemingly random effects. We refer to any operation, such as read, write, or deallocate, that follows a pointer and tries to use the object it points to, as dereferencing the pointer. Notice that reading through a dangling pointer may return an arbitrary value. Writing through a dangling pointer arbitrarily changes the value of the new variable. Deallocating a dangling pointer's storage means that the storage of the new variable may be allocated to yet another variable, and actions on the old and new variables may conflict with each other. Unlike memory leaks, dereferencing a dangling pointer after the freed storage is reallocated almost always creates a program error that is hard to debug. As a result, programmers are more inclined not to deallocate a variable if they are not certain it is unreferencable. A related form of programming error is to access an illegal 'address. Common examples of such errors include dereferencing null pointers and accessing an out-of-bounds array element. It is better for such errors to be detected than to have the program silently corrupt the results. In fact, many security violations exploit programming errors of this type, where certain program inputs allow unintended access to data, leading to a "hacker" taking control of the program

462

CHAPTER 7. RUN-TIME ENVIRONMENTS

An Example: Purify Rational's Purify is one of the most popular commercial tools that helps programmers find memory access errors and memory leaks in programs. Purify instruments binary code by adding additional instructions to check for errors as the program executes. It keeps a map of memory to indicate where all the freed and used spaces are. Each allocated object is bracketed with extra space; accesses to unallocated locations or to spaces between objects are flagged as errors. This approach finds some dangling pointer references, but not when the memory has been reallocated and a valid object is sitting in its place. This approach also finds some out-of-bound array accesses, if they happen to land in the space inserted at the end of the objects. Purify also finds memory leaks at the end of a program execution. It searches the contents of all the allocated objects for possible pointer values. Any object without a pointer to it is a leaked chunk of memory. Purify reports the amount of memory leaked and the locations of the leaked objects. We may compare Purify to a "conservative garbage collector," which will be discussed in Section 7.8.3. and machine. One antidote is to have the compiler insert checks with every access, to make sure it is within bounds. The compiler's optimizer can discover and remove those checks that are not really necessary because the optimizer can deduce that the access must be within bounds. Programming Conventions and Tools

We now present a few of the most popular conventions and tools that have been developed to help programmers cope with the complexity in managing memory: •

Object ownership is useful when an object's lifetime can be statically rea soned about. The idea is to associate an owner with each object at all times. The owner is a pointer to that object, presumably belonging to some function invocation. The owner (i.e., its function) is responsible for either deleting the object or for passing the object to another owner. It is possible to have other, nonowning pointers to the same object; these pointers can be overwritten any time, and no deletes should ever be ap plied through them. This convention eliminates memory leaks, as well as attempts to delete the same object twice. However, it does not help solve the dangling-pointer-reference problem, because it is possible to follow a nonowning pointer to an object that has been deleted .

•

Reference counting is useful when an object's lifetime needs to be deter mined dynamically. The idea is to associate a count with each dynamically

INTRODUCTION TO GARBAGE COLLECTION

7. 5.

463

allocated object. Whenever a reference to the object is created, we incre ment the reference count; whenever a reference is removed, we decrement the reference count. When the count goes to zero, the object can no longer be referenced and can therefore be deleted. This technique, however, does not catch useless, circular data structures, where a collection of objects cannot be accessed, but their reference counts are not zero, since they refer to each other. For an illustration of this problem, see Example 7.1l. Reference counting does eradicate all dangling-pointer references, since t here are no outstanding references to any deleted objects. Reference counting is expensive because it imposes an overhead on every operation that stores a pointer . •

Region-based allocation is useful for collections of objects whose lifetimes are tied to specific phases in a computation. When objects are created to be used only within some step of a computation, we can allocate all such objects in the same region. We then delete the entire region once that computation step completes. This region-based allocation technique has limited applicability. However, it is very efficient whenever it can be used; instead of deallocating objects one at a time, it deletes all objects in the region in a wholesale fashion.

7.4.6

Exercises for Section 7.4

Exercise 7.4. 1 : Suppose the heap consists of seven chunks, starting at address

0. The sizes of the chunks, in order, are 80, 30, 60, 50, 70, 20, 40 bytes. When we place an object in a chunk, we put it at the high end if there is enough space remaining to form a smaller chunk (so that the smaller chunk can easily remain on the linked list of free space) . However, we cannot tolerate chunks of fewer that 8 bytes, so if an object is almost as large as the selected chunk, we give it the entire chunk and place the object at the low end of the chunk. If we request space for objects of the following sizes: 32, 64, 48, 16, in that order, what does the free space list look like after satisfying the requests, if the method of selecting chunks is a) First fit. b) Best fit. 7.5

Intro duct ion t o G arbage C ollect ion

Data that cannot be referenced is generally known as garbage. Many high-level programming languages remove the burden of manual memory management from the programmer by offering automatic garbage collection, which deallo cates unreachable data. Garbage collection dates back to the initial implemen tation of Lisp in 1958. Other significant languages that offer garbage collection include Java, Perl, ML, Modula-3, Prolog, and Smalltalk.

464

CHAPTER 7. RUN-TIME ENVIRONMENTS

In this section, we introduce many of the concepts of garbage collection. The notion of an object being "reachable" is perhaps intuitive, but we need to be precise; the exact rules are discussed in Section 7.5.2. We also discuss, in Section 7.5.3, a simple, but imperfect, method of automatic garbage collection: reference counting, which is based on the idea that once a program has lost all references to an object, it simply cannot and so will not reference the storage. Section 7.6 covers trace-based collectors, which are algorithms that discover all the objects that are still useful, and then turn all the other chunks of the heap into free space.

7.5. 1

Design Goals for Garbage Collectors

Garbage collection is the reclamation of chunks of storage holding objects that can no longer be accessed by a program. We need to assume that objects have a type that can be determined by tp.e garbage collector at run time. From the type information, we can tell how large the object is and which components of the object contain references (pointers) to other objects. We also assume that references to objects are always to the address of the beginning of the object, never pointers to places within the object. Thus, all references to an object have the same value and can be identified easily. A user program, which we shall refer to as the mutator, modifies the col lection of objects in the heap. The mutator creates objects by acquiring space fro� the memory manager, and the mutator may introduce and drop references to existing objects. Objects become garbage when the mutator program cannot "reach" them, in the sense made precise in Section 7.5.2. The garbage collector finds these unreachable objects and reclaims their space by handing them to the memory manager, which keeps track of the free space. A :Basic Requirement : Type Safety

Not all languages are good candidates for automatic garbage collection. For a garbage collector to work, it must be able to tell whether any given data element or component of a data element is, or could be used as, a pointer to a chunk of allocated memory space. A language in which the type of any data component can be determined is said to be type safe. There are type-safe languages like ML, for which we can determine types at compile time. There are other type safe languages, ' like Java, whose types cannot be determined at compile time, but can be determined at run time. The latter are called dynamically typed languages. If a language is neither statically nor dynamically type safe, then it. is said to be unsafe. Unsafe languages, which unfortunately include some of the most impor tant languages such as C and C++, are bad candidates for automatic garbage collection. In unsafe languages, memory addresses can be manipulated arbi trarily: arbitrary arithmetic operations can be applied to pointers to create new pointers, and arbitrary integers can be cast as pointers: Thus a program

INTRODUCTION TO GARBAGE COLLECTION

7. 5.

465

theoretically could refer to any location in memory at any time. Consequently, no memory location can be considered to be inaccessible, and no storage can ever be reclaimed safely. In practice, most C and C++ programs do not generate pointers arbitrarily, and a theoretically unsound garbage collector that works well empirically has been developed and used. We shall discuss conservative garbage collection for C and C++ in Section 7.8.3. Performance Metrics

Garbage collection is often so expensive that, although it was invented decades ago and absolutely prevents memory leaks, it has yet to be adopted by many mainstream programming languages. Many different approaches have been pro posed over the years, and there is not one clearly best garbage-collection algo rithm. Before exploring the options, let us first enumerate the performance metrics that must be considered when designing a garbage collector. •

Overall Execution Time. Garbage collection can be very slow. It is impor tant that it not significantly increase the total run time of an application. Since the garbage collector necessarily must touch a lot of data, its perfor mance is determined greatly by how it leverages the memory subsystem.

•

Spac� Usage. It is important that garbage collection avoid fragmentation and make the best use of the available memory.

•

Pause Time. Simple garbage collectors are notorious for causing pro grams - the mutators - to pause suddenly for an extremely long time, as garbage collection kicks in without warning. Thus, besides minimiz ing the overall execution time, it is desirable that the maximum pause time be minimized. As an important special case, real-time applications require certain computations to be completed within a time limit. We must either suppress garbage collection while performing real-time tasks, or restrict maximum pause time. Thus, garbage collection is seldom used in real-time applications.

•

Program Locality. We cannot evaluate the speed of a garbage collector solely by its running time. The garbage collector controls the placement of data and thus influences the data locality of the mutator program. It can improve a mutator's temporal locality by freeing up space and reusing it; it can improve the mutator's spatial locality by relocating data used together in the same cache or pages.

Some of these design goals conflict with one another, and tradeoffs must be made carefully by considering how programs typically behave. Also objects of different characteristics may favor different treatments, requiring a collector to use different techniques for different kinds of objects.

466

CHAPTER 7. RUN-TIME ENVIRONMENTS

For example, the number of objects allocated is dominated by small objects, so allocation of small objects must not incur a large overhead. On the other hand, consider garbage collectors that relocate reachable objects. Relocation is expensive when dealing with large objects, but less so with small objects. As another example, in general, the longer we wait to collect garbage in a trace-based collector, the larger the fraction of objects that can be collected. The reason is that objects often "die young," so if we wait a while, many of the newly allocated objects will become unreachable. Such a collector thus costs less on the average, per unreachable object collected. On the other hand, infrequent collection increases a program's memory usage, decreases its data locality, and increases the length of the pauses. In contrast, a reference-counting collector, by introducing a constant over head to many of the mutator's operations, can slow down the overall execution of a program significantly. On the other hand, reference counting does not cre ate long pauses, and it is memory efficient, because it finds garbage as soon as it is produced ( with the exception of certain cyclic structures discussed in Section 7.5.3). Language design can also affect the characteristics of memory usage. Some languages encourage a programming style that generates a lot of garbage. For example, programs in functional or almost functional programming languages create more objects to avoid mutating existing objects. In Java, all objects, other than base types like integers and references, are allocated on the heap and not the stack, even if their lifetimes are confined to that of one function invocation. This design frees the programmer from worrying about the lifetimes of variables, at the expense of generating more garbage. Compiler optimizations have been developed to analyze the lifetimes of variables and allocate them on the stack whenever possible.

7.5.2

Reachability

We refer to all the data that can be accessed directly by a program, without having to dereference any pointer, as the root set. For example, in Java the root set of a program consists of all the static field members and all the variables on its stack. A program obviously can reach any member of its root set at any time. Recursively, any object with a reference that is stored in the field members or array elements of any reachable object is itself reachable. Reachability becomes a bit more complex when the program has been op timized by the compiler. First, a compiler may keep reference variables in registers. These references must also be considered part of the root set. Sec ond, even though in a type-safe language programmers do not get to manipulate memory addresses directly, a compiler often does so for the sake of speeding up the code. Thus, registers in compiled code may point to the middle of an object or an array, or they may contain a value to which an offset will be applied to compute a legal address. Here are some things an optimizing compiler can do to enable the garbage collector to find the correct root set:

7.5. INTRODUCTION TO GARBAGE COLLECTION

467

•

The compiler can restrict the invocation of garbage collection to only certain code points in the program, when no "hidden" references exist.

•

The compiler can write out information that the garbage collector can use to recover all the references, such as specifying which registers contain references, or how to compute the base address of an object that is given an internal address.

•

The compiler can assure that there is a reference to the base address of all reachable objects whenever the garbage collector may be invoked.

The set of reachable objects changes as a program executes. It grows as new objects get created and shrinks as objects become unreachable. It is important to remember that once an object becomes unreachable, it cannot become reach able again. There are four basic operations that a mutator performs to change the set of reachable objects: •

Object Allocations. These are performed by the memory manager, which returns a reference to each newly allocated chunk of memory. This oper ation adds members to the set of reachable objects.

•

Parameter Passing and Return Values. References to objects are passed from the actual input parameter to the corresponding formal parameter, and from the returned result back to the callee. Objects pointed to by these references remain reachable.

•

Reference Assignments. Assignments of the form u = v, where u and v are references, have two effects. First, u is now a reference to the object referred to by v . As long as u is reachable, the object it refers to is surely reachable. Second, the original reference in u is lost. If this reference is the last to some reachable object, then that object becomes unreachable. Any time an object becomes unreachable, all objects that are reachable only through references contained in that object also become unreachable.

•

Procedure Returns. As a procedure exits, the frame holding its local variables is popped off the stack. If the frame holds the only reachable reference to any object, that object becomes unreachable. Again, if the now unreachable objects hold the only references to other objects, they too become unreachable, and so on.

In summary, new objects are introduced through object allocations. Param eter passing and assignments can propagate reachability; assignments and ends of procedures can terminate reachability. As an object becomes unreachable, it can cause more objects to become unreachable. There are two basic ways to find unreachable objects. Either we catch the transitions as reachable objects turn unreachable, or we periodically locate all the reachable objects and then infer that all the other objects are unreachable. Reference counting, introduced in Section 7.4.5, is a well-known approximation

468

CHAPTER 7. RUN-TIME ENVIRONMENTS

Survival of Stack Objects When a procedure is called, a local variable v, whose object is allocated on the stack, may have pointers to v placed in nonlocal variables. These pointers will continue to exist after the procedure returns, yet the space for v disappears, resulting in a dangling-reference situation. Should we ever allocate a local like v on the stack, as C does for example? The answer is that the semantics of many languages requires that local variables cease to exist when their procedure returns. Retaining a reference to such a variable is a programming error, and the compiler is not required to fix the bug in the program.

to the first approach. We maintain a count of the references to an object, as the mutator performs actions that may change the reach ability set. When the count goes to zero, the object becomes unreachable. We discuss this approach in more detail in Section 7.5.3. The second approach computes reachability by tracing all the references transitively. A trace-based garbage collector starts by labeling ( "marking" ) all objects in the root set as "reachable," examines iteratively all the references in reachable objects to find more reachable objects, and labels them as such. This approach must trace all the references before it can determine any object to be unreachable. But once the reachable set is computed, it can find many unreachable objects all at once and locate a good deal of free storage at the same time. Because all the references must be analyzed at the same time, we have an option to relocate the reachable objects and thereby reduce fragmentation. There are many different trace-based algorithms, and we discuss the options in Sections 7.6 and 7.7. 1 .

7.5.3

Reference Counting Garbage Collectors

We now consider a simple, although imperfect, garbage collector, based on reference counting, which identifies garbage as an object changes from being reachable to unreachable; the object can be deleted when its count drops to zero. With a reference-counting garbage collector, every object must have a field for the reference count. Reference counts can be maintained as follows: 1. Object Allocation. The reference count of the new object is set to 1. 2. Parameter Passing. The reference count of each object passed into a procedure is incremented.

3. Reference Assignments. For statement u = v, where u and v are refer ences, the reference count of the object referred to by v goes up by one, and the count for the old object referred to by u goes down by one.

469

7. 5. INTRODUCTION TO GARBAGE COLLECTION

4. Procedure Returns. As a procedure exits, all the references held by the local variables of that procedure activation record must also be decre mented. If several local variables hold references to the same object, that object's count must be decremented once for each such reference. 5. Transitive Loss of Reachability. Whenever the reference count of an object becomes zero, we must also decrement the count of each object pointed to by a reference within the object. Reference counting has two main disadvantages: it cannot collect unreach able, cyclic data structures, and it is expensive. Cyclic data structures are quite plausible; data structures often point back to their parent nodes, or point to each other as cross references. Example 7. 1 1 : Figure 7. 18 shows three objects with references among them, but no references from anywhere else. If none of these objects is part of the root set, then they are all garbage, but their reference counts are each greater than o. Such a situation is tantamount to a memory leak if we use reference counting for garbage collection, since then this garbage and any structures like it are never deallocated. 0 No pointers from outside I

/

/

/

I

,

"

,

\

\ \ I

'-

Figure 7.18: An unreachable, cyclic data structure The overhead of reference counting is high because additional operations are introduced with each reference assignment, and at procedure entries and exits. This overhead is proportional to the amount of computation in the program, and not just to the number of objects in the system. Of particular concern are the updates made to references in the root set of a program. The concept of deferred reference counting has been proposed as a means to eliminate the overhead associated with updating the reference counts due to local stack accesses. That is, reference counts do not include references from the root set of the program. An object is not considered to be garbage until the entire root set is scanned and no references to the object are found. The advantage of reference counting, on the other hand, is that garbage col lection is performed in an incremental fashion. Even though the total overhead can be large, the operations are spread throughout the mutator's computation.

470

CHAPTER 7. RUN-TIME ENVIRONMENTS

Although removing one reference may render a large number of objects un reachable, the operation of recursively modifying reference counts can easily be deferred and performed piecemeal across time. Thus, reference counting is par ticularly attractive algorithm when timing deadlines must be met, as well as for interactive applications where long, sudden pauses are unacceptable. Another advantage is that garbage is collected immediately, keeping space usage low.

I

Figure 7.19: A network of objects

7.5.4

Exercises for Section 7.5

Exercise 7.5 . 1 : What happens to the reference counts of the objects in Fig.

7. 19 if:

a) The pointer from A to B is deleted. b) The pointer from X to A is deleted. c) The node C is deleted. Exercise 7.5.2 : What happens to reference counts when the pointer from A to D in Fig. 7.20 is deleted?

7. 6

Introduction to Trace-B ased C ollect ion

Instead of collecting garbage as it is created, trace-based collectors run periodi cally to find unreachable objects and reclaim their space. Typically, we run the

7. 6.

INTRODUCTION TO TRACE-BASED COLLECTION

471

Figure 7.20: Another network of objects trace-based collector whenever the free space is exhausted or its amount drops below some threshold. We begin this section by introducing the simplest "mark-and-sweep" gar bage collection algorithm. We then describe the variety of trace-based algo rithms in terms of four states that chunks of memory can be put in. This section also contains a number of improvements on the basic algorithm, includ ing those in which object relocation is a part of the garbage-collection function.

7.6. 1 A Basic Mark-and-Sweep Collector Mark-and-sweep garbage-collection algorithms are straightforward, stop-the

world algorithms that find all the unreachable objects, and put them on the list of free space. Algorithm 7. 12 visits and "marks" all the reachable objects in the first tracing step and then "sweeps" the entire heap to free up unreachable ob jects. Algorithm 7.14, which we consider after introducing a general framework for trace-based algorithms, is an optimization of Algorithm 7.12. By using an additional list to hold all the allocated objects, it visits the reachable objects only once. Algorithm 7. 12 : Mark-and-sweep garbage collection.

A root set of objects, a heap, and a free list, called Free, with all the unallocated chunks of the heap. As in Section 7.4.4, all chunks of space are marked with boundary tags to indicate their free/used status and size.

INPUT:

OUTPUT:

A modified Free list after all the garbage has been removed.

METHOD : The algorithm, shown in Fig. 7.21 , uses several simple data struc tures. List Free holds objects known to be free. A list called Unscanned, holds objects that we have determined are reached, but whose successors we have not yet considered. That is, we have not scanned these objects to see what other

472

CHAPTER 7. RUN-TIME ENVIRONMENTS 1) 2) 3) 4) 5) 6) 7)

8) 9) 10) 1 1)

j * mar king phase * j

set the reached-bit to 1 and add to list Unscanned each object referenced by the root set; while ( Unscanned i- 0) { remove some object 0 from Unscanned; for (eath object 0' referenced in 0) { if ( 0' is unreached; i.e., its reached-bit is 0) { set the reached-bit of 0' to 1; put 0' in Unscanned; } } } j * sweeping phase * j Free = 0; for (each chunk of memory 0 in the heap) { if ( 0 is unreached, i.e., its reached-bit is 0) add 0 to Free; else set the reached-bit of 0 to 0; } Figure 7.21: A Mark-and-Sweep Garbage Collector

objects can be reached through them. The Unscanned list is empty initially. Additionally, each object includes a bit to indicate whether it has been reached (the reached-bit) . Before the algorithm begins, all allocated objects have the reached�bit set to O. In line (1) of Fig. 7.21, we initialize the Unscanned list by placing there all the objects referenced by th� root set. The reached-bit for these objects is also set to 1 . Lines (2) through (7) are a loop, in which we, in turn, examine each object 0 that is ever placed on the Unscanned list. The for-loop of lines (4) through (7) implements the scanning of object o. We examine each object 0' for which we find a reference within o. If 0' has already been reached (its reached-bit is 1) , then there is no need to do anything about 0' ; it either has been scanned previously, or it is on . the Unscanned list to be scanned later. However, if 0' was not reached already, then we need to set its reached-bit to 1 in line (6) and add 0' to the Unscanned list in line (7) . Figure 7.22 illustrates this process. It shows an Unscanned list with four objects. The first object on this list, corresponding to object 0 in the discussion above, is in the process of being scanned. The dashed lines correspond to the three kinds of objects that might be reached from 0: 1. A previously scanned object that need not be scanned again.

2. An object currently on the Unscanned list. 3. An item that is reachable, but was previously thought to be unreached.

7. 6.

INTRODUCTION TO TRACE-BASED COLLECTION

473

Unscanned Free and unreached objects reached bit = 0 Unscanned and previously scanned objects reached bit 1

=

Figure 7.22: The relationships among objects during the marking phase of a mark-and-sweep garbage collector Lines (8) through (11) , the sweeping phase, reclaim the space of all the objects that remain unreached at the end of the marking phase. Note that these will include any objects that were on the Free list originally. Because the set of unreached objects cannot be enumerated directly, the algorithm sweeps through the entire heap. Line (10) puts free and unreached objects on the Free list, one at a time. Line (11) handles the reachable objects. We set their reached-bit to 0, in order to maintain the proper preconditions for the next execution of the garbage-collection algorithm. 0

7.6.2

Basic Abstraction

All trace-based algorithms compute the set of reachable objects and then take the complement of this set. Memory is therefore recycled as follows: a) The program or mutator runs and makes allocation requests. b) The garbage collector discovers reachability by tracing. c) The garbage collector reclaims the storage for unreachable objects. This cycle is illustrated in Fig. 7.23 in terms of four states for chunks of memory: Free, Unreached, Unscanned, and Scanned. The state of a chunk might be stored in the chunk itself, or it might be implicit in the data structures used by the garbage-collection algorithm. While trace-based algorithms may differ in their implementation, they can all be described in terms of the following states: 1. Free. A chunk is in the Free state if it is ready to be allocated. Thus, a Free chunk must not hold a reachable object. 2 . Unreached. Chunks are presumed unreachable, unless proven reachable by

tracing. A chunk is in the Unreached state at any point during garbage

474

CHAPTER 7. RUN-TIME ENVIRONMENTS

(a) Before tracing: action of mutator

Scanned

pointers scanned

reached from root set

(b) Discovering reacp-ability by tracing

ready for next collection ( c) Reclaiming storage

Figure 7.23: States of memory in a garbage collection cycle collection if its reachability has not yet been established. Whenever a chunk is allocated by the memory manager, its state is set to Unreached as illustrated in Fig. 7.23(a) . Also, after a round of garbage collection, the state of a reachable object is reset to Unreached to get ready for the next round; see the transition from Scanned to Unreached, which is shown dashed to emphasize that it prepares for the next round. 3. Unscanned. Chunks that are known to be reachable are either in state Unscanned or state Scanned. A chunk is in the Unscanned state if it is known to be reachable, but its pointers have not yet been scanned. The transition to Unscanned from Unreached occurs when we discover that a chunk is reachable; see Fig. 7.23(b) . 4. Scanned. Every Unscanned object will eventually be scanned and tran sition to the Scanned state. To scan an object , we examine each of the pointers within it and follow those pointers to the objects to which they refer. If a reference is to an Unreached object, then that object is put in the Unscanned state. 'Vhen the scan of an object is completed, that object is placed in the Scanned state; see the lower transition in Fig. 7.23(b ) . A Scanned object can only contain references to other Scanned or Unscanned objects, and never to Unreached objects.

7. 6.

INTRODUCTION TO TRACE-BASED COLLECTION

475

When no objects are left in the Unscanned state, the computation of reach ability is complete. Objects left in the Unreached state at the end are truly unreachable. The garbage collector reclaims the space they occupy and places the chunks in the Free state, as illustrated by the solid transition in Fig. 7 .23( c) . To get ready for the next cycle of garbage collection, objects in the Scanned state are returned to the Unreached state; see the dashed transition in Fig. 7.23(c) . Again, remember that these objects really are reachable right now. The Un reachable state is appropriate because we shall want to start all objects out in this state when garbage collection next begins, by which time any of the currently reachable objects may indeed have been rendered unreachable. Example 7.13 : Let us see how the data structures of Algorithm 7.12 relate

to the four states introduced above. Using the reached-bit and membership on lists Free and Unscanned, we can distinguish among all four states. The table of Fig. 7.24 summarizes the characterization of the four states in terms of the data structure for Algorithm 7.12. 0 STATE

Free Unreached Unscanned Scanned

ON Free Yes No No No

ON Unscanned REACHED-BIT No 0 No 0 Yes 1 No 1

Figure 7.24: Representation of states in Algorithm 7.12

7.6.3

Optimizing Mark-and-Sweep

The final step in the basic mark-and-sweep algorithm is expensive because there is no easy way to find only the unreachable objects without examining the entire heap. An improved algorithm, due to Baker, keeps a list of all allocated objects. To find the set of unreachable objects, which we must return to free space, we take the set difference of the allocated objects and the reached objects. Algorithm 7.14 : Baker's mark-and-sweep collector. "

A root set of objects, a heap, a free list Free, and a list of allocated objects, which we refer to as Unreached. OUTPUT: Modified lists Free and Unreached, which holds allocated objects. METHOD : In this algorithm, shown in Fig. 7.25, the data structure for garbage collection is four lists named Free, Unreached, Unscanned, and Scanned, each of which holds all the objects in the state of the same name. These lists may be implemented by embedded, doubly linked lists, as was discussed in Sec tion 7.4.4. A reached-bit in objects is not used, but we assume that each object INPUT:

476

CHAPTER 7. RUN-TIME ENVIRONMENTS

contains bits telling which of the four states it is in. Initially, Free is the free list maintained by the memory manager, and all allocated objects are on the Unreached list (also maintained by the memory manager as it allocates chunks to objects) . 1) 2) 3) 4) 5) 6) 7)

8) 9)

Scanned = 0 ; Unscanned = set of objects referenced in the root set; while ( Unscanned 1= 0) { move object 0 from Unscanned to Scanned; for (each object 0' referenced in 0 ) { if ( 0' is in Unreached) move 0 ' from Unreached to Unscanned; } } Free = Free U Unreached; Unreached = Scanned; Figure 7.25: Baker's mark-and-sweep algorithm

Lines (1) and (2) initialize Scanned to be the empty list, and Unscanned to have only the objects reached from the root set. Note that these objects were presumably on the list Unreached and must be removed from there. Lines (3) through (7 ) are a straightforward implementation of the basic marking algo rithm, using these lists. That is, the for-loop of lines (5) through (7) examines the references in one unscanned object 0 , and if any of those references 0' have not yet been reached, line (7) changes 0' to the Unscanned state. At the end, line (8) takes those objects that are still on the Unreached list and deallocates their chunks, by moving them to the Free list. Then, line (9) takes all the objects in state Scanned, which are the reachable objects, and reinitializes the Unreached list to be exactly those objects. Presumably, as the memory manager creates new objects, those too will be added to the Unreached list and removed from the Free list. 0 In both algorithms of this section, we have assumed that chunks returned to the free list remain as they were before deallocation. However, as discussed in Section 7.4.4, it is often advantageous to combine adjacent free chunks into larger chunks. If we wish to do so, then every time we return a chunk to the free list, either at line (10) of Fig. 7.21 or line (8) of Fig. 7.25, we examine the chunks to its left and right, and merge if one is free.

7.6.4 Mark-and-Compact Garbage Collectors Relocating collectors move reachable objects around in the heap to eliminate

memory fragmentation. It is common that the space occupied by reachable ob jects is much smaller than the freed space. Thus, after identifying all the holes,

INTRODUCTION TO TRACE-BASED COLLECTION

7. 6.

477

instead of freeing them individually, one attractive alternative is to relocate all the reachable objects into one end of the heap, leaving the entire rest of the heap as one free chunk. After all, the garbage collector has already analyzed every reference within the reachable objects, so updating them to point to the new locations does not require much more work. These, plus the references in the root set, are all the references we need to change. Having all the reachable objects in contiguous locations reduces fragmen tation of the memory space, making it easier to house large objects. Also, by making the data occupy fewer cache lines and pages, relocation improves a pro gram's temporal and spatial locality, since new objects created at about the same time are allocated nearby chunks. Objects in nearby chunks can bene fit from prefetching if they are used together. Further, the data structure for maintaining free space is simplified; instead of a free list, all we need is a pointer free to the beginning of the one free block. Relocating collectors vary in whether they relocate in place or reserve space ahead of time for the relocation: •

A mark-and-compact collector, described in this section, compacts objects in place. Relocating in place reduces memory usage.

•

The more efficient and popular copying collector in Section 7.6.5 moves objects from one region of memory to another. Reserving extra space for relocation allows reachable objects to be moved as they are discovered.

The mark-and-compact collector in Algorithm 7.15 has three phases: 1. First is a marking phase, similar to that of the mark-and-sweep algorithms described previously. 2. Second, the algorithm scans the allocated section of the heap and com putes a new address for each of the reachable objects. New addresses are assigned from the low end of the heap, so there are no holes between reach able objects. The new address for each object is recorded in a structure called NewLocation.

3. Finally, the algorithm copies objects to their new locations, updating all references in the objects to point to the corresponding new locations. The needed addresses are found in NewLocation. Algorithm 7. 1 5 : A mark-and-compact garbage collector. INPUT : A root set of objects, a heap, and free, a pointer marking the start of free space. OUTPUT: The new value of pointer free. METHOD:

The algorithm is in Fig. 7.26; it uses the following data structures:

1. An Unscanned list, as in Algorithm 7. 12.

478

CHAPTER 7. RUN-TIME ENVIRONMENTS

2. Reached bits in all objects, also as in Algorithm 7.12. To keep our de scription simple, we refer to objects as "reached" or "unreached," when we mean that their reached-bit is 1 or 0, respectively. Initially, all objects are unreached. 3. The pointer free, which marks the beginning of unallocated space in the heap. 4. The table NewLocation. This structure could be a hash table, search tree, or another structure that implements the two operations: (a) Set NewLocation(o) to a new address for object o. (b) Given object 0 , get the value of NewLocation(o) . We shall not concern ourselves with the exact structure used, although you may assume that NewLocation is a hash table, and therefore, the "set" and "get" operations are each performed in average constant time, independent of how many objects are in the heap. The first, or marking, phase of lines (1) through (7) is essentially the same as the first phase of Algorithm 7.12. The second phase, lines (8) through (12) , visits each chunk in the allocated part of the heap, from the left, or low end. As a result, chunks are assigned new addresses that increase in the same order as their old addresses. This ordering is important, since when we relocate objects, we can do so in a way that assures we only move objects left, into space that was formerly occupied by objects we have moved already. Line (8) starts the free pointer at the low end of the heap. In this phase, we use free to indicate the first available new address. We create a new address only for those objects 0 that are marked as reached. Object 0 is given the next available address at line (10 ) , and at line (11) we increment free by the amount of storage that object 0 requires, so free again points to the beginning of free space. In the final phase, lines ( 13 ) through (17) , we again visit the reached objects, in the same from-the-Ieft order as in the second phase. Lines ( 15 ) and ( 16 ) replace all internal pointers of a reached object 0 by their proper new values, using the NewLocation table to determine the replacement. Then, line ( 17) moves the object 0, with the revised internal references, to its new location. Finally, lines ( 18 ) and (19) retarget pointers in the elements of the root set that are not themselves heap objects, e.g., statically allocated or stack-allocated objects. Figure 7.27 suggests how the reachable objects (those that are not shaded) are moved down the heap, while the internal pointers are changed to point to the new locations of the reached objects. 0

7.6.5

Copying collectors

A copying collector reserves, ahead of time, space to which the objects can move, thus breaking the dependency between tracing and finding free space.

7. 6. INTRODUCTION TO TRACE-BASED COLLECTION 1) 2) 3)

4) 5) 6) 7)

8) 9) 10) 11) 12)

13) 14) 15) 16) 17) 18) 19)

479

/ * mark * / Unscanned = set of objects referenced by the root set; while ( Unscanned -::J 0) { remove object 0 from Unscanned; for (each object 0' referenced in 0) { if ( 0' is unreached) { mark 0' as reached; put 0' on list Unscanned; } } } / * compute new locations * / free = starting location of heap storage; for (each chunk of memory 0 in the heap, from the low end) { if ( 0 is reacheq { NewLocation(0) = free; free = free + sizeof(o) ; } } / * retarget references and move reached objects * / for (each chunk of memory 0 in the heap, from the low end) { if ( 0 is reached) { for (each reference o.r in 0) o.r = NewLocation(o.r) ; copy 0 to NewLocation( 0) ; } } for (each reference r in the root set) r = NewLocation(r) ; Figure 7.26: A Mark-and-Compact Collector

The memory space is partitioned into two semispaces, A and B. The mutator allocates memory in one semispace, say A, until it fills up, at which point the mutator is stopped and the garbage collector copies the reachable objects to the other space, say B. When garbage collection completes, the roles of the semispaces are reversed: The mutator is allowed to resume and allocate objects in space B, and the next round of garbage collection moves reachable objects to space A. The following algorithm is due to C. J. Cheney. Algorithm 7.16 : Cheney's copying collector. INPUT: A root set of objects, and a heap consisting of the From semispace, containing allocated objects, and the To semispace, all of which is free.

480

CHAPTER 7. RUN-TIME ENVIRONMENTS

free

Figure 7.27: Moving reached objects to the front of the heap, while preserving internal pointers OUTPUT: At the end, the To semispace holds the allocated objects. A free pointer indicates the start of free space remaining in the To semispace. The From semispace is completely free.

The algorithm is shown in Fig. 7.28. Cheney's algorithm finds reachable objects in the From semispace and copies them, as soon as they are reached, to the To semispace. This placement groups related objects together and may improve spatial locality. Before examining the algorithm itself, which is the function CopyingCollec tor in Fig. 7.28, consider the auxiliary function LookupNewLocation in lines (11) through (16) . This function takes an object 0 and finds a new location for it in the To space if 0 has no location there yet. All new locations are recorded in a structure NewLocation, and a value of NULL indicates 0 has no assigned location. 5 As ih Algorithm 7.15, the exact form of structure NewLocation may vary, but it is fine to assume that it is a hash table. If we find at line ( 12) that 0 has no location, then it is assigned the beginning of the free space within the To semispace, at line (13) . Line (14) increments the free point er by the amount of space taken by 0, and at line (15) we copy 0 from the Froin space to the To space. Thus, the movement of objects from one semispace to the other occurs as a side effect, the first time we look up the new location for the object. Regardless of whether the location of 0 was or was not previously established, line (16) returns the location of 0 in the To space. Now, we can consider the algorithrri itself. Line (2) establishes that none of the objects in the From space have new addresses yet. At line (3) , we initialize two pointers, unscanned and free, to the beginning of the To semispace. Pointer free will always indicate the beginning of free space within the To space. As we add objects to the To space, those with addresses below unscanned will be in the Scanned state, while those between unscanned and free are in the Unscanned METHOD :

S In a typical data structure, such as a hash table, if 0 is not assigned a location, then there simply would be no mention of it in the structure.

7. 6.

INTRODUCTION TO TRACE-BASED COLLECTION 1) 2) 3) 4)

5) 6) 7) 8) 9) 10)

11) 12) 13) 14)

15)

16)

481

Copying Collector () { for (all objects 0 in From space) NewLocation(o) =NULLi unscanned = free = starting address of To space; for (each reference r in the root set) replace r with LookupNewLocation(r) ; while ( unscanned f. free) { 0 = object at location unscanned; for (each reference o.r within 0) o.r = LookupNewLocation(o.r) ; unscanned = unscanned + sizeof(o) ; } } '

/ * Look up the new location for object if it has been moved. * / / * Place object in Unscanned state otherwise. * / LookupNewLocation(0) { . if (NewLocation(o) = NULL) { NewLocation(o) = free; free = free + sizeof( 0) ; copy 0 to NewLocation(o) ; } return NewLocation(o) ; } Figure 7.28: A Copying Garbage Collector

state. Thus, free always leads unscanned, and when the latter catches up to the former, there are no more Unscanned objects, and we are done with the garbage collection. Notice that we do our work within the 1;0 space, although all references within objects examined at line (8) lead us back to the From space. Lines (4) and (5) handle the objects reachable from the root set. Note that as a side effect, some of the calls to LookupNewLocation at line (5) will increase free, as chunks for these objects are allocated within To. Thus, the loop of lines (6) through (10) will be entered the first time it is reached, unless there are no objects referenced by the root set (in which case the entire heap is garbage) . This loop then scans each of the objects that has been added to To and is in the Unscanned state. Line (7) takes the next unscanned object, o. Then, at lines (8) and (9) , each reference within 0 is translated from its value ip the From semispace to its value in the To semispace. Notice that, as a sid� effect, if a reference within 0 is to an object we have not reached previously, then the call to LookupNewLocation at line (9) creates space for that object in the To space and moves the object there. Finally, line (10) increments unscanned to point to the next object, just beyond 0 in the To space. 0

CHAPTER

482 7.6.6

7.

RUN-TIME ENVIRONMENTS

Comparing Costs

Cheney's algorithm has the advantage that it does not touch any of the un reachable objects. On the other hand, a copying garbage collector must move the contents of all the reachable objects. This process is especially expensive for large objects and for long-lived objects that survive multiple rounds of garbage collection. We can summarize the running time of each of the four algorithms described in this section, as follows. Each estimate ignores the cost of processing the root set. •

Basic Mark-and-Sweep ( Algorithm 7. 12) : Proportional to the number of chunks in the heap.

•

Baker's Mark-and-Sweep ( Algorithm 7.14) : Proportional to the number of reached objects.

•

Basic Mark-and-Compact ( Algorithm 7.15) : Proportional to the number of chunks in the heap plus the total size of the reached objects.

•

Cheney 's Copying Collector ( Algorithm 7.16) : Proportional to the total size of the reached objects.

7.6.7

Exercises for Section 7.6

Exercise 7.6. 1 : Show the steps of a mark-and-sweep garbage collector on

a) Fig. 7.19 with the pointer A -t B deleted.

b ) Fig. 7. 19 with the pointer A -+ C deleted.

c) Fig. 7.20 with the pointer A -+ D deleted.

d ) Fig. 7.20 with the object B deleted. Exercise 7.6.2 : The Baker mark-and-sweep algorithm moves objects among four lists: Free, Unreached, Unscanned, and Scanned. For each of the object networks of Exercise 7.6.1, indicate for each object the sequence of lists on which it finds itself from just before garbage collection begins until just after it finishes. Exercise 7.6.3 : Suppose we perform a mark-and-compact garbage collection on each of the networks of Exercise 7.6.1. Also, suppose that

i. Each object has size 100 bytes, and ii. Initially, the nine objects in the heap are arranged in alphabetical order,

starting at byte 0 of the heap.

What is the address of each object after garbage collection?

7. 7. SHORT-PAUSE GARBAGE COLLECTION

483

Exercise 7.6.4 : Suppose we execute Cheney's copying garbage collection al

gorithm on each of the networks of Exercise 7.6.1. Also, suppose that �.

Each object has size 100 bytes,

ii. The unscanned list is managed as a queue, and when an object has more than one pointer, the reached objects are added to the queue in alpha betical order, and iii. The From semispace starts at location 0, and the To semispace starts at location 10,000. What is the value of NewLocation ( 0) for each object 0 that remains after garbage collection? 7.7

Short-Pause G arbage Collection

Simple trace-based collectors do stop-the-world-style garbage collection, which may introduce long pauses into the execution of user programs. We can reduce the length of the pauses by performing garbage collection one part at a time. We can divide the work in time, by interleaving garbage collection with the mutation, or we can divide the work in space by collecting a subset of the garbage at a time. The former is known as incremental collection and the latter is known as partial collection. An incremental collector breaks up the reachability analysis into smaller units, allowing the mutator to run between these execution units. The reachable set changes as the mutator executes, so incremental collection is complex. As we shall see in Section 7.7. 1 , finding a slightly conservative answer can make tracing more efficient. The best known of partial-collection algorithms is generational garbage col lection; it partitions objects according to how long they have been allocated and collects the newly created objects more often because they tend to have a shorter lifetime. An alternative algorithm, the train algorithm, also collects a subset of garbage at a time, and is best applied to more mature objects. These two algorithms can be used together to create a partial collector that handles younger and older objects differently. We discuss the basic algorithm behind partial collection in Section 7.7.3, and then describe in more detail how the generational and train algorithms work. Ideas from both incremental and partial collection can be adapted to cre ate an algorithm that collects objects in parallel on a multiprocessor; see Sec tion 7.8. 1 . 7.7. 1

Incremental Garbage Collection

Incremental collectors are conservative. While a garbage collector must not collect objects that are not garbage, it does not have to collect all the garbage

484

CHAPTER 7. RUN-TIME ENVIRONMENTS

in each round. We refer to the garbage left behind after collection as floating garbage. Of course it is desirable to minimize floating garbage. In particular, an incremental collector should not leave behind any garbage that was not reachable at the beginning of a collection cycle. If we can be sure of such a collection guarantee, then any garbage not collected in one round will be collected in the next, and no memory is leaked because of this approach to garbage collection. In other words, incremental collectors play it safe by overestimating the set of reachable objects. They first process the program's root set atomically, with out interference from the mutator. After finding the initial set of unscanned objects, the mutator's actions are interleaved with the tracing step. During this period, any of the mutator's actions that may change reachability are recorded succinctly, in a side table, so that the collector can make the necessary ad justments when it resumes execution. If space is exhausted before tracing com pletes, the collector completes the tracing process, without allowing the mutator to execute. In any event, when tracing is done, space is reclaimed atomically. Precision of Incremental Collection

Once an object becomes unreachable, it is not possible for the object to become reachable again. Thus, as garbage collection and mutation proceed, the set of reachable objects can only

1. Grow due to new objects allocated after garbage collection starts, and 2. Shrink by losing references to allocated objects. Let the set of reachable objects at the beginning of garbage collection be R; let New be the set of allocated objects during garbage collection, and let Lost be the set of objects that have become unreachable due to lost references since tracing began. The set of objects reachable when tracing completes is

(R U New)

-

Lost.

It is expensive to reestablish an object's reachability every time a mutator loses a reference to the object, so incremental collectors do not attempt to collect all the garbage at the end of tracing. Any garbage left behind - floating garbage - should be a subset of the Lost objects. Expressed formally, the set S of objects found by tracing must satisfy

(R U New)

-

Lost � S � (R U New)

Simple Incremental Tracing

We first describe a straightforward tracing algorithm that finds the upper bound R U New. The behavior of the mutator is modified during the tracing as follows:

7. 7. SHORT-PAUSE GARBAGE COLLECTION

485

•

All references that existed before garbage collection are preserved; that is, before the mutator overwrites a reference, its old value is remembered and treated like an additional unscanned object containing just that reference .

•

All objects created are considered reachable immediately and are placed in the Unscanned state.

This scheme is conservative but correct, because it finds R, the set of all the objects reachable before garbage collection, plus New, the set of all the newly allocated objects. However, the cost is high, because the algorithm intercepts all write operations and remembers all the overwritten references. Some of this work is unnecessary because it may involve objects that are unreachable at the end of garbage collection. We could avoid some of this work and also improve the algorithm's precision if we could detect when the overwritten references point to objects that are unreachable when this round of garbage collection ends. The next algorithm goes fairly far in these two directions. 7.7.2

Incremental Reachability Analysis

If we interleave the mutator with a basic tracing algorithm, such as Algo rithm 7.12, then some reachable objects may be misclassified as unreachable. The problem is that the actions of the mutator can violate a key invariant of the algorithm; namely, a Scanned object can only contain references to other Scanned or Unscanned objects, never to Unreached objects. Consider the fol lowing scenario: 1 . The garbage collector finds object 01 reachable and scans the pointers within 01 , thereby putting 0 1 in the Scanned state. 2. The mutator stores a reference to an Unreached ( but reachable ) object 0 into the Scanned object 0 1 . It does so by copying a reference to 0 from an object 02 that is currently in the Unreached or Unscanned state.

3. The mutator loses the reference to 0 in object 02 . It may have overwrit ten 02 's reference to 0 before the reference is scanned, or 02 may have become unreachable and never have reached the Unscanned state to have its references scanned. Now, 0 is reachable through object 01 , but the garbage collector may have seen neither the reference to 0 in 01 nor the reference to 0 in 02 . The key to a more precise, yet correct, incremental trace is that we must note all copies of references to currently unreached objects from an object that has not been scanned to one that has. To intercept problematic transfers of references, the algorithm can modify the mutator's action during tracing in any of the following ways:

486

CHAPTER 7. RUN-TIME ENVIRONMENTS •

Write Barriers. Intercept writes of references into a Scanned object 01 , when the reference is to an Unreached object o. In this case, classify 0 as reachable and place it in the Unscanned set. Alternatively, place the written object 01 back in the Unscanned set so we can res can it.

•

Read Barriers. Intercept the reads of references in Unreached or Un scanned objects. Whenever the mutator reads a reference to an object 0 from an object in either the Unreached or Unscanned state, classify 0 as reachable and place it in the Unscanned set .

•

Transfer Barriers. Intercept the loss of the original reference in an Un reached or Unscanned object. Whenever the mutator overwrites a ref erence in an Unreached or Unscanned object, save the reference being overwritten, classify it as reachable, and place the reference itself in the Unscanned set.

None of the options above finds the smallest set of reachable objects. If the tracing process determines an object to be reachable, it stays reachable even though all references to it are overwritten before tracing completes. That is, the set of reachable objects found is between (R U New) Lost and (R U New). Write barriers are the most efficient of the options outlined above. Read barriers are more expensive because typically there are many more reads than there are writes. Transfer barriers are not competitive; because many objects "die young," this approach would retain many unreachable objects. -

Implementing Write Barriers

We can implement write barriers in two ways. The first approach is to re member, during a mutation phase, all new references written into the Scanned objects. We can place all these references in a list; the size of the list is propor tional to the number of write operations to Scanned objects, unless duplicates are removed from the list. Note that references on the list may later be over written themselves and potentially could be ignored. The second, more efficient approach is to remember the locations where the writes occur. We may remember them as a list of locations written, possibly with duplicates eliminated. Note it is not important that we pinpoint the exact locations written, as long as all the locations that have been written are rescanned. Thus, there are several techniques that allow us to remember less detail about exactly where the rewritten locations are . •

Instead of remembering the exact address or the object and field that is written, we can remember just the objects that hold the written fields.

•

We can divide the address space into fixed-size blocks, known as cards, and use a bit array to remember the cards that have been written into.

SHORT-PAUSE GARBAGE COLLECTION

7. 7.

•

487

We can choose to remember the pages that contain the written locations. We can simply protect the pages containing Scanned objects. Then, any writes into Scanned objects will be detected without executing any ex plicit instructions, because they will cause a protection violation, and the operating system will raise a program exception.

In general, by coarsening the granularity at which we remember the written locations, less storage is needed, at the expense of increasing the amount of rescanning performed. In the first scheme, all references in the modified objects will have to be rescanned, regardless of which reference was actually modified. In the last two schemes, all reachable objects in the modified cards or modified pages need to be rescanned at the end of the tracing process. Combining Incremental and Copying Techniques

The above methods are sufficient for mark-and-sweep garbage collection. Copy ing collection is slightly more complicated, because of its interaction with the mutator. Objects in the Scanned or Unscanned states have two addresses, one in the From semispace and one in the To semispace. As in Algorithm 7.16, we must keep a mapping from the old address of an object to its relocated address. There are two choices for how we update the references. First, we can have the mutator make all the changes in the From space, and only at the end of garbage collection do we update all the pointers and copy all the contents over to the To space. Second, we can instead make changes to the representation in the To space. Whenever the mutator dereferences a pointer to the From space, the pointer is translated to a new location in the To space if one exists. All the pointers need to be translated to point to the To space in the end. 7.7.3

Partial-Collection Basics

The fundamental fact is that objects typically "die young." It has been found that usually between 80% and 98% of all newly allocated objects die within a few million instructions, or before another megabyte has been allocated. That is, objects often become unreachable before any garbage collection is invoked. Thus, is it quite cost effective to garbage collect new objects frequently. Yet, objects that survive a collection once are likely to survive many more collections. With the garbage collectors described so far, the same mature objects will be found to be reachable over and over again and, in the case of copying collectors, copied over and over again, in every round of garbage collection. Generational garbage collection works most frequently on the area of the heap that contains the youngest objects, so it tends to collect a lot of garbage for relatively little work. The train algorithm, on the other hand, does not spend a large proportion of time on young objects, but it does limit the pauses due to garbage collection. Thus, a good combination of strategies is to use generational collection for young objects, and once an object becomes

488

CHAPTER 7. RUN-TIME ENVIRONMENTS

sufficiently mature, to "promote" it to a separate heap that is managed by the train algorithm. We refer to the set of objects to be collected on one round of partial collection as the target set and the rest of the objects as the stable set. Ideally, a partial collector should reclaim all objects in the target set that are unreachable from the program's root set. However, doing so would require tracing all objects, which is what we try to avoiq in the first place. Instead, partial collectors conservatively reclaim only those objects that cannot be reached through either the root set of the program or the stable set. Since some objects in the stable set may themselves be unreachable, it is possible that we shall treat as reachable some objects in the target set that really have no path from the root set. We can adapt the garbage collectors described in Sections 7.6.1 and 7.6.4 to work in a partial manner by changing the definition of the "root set." Instead of referring to just the objects held in the registers, stack and global variables, the root set now also includes all the objects in the stable set that point to objects in the target set. References from target objects to other target objects are traced as before to find all the reachable objects. We can ignore all pointers to stable objects, because these objects are all considered reachable in this round of partial collection. To identify those stable objects that reference target objects, we can adopt techniques similar to those used in incremental garbage collection. In incremen tal collection, we need to remember all the writes of references from scanned objects to unreached objects during the tracing process. Here we need to re member all the writes of references from the stable objects to the target objects throughout the mutator's execution. Whenever the mutator stores into a sta ble object a reference to an object in the target set, we remember either the reference or the location of the write. We refer to the set of objects holding references from the stable to the target objects as the remembered set for this set of target objects. As discussed in Section 7.7.2, we can compress the repre sentation of a remembered set by recording only the card or page in which the written object is found. Partial garbage collectors are often implemented as copying garbage collec tors. Noncopying collectors can also be implemented by using linked lists to keep track of the reachable objects. The "generational" scheme described below is an example of how copying may be combined with partial collection.

7.7.4

Generational Garbage Collection

Generational garbage collection is an effective way to exploit the property that most objects die young. The heap storage in generational garbage collection is separated into a series of partitions. We shall use the convention of numbering them 0, 1, 2, . . , n , with the lower-numbered partitions holding the younger objects. Objects are first created in partition 0. When this partition fills up, it is garbage collected, and its reachable objects are moved into partition 1 . Now, with partition 0 empty again, we resume allocating new objects i n that .

7. 7. SHORT-PAUSE GARBAGE COLLECTION

489

partition. When partition 0 again fills, 6 it is garbage collected and its reachable objects copied into partition 1, where they join the previously copied objects. This pattern repeats until partition 1 also fills up, at which point garbage collection is applied to partitions 0 and 1 . In general, each round of garbage collection is applied t o all partitions num bered i or below, for some i; the proper i to choose is the highest-numbered partition that is currently full. Each time an object survives a collection ( Le., it is found to be reachable) , it is promoted to the next higher partition from the one it occupies, until it reaches the oldest partition, the one numbered n. Using the terminology introduced in Section 7.7.3, when partitions i and below are garbage collected, the partitions from 0 through i make up the target set, and all partitions above i comprise the stable set. To support finding root sets for all possible partial collections, we keep for each partition i a remembered set, consisting of all the objects in partitions above i that point to objects in set i. The root set for a partial collection invoked on set i includes the remembered sets for partition i and below. In this scheme, all partitions below i are collected whenever we collect i. There are two reasons for this policy:

1. Since younger generations contain more garbage and are collected more often anyway, we may as well collect them along with an older generation. 2. Following this strategy, we need to remember only the references pointing from an older generation to a newer generation. That is, neither writes to objects in the youngest generation nor promoting objects to the next generation causes updates to any remembered set. If we were to collect a partition without a younger one, the younger generation would become part of the stable set, and we would have to remember references that point from younger to older generations as well. In summary, this scheme collects younger generations more often, and col lections of these generations are particularly cost effective, since "objects die young." Garbage collection of older generations takes more time, since it in cludes the collection of all the younger generations and contains proportionally less garbage. Nonetheless, older generations do need to be collected once in a while to remove unreachable objects. The oldest generation holds the most mature objects; its collection is expensive because it is equivalent to a full collec tion� That is, generational collectors occasionally require that the full tracing step be performed and therefore can introduce long pauses into a program's execution. An alternative for handling mature objects only is discussed next. 6 Technically, partitions do not "fill," since they can be expanded with additional disk blocks by the memory manager, if desired. However, there is normally a limit on the size of a partition, other than the last. We shall refer to reaching this limit as "filling" the partition.

CHAPTER

490

7.7.5

7.

RUN-TIME ENVIRONMENTS

The Train Algorithm

While the generational approach is very efficient for the handling of immature objects, it is less efficient for the mature objects, since mature objects are moved every time there is a collection involving them, and they are quite unlikely to be garbage. A different approach to incremental collection, called the train algorithm, was developed to improve the handling of mature objects. It can be used for collecting all garbage, but it is probably better to use the generational approach for immature objects and, only after they have survived a few rounds of collection, "promote" them to another heap, managed by the train algorithm. Another advantage to the train algorithm is that we never have to do a complete garbage collection, as we do occasionally for generational garbage collection. To motivate the train algorithm, let us look at a simple example of why it is necessary, in the generational approach, to have occasional all-inclusive rounds of garbage collection. Figure 7.29 shows two mutually linked objects in two partitions i and j , where j > i. Since both objects have pointers from outside their partition, a collection of only partition i or only partition j could never collect either of these objects. Yet they may in fact be part of a cyclic garbage structure with no links from the outside. In general, the "links" between the objects shown may involve many objects and long chains of references.

�::

Partition i

I D(

on j

X

Figure 7.29: A cyclic structure across partitions that may be cyclic garbage In generational garbage collection, we eventually collect partition j, and since i < j , we also collect i at that time. Then, the cyclic structure will be completely contained in the portion of the heap being collected, and we can tell if it truly is garbage. However, if we never have a round of collection that includes both i and j , we would have a problem with cyclic garbage, just as we did with reference counting for garbage collection. The train algorithm uses fixed-length partitions, called cars; a car might be a single disk block, provided there are no objects larger than disk blocks, or the car size could be larger, but it is fixed once and for all. Cars are organized into trains. There is no limit to the number of cars in a train, and no limit to the number of trains. There is a lexicographic order to cars: first order by train number, and within a train, order by car number, as in Fig. 7.30. There are two ways that garbage is collected by the train algorithm: •

The first car in lexicographic order (that is, the first remaining car of the first remaining train) is collected in one incremental garbage-collection step. This step is similar to collection of the first partition in the gener ational algorithm, since we maintain a "remembered" list of all pOinters

491

7. 7. SHORT-PAUSE GARBAGE COLLECTION Train 1 Train 2 Train 3

�� I car I I car I � � I car I 23

24

33

Figure 7.30: Organization of the heap for the train algorithm from outside the car. Here, we identify objects with no references at all, as well as garbage cycles that are contained completely within this car. Reachable objects in the car are always moved to some other car, so each garbage-collected car becomes empty and can be removed from the train . •

Sometimes, the first train has no external references. That is, there are no pointers from the root set to any car of the train, and the remembered sets for the cars contain only references from other cars in the train, not from other trains. In this situation, the train is a huge collection of cyclic garbage, and we delete the entire train.

Remembered Sets

We now give the details of the train algorithm. Each car has a remembered set consisting of all references to objects in the car from a) Objects in higher-numbered cars of the same train, and b) Objects in higher-numbered trains. In addition, each train has a remembered set consisting of all references from higher-numbered trains. That is, the remembered set for a train is the union of the remembered sets for its cars, except for those references that are internal to the train. It is thus possible to represent both kinds of remembered sets by dividing the remembered sets for the cars into "internal" (same train) and "external" (other trains) portions. Note that references to objects can come from anywhere, not just from lexicographically higher cars. However, the two garbage-collection processes deal with the first car of the first train, and the entire first train, respectively. Thus, when it is time to use the remembered sets in a garbage collection, there is nothing earlier from which references could come, and therefore there is no point in remembering references to higher cars at any time. We must be careful, of course, to manage the remembered sets properly, changing them whenever the mutator modifies references in any object.

492

CHAPTER 7. RUN-TIME ENVIRONMENTS

Managing Trains

Our objective is to draw out of the first train all objects that are not cyclic garbage. Then, the first train either becomes nothing but cyclic garbage and is therefore collected at the next round of garbage collection, or if the garbage is not cyclic, then its cars may be collected one at a time. We therefore need to start new trains occasionally, even though there is no limit on the number of cars in one train, and we could in principle simply add new cars to a single train, every time we needed more space. For example, we could start a new train after every k object creations, for some k . That is, in general, a new object is placed in the last car of the last train, if there is room, or in a new car that is added to the end of the last train, if there is no room. However, periodically, we instead start a new train with one car, and place the new object there. Garbage Collecting a Car

The heart of the train algorithm is how we process the first car of the first train during a round of garbage collection. Initially, the reachable set is taken to be the objects of that car with references from the root set and those with references in the remembered set for that car. We then scan these objects as in a mark-and-sweep collector, but we do not scan any reached objects outside the one car being collected. After this tracing, some objects in the car may be identified as garbage. There is no need to reclaim their space, because the entire car is going to disappear anyway. However, there are likely to be some reachable objects in the car, and these must be moved somewhere else. The rules for moving an object are: •

If there is a reference in the remembered set from any other train (which will be higher-numbered than the train of the car being collected) , then move the object to one of those trains. If there is room, the object can go in some existing car of the train from which a reference emanates, or it can go in a new, last car if there is no room .

•

If there is no reference from other trains, but there are references from the root set or from the first train, then move the object to any other car of the same train, creating a new, last car if there is no room. If possible, pick a car from which there is a reference, to help bring cyclic structures to a single car.

After moving all the reachable objects from the first car, we delete that car. Panic Mode

There is one problem with the rules above. In order to be sure that all garbage will eventually be collected, we need to be sure that every train eventually becomes the first train, and if this train is not cyclic garbage, then eventually

7. 7. SHORT-PAUSE GARBAGE COLLECTION

493

all cars of that train are removed and the train disappears one car at a time. However, by rule (2) above, collecting the first car of the first train can produce a new last car. It cannot produce two or more new cars, since surely all the objects of the first car can fit in the new, last car. However, could we be in a situation where each collection step for a train results in a new car being added, and we never get finished with this train and move on to the other trains? The answer is, unfortunately, that such a situation is possible. The problem arises if we have a large, cyclic, nongarbage structure, and the mutator manages to change references in such a way that we never see, at the time we collect a car, any references from higher trains in the remembered set. If even one object is removed from the train during the collection of a car, then we are OK, since no new objects are added to the first train, and therefore the first train will surely run out of objects eventually. However, there may be no garbage at all that we can collect at a stage, and we run the risk of a loop where we perpetually garbage collect only the current first train. To avoid this problem, we need to behave differently whenever we encounter a futile garbage collection, that is, a car from which not even one object can be deleted as garbage or moved to another train. In this "panic mode," we make two changes:

1. When a reference to an object in the first train is rewritten, we maintain the reference as a new member of the root set. 2. When garbage collecting, if an object in the first car has a reference from the root set, including dummy references set up by point (1) , then we move that object to another train, even if it has no references from other trains. It is not important which train we move it to, as long as it is not the first train. In this way, if there are any references from outside the first train to objects in the first train, these references are considered as we collect every car, and eventually some object will be removed from that train. We can then leave panic mode and proceed normally, sure that the current first train is now smaller than it was.

7.7.6

Exercises for Section 7.7

Exercise 7.7. 1 : Suppose that the network of objects from Fig. 7.20 is managed by an incremental algorithm that uses the four lists Unreached, Unscanned, Scanned, and Free, as in Baker's algorithm. To be specific, the Unscanned list is managed as a queue, and when more than one object is to be placed on this list due to the scanning of one object, we do so in alphabetical order. Suppose also that we use write barriers to assure that no reachable object is made garbage. Starting with A and B on the Unscanned list, suppose the following events occur:

i. A is scanned.

494

CHAPTER 7. RUN-TIME ENVIRONMENTS

ii. The pointer A --t D is rewritten to be A --t H. zn.

B is scanned.

iv. D is scanned. v. The pointer B --t C is rewritten to be B --t I. Simulate the entire incremental garbage collection, assuming no more pointers are rewritten. Which objects are garbage? Which objects are placed on the Free list? Exercise 7.7.2 : Repeat Exercise 7.7. 1 on the assumption that

a) Events (ii) and (v) are interchanged in order. b) Events (ii) and ( v) occur before (i) , (iii) , and (iv) . Exercise 7.7.3 : Suppose the heap consists of exactly the nine cars on three trains shown in Fig. 7.30 (Le., ignore the ellipses) . Object 0 in car 1 1 has references from cars 12, 23, and 32. When we garbage collect car 1 1 , where might 0 wind up? Exercise 7.7.4 : Repeat Exercise 7.7.3 for the cases that 0 has

a) Only references from cars 22 and 31. b ) No references other than from car 11. Exercise 7.7.5 : Suppose the heap consists of exactly the nine cars on three trains shown in Fig. 7.30 (i.e., ignore the ellipses) . We are currently in panic mode. Object 01 in car 11 has only one reference, from object 02 in car 12. That reference is rewritten. When we garbage collect car 1 1 , what could happen to 01 ?

7.8

Advanced Topics in G arbage C ollect ion

We close our investigation of garbage collection with brief treatments of four additional topics:

1 . Garbage collection in parallel environments. 2. Partial relocations of objects. 3. Garbage collection for languages that are not type-safe. 4. The interaction between programmer-controlled and automatic garbage collection.

7.8. ADVANCED TOPICS IN GARBAGE COLLECTION

7. 8 . 1

495

Parallel and Concurrent Garbage Collection

Garbage collection becomes even more challenging when applied to applications running in parallel on a multiprocessor machine. It is not uncommon for server applications to have thousands of threads running at the same time; each of these threads is a mutator. Typically, the heap will consist of gigabytes of memory. Scalable garbage-collection algorithms must take advantage of the presence of multiple processors. We say a garbage collector is parallel if it uses multiple threads; it is concurrent if it runs simultaneously with the mutator. We shall describe a parallel, and mostly concurrent, collector that uses a concurrent and parallel phase that does most of the tracing work, and then a stop-the-world phase that guarantees all the reachable objects are found and re claims the storage. This algorithm introduces no new basic concepts in garbage collection per se; it shows how we can combine the ideas described so far to create a full solution to the parallel-and-concurrent collection problem. How ever, there are some new implementation issues that arise due to the nature of parallel execution. We shall discuss how this algorithm coordinates multiple threads in a parallel computation using a rather common work-queue model. To understand the design of the algorithm we must keep in mind the scale of the problem. Even the root set of a parallel application is much larger, consisting of every thread's stack, register set and globally accessible variables. The amount of heap storage can be very large, and so is the amount of reachable data. The rate at which mutations take place is also much greater. To reduce the pause time, we can adapt the basic ideas developed for in cremental analysis to overlap garbage collection with mutation. Recall that an incremental analysis, as discussed in Section 7.7, performs the following three steps: 1. Find the root set. This step is normally performed atomically, that is,

with the mutator(s) stopped.

2. Interleave the tracing of the reachable objects with the execution of the

mutator(s) . In this period, every time a mutator writes a reference that points from a Scanned object to an Unreached object, we remember that reference. As discussed in Section 7.7.2, we have options regarding the granularity with which these references are remembered. In this section, we shall assume the card-based scheme, where we divide the heap into sections called "cards" and maintain a bit map indicating which cards are dirty (have had one or more references within them rewritten) .

3. Stop the mutator(s) again to rescan all the cards that may hold references to unreached objects. For a large multithreaded application, the set of objects reached by the root set can be very large. It is infeasible to take the time and space to visit all such objects while all mutations cease. Also, due to the large heap and the large

496

CHAPTER 7. RUN-TIME ENVIRONMENTS

number of mutation threads, many cards may need to be rescanned after all objects have been scanned once. It is thus advisable to scan some of these cards in parallel, while the mutators are allowed to continue to execute concurrently. To implement the tracing of step (2) above, in parallel, we shall use multiple garbage-collecting threads concurrently with the mutator threads to trace most of the reachable objects. Then, to implement step (3) , we stop the mutators and use parallel threads to ensure that all reachable objects are found. The tracing of step (2) is carried out by having each mutator thread per form· part of the garbage collection, along with its own work. In addition, we use threads that are dedicated purely to collecting garbage. Once garbage col lection has been initiated, whenever a mutator thread performs some memory allocation operation, it also performs some tracing computation. The pure garbage-collecting threads are put to use only when a machine has idle cycles. As in incremental analysis, whenever a mutator writes a reference that points from a Scanned object to an Unreached object, the card that holds this reference is marked dirty and needs to be rescanned. Here is an outline of the parallel, concurrent garbage-collection algorithm.

1. Scan the root set for each mutator thread, all(� put all objects directly reachable from that thread into the Unscanned state. The simplest incre mental approach to this step is to wait until a mutator thread calls the memory manager, and have it scan its own root set if that has not already been done. If some mutator thread has not called a memory allocation function, but all the rest of tracing is done, then this thread must be interrupted to have its root set scanned. 2. Scan objects that are in the Unsccmned state. To support parallel com putation, we use a work queue of fixed-size work packets, each of which holds a number of [fnscanned objects. Unscanned objects are placed in work packets as they are discovered. Threads looking for work will de queue these work packets and trace the Unscanned objects therein. This strategy allows the work to be spread evenly among workers in the tracing process. If the system runs out of space, and we cannot find the space to create these work packets, we simply mark the cards holding the objects to force them to be scanned. The latter is always possible because the bit array holding the marks for the cards has already been allocated. 3. Scan the objects in dirty cards. When there are no more Unscanned ob jects left in the work queue, and all threads' root sets have been scanned, the cards are rescanned for reachable objects. As long as the mutators continue to execute, dirty cards continue to be produced. Thus, we need to stop the tracing process using some criterion, such as allowing cards to be rescanned only once or a fixed number of times, or when the number of outstanding cards is reduced to some threshold. As a result, this paral lel and concurrent step normally terminates before completing the trace, which is finished by the final step, below.

7.S. ADVANCED TOPICS IN GARBAGE COLLECTION

497

4. The final step gu arantees that all reachable objects are marked as reached. With all the mutators stopped, the root sets for all the threads can now be found quickly using all the processors in the . system. Because the reachability of most objects has been traced, only a srhall number of objects are expected to be placed in the Unscanned state. All the threads then participate in tracing the rest of the reachable objects and rescanning all the cards. It is important that we control the rate at which tracing takes place. The tracing phase is like a race. The mutators create new objects and new references that must be scanned, and the tracing tries to scan all the reachable objects and rescan the dirty cards generated in the meanwhile. It is not desirable to start the tracing too much before a garbage collection is needed, because that will ihcrease the amount of floating garbage. On the other hand, we cannot wait until the memory is exhausted before the tracing starts, because then mutators will not be able to make forward progress and the situation degenerates to that of a stop-the-world collector. Thus, the algorithm must choose the time to commence the collection and the rate of tracing appropriately. An estimate of the mutation rate from previous cycles of collection can be used to help in the decision. The tracing rate is dynamically adjusted to account for the work performed by the pure garbage-collecting threads.

7.8.2

Partial Object Relocation As discussed starting in Section 7.6.4, copying or compacting collectors are ad

vantageous because they eliminate fragmentation. However, these collectors have nontrivial overheads. A compacting collector requires moving all objects and updating all the references at the end of garbage collection. A copying collector figures out where the reachable objects go as tracing proceeds; if trac ing is performed incrementally, we need either to translate a mutator's every reference, or to move all the objects and update their references at the end. Both options are very expensive, especially for a large heap. We can instead use a copying generational garbage collector. It is effective in collecting immature objects and reducing fragmentation, but can be expensive when collecting mature objects. We can use the train algorithm to limit the amount of mature data analyzed each time. However, the overhead of the train algorithm is sensitive to the size of the remembered set for each partition. There is a hybrid collection scheme that uses concurrent tracing to reclaim all the unreachable objects and at the same time moves only a part of the objects. This method reduces fragmentation without incurring the full cost of relocation in each collection cycle. 1. Before tracing begins, choose a part of the heap that will be evacuated. 2. As the reachable objects are marked, also remember all the references

pointing to objects in the designated area.

498

CHAPTER 7. RUN-TIME ENVIRONMENTS

3. When tracing is complete, sweep the storage in parallel to reclaim the space occupied by unreachable objects. 4. Finally, evacuate the reachable objects occupying the designated area and fix up the references to the evacuated objects.

7.8.3

Conservative Collection for Unsafe Languages As discussed in Section 7.5. 1 , it is impossible to build a garbage collector that is

guaranteed to work for all C and C++ programs. Since we can always compute an address with arithmetic operations, no memory locations in C and C++ can ever be shown to be unreachable. However, many C or C++ programs never fabricate addresses in this way. It has been demonstrated that a conservative garbage collector - one that does not necessarily discard all garbage - can be built to work well in practice for this class of programs. A conservative garbage collector assumes that we cannot fabricate an ad dress, or derive the address of an allocated chunk of memory without an ad dress pointing somewhere in the same chllnk. We can find all the garbage in programs satisfying such an assumption by treating as a valid address any bit pattern found anywhere in reachable memory, as long as that bit pattern may be construed as a memory location. This scheme may classify some data erro neously as addresses. It is correct, however, since it only causes the collector to be conservative and keep more data than necessary. Object relocation, requiring all references to the old locations be updated to point to the new locations, is incompatible with conservative garbage collection. Since a conservative garbage collector does not know if a particular bit pattern refers to an actual address, it cannot change these patterns to point to new addresses. Here is how a conservative garbage collector works. First, the memory manager is modified to keep a data map of all the allocated chunks of memory. This map allows us to find easily the starting and ending boundary of the chunk of memory that spans a certain address. The tracing starts by scanning the program's root set to find any bit pattern that looks like a memory location, without worrying about its type. By looking up these potential addresses in the data map, we can find the starting addresses of those chunks of memory that might be reached, and place them in the Unscanned state. We then scan all the unscanned chunks, find more (presumably) reachable chunks of memory, and place them on the work list until the work list becomes empty. After tracing is done, we sweep through the heap storage using the data map to locate and free all the unreachable chunks of memory.

7.8.4

Weak References

Sometimes, programmers use a language with garbage collection, but also wish to manage memory, or parts of memory, themselves. That is, a programmer may know that certain objects are never going to be accessed again, even though

7. S.

ADVANCED TOPICS IN GARBAGE COLLECTION

499

references to the objects remain. An example from compiling will suggest the problem. Example 7.17 : We have seen that the lexical analyzer often manages a sym

bol table by creating an object for each identifier it sees. These objects may appear as lexical values attached to leaves of the parse tree representing those identifiers, for instance. However, it is also useful to create a hash table, keyed by the identifier's string, to locate these objects. That table makes it easier for the lexical analyzer to find the object when it encounters a lexeme that is an identifier. When the compiler passes the scope of an identifier I, its symbol-table object no longer has any references from the parse tree, or probably any other intermediate structure used by the compiler. However, a reference to the object is still sitting in the hash table. Since the hash table is part of the root set of the compiler, the object cannot be garbage collected. If another identifier with the same lexeme as I is encountered, then it will be discovered that I is out of scope, and the reference to its object will be deleted. However, if no other identifier with this lexeme is encountered, then 1's object may remain as uncollectable, yet useless, throughout compilation. 0 If the problem suggested by Example 7.17 is important, then the compiler writer could arrange to delete from the hash table all references to objects as soon as their scope ends. However, a technique known as weak references allows the programmer to rely on automatic garbage collection, and yet not have the heap burdened with reachable, yet truly unused, objects. Such a system allows certain references to be declared "weak." An example would be all the references in the hash table we have been discussing. When the garbage collector scans an object, it does not follow weak references within that object, and does not make the objects they point to reachable. Of course, such an object may still be reachable if there is another reference to it that is not weak.

7.8.5

Exercises for Section 7.8

! Exercise 7.8. 1 : In Section 7.8.3 we suggested that it was possible to garbage

collect for C programs that do not fabricate expressions that point to a place within a chunk unless there is an address that points somewhere within that same chunk. Thus, we rule out code like p = 12345 ; x = *p ;

because, while p might point to some chunk accidentally, there could be no other pointer to that chunk. On the other hand, with the code above, it is more likely that p points nowhere, and executing that code will result in a segmentation fault. However, in C it is possible to write code such that a variable like p is guaranteed to point to some chunk, and yet there is no pointer to that chunk. Write such a program.

500

7. 9

CHAPTER 7. RUN-TIME ENVIRONMENTS S ummary of Chapt er 7

+ Run- Time Organization. To implement the abstractions embodied in the source language, a compiler creates and manages a run-time environment in concert with the operating system and the target machine. The run time environment has static data areas for the object code and the static data objects created at compile time. It also has dynamic stack and heap areas for managing objects created and destroyed as the target program executes. + Control Stack. Procedure calls and returns are usually managed by a run time stack called the control stack. We can use a stack because procedure calls or activations nest in time; that is, if p calls q, then this activation of q is nested within this activation of p. + Stack Allocation. Storage for local variables can allocated on a run-time stack for languages that allow or require local variables to become inacces sible when their procedures end. For such languages, each live activation has an activation record ( or frame) on the control stack, with the root of the activation tree at the bottom, and the entire sequence of activation records on the stack corresponding to the path in the activation tree to the activation where control currently resides. The latter activation has its record at the top of the stack. + Access to Nonlocal Data on the Stack. For languages like C that do not allow nested procedure declarations, the location for a variable is either global or found in the activation record on top of the run-time stack. For languages with nested procedures, we can access nonlocal data on the stack through access links, which are pointers added to each activation record. The desired nonlocal data is found by following a chain of access links to the appropriate activation record. A display is an auxiliary array, used in conjunction with access links, that provides an efficient short-cut alternative to a chain of access links. + Heap Management. The heap is the portion of the store that is used for data that can live indefinitely, or until the program deletes it explicitly. The memory manager allocates and deallocates space within the heap. Garbage collection finds spaces within the heap that are no longer in use and can therefore be reallocated to house other data items. For languages that require it, the garbage collector is an important subsystem of the memory manager. + Exploiting Locality. By making good use of the memory hierarchy, mem ory managers can influence the run time of a program. The time taken to access different parts of memory can vary from nanoseconds to millisec onds. Fortunately, most programs spend most of their time executing a relatively small fraction of the code and touching only a small fraction of

7. 9. SUMMARY OF CHAPTER 7

501

the data. A program has temporal locality if it is likely to access the same memory locations again soon; it has spatial locality if it is likely to access nearby memory locations soon.

.. Reducing Fragmentation. As the program allocates and deallocates mem ory, the heap may get fragmented, or broken into large numbers of small noncontiguous free spaces or holes. The best fit strategy - allocate the smallest available hole that satisfies a request - has been found empir ically to work well. While best fit tends to improve space utilization, it may not be best for spatial locality. Fragmentation can be reduced by combining or coalescing adjacent holes. .. Manual Deallocation. Manual memory management has two common failings: not deleting data that can not be referenced is a memory-leak error, and referencing deleted data is a dangling-pointer-dereference error . .. Reachability. Garbage is data that cannot be referenced or reached. There are two basic ways of finding unreachable objects: either catch the tran sition as a reachable object turns unreachable, or periodically locate all reachable objects and infer that all remaining objects are unreachable . .. Reference- Counting Collectors maintain a count of the references to an ob ject; when the count transitions to zero, the object becomes unreachable. Such collectors introduce the overhead of maintaining references and can fail to find "cyclic" garbage, which consists of unreachable objects that reference each other, perhaps through a chain of references . .. Trace-Based Garbage Collectors iteratively examine or trace all references to find reachable objects, starting with the root set consisting of objects that can be accessed directly without having to dereference any pointers . .. Mark-and-Sweep Collectors visit and mark all reachable objects in a first tracing step and then sweep the heap to free up unreachable objects . .. Mark-and- Compact Collectors improve upon mark-and-sweep; they relo cate reachable objects in the heap to eliminate memory fragmentation. .. Copying Collectors break the dependency between tracing and finding free space. They partition the memory into two semispaces, A and B . Allocation requests are satisfied from one semispace, say A, until it fills up, at which point the garbage collector takes over, copies the reachable objects to the other space, say B, and reverses the roles of the semispaces . .. Incremental Collectors. Simple trace-based collectors stop the user pro gram while garbage is collected. Incremental collectors interleave the actions of the garbage collector and the mutator or user program. The mutator can interfere with incremental reachability analysis, since it can

502

CHAPTER 7. RUN-TIME ENVIRONMENTS change the references within previously scanned objects. Incremental col lectors therefore play it safe by overestimating the set of reachable objects; any "floating garbage" can be picked up in the next round of collection .

.. Partial Collectors also reduce pauses; they collect a subset of the garbage at a time. The best known of partial-collection algorithms, generational garbage collection, partitions objects according to how long they have been allocated and collects the newly created objects more often because they tend to have shorter lifetimes. An alternative algorithm, the train algorithm, uses fixed length partitions, called cars, that are collected into trains. Each collection step is applied to the first remaining car of the first remaining train. When a car is collected, reachable objects are moved out to other cars, so this car is left with garbage and can be removed from the train. These two algorithms can be used together to create a partial collector that applies the generational algorithm to younger objects and the train algorithm to more mature objects. 7. 1 0

References for C hapter 7

In mathematical logic, scope rules and parameter passing by substitution date back to Frege [8]. Church's lambda calculus [3] uses lexical scope; it has been used as a model for studying programming languages. Algol 60 and its succes sors, including C and Java, use lexical scope. Once introduced by the initial implementation of Lisp, dynamic scope became a feature of the language; Mc Carthy [14] gives the history. Many of the concepts related to stack allocation were stimulated by blocks and recursion in Algol 60. The idea of a display for accessing nonlocals in a lexically scoped language is due to Dijkstra [5] . A detailed description of stack allocation, the use of a display, and dynamic allocation of arrays appears in Randell and Russell [16] . Johnson and Ritchie [10] discuss the design of a calling sequence that allows the number of arguments of a procedure to vary from call to call. Garbage collection has been an active area of investigation; see for example Wilson [17] . Reference counting dates back to Collins [4] . Trace-based collection dates back to McCarthy [1 3] , who describes a mark-sweep algorithm for fixed length cells. The boundary-tag for managing free space was designed by Knuth in 1962 and published in [11] . Algorithm 7.14 is based on Baker [1] . Algorithm 7.16 is based on Cheney's [2] nonrecursive version of Fenichel and Yochelson's [7] copying collector. Incremental reachability analysis is explored by Dijkstra et al. [6] . Lieber man and Hewitt [12] present a generational collector as an extension of copying collection. The train algorithm began with Hudson and Moss [9] . 1 . Baker, H. G. Jr., "The treadmill: real-time garbage collection without motion sickness," A CM SIGPLAN Notices 27:3 (Mar., 1992) , pp. 66-70.

503

7. 10. REFERENCES FOR CHAPTER 7

2. Cheney, C. J., "A nonrecursive list compacting algorithm," Comm. A CM 13:11 (Nov., 1970) , pp. 677-678. 3. Church, A., The Calculi of Lambda Conversion, Annals of Math. Studies, No. 6, Princeton University Press, Princeton, N. J., 1941 . 4. Collins, G. E., "A method for overlapping and erasure of lists," Comm. A CM 2:12 (Dec., 1960) , pp. 655-657. 5. Dijkstra, E. W., "Recursive programming," Numerische Math. 2 (1960) , pp. 312-318. 6. Dijkstra, E. W., L. Lamport, A. J. Martin, C. S. Scholten, and E. F. M. Steffens, "On-the-fly garbage collection: an exercise in cooperation," Comm. A CM 2 1 : 1 1 (1978) , pp. 966-975. 7.

Fenichel, R. R. and J. C. Yochelson, "A Lisp garbage-collector for virtual memory computer systems" , Comm. A CM 12:11 (1969) , pp. 61 1-612.

8. Frege, G., "Begriffsschrift, a formula language, modeled upon that of arithmetic, for pure thought," (1879) . In J. van Heijenoort, From Frege to Cadel, Harvard Univ. Press, Cambridge MA, 1967. 9. Hudson, R. L. and J. E. B. Moss, "Incremental Collection of Mature Objects" , Proc. Inti. Workshop on Memory Management, Lecture Notes In Computer Science 637 (1992) , pp. 388-403. 10. Johnson, S. C. and D. M. Ritchie, "The C language calling sequence," Computing Science Technical Report 102, Bell Laboratories, Murray Hill NJ , 1981 . 1 1 . Knuth, D . E., Art of Computer Programming, Volume Algorithms, Addison-Wesley, Boston MA, 1968.

1:

Fundamental

12. Lieberman, B . and C. Hewitt, "A real-time garbage collector based on the lifetimes of objects," Comm. A CM 26:6 (June 1983) , pp. 419-429. 13. McCarthy, J., "Recursive functions of symbolic expressions and their com putation by machine," Comm. A CM 3:4 (Apr., 1960) , pp. 184-195. 14. McCarthy, J., "History of Lisp." See pp. 173-185 in R. L. Wexelblat (ed.) , History of Programming Languages, Academic Press, New York, 1981. 15. Minsky, M., "A LISP garbage collector algorithm using secondary stor age," A. 1. Memo 58, MIT Project MAC , Cambridge MA, 1963. 16. Randell, B. and L . J. Russell, Algol New York, 1964.

60

Implementation, Academic Press,

17. Wilson, P. R., "Uniprocessor garbage collection techniques," ftp : //ftp . cs . utexas . edu/pub/garbage/bigsurv . ps

Chapter

8

Code Generation The final phase in our compiler model is the code generator. It takes as input the intermediate representation ( IR) produced by the front end of the com piler, along with relevant symbol table information, and produces as output a semantically equivalent target program, as shown in Fig. 8.l. The requirements imposed on a code generator are severe. The target pro gram must preserve the semantic meaning of the source program and be of high quality; that is, it must make effective use of the available resources of the target machine. Moreover, the code generator itself must run efficiently. The challenge is that, mathematically, the problem of generating an optimal target program for a given source program is undecidable; many of the subprob lems encountered iIi code generation such as register allocation are computa tionally intractable. In practice, we must be content with heuristic techniques that generate good, but not necessarily optimal, code. Fortunately, heuristics have matured enough that a carefully designed code generator can produce code that is several times faster than code produced by a naive one. Compilers that need to produce efficient target programs, include an op timization phase prior to code generation. The optimizer maps the IR into IR from which more efficient code can be generated. In general, the code optimization and code-generation phases of a compiler, often referred to as the back end, may make multiple passes over the IR before generating the target program. Code optimization is discussed in detail in Chapter 9. The tech niques presented in this chapter can be used whether or not an optimization phase occurs before code generation. A code generator has three primary tasks: instruction selection, register 1 - - - - - - - 1 1

Code 1 intermediate intermediate : Optimizer : code code

Figure 8.1: Position of code generator 505

506

CHAPTER 8. CODE GENERATION

allqcation and assignment, and instruction ordering. The importance of these tasks is outlined in Section 8.1 . Instruction selection involves choosing appro priate target-machine instructions to implement the IR statements. Register allocation and assignment involves deciding what values to keep in which reg isters. Instruction ordering involves deciding in what order to schedule the execution of instructions. This chapter presents algorithms that code generators can use to trans late the IR into a sequence of target language instructions for simple register machines. The algorithms will be illustrated by using the machine model in Sec tion 8.2. Chapter 10 covers the problem of code generation for complex modern machines that support a great deal of parallelism within a single instruction. After discussing the broad issue$ in the design of a code generator, we show what kind of target code a compiler needs to generate to support the abstrac tions embodied in a typical source language. In Section 8.3, we outline imple mentations of static and stack allocation of data areas, and show how names in the IR can be converted into addresses in the target code. Many code generators partition IR instructions into "basic blocks," which consist of sequences of instructions that are always executed together. The partitioning of the IR into basic blocks is the subject of Section 8.4. The following section presents simple local transformations that can be used to transform basic blocks into modified basic bloGks from which more efficient code can be generated. These transformations are a rudimentary form of code optimization, although tlle deeper theory of code optimization will not be taken up until Chapter 9. A� example of a useful, local transformation is the discovery of cornmon sub expressions at the level of intermediate code and the resultant replacement of arithmetic operations by simpler copy operations. Section 8.6 presents a simple code-generation algorithm that generates code for each statement in turn, keeping operands in registers as long as possible. The output of this kind of code generator can be readily improved by peephole optimization techniques such as those discussed in the following Section 8.7. The remaining sections explore instruction selection and register allocation. 8.1

IssQes in the D esjgn of a Code G enerator

While the details are dependent on the specifics of the intermediate represen tation, the target language, and the run-time system, tasks such as instruction selection, register allocation and assignment, and instruction ordering are en countered in the design of almost all co de generators. The most important criterion for a code generator is that it produce cor rect code. Correctness takes on special significance beca"ijse of the number of special cases that a code generator might face. Given the premium on correct ness, designing a code generator so it can be easily implemented, tested, and maintained is �n important design goal.

8. 1. ISSUES IN THE DESIGN OF A CODE GENERATOR 8.1.1

507

Input t o the Code Generator

The input to the code generator is the intermediate representation of the source program produced by the front end, along with information in the symbol table that is used to determine the run-time addresses of the data objects denoted by the names in the lR. The many choices for the IR include three-address representations such as quadruples, triples, indirect triples; virtual machine representations such as bytecodes and stack-machine code; linear representations such as postfix no tation; and graphical representations such as syntax trees and DAG's. Many of the algorithms in this chapter are couched in terms of the representations considered in Chapter 6: three-address code, trees, and DAG's. The techniques we discuss can be applied, however, to the other intermediate representations as well. In this chapter, we assume that the front end has scanned, parsed, and translated the source program into a relatively low-level IR, so that the values of the names appearing in the IR can be represented by quantities that the target machine can directly manipulate, such as integers and floating-point numbers. We also assume that all syntactic and static semantic errors have been detected, that the necessary type checking has taken place, and that type conversion operators have been inserted wherever necessary. The code generator can therefore proceed on the assumption that its input is free of these kinds of errors. 8 . 1 .2

The Target Program

The instruction-set architecture of the target machine has a significant im pact on the difficulty of constructing a good code generator that produces high-quality machine code. The most common target-machine architectures are RISe (reduced instruction set computer) , elSe (complex instruction set computer) , and stack based. A RISe machine typically has many registers, three-address instructions, simple addressing modes, and a relatively simple instruction-set architecture. In contrast, a elSe machine typically has few registers, two-address instruc tions, a variety of addressing modes, several register classes, variable-length instructions, and instructions with side effects. In a stack-based machine, operations are done by pushing operands onto a stack and then performing the operations on the operands at the top of the stack. To achieve high performance the top of the stack is typically kept in registers. Stack-based machines almost disappeared because it was felt that the stack organization was too limiting and required too many swap and copy operations. However, stack-based architectures were revived with the introduction of the Java Virtual Machine (JVM) . The JVM is a software interpreter for Java bytecodes, an intermediate language produced by Java compilers. The inter-

508

CHAPTER 8. CODE GENERATION

preter provides software compatibility across multiple platforms, a major factor in the success of Java. To overcome the high performance penalty of interpretation, which can be on the order of a factor of 10, just-in-time (JIT) Java compilers have been created. These JIT compilers translate bytecodes during run time to the native hardware instruction set of the target machine. Another approach to improving Java performance is to build a compiler that compiles directly into the machine instructions of the target machine, bypassing the Java bytecodes entirely. Producing an absolute machine-language program as output has the ad vantage that it can be placed in a fixed location in memory and immediately executed. Programs can be compiled and executed quickly. Producing a relocatable machine-language program (often called an object module) as output allows subprograms to be compiled separately. A set of relocatable object modules can be linked together and loaded for execution by a linking loader. Although we must pay the added expense of linking and loading if we produce relocatable object modules, we gain a great deal of flexibility in being able to compile subroutines separately and to call other previously compiled programs from an object module. If the target machine does not handle relocation automatically, the compiler must provide explicit relocation information to the loader to link the separately compiled program modules. Producing an assembly-language program as output makes the process of code generation somewhat easier. We can generate symbolic instructions and use the macro facilities of the assembler to help generate code. The price paid is the assembly step after code generation. In this chapter, we shall use a very simple RISe-like computer as our target machine. We add to it some elSe-like addressing modes so that we can also discuss code-generation techniques for elSe machines. For readability, we use assembly code as the target language . As long as addresses can be calculated from offsets and other information stored in the symbol table, the code gener ator can produce relocatable or absolute addresses for names just as easily as symbolic addresses.

8.1.3

Instruction Selection

The code generator must map the lR program into a code sequence that can be executed by the target machine. The complexity of performing this mapping is determined by a factors such as •

the level of the IR

•

the nature of the instruction-set architecture

•

the desired quality of the generated code.

If the IR is high level, the code generator may translate each IR statement into a sequence of machine instructions using code templates. Such statement by-statement code generation, however, often produces poor code that needs

8. 1. ISSUES IN THE DESIGN OF A CODE GENERATOR

509

further optimization. If the IR reflects some of the low-level details of the un derlying machine, then the code generator can use this information to generate more efficient code sequences. The nature of the instruction set of the target machine has a strong effect on the difficulty of instruction selection. For example, the uniformity and com pleteness of the instruction set are important factors. If the target machine does not support each data type in a uniform manner, then each exception to the general rule requires special handling. On some machines, for example, floating-point operations are done using separate registers. Instruction speeds and machine idioms are other important factors. If we do not care about the efficiency of the target program, instruction selection is straightforward. For each type of three-address statement, we can design a code skeleton that defines the target code to be generated for that construct. For example, every three-address statement of the form x = y + Z , where x, y, and Z are statically allocated, can be translated into the code sequence LD RO , Y ADD RO , RO , 8T x , RO

Z

II RO = y II RO = RO + II x = RO

Z

( load y into register RO) ( add Z to RO) ( store RO into x )

This strategy often produces redundant loads and stores. For example, the sequence of three-address statements a = b + c d = a + e

would be translated into LD ADD 8T LD ADD 8T

RO , b RO , RO , c a , RO RO , a RO , RO , e d , RO

II II II II II II

RO = b RO = RO + c a = RO RO = a RO = RO + e d = RO

Here, the fourth statement is redundant since it loads a value that has just been stored, and so is the third if a is not subsequently used. The quality of the generated code is usually determined by its speed and size. On most machines, a given IR program can be implemented by many different code sequences, with significant cost differences between the different implementations. A naive translation of the intermediate code may therefore lead to correct but unacceptably inefficient target code. For example, if the target machine has an "increment" instruction ( INC) , then the three-address statement a = a + 1 may be implemented more efficiently by the single instruction INC a, rather than by a more obvious sequence that loads a into a register, adds one to the register, and then stores the result back into a:

510

CHAPTER 8. CODE GENERATION LD RO , a ADD RO , RO , # 1 8 T a , RO

II RO = a I I RO = RO + 1 II a = RO

We need to know instruction costs in order to design good code sequences but, unfortunately, accurate cost information is often difficult to obtain. De ciding which machine-code sequence is best for a given three-address construct may also require knowledge about the context in which that construct appears. In Section 8.9 we shall see that instruction selection can be modeled as a tree-pattern matching process in which we represent the IR and the machine instructions as trees. We then attempt to "tile" an IR tree with a set of sub trees that correspond to machine instructions. If we associate a cost with each machine-instruction subtree, we can use dynamic programming to generate op timal code sequences. Dynamic programming is discussed in Section 8.11.

8. 1 .4

Register Allocation

A key problem in code generation is deciding what values to hold in what registers. Registers are the fastest computational unit on the target machine, but we usually do not have enough of them to hold all values. Values not held in registers need to reside in memory. Instructions involving register operands are invariably shorter and faster than those involving operands in memory, so efficient utilization of registers is particularly important. The use of registers is often subdivided into two subproblems:

1. Register allocation, during which we select the set of variables that will reside in registers at each point in the program. 2. Register assignment, during which we pick the specific register that a variable will reside in. Finding an optimal assignment of registers to variables is difficult, even with single-register machines. Mathematically, the problem is NP-complete. The problem is further complicated because the hardware and / or the operating system of the target machine may require that certain register-usage conventions be observed. Example 8 . 1 : Certain machines require register-pairs ( an even and next odd numbered register ) for some operands and results. For example, on some ma chines, integer multiplication and integer division involve register pairs. The multiplication instruction is of the form

M x, Y

where x, the multiplicand, is the even register of an even / odd register pair and y, the multiplier, is the odd register. The product occupies the entire even / odd

register pair. The division instruction is of the form

8. 1 .

ISSUES IN THE DESIGN OF A CODE GENERATOR D x,

511

y

where the dividend occupies an even/odd register pair whose even register is X ; the divisor is y . After division, the even register holds the remainder and the odd register the quotient. Now, consider the two three-address code sequences in Fig. 8.2 in which the only difference in (a) and (b) is the operator in the second statement. The shortest assembly-code sequences for (a) and (b) are given in Fig. 8.3. t t t

a + b t + C t / d

t t t

a + b t * c t / d

(b)

(a)

Figure 8.2: Two three-address code sequences L Rl , a Rl , b A RO , c M D RO , d 8T Rl , t

L A A SRDA D

8T

(a)

RO , RO , RO , RO , RO , Rl ,

a b c 32 d t

(b)

Figure 8.3: Optimal machine-code sequences Ri stands for register i. 8RDA stands for Shift-Right-Double-Arithmetic and 8RDA RO , 32 shifts the dividend into Rl and clears RO so all bits equal its sigfl. bit. L, ST, and A stand for load, store, and add, respectively. Note that the optimal choice for the register into which a is to be loaded depends on what will ultimately happen to t . 0

Strategies for register allocation and assignment are discussed in Section 8.8. Section 8.10 shows that for certain classes of machines we can construct code sequences that evaluate expressions using as few registers as possible.

8 . 1 .5

Evaluation Order

The order in which computations are performeq can affect the efficiency of the target code. As we shall see, some computation orders require fewer registers to hold intermediate results than others. However, picking a best order in the general case is a difficult NP-complete problem. Initially, we shall avoid

512

CHAPTER 8. CODE GENERATION

the problem by generating code for the three-address statements in the order in which they have been produced by the intermediate code generator. In Chapter 10, we shall study code scheduling for pipelined machines that can execute several operations in a single clock cycle. 8.2

The Target Language

Familiarity with the target machine and its instruction set is a prerequisite for designing a good code generator. Unfortunately, in a general discussion of code generation it is not possible to describe any target machine in sufficient detail to generate good code for a complete language on that machine. In this chapter, we shall use as a target language assembly code for a simple computer that is representative of many register machines. However, the code generation techniques presented in this chapter can be used on many other classes of machines as well.

8.2.1

A Simple Target Machine Model

Our target computer models a three-address machine with load and store oper ations, computation operations, jump operations, and conditional jumps. The underlying computer is a byte-addressable machine with n general-purpose reg isters, RO, Rl, . . . , Rn - 1 . A full-fledged assembly language would have scores of instructions. To avoid hiding the concepts in a myriad of details, we shall use a very limited set of instructions and assume that all operands are integers. Most instructions consists of an operator, followed by a target, followed by a list of source operands. A label may precede an instruction. We assume the following kinds of instructions are available: •

Load operations: The instruction LD dst, addr loads the value in location addr into location dst. This instruction denotes the assignment dst = addr. The most common form of this instruction is LD 1', x which loads the value in location x into register r. An instruction of the form LD 1'1 , 1'2 is a register-to-register copy in which the contents of register 1'2 are copied

into register 1'1 . •

Store operations: The instruction S1 x, r stores the value in register r into the location x. This instruction denotes the assignment x = r.

•

Computation operations of the form OP dst, srCl , srC2 , where OP is a op erator like ADD or SUB, and dst, srCl , and srC2 are locations, not necessarily distinct. The effect of this machine instruction is to apply the operation represented by OP to the values in locations srCI and srC2 , and place the result of this operation in location dst. For example, SUB rl , r2 , 1'3 com putes ri = r2 - 1'3 . Any value formerly stored in rl is lost, but if 1'1 is r2 or 1'3 , the old value is read first. Unary operators that take only one operand do not have a src2 .

8.2. THE TARGET LANGUAGE •

•

513

Unconditional jumps: The instruction BR L causes control to branch to the machine instruction with label L. (BR stands for branch.) Conditional jumps of the form Bcond r, L, where r is a register, L is a label, and cond stands for any of the common tests on values in the register r. For example, BLT2 r, L causes a jump to label L if the value in register r is less than zero, and allows control to pass to the next machine instruction if not.

We assume our target machine has a variety of addressing modes: •

•

In instructions, a location can be a variable name x referring to the mem ory location that is reserved for x (that is, the I-value of x) . A location can also be an indexed address of the form a ( r ) , where a is a variable and r is a register. The memory location denoted by a (r ) is computed by taking the I-value of a and adding to it the value in register r. For example, the instruction LD Rl , a (R2) has the effect of setting Rl = contents ( a + contents (R2) ) , where contents(x) denotes the contents of the register or memory location represented by x. This addressing mode is useful for accessing arrays, where a is the base address of the array (that is, the address of the first element ) , and r holds the number of bytes past that address we wish to go to reach one of the elements of arrCLY a.

•

A memory location can be an integer indexed by a register. For ex ample, LD Rl , 100 (R2) has the effect of setting Rl = contents (100 + contents (R2)) , that is, of loading into Rl the value in the memory loca tion obtained by adding 100 to the contents of register R2. This feature is useful for following pointers, as we shall see in the example below.

•

We also allow two indirect addressing modes: *r means the memory lo cation found in the location rep'resented by the contents of register r and * 100 (r ) means the memory location found in the location obtained by adding 100 to the contents of r. For example, LD Rl , * 100 (R2) has the effect of setting Rl = contents(contents (100 + contents(R2))) , that is, of loading into Rl the value in the memory location stored in the memory location obtained by adding 100 to the contents of register R2.

•

Finally, we allow an immediate constant addressing mode. The constant is prefixed by #. The instruction LD Rl , # 100 loads the integer 100 into register Rl , and ADD Rl , Rl , # 100 adds the integer 100 into register l U .

Comments at the end of instructions are preceded by / /. Example 8.2 : The three-address statement x = y - z can be implemented by

the machine instructions:

514

CHAPTER 8. CODE GENERATION LD LD SUB ST

R1 , Y R2 , z R1 , R1 , R2 x , R1

II II II II

R1 y R2 = z R1 R1 - R2 x = R1

We can do better, perhaps. One of the goals of a good code-generation algorithm is to avoid using all four of these instructions, whenever possible. For example, y and / or z may have been computed in a register, and if so we can avoid the LD step ( s ) . Likewise, we might be able to avoid ever storing x if its value is used within the register set and is not subsequently needed. Suppose a is an array whose elements are 8-byte values, perhaps real num bers. Also assume elements of a are indexed starting at o. We may execute the three-address instruction b = a [1] by the machine instructions: LO MUL LD ST

R1 , i R1 , R 1 , 8 R2 , a (R 1 ) b , R2

II II II II

R1 i R1 R1 * 8 R2 contents ( a + content s (R1 ) ) b = R2

That is, the second step computes 8i, and the third step places in register R2 the value in the ith element of a - the one found in the location that is 8i bytes past the base address of the array a. Similarly, t he assignment into the array a represented by three-address in struction a [j ] c is implemented by: LD LD MUL ST

R1 , c R2 , j R2 , R2 , 8 a (R2) , R1

II II II II

R1 = c R2 j R2 R2 * 8 content s ( a +

content s (R2) ) = R 1

To implement a simple pointer indirection, such as the three-address state ment x = *p, we can use machine instructions like: LD R1 , P LD R2 , O (R 1 ) S T x , R2

I I R1 = p II R2 = content s (O + content s (R1 ) ) II x = R2

The assignment through a pointer *p code by: LD R1 , P LD R2 , Y ST O (R 1 ) , R2

y is similarly implemented in machine

II R1 = p II R2 = y II content s ( O + content s (R1 ) ) = R2

Finally, consider a conditional-jump three-address instruction like if x < Y goto L

8.2. THE TARGET LANG UAGE

515

The machine-code equivalent would be something like: LD LD SUB BLTZ

R1 , R2 , R1 , R1 ,

x Y R1 , R2

M

II II II II

R1 = x R2 y R1 - R2 R1 if R1 < 0 j ump to M

Here, M is the label that represents the first machine instruction generated from the three-address instruction that has label 1. As for any three-address instruc tion, we hope that we can save some of these machine instructions because the needed operands are already in registers or because the result need never be stored. 0 8 . 2 .2

Program and Instruction Costs

We often associate a cost with compiling and running a program. Depending on what aspect of a program we are interested in optimizing, some common cost measures are the length of compilation time and the size, running time and power consumption of the target program. Determining the actual cost of compiling and running a program is a com plex problem. Finding an optimal target program for a given source program is an undecidable problem in general, and many of the subproblems involved are NP-hard. As we have indicated, in code generation we must often be content with heuristic techniques that produce good but not necessarily optimal target programs. For the remainder of this chapter, we shall assume each target-language instruction has an associated cost. For simplicity, we take the cost of an in struction to be one plus the costs associated with the addressing modes of the operands. This cost corresponds to the length in words of the instruction. Addressing modes involving registers have zero additional cost, while those in volving a memory location or constant in them have an additional cost of one, because such operands have to be stored in the words following the instruction. Some examples: •

The instruction LD RO , R1 copies the contents of register R1 into register RO . This instruction has a cost of one because no additional memory

words are required. •

The instruction LD RO , M loads the contents of memory location M into register RO. The cost is two since the address of memory location M is in the word following the instruction.

•

The instruction LD R1 , * 1 00 (R2) loads into register R1 the value given by contents( contents( 1 00 + contents(R2) ) ) . The cost is three because the constant 100 is stored in the word following the instruction.

516

CHAPTER 8 . CODE GENERATION

In this chapter we assume the cost of a target-language program on a given input is the sum of costs of the individual instructions executed when the pro gram is run on that input. Good code-generation algorithms seek to minimize the sum of the costs of the instructions executed by the generated target pro gram on typical inputs. We shall see that in some situations we can actually generate optimal code for expressions on certain classes of register machines. 8.2.3

Exercises for Section 8 . 2

Exercise 8 . 2 . 1 : Generate code for the following three-address statements as suming all variables are stored in memory locations.

a) x = 1 b) x = a c) x = a + 1 d) x = a + b e ) The two statements x = b * c = a + x

Y

Exercise 8.2.2 : Generate code for the following three-address statements as suming a and b are arrays whose elements are 4-byte values.

a) The four-statement sequence x = a [i] y = b [j ] a [i] y b [j ] = x

b ) The three-statement sequence x = a [i] b [i] y z = x * y

c ) The three-statement sequence x = a [i] y = b [x] a [i] = y

8.2. THE TARGET LANGUAGE

517

Exercise 8.2.3 : Generate code for the following three-address sequence as suming that p and q are in memory locations:

y = *q q = q + 4 *p = Y P = P + 4 Exercise 8.2.4 : Generate code for the following sequence assuming that x, y,

and

z

are in memory locations:

if x < Y goto L 1 = ° goto L2 L1 : z = 1 z

Exercise 8.2.5 : Generate code for the following sequence assuming hat n is

in a memory location:

= ° i = ° L 1 : if i > n goto L2 s = s + i i i + 1 goto L 1 L2 : s

=

Exercise 8.2.6 : Determine the costs of the following instruction sequences:

a)

b)

c)

d)

LD LD ADD 8T

RO , Y R1 , z RO , RO , R1 x , RO

LD MUL LD 8T

RO , i RO , RO , 8 R1 , a eRO) b , R1

LD LD MUL 8T

RO , c R1 , i R1 , R1 , 8 a (R1 ) , RO

LD RO , P LD R1 , O (RO) 8T x , R1

518

CHAPTER 8. CODE GENERATION

e)

LD RO , P LD R1 , x ST a eRO) , R1

f)

LD LD SUB BLTZ

8.3

RO , x R1 , Y RO , RO , R1 *R3 , RO

Addresses in the Target Code

In this section, we show how names in the IR can be converted into addresses in the target code by looking at code generation for simple procedure calls and returns using static and stack allocation. In Section 7.1, we described how each executing program runs in its own logical address space that was partitioned into four code and data areas:

1. A statically determined area Code that holds the executable target code. The size of the target code can be determined at compile time. 2. A statically determined data area Static for holding global constants and other data generated by the compiler. The size of the global constants and compiler data can also be determined at compile time. 3. A dynamically managed area Heap for holding data objects that are allo cated and freed during program execution. The size of the Heap cannot be determined at compile time. 4. A dynamically managed area Stack for holding activation records as they are created and destroyed during procedure calls and returns. Like the Heap, the size of the Stack cannot be determined at compile time.

8. 3 . 1

Static Allocation

To illustrate code generation for simplified procedure calls and returns, we shall focus on the following three-address statements: •

call callee

•

return

•

halt

•

act ion, which is a placeholder for other three-address statements.

The size and layout of activation records are determined by the code gener ator via the information about names stored in the symbol table. We shall first illustrate how to store the return address in an activation record on a procedure

8.3.

ADDRESSES IN THE TARGET CODE

519

call and how to return control to it after the procedure call. For convenience, we assume the first location in the activation holds the return address. Let us first consider the code needed to implement the simplest case, static allocation. Here, a call callee statement in the intermediate code can be im plemented by a sequence of two target-machine ihstructions: 8T BR

callee. staticArea , #here + 20 callee. codeA rea

The 8T instruction saves the return address at the beginning of the activation record for .callee, and the BR transfers control to the target code for the called procedure callee. The attribute before callee. staticArea is a constant that gives the address of the beginning of the activation record for callee, and the attribute callee. codeA rea is a constant referring to the address of the first instruction of the called procedure callee in the Code area of the run-time memory. The operand #here + 20 in the 8T instruction is the literal return address; it is the address of the instruction following the BR instruction. We assume that #here is the address of the current instruction and that the three constants plus the two instructions in the calling sequence have a length of 5 words or 20 bytes. The code for a procedure ends with a return to the calling procedure, except that the first procedure has no caller, so its final instruction is HALT, which returns control to the operating system. A return callee statement can be implemented by a simple jump instruction BR

* callee. staticA rea

which transfers control to the address saved at the beginning of the activation record for callee. Example 8 . 3 : Suppose we have the following three-address code:

act ionl call p act ion2 halt act ion3 return

/ / code for c

/ / code for p

Figure 8.4 shows the target program for this three-address code. We use the pseudoinstruction ACTION to represent the sequence of machine instructions to execute the statement act ion, which represents three-address code that is not relevant for this discussion. We arbitrarily start the code for procedure c at address 100 and for procedure p at address 200. We that assume each ACTION instruction takes 20 bytes. We further assume that the activation records for these procedures are statically allocated starting at locations 300 and 364; re spectively. The instructions starting at address 100 implement the statements

520

CHAPTER 8. CODE GENERATION act ionl ; call p ; act ion2 ; halt

of the first procedure c. Execution therefore starts with the instruction ACTION1 at address 100. The 8T instruction at address 120 saves the return address 140 in the machine-status field, which is the first word in the activation record of p. The BR instruction at address 132 transfers control the first instruction in the target code of the called procedure p.

100: ACTION1 120: 8T 364 , # 140 132: BR 200 140: ACTION2 160: HALT 200: 220:

ACTION3 BR *364

II II II II

code for c code for act i onl save return address 140 in location 364 call p

II return to operating system I I code for p II return to address saved in location 364

300: 304:

II 300-363 hold activation record for I I return address II local data for c

364: 368:

II 364-451 hold activation record for p II return address II local data for p

c

Figure 8.4: Target code for static allocation After executing ACTION3 , the jump instruction at location 220 is executed. Since location 140 was saved at address 364 by the call sequence above, *364 represents 140 when the BR statement at address 220 is executed. Therefore, when procedure p terminates, control returns to address 140 and execution of procedure c resumes. 0

8.3.2

Stack Allocation

Static allocation can become stack allocation by using relative addresses for storage in activation records. In stack allocation, however, the position of an activation record for a procedure is not known until run time. This position is usually stored in a register, so words in the activation record can be accessed as offsets from the value in this register. The indexed address mode of our target machine is convenient for this purpose. Relative addresses in an activation record can be taken as offsets from any known position in the activation record, as we saw in Chapter 7. For conve-

8.3. ADDRESSES IN THE TARGET CODE

521

nience, we shall use positive offsets by maintaining in a register SP a pointer to the beginning of the activation record on top of the stack. When a procedure call occurs, the calling procedure increments SP and transfers control to the called procedure. After control returns to the caller, we decrement SP , thereby deallocating the activation record of the called procedure. The code for the first procedure initializes the stack by setting SP to the start of the stack area in memory: LD

SP , #stackStart

code for the first procedure

HALT

/ / initialize the stack / / terminate execution

A procedure call sequence increments SP, saves the return address, and transfers control to the called procedure: ADD ST

BR

SP , SP , # caller. recordSize *SP , #here + 16 callee. codeA rea

/ / increment stack pointer / / save return address / / return to caller

The operand # caller. recordSize represents the size of an activation record, so the ADD instruction makes SP point to the next activation record. The operand #here + 16 in the ST instruction is the address of the instruction following BR; it is saved in the address pointed to by SP . The return sequence consists of two parts. The called procedure transfers control to the return address using BR *O (SP)

/ / return to caller

The reason for using *0 (SP) in the BR instruction is that we need two levels of indirection: 0 (SP) is the address of the first word in the activation record and * O ( SP ) is the return address saved there. The second part of the return sequence is in the caller, which decrements SP , thereby restoring SP to its previous value. That is, after the subtraction SP points to the beginning of the activation record of the caller: SUB

SP , SP , # caller. recordSize

/ / decrement stack pointer

Chapter 7 contains a broader discussion of calling sequences and the trade offs in the division of labor between the calling and called procedures. Example 8.4 : The program in Fig. 8.5 is an abstraction of the quicksort program in the previous chapter. Procedure q is recursive, so more than one activation of q can be alive at the same time. Suppose that the sizes of the activation records for procedures ffi, p, and q have been determined to be msize, psize, and qsize, respectively. The first word in each activation record will hold a return address. We arbitrarily assume that the code for these procedures starts at addresses 100, 200, and 300, respectively,

522

CHAPTER 8. CODE GENERATION act ion! call q act ion2 halt act ion3 return act ion4 call p act ion5 call q act ion6 call q return

/ / code for m

/ / code for p / / code for 0,

We say that x is also an induction variable, though not a basic one. Note that the formula above does not apply if i = 0. 4. In all other cases, f i (m) (x) = NAA. To find the effect of executing a fixed number of iterations, we simply replace

i above by that number. In the case where the number of iterations is unknown,

the value at the start of the last iteration is given by f* . In this case, the only variables whose values can still be expressed in the affine form are the loop invariant variables. f* (m) (v) =

{ m(v) NAA

if f (m) (v) = m(v) otherwise

Example 9 . 6 2 : For the innermost loop in Example 9.58, the effect of executing i iterations, i > 0, is summarized by From the definition of we see that a and b are symbolic constants, C is a basic induction variable as it is

f13 •

fB3 '

696

CHAPTER 9. MACHINE-INDEPENDENT OPTIMIZATIONS

1

incremented by one every iteration. d is an induction variable because it is an affine function the symbolic constant b and basic induction variable c. Thus,

fBi 3 (m ) (v ) =

m ( a) m ( b) m ( c) + i m ( b) + m ( c) + i

if v = a if v = b if v = c if v = d.

If we could not tell how many times the loop of block B3 iterated, then we could not use f i and would have to use f * to express the conditions at the end of the loop. In this case, we would have

1 ;��

m ( a)

fB3 (m ) (v ) =

NAA

if v = a if v = b if v = c if v = d.

o

A Region-Based Algorithm Algorithm INPUT:

9 . 63 :

Region-based symbolic analysis.

A reducible flow graph G.

OUTPUT:

Symbolic maps IN [B] for each block B of G.

METHOD :

We make the following modifications to Algorithm 9.53.

1 . We change . ho� we construct the transfer function for a loop region. In the original algorithm we use the fR,IN[S] transfer function to map the symbolic map at the entry of loop region R to a symbolic map at the entry of loop body S after executing an unknown number of iterations. It is defined to be the closure of the transfer function representing all paths leading back to the entry of the loop, as shown in Fig. 9.50 ( b ) . Here we define fR, i , IN [S] to represent the effect of execution from the start of the loop region to the entry of the ith iteration. Thus,

fR, i, IN [S] =

(

/\

predecessors B in R of the header of S

fS, Q UT[B] )

i- I

2. If the number of iterations of a region is known, the summary of the region is computed by replacing i with the actual count. 3. In the top-down pass, we compute fR, i , IN [B] to find the symbolic map associated with the entry of the ith iteration of a loop.

9.S.

697

SYMBOLIC ANALYSIS

4. In the case where the input value of a variable m ( v ) is used on the right hand-side of a symbolic map in region R, and m ( v ) = NAA upon entry to th�region, we introduce a new reference variable t, add assignment t = v to the beginning of region R, and all references of m ( v ) are replaced by t. If we did not introduce a reference variable at this point, the NAA value held by v would penetrate into inner loops. D

fR5 , j ,IN[B3 ] fR5,j , OUT[B3] fR6 ,IN[B2 ] fR6 ,IN[R5] fR6 , OUT [B4 ] fR7,i,IN[R6 ] fR7,i,OUT [B4] fRs ,IN[B1 ] fRs ,IN[R7 ] fRs ,OUT[B4]

fBj-3 1 f13 I fB2 10

fR5 ,lO,OUT [B3] 0 fB2

ft�OUT[B4 ] f�6 ,OUT[B4] I fBl fR7,lOO,OUT[B4] 0 fBl

Figure 9.62: Transfer fmiction relations in the bottom-up pass for Example 9.58. Example 9 . 64 : For Example 9.58, we show how the transfer functions for the program are computed in the bottom-up pass in Fig. 9.62. Region R5 is the inner loop, with body B5 . The transfer function representing the path from the entry of region R5 to the beginning of the jth iteration, j 2:: 1, is f1� 1 . The transfer function representing the path to the end of the jth iteration, j 2:: 1, I. S fBj 3 ' Region R6 consists of blocks B2 and B4 , with loop region R5 in the middle. The transfer functions from the entry of B2 and R5 can be computed in the same way as in the original algorithm. Transfer function fR6 , OUT[B3 ] represents the composition of block B2 and the entire execution of the inner loop, since fB4 is the identity function. Since the inner loop is known to iterate 10 times, we can replace j by 10 to summarize the effect of the inner loop precisely. The

698

CHAPTER 9. MACHINE-INDEPENDENT OPTIMIZATIONS f (m) (b) m(b) m(b)

f (m) (c) f (m) (d) m(c) + j - 1 NAA m(c) + j m(b) + m(c)+ j-1 m(b) m(c) m(a) m (d) fR6 ,IN[B2] m(d) m(a) + 1 10m(a) + 10 0 fR6 ,IN[R5] 1 10 10m(a) + 10 10m(a) + 9 m(a) + B fR6,OUTr 41 N AA NAA N AA m(a) + i I fR7, i ,IN[1l6] 10m(a)+ 10m(a) + 10i 10 fR7, i ,OUT[B4] m(a) + i 10i + 9 m(d) m(c) m(b) m(a) fRs ,IN[B1 ] m(d) m(b) 0 m (c) fRs ,IN[ R7] 1009 10 1000 100 fRs ,OUTfE41

f (m) (a) f m(a) fR5,j,IN[Bs] fR5,;,OUT[Bs] m(a)

Figure 9.63: Transfer functions cOnIputed in the bottom-up pass for Exam ple 9.58 rest of the transfer functions can be computed in a similar m&nner. The actual transfer functions computed are shown in Fig. 9.63. The symbolic map at the entry of the program is simply mNAA ' We use the top-down pass to compute the symbolic map to the entry to successively nested regions llntil we find all the symbolic maps for ev�ry basic block. We start by computing the data-flow values for block Bl in region R8 : IN[B1] = m NAA oUT[B1] = fBI (IN[Bl ]) Descending down to regions R7 and R6 , we get INi [B2] fR7, i ,IN[R6] (ouT[B1]) OUTi [B2] = fB2 (IN dB2])

Finally, in region R5 , we get 1Ni ,j [B3] OUTi ,j [B3]

fR5 ,j,IN [ B s] (OUTi [B2] ) fBs (INi ,j [B3])

Not surprisingly, these equations produce the results we showed in Fig. 9.58. o

Example 9.58 shows a simple program where every variable used in the symbolic map has an affine expression. We use Example 9.65 to illustrate why and how we introduce reference variables in Algorithm 9�63.

9.S. SYMBOLIC ANALYSIS 1) 2) 3) 4) 5) 6)

699

for ( i = 1 ; i < n ; i++} { a = input 0 ; for (j = 1 ; j < 10 ; j ++ ) { a = a - 1 ·, b = j + a ,· a = a + 1 ,· } }

(a) A loop where a fluctuates. for ( i = 1 ; i < n ; i++} { a = input 0 ; t = a; for ( j = 1 ; j < 1 0 ; j ++) { a = t - 1 ·, t - 1 + j; b a = t ·, } }

(b) A reference variable t makes b an induction variable. Figure 9.64: The need to introduce reference variables Example 9 . 6 5 : Consider the simple example in Fig. 9.64(a) . Let Ij be the transfer function summarizing the effect of executing j iterations of the inner loop. Even though the value of a may fluctuate during the execution of the loop, we see that b is an induction variable based on the value of a on entry of the loop; that is, Ji (m) (b) = m(a) - 1 + j. Because a is assigned an input value, the symbolic map upon entry to the inner loop maps a to NAA. We introduce a new reference variable t to save the value of a upon entry, and perform the substitutions as in Fig. 9.64(b) . 0

9.8.4

Exercises for Section 9.8

9 . 8 . 1 : For the flow graph of Fig. 9.10 (see the exercises for Section 9.1) , give the transfer functions for

Exercise

a) Block B2 • b) Block B4 . c) Block B5 .

700

CHAPTER 9. MACHINE-INDEPENDENT OPTIMIZATIONS

9 . 8 . 2 : Consider the inner loop of Fig. 9 . 10, consisting of blocks B3 and B4 • If i represents the number of times around the loop, and j is the transfer function for the loop body (Le. , excluding the edge from B4 to B3 ) from the entry of the loop (Le. , the beginning of B3 ) to the exit from B4 , then what is j ? Remember that j takes as argument a map m, and m assigns a value to each of variables a, b, d, and e. We denote these values m ( a ) , and so on, although we do not know their values.

Exercise

i

! Exercise 9 . 8 . 3 : Now consider the outer loop of Fig. 9 . 10, consisting of blocks B2 , B3 , B4 , and B5 • Let g be the transfer function for the loop body, from the entry of the loop at B2 to its exit at B5 • Let i measure the number of iterations of the inner loop of B3 and B4 (which count of iterations we cannot know) , and let j measure the number of iterations of the outer loop (which we also cannot know) . What is gj ? 9.9

Summary of Chapter 9

+ Global Common Subexpressions : An important optimization is finding computations of the same expression in two different basic blocks. If one precedes the other, we can store the result the first time it is computed and use the stored result on subsequent occurrences. + Copy Propagation: A copy statement, U = v, assigns one variable v to another, u . In some circumstances, we can replace all uses of u by v, thus eliminating both the assignment and u . + Code Motion: Another optimization is to move a computation outside the loop in which it appears. This change is only correct if the computation produces the same value each time around the loop. + Induction Variables : Many loops have induction variables, variables that take on a linear sequence of values each time around the loop. Some of these are used only to count iterations, and they often can be eliminated, thus reducing the time it takes to go around the loop. + Data-Flow Analysis : A data-flow analysis schema defines a value at each point in the program. Statements of the program have associated transfer functions that relate the value before the statement to the value after. Statements with more than one predecessor must have their value defined by combining the values at the predecessors, using a meet (or confluence) operator. + Data-Flow Analysis on Basic Blocks : Because the propagation of data flow values within a block is usually quite simple, data-flow equations are generally set up to have two variables for each block, called IN and OUT, that represent the data-flow values at the beginning and end of the

9.9. SUMMARY OF CHAPTER 9

701

block, respectively. The transfer functions for the statements in a block are composed to get the transfer function for the block as a whole . .. Reaching Definitions : The reaching-definitions data-flow framework has

values that are sets of statements in the program that define values for one or more variables. The transfer function for a block kills definitions of variables that are definitely redefined in the block and adds ( "gener ates" ) in those definition� of variables that occur within the block. The confluence operator is union, since definitions reach a point if they reach any predecessor of that point .

.. Live Variables : Another important data-flow framework computes the

variables that are live (will be used before redefinition) at each point. The framework is similar to reaching definitions, except that the transfer function runs backward. A variable is live at the beginning of a block if it is either used before definition in the block or is live at the end of the block and not redefined in the block .

.. Available Expressions : To discover global common subexpressions, we

determine the available expressions at each point - expressions that have been computed and neither of the expression's arguments were redefined after the last computation. The data-flow framework is similar to reaching definitions, but the confluence operator is intersection rather than union .

.. Abstraction of Data-Flow Problems : Common data-flow problems, such

as those already mentioned, can be expressed in a common mathematical structure. The values are members of a semilattice, whose meet is the confluence operator. Transfer functions map lattice elements to lattice elements. The set of allowed transfer functions must be closed under composition and include the identity function.

.. Monotone Frameworks : A semilattice has a � relation defined by a � b if and only if a /\ b = a. Monotone frameworks have the property that each transfer function preserves the � relationship; that is, a � b implies f (a) � j (b) , for all lattice elements a and b and transfer function j . .. Distributive Frameworks : These frameworks satisfy the condition that j (a /\ b) = f (a) /\ f (b) , for all lattice elements a and b and transfer function j. It can be shown that the distributive condition implies the monotone

condition .

.. Iterative Solution to Abstract Frameworks : All monotone data-flow frame

works can be solved by an iterative algorithm, in which the IN and OUT values for each block are initialized appropriately (depending on the framework) , and new values for these variables are repeatedly com puted by applying the transfer and confluence operations. This solution is always safe (optimizations that it suggests will not change what the

702

CHAPTER 9. MACHINE-INDEPENDENT OPTIMIZATIONS program does) , but the solution is certain to be the best possible only if the framework is distributive .

.. The Constant Propagation Framework : While the basic frameworks such

as reaching definitions are distributive, there are interesting monotone but-not-distributive frameworks as well. One involves propagating con stants by using a semilattice whose elements are mappings from the pro gram variables to constants, plus two special values that represent "no information" and "definitely not a constant."

.. Partial-Redundancy Elimination: Many useful optimizations, such as code

motion and global common-subexpression elimination, can be generalized to a single problem called partial-redundancy elimination. Expressions that are needed, but are available along only some of the paths to a point, are computed only along the paths where they are not available. The correct application of this idea requires the solution to a sequence of four different data-flow problems plus other operations.

.. Dominators : A node in a flow graph dominates another if every path to

the latter must go through the former. A proper dominator is a dominator other than the node itself. Each node except the entry node has an imme diate dominator - that one of its proper dominators that is dominated by all the other proper dominators .

.. Depth-First Ordering of Flow Graphs : If we perform a depth-first search

of a flow graph, starting at its entry, we produce a depth-first spanning tree. The depth-first order of the nodes is the reverse of a postorder traversal of this tree .

.. Classification of Edges : When we construct a depth-first spanning tree,

all the edges of the flow graph can be divided into three groups: advanc ing edges (those that go from ancestor to proper descendant) , retreating edges (those from descendant to ancestor) and cross edges (others) . An important property is that all the cross edges go from right to left in the tree. Another important property is that of these edges, only the retreat ing edges have a head lower than its tail in the depth-first order (reverse postorder) .

.. Back Edges : A back edge is one whose head dominates its tail. Every

back edge is a retreating edge, regardless of which depth-first spanning tree for its flow graph is chosen .

.. Reducible Flow Graphs : If every retreating edge is a back edge, regardless

of which depth-first spanning tree is chosen, then the flow graph is said to be reducible. The vast majority of flow graphs are reducible; those whose only control-flow statements are the usual loop-forming and branching statements are certainly reducible.

9. 10. REFERENCES FOR CHAPTER 9

703

• Natural Loops : A natural loop is a set of nodes with a header node that dominates all the nodes in the set and has at least one back edge entering that node. Given any back edge, we can construct its natural loop by taking the head of the edge plus all nodes that can reach the tail of the edge without going throllgh the head. Two natural loops with different headers are either disjoint or one is completely contaiped in the other; this fact lets us talk about a hierarchy of nested loops, as long as "loops" are taken to be natural loops. • Depth-First Order Makes the Iterative Algorithm Efficient: The iterative algorithm requires few passes, as long as propagation of information along acyclic paths is sufficient; i.e. , cycles add nothing. If we visit nodes in depth-first order, any data-flow framework that propagates information forward, e.g. , reaching definitions, will converge in no more than 2 plus the largest number of retreating edges on any acyclic path. The same holds for backward-propagating frameworks, lil�e live variables, if we visit in the reverse of depth-first order (Le., in postorder ) . • Regions : Regions are sets of nodes and edges with a header h that domi nates all nodes in the region. The predecessors of any node other than h in the region must also be in the region. The edges of the region are all that go between nodes of the region, with the possible exception of some ' or all that enter the header. • Regions and Reducible Flow Graphs : Reducible flow graphs can be pars�d into a hierarchy of regions. These regions are either loop regions, which include all the edges into the header, or body regions that have no edges into the header. • Region-Based Data-Flow Analysis : An alternative to the iterative ap proach to data-flow analysis is to work up and down the region hierarchy, computing transfer functions from the header of each region to each node in that region. • Region-Based Induction Variable JJetection: An important application of region-based analysis is in a data-flow framework that tries to compute formulas for each variable in a loop region whose value is an affine (linear ) function of the number of times around the loop. 9 . 10

References for Chapt er 9

Two early compilers that did extensive code optimization are Alpha [7] and Fortran H [16] . The fundamental treatise on techniques for loop optimization (e.g. , code motion) is [1] , although earlier versions of some of these ideas appear in [8] . An informally distributed book [4] was influential in disseminating code optimization ideas.

704

CHAPTER 9. MACHINE-INDEPENDENT OPTIMIZATIONS

The first description of the iterative algorithm for data-flow analysis is from the unpublished technical report of Vyssotsky and Wegner [20] . The scientific study of data-flow analysis is said to begin with a pair of papers by Allen [2] and Cocke [3] . The lattice-theoretic abstraction described here is based on the work of Kil dail [13] . These frameworks assumed distributivity, whi ch many frameworks do not satisfy. After a number of such frameworks came to light, the monotonicity condition was embedded in the model by [5] and [11 J. Partial-redundancy elimination was pioneered by [17J . The iazy-code-mo tion algorithm described in this chapter is based on [14J Dominators were first used in the compiler described in [13] . However, the idea dates back to [18] . The notion of reducible flow graphs comes from [2] . The structure of these flow graphs, as presented here, is from [9J and [10J . [12] and [15] first connected reducibility of flow graphs to the common nested control-flow structures, which explains why this class of flow graphs is so common. The definition of reducibility by Tr -T2 reduction, as used in region-based analysis, is from [19] . The region-based approach was first used in a compiler described in [21J . The static single-assignment (SSA) form of intermediate representation in troduced in Section 6.1 incorporates both data flow and control flow into its representation. SSA facilitates the implementation of many optimizing trans formations from a common framework [6] . 1 . Allen, F. E., "Program optimization," Annual Review in Automatic Pro gramming 5 (1969) , pp. 239-307. 2. Allen, F. E., "Control flow analysis," A CM Sigplan Notices 5:7 (1970), pp. 1-19. 3. Cocke, J., "Global common subexpression elimination," A CM SIGPLAN Notices 5:7 (1970) , pp. 20-24. 4. Cocke, J. and J. T. Schwartz, Programming Languages and Their Com pilers: Preliminary Notes, Courant Institute of Mathematical Sciences, New York Univ., New York, 1970. 5. Cousot, P. and R. Cousot, "Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints," Fourth A CM Symposium on Principles of Programming Lan guages (1977) , pp. 238-252. 6. Cytron, R. , J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K . Zadeck, "Efficiently computing static single assignment form and the control de pendence graph," A CM Transactions on Programming Languages and Systems 13:4 (1991) , pp. 451-490.

9.10. REFERENCES FOR CHAPTER 9

705

7. Ershov, A. P. , "Alpha - an automatic programming system of high effi ciency," J. A CM 13:1 ( 1966 ) , pp. 17-24. 8. Gear, C. W., "High speed compilation of efficient object code," Comm. A CM 8:8 ( 1965 ) , pp. 483-488. 9. Hecht, M. S. and J. D. Ullman, "Flow graph reducibility," SIAM J. Com puting 1 ( 1972 ) , pp. 188-202. 10. Hecht, M. S. and J. D. Ullman, "Characterizations of reducible flow graphs," J. A CM 21 ( 1974 ) , pp. 367-375. 1 1 . Kam, J. B. and J. D. Ullman, "Monotone data flow analysis frameworks," Acta Informatica 7:3 ( 1977) , pp. 305-318. 12. Kasami, T., W. W. Peterson, and N. Tokura, "On the capabilities of while, repeat, and exit statements," Comm. A CM 16:8 ( 1973 ) , pp. 503-512. 13. Kildall, G., "A unified approach to global program optimization," A CM Symposium on Principles of Programming Languages ( 1973 ) , pp. 194-206. 14. Knoop, J., "Lazy code motion," Proc. A CM SIGPLAN

1 992 conference on Programming Language Design and Implementation, pp. 224-234.

15. Kosaraju, S. R. , "Analysis of structured programs," J. Computer and System Sciences 9:3 ( 1974 ) , pp. 232-255. 16. Lowry, E. S. and C. W. Medlock, "Object code optimization," Comm. A CM 1 2 : 1 ( 1969 ) , pp. 13-22. 17. Morel, E. and C. Renvoise, "Global optimization by suppression of partial redundancies," Comm. A CM 22 ( 1979 ) , pp. 96-103. 18. Prosser, R. T., "Application of boolean matrices to the analysis of flow diagrams," AFIPS Eastern Joint Computer Conference ( 1959 ) , Spartan Books, Baltimore MD, pp. 133-138. 19. Ullman, J. D . , "Fast algorithms for the elimination of common subexpres sions," Acta Informatica 2 ( 1973 ) , pp. 191-213. 20. Vyssotsky, V. and P. Wegner, "A graph theoretical Fortran source lan guage analyzer," unpublished technical report, Bell Laboratories, Murray Hill NJ, 1963. 21. Wulf, W. A., R. K. Johnson, C. B. Weinstock, S. O. Hobbs, and C. M. Geschke, The Design of a n Optimizing Compiler, Elsevier, New York, 1975.

Chapter

10

Instruction - Level Parallelism Every modern high-performance processor can execute several operations in a single clock cycle. The "billion-dollar question" is how fast can a program be run on a processor with instruction-level parallelism? The answer depends on: 1.

The potential parallelism in the program.

2. The available parallelism on the processor. 3.

Our ability to extract parallelism from the original sequential program.

4. Our ability to find the best parallel schedule given scheduling constraints. If all the operations in a program are highly dependent upon one another, then no amount of hardware or parallelization techniques can make the program run fast in parallel. There has been a lot of research on understanding the limits of parallelization. Typical nonnumeric applications have many inherent dependences. For example, these programs have many data-dependent branches that make it hard even to predict which instructions are to be executed, let alone decide which operations can be executed in parallel. Therefore, work in this area has focused on relaxing the scheduling constraints, including the introduction of new architectural features, rather than the scheduling techniques themselves. Numeric applications, such as scientific computing and signal processing, tend to have more parallelism. These applications deal with large aggregate data structures; operations on distinct elements of the structure are often inde pendent of one another and can be executed in parallel. Additional hardware resources can take advantage of such parallelism and are provided in high performance, general-purpose machines and digital signal processors. These programs tend to have simple control structures and regular data-access pat terns, and static techniques have been developed to extract the available paral lelism from these programs. Code scheduling for such applications is interesting 707

708

CHAPTER 10. INSTRUCTION-LEVEL PARALLELISM

and significant, as they offer a large number of independent operations to be mapped onto a large number of resources. Both parallelism extraction and scheduling for parallel execution can be performed either statically in software, or dynamically in hardware. In fact, even machines with hardware scheduling can be aided by software scheduling. This chapter starts by explaining the fundamental issues in using instruction level parallelism, which is the same regardless of whether the parallelism is managed by software or hardware. We then motivate the basic data-dependence analyses needed for the extraction of parallelism. These analyses are useful for many optimizations other than instruction-level parallelism as we shall see in Chapter 1 1 . Finally, we present the basic ideas in code scheduling. We describe a tech nique for scheduling basic blocks, a method for handling highly data-dependent control flow found in general-purpose programs, and finally a technique called software pipelining that is used primarily for scheduling numeric programs. 10.1

Pro cessor Architect ures

When we think of instruction-level parallelism, we usually imagine a processor issuing several operations in a single clock cycle. In fact, it is possible for a machine to issue just one operation per clockl and yet achieve instruction level parallelism using the concept of pipelining. In the following, we shall first explain pipelining then discuss multiple-instruction issue.

10. 1 . 1

Instruction Pipelines and Branch Delays

Practically every processor, be it a high-performance supercomputer or a stan dard machine, uses an instruction pipeline. With an instruction pipeline, a new instruction can be fetched every clock while preceding instructions are still going through the pipeline. Shown in Fig. 10.1 is a simple 5-stage instruction pipeline: it first fetches the instruction (IF) , decodes it (ID), executes the op eration (EX) , accesses the memory (MEM) , and writes back the result (WB). The figure shows how instructions i , i + 1, i + 2, i + 3, and i + 4 can execute at the same time. Each row corresponds to a clock tick, and each column in the figure specifies the stage each instruction occupies at each clock tick. If the result from an instruction is available by the time the succeeding in struction needs the data, the processor can issue an instruction every clock. Branch instructions are especially problematic because until they are fetched, decoded and executed, the processor does not know which instruction will ex ecute next. Many processors speculatively fetch and decode the immediately succeeding instructions in case a branch is not taken. When a branch is found to be taken, the instruction pipeline is emptied and the branch target is fetched. 1 We shall refer to a clock "tick" or clock cycle simply as a "clock," when the intent is clear.

1 0. 1 .

709

PROCESSOR ARCHITECTURES i+1 1 . IF 2. ID 3. EX 4. MEM 5. WB 6. 7. 8. 9.

IF ID EX MEM WB

i+2

IF ID EX MEM WB

i+3

i+4

IF IF ID EX ID MEM EX MEM WB WB

Figure 10.1: Five consecutive instructions in a 5-stage instruction pipeline Thus, taken branches introduce a delay in the fetch of the branch target and introduce "hiccups" in the instruction pipeline. Advanced processors use hard ware to predict the outcomes of branches based on their execution history and to prefetch from the predicted target locations. Branch delays are nonetheless observed if branches are mispredicted.

10. 1 .2

Pipelined Execution

Some instructions take several clocks to execute. One common example is the memory-load operation. Even when a memory access hits in the cache, it usu ally takes several clocks for the cache to return the data. We say that the execution of an instruction is pipelined if succeeding instructions not dependent on the result are allowed to proceed. Thus, even if a processor can issue only one operation per clock, several operations might be in their execution stages at the same time. If the deepest execution pipeline has n stages, potentially n operations can be "in flight" at the same time. Note that not all instruc tions are fully pipelined. While floating-point adds and multiplies often are fully pipelined, floating-point divides, being more complex and less frequently executed, often are not. Most general-purpose processors dynamically detect dependences between consecutive instructions and automatically stall the execution of instructions if their operands are not available. Some processors, especially those embedded in hand-held devices, leave the dependence checking to the software in order to keep the hardware simple and power consumption low. In this case, the compiler is responsible for inserting "no-op" instructions in the code if necessary to assure that the results are available when needed.

710

10. 1. 3

CHAPTER 1 0. INSTRUCTION-LEVEL PARALLELISM

Multiple Instruction Issue

By issuing several operations per clock, processors can keep even more opera tions in flight. The largest number of operations that can be executed simul taneously can be computed by multiplying the instruction issue width by the average number of stages in the execution pipeline. Like pipelining, parallelism on multiple-issue machines can be managed ei ther by software or hardware. Machines that rely on software to manage their parallelism are known as VLIW ( Very-Long-Instruction-Word ) machines, while those that manage their parallelism with hardware are known as superscalar machines. VLIW machines, as their name implies, have wider than normal instruction words that encode the operations to be issued in a single clock. The compiler decides which operations are to be issued in parallel and encodes the information in the machine code explicitly. Superscalar machines, on the other hand, have a regular instruction set with an ordinary sequential-execution semantics. Superscalar machines automatically detect dependences among in structions and issue them as their operands become available. Some processors include both VLIW and superscalar functionality. Simple hardware schedulers execute instructions in the order in which they are fetched. If a scheduler comes across a dependent instruction, it and all instructions that follow must wait until the dependences are resolved (Le. , the needed results are available) . Such machines obviously can benefit from having a static scheduler that places independent operations next to each other in the order of execution. More sophisticated schedulers can execute instructions "out of order." Op erations are independently stalled and not allowed to execute until all the values they depend on have been produced. Even these schedulers benefit from static scheduling, because hardware schedulers have only a limited space in which to buffer operations that must be stalled. Static scheduling can place independent operations close together to allow better hardware utilization. More impor tantly, regardless how sophisticated a dynamic scheduler is, it cannot execute instructions it has not fetched. When the processor has to take an unexpected branch, it can only find parallelism among the newly fetched instructions. The compiler can enhance the performance of the dynamic scheduler by ensuring that these newly fetched instructions can execute in parallel. 10.2

C o de- Scheduling Constraints

Code scheduling is a form of program optimization that applies to the machine code that is produced by the code generator. Code scheduling is subject to three kinds of constraints: 1. Control-dependence constraints. All the operations executed in the origi nal program must be executed in the optimized one.

10.2. CODE-SCHEDULING CONSTRAINTS

711

2. Data-dependence constraints. The operations in the optimized program must produce the same results as the corresponding ones in the original program. 3. Resource constraints. The schedule must not oversubscribe the resources on the machine. These scheduling constraints guarantee that the optimized program pro duces the same results as the original. However, because code scheduling changes the order in which the operations execute, the state of the memory at any one point may not match any of the memory states in a sequential ex ecution. This situation is a problem if a program's execution is interrupted by, for example, a thrown exception or a user-inserted breakpoint. Optimized programs are therefore harder to debug. Note that this problem is not specific to code scheduling but applies to all other optimizations, including partial redundancy elimination ( Section 9.5) and register allocation ( Section 8.8) .

10.2 . 1

Data Dependence

It is easy to see that if we change the execution order of two operations that do not touch any of the same variables, we cannot possibly affect their results. In fact, even if these two operations read the same variable, we can still permute their execution. Only if an operation writes to a variable read or written by another can changing their execution order alter their results. Such pairs of operations are said to share a data dependence , and their relative execution order milst be preserved. There are three flavors of data dependence: 1. True dependence: read after write. If a write is followed by a read of the same location, the read depends on the value written; such a dependence is known as a true dependence. 2.

Antidependence: write after read. If a read is followed by a write to the same location, we say that there is an antidependence from the read to the write. The write does not depend on the read per se, but if the write happens before the read, then the read operation will pick up the wrong value, Antidependence is a byproduct of imperative programming, where the same meniory locations are used to store different values. It is not a "true" dependence and potentially can be eliminated by storing the values in different locations.

3. Output dependence: write after write. Two writes to the same location share an output dependence. If the dependence is violated, the value of the memory location written will have the wrong value after both operations are performed. Antidependence and output dependences are referred to as storage-related de pendences. These are not "true" dependences and can be eliminated by using

712

CHAPTER 1 0. INSTRUCTION-LEVEL PARALLELISM

different locations to store different values. Note that data dependences apply to both memory accesses and register accesses.

10.2.2

Finding Dependences Among Memory Accesses

To check if two memory accesses share a data dependence, we only need to tell if they can refer to the same location; we do not need to know which location is being accessed. For example, we can tell that the two accesses * P and ( *p) +4 cannot refer to the same location, even though we may not know what p points to. Data dependence is generally undecidable at compile time. The compiler must assume that operations may refer to the same location unless it can prove otherwise. Example

10.1 :

Given the code sequence

1) 2) 3)

a *P x

1" 2" a ," ,

,

unless the compiler knows that p cannot possibly point to a, it must conclude that the three operations need to execute serially. There is an output depen dence flowing from statement (1) to statement (2) , and there are two true dependences flowing from statements (1) and (2) to statement (3) . D Data-dependence analysis is highly sensitive to the programming language used in the program. For type-unsafe languages like C and C++, where a pointer can be cast to point to any kind of object, sophisticated analysis is necessary to prove independence between any pair of pointer-based memory ac cesses. Even local or global scalar variables can be accessed indirectly unless we can prove that their addresses have not been stored anywhere by any instruc tion in the program. In type-safe languages like Java, objects of different types are necessarily distinct from each other. Similarly, local primitive variables on the stack cannot be aliased with accesses through other names. A correct discovery of data dependences requires a number of different forms of analysis. We shall focus on the major questions that must be resolveq if the compiler is to detect all the dependences that exist in a program, and how to use this information in code scheduling. Later chapters show how these analyses are performed. Array Data-Dependence Analysis

Array data dependence is the problem of disambiguating between the valu�s of indexes in array-element accesses. For example, the loop f or ( i = 0 ; i < n ; i++) A [2*i] = A [2* i+1] ;

10.2. CODE-SCHEDULING CONSTRAINTS

713

copies odd elements in the array A to the even elements just preceding them. Because all the read and written locations in the loop are distinct from each other, there are no dependences between the accesses, and all the iterations in the loop can execute in parallel. Array data-dependence analysis, often referred to simply as data-dependence analysis, is very important for the optimization of numerical applications. This topic will be discussed in detail in Section 1 1 .6. Pointer-Alias Analysis

We say that two pointers are aliased if they can refer to the same object. Pointer-alias analysis is difficult because there are many potentially aliased pointers in a program, and they can each point to an unbounded number of dynamic objects over time. To get any precision, pointer-alias analysis must be applied across all the functions in a program. This topic is discussed starting in Section 1 2 .4. Interprocedural Analysis

For languages that pass parameters by reference, interprocedural analysis is needed to determine if the same variable is passed as two or more different arguments. Such aliases can create dependences between seemingly distinct parameters. Similarly, global variables can be used as parameters and thus create dependences between parameter accesses and global variable accesses. Interprocedural analysis, discussed in Chapter 12, is necessary to determine these aliases.

10.2.3

Tradeoff Between Register Usage and Parallelism

In this chapter we shall assume that the machine-independent intermediate rep resentation of the source program uses an unbounded number of pseudoregisters to represent variables that can be allocated to registers. These variables include scalar variables in the source program that cannot be referred to by any other names, as well as temporary variables that are generated by the compiler to hold the partial results in expressions. Unlike memory locations, registers are uniquely named. Thus precise data-dependence constraints can be generated for register accesses easily. The unbounded number of pseudoregisters used in the intermediate repre sentation must eventually be mapped to the small number of physical registers available on the target machine. Mapping several pseudoregisters to the same physical register has the unfortunate side effect of creating artificial storage dependences that constrain instruction-level parallelism. Conversely, executing instructions in parallel creates the need for more storage to hold the values being computed simultaneously. Thus, the goal of minimizing the number of registers used conflicts directly with the goal of maximizing instruction-level parallelism. Examples 10. 2 and 10.3 below illustrate this classic trade-off between storage and parallelism.

CHAPTER 1 0. INSTRUCTION-LEVEL PARALLELISM

714

Hardware Register Renaming Instruction-level parallelism was first used in computer architectures as a means to speed up ordinary sequential machine code. Compilers at the time were not aware of the instruction-level parallelism in the machine and were designed to optimize the use of registers. They deliberately reordered instructions to minimize the number of registers used, and as a result, also minimized the amount of parallelism available. Example 10.3 illustrates how minimizing register usage in the computation of expression trees also limits its parallelism. There was so little parallelism left in the sequential code that com puter architects invented the concept of hardware register renaming to undo the effects of register optimization in compilers. Hardware register renaming dynamically changes the assignment of registers as the program runs. It interprets the machine code, stores values intended for the same register in different internal registers, and updates all their uses to refer to the right registers accordingly. Since the artificial register-dependence constraints were introduced by the compiler in the first place, they can be eliminated by using a register-allocation algorithm that is cognizant of instruction-level paral lelism. Hardware register renaming is still useful in the case when a ma chine's instruction set can only refer to a small number of registers. This capability allows an implementation of the architecture to map the small number of architectural registers in the code to a much larger number of internal registers dynamically.

Example 1 0 . 2 : The code below copies the values of variables in locations a and c to variables in locations b and d, respectively, using pseudoregisters t 1 and t2.

LD 8T LD 8T

t1, a

b, t1

t2 , c d , t2

II II II II

t1 = a t1 t2 = c t2 d

b

all the memory locations accessed are known to be distinct from each other, then the copies can proceed in parallel. However, if t 1 and t2 are assigned the same register so as to minimize the number of registers used, the copies are necessarily serialized. 0 If

Example 1 0 . 3 : Traditional register-allocation techniques aim to minimize the number of registers used when performing a computation. Consider the expression

715

10.2. CODE-SCHEDULING CONSTRAINTS + +

+

/�

+

/�

d

c

e

b

a

Figure 10.2: Expression tree in Example 10.3 ( a + b) + c + (d + e)

shown as a syntax tree in Fig. 10.2. It is possible to perform this computation using three registers, as illustrated by the machine code in Fig. 10.3. LD rl , a LD r2 , b ADD r l , r l , LD r2 , c ADD r l , rl , LD r2 , d LD r3 , e ADD r2 , r2 , ADD r l , r l ,

r2 r2 r3 r2

II II II II II II II II II

rl = a r2 b rl+r2 rl r2 = c r l+r2 rl d r2 e r3 r2+r3 r2 r l+r2 rl

Figure 10.3: Machine code for expression of Fig. 10.2 The reuse of registers, however, serializes the computation. The only oper ations allowed to execute in parallel are the loads of the values in locations a and b, and the loads of the values in locations d and e. It thus takes a total of 7 steps to complete the computation in parallel. Had we used different registers for every partial sum, the expression could be evaluated in 4 steps, which is the height of the expression tree in Fig. 10.2. The parallel computation is suggested by Fig. 10.4. D rl r6 r8 r9

a rl+r2 r6+r3 r8+r7

r2 r7

b

r5 = e

r4+r5

Figure 10.4: Parallel evaluation of the expression of Fig. 10.2

716

10.2.4

CHAPTER 1 0. INSTRUCTION-LEVEL PARALLELISM

Phase Ordering Between Register Allocation and Code Scheduling

If registers are allocated before scheduling, the resulting code tends to have many storage dependences that limit code scheduling. On the other hand, if code is scheduled before register allocation, the schedule created may require so many registers that register spilling ( storing the contents of a register in a memory location, so the register can be used for some other purpose ) may negate the advantages of instruction-level parallelism. Should a compiler allo cate registers first before it schedules the code? Or should it be the other way round? Or, do we need to address these two problems at the same time? To answer the questions above, we must consider the characteristics of the programs being compiled. Many nonnumeric applications do not have that much available parallelism. It suffices to dedicate a small number of registers for holding temporary results in expressions. We can first apply a coloring algorithm, as in Section 8.8.4, to allocate registers for all the nontemporary variables, then schedule the code, and finally assign registers to the temporary variables. This approach does not work for numeric applications where there are many more large expressions. We can use a hierarchical approach where code is op timized inside out, starting with the innermost loops. Instructions are first scheduled assuming that every pseudoregister will be allocated its own physical register. Register allocation is applied after scheduling and spill code is added where necessary, and the code is then rescheduled. This process is repeated for the code in the outer loops. When several inner loops are considered together in a common outer loop, the same variable may have been assigned different registers. We can change the register assignment to avoid having to copy the values from one register to another. In Section 10.5, we shall discuss the in teraction between register allocation and scheduling further in the context of a specific scheduling algorithm.

10.2.5

Control Dependence

Scheduling operations within a basic block is relatively easy because all the instructions are guaranteed to execute once control flow reaches the beginning of the block. Instructions in a basic block can be reordered arbitrarily, as long as all the data dependences are satisfied. Unfortunately, basic blocks, especially in nonnumeric programs, are typically very small; on average, there are only about five instructions in a basic block. In addition, operations in the same block are often highly related and thus have little parallelism. Exploiting parallelism across basic blocks is therefore crucial. An optimized program must execute all the operations in the original pro gram. It can execute more instructions than the original, as long as the extra instructions do not change what the program does. Why would executing extra instructions speed up a program's execution? If we know that an instruction

10.2. CODE-SCHEDULING CONSTRAINTS

717

is likely to be executed, and an idle resource is available to perform the opera tion "for free," we can execute the instruction speculatively. The program runs faster when the speculation turns out to be correct. An instruction i1 is said to be control-dependent on instruction i2 if the outcome of i2 determines whether i1 is to be executed. The notion of control dependence corresponds to the concept of nesting levels in block-structured programs. Specifically, in the if-else statement if ( c ) s 1 ; else s2 ; s 1 and s2 are control dependent on

c.

Similarly, in the while-statement

while ( c) s ;

the body s is control dependent on Example

10.4 :

c.

In the code fragment

if ( a > t ) b = a*a ; d = a+c ;

the statements b = a*a and d = a+c have no data dependence with any other part of the fragment. The statement b = a*a depends on the comparison a > t . The statement d = a+c , however, does not depend on the comparison and can be executed any time. Assuming that the multiplication a * a does not cause any side effects, it can be performed speculatively, as long as b is written only after a is found to be greater than t. D

10.2.6

Speculative Execution Support

Memory loads are one type of instruction that can benefit greatly from specula tive execution. Memory loads are quite common, of course. They have relatively long execution latencies, addresses used in the loads are commonly available in advance, and the result can be stored in a new temporary variable without destroying the value of any other variable. Unfortunately, memory loads can raise exceptions if their addresses are illegal, so speculatively accessing illegal addresses may cause a correct program to halt unexpectedly. Besides, mispre dicted memory loads can cause extra cache misses and page faults, which are extremely costly. Example 1 0 . 5 : In the fragment if (p ! = null) q = *p ;

dereferencing p speculatively will cause this correct program to halt in error if is null. 0

p

Many high-performance processors provide special features to support spec ulative memory accesses. We mention the most important ones next.

718

CHAPTER 10. INSTRUCTION-LEVEL PARALLELISM

Prefetching

The prefetch instruction was invented to bring data from memory to the cache before it is used. A prefetch instruction indicates to the processor that the program is likely to use a particular memory word in the near future. If the location specified is invalid or if accessing it causes a page fault, the processor can simply ignore the operation. Otherwise, the processor will bring the data from memory to the cache if it is not already there. Poison Bits

Another architectural feature called poison bits was invented to allow specu lative load of data from memory into the register file. Each register on the machine is augmented with a poison bit. If illegal memory is accessed or the accessed page is not in memory, the processor does not raise the exception im mediately but instead just sets the poison bit of the destination register. An exception is raised only if the contents of the register with a marked poison bit are used. Predicated Execution

Because branches are expensive, and mispredicted branches are even more so ( see Section 10. 1 ) , predicated instructions were invented to reduce the number of branches in a program. A predicated instruction is like a normal instruction but has an extra predicate operand to guard its execution; the instruction is executed only if the predicate is found to be true. As an example, a conditional move instruction CMOVZ R2 , R3 , R1 has the semantics that the contents of register R3 are moved to register R2 only if register R1 is zero. Code such as if (a == 0) b = c+d ;

can be implemented with two machine instructions, assuming that a, b, d are allocated to registers R 1 , R2, R4, R5, respectively, as follows:

c,

and

ADD R3 , R4 , R5 CMOVZ R2 , R3 , R1

This conversion replaces a series of instructions sharing a control dependence with instructions sharing only data dependences. These instructions can then be combined with adjacent basic blocks to create a larger basic block. More importantly, with this code, the processor does not have a chance to mispredict, thus guaranteeing that the instruction pipeline will run smoothly. Predicated execution does come with a cost. Predicated instructions are fetched and decoded, even though they may not be executed in the end. Static schedulers must reserve all the resources needed for their execution and ensure

10.2. CODE-SCHEDULING CONSTRAINTS

71 9

Dynamically Scheduled Machines The instruction set of a statically scheduled machine explicitly defines what can execute in parallel. However, recall from Section 10. 1 .2 that some ma,. chine architectures allow the decision to be made at run time about what can be executed in parallel. With dynamic scheduling, the same machine code can be run on different members of the same family (machines that implement the same instruction set) that have varying amounts of parallel execution support. In fact, machine-code compatibility is one of the major advantages of dynamically scheduled machines. Static schedulers, implemerited in the compiler by software, can help dynamic schedulers (implemented in the machine's hardware) better utilize machine resources. To build a static scheduler for a dynamically sched uled machine, we can use almost the same scheduling algorithm as for statically scheduled machines except that no-op instructions left in the schedule need not be generated explicitly. The matter is discussed further in Section 10.4.7. that all the potential data dependences are satisfied. Predicated execution should not be used aggressively unless the machine has many more resources than can possibly be used otherwise.

10.2.7

A Basic Machine Model

Many machines can be represented using the following simple model. A machine M = (R , T ) , consists of: 1 . A set of operation types T, such as loads, stores, arithmetic operations, and so on. 2. A vector R = [rl ' r2 , . . . ] representing hardware resources, where ri is the number of units available of the ith kind of resource. Examples of typical resource types include: memory access units, ALU's, and floating-point functional units.

Each operation has a set of input operands, a set of output operands, and a resource requirement. Associated with each input operand is an input latency indicating when the input value must be available (relative to the start of the operation) . Typical input operands have zero latency, meaning that the values are needed immediately, at the clock when the operation is issued. Similarly, associated with each output operand is an output latency, which indicates when the result is available, relative to the start of the operation. Resource usage for each machine operation type t is modeled by a two dimensional resource-reservation table, RTt . The width of the table is the

720

CHAPTER 10. INSTRUCTION-LEVEL PARALLELISM

number of kinds of resources in the machine, and its length is the duration over which resources are used by the operation. Entry RTt [i, j] is the number of units of the jth resource used by an operation of type t, i clocks after it is issued. For notational simplicity, we assume RTt [i, j] = 0 if i refers to a nonex istent entry in the table (i.e. , i is greater than the number of clocks it takes to execute the operation) . Of course, for any t, i, and j, RTt [i, j] must be less than or equal to R[j] , the number of resources of type j that the machine has. Typical machine operations occupy only one unit of resource at the time an operation is issued. Some operations may use more than one functional unit. For example, a multiply-and-add operation may use a multiplier in the first clock and an adder in the second. Some operations, such as a divide, may need to occupy a resource for several clocks. Fully pipelined operations are those that can be issued every clock, even though their results are not available until some number of clocks later. We need not model the resources of every stage of a pipeline explicitly; one single unit to represent the first stage will do. Any operation occupying the first stage of a pipeline is guaranteed the right to proceed to subsequent stages in subsequent clocks.

1) 2) 3) 4) 5) 6)

a c b d c a

b d c a d b

Figure 10.5: A sequence of assignments exhibiting data dependences

10.2.8

Exercises for Section 10.2 Exercise 1 0 . 2 . 1 : The assignments in Fig. 10.5 have certain dependences. For

each of the following pairs of statements, classify the dependence as (i) true de pendence, (ii) antidependence, (iii) output dependence, or (iv) no dependence (Le., the instructions can appear in either order) : a) Statements (1) and (4) . b) Statements (3) and (5) .

c) Statements ( 1 ) and (6) .

d) Statements (3) and (6) . e) Statements (4) and (6) . Exercise 1 0 . 2 . 2 : Evaluate the expression ((u+v) + (w +x)) + (y +z) exactly as parenthesized (i.e. , do not use the commutative or associative laws to reorder the

721

10.3. BASIC-BLOCK SCHEDULING

additions ) . Give register-level machine code to provide the maximum possible parallelism. Exercise

a) b)

10.2.3 :

Repeat Exercise 10.2.2 for the following expressions:

(u + (v + (w + x)) ) + ( Y + Z ) . (u + (v + w)) + (x + (y + z) ) .

If instead of maximizing the parallelism, we minimized the number of registers, how many steps would the computation take? How many steps do we save by using maximal parallelism? Exercise 1 0 . 2 . 4 : The expression of Exercise 10.2.2 can be executed by the sequence of instructions shown in Fig. 10.6. If we have as much parallelism as we need, how many steps are needed to execute the instructions?

1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11)

LD r 1 , U LD r2 , v ADD r 1 , r 1 , LD r2 , W LD r3 , x ADD r2 , r2 , ADD r 1 , r 1 , LD r2 , y LD r3 , z ADD r2 , r2 , ADD r 1 , r 1 ,

r2 r3 r2 r3 r2

II II II II II II II II II II II

r1 = U v r2 r 1 + r2 r1 r2 = W r3 = x r2 + r3 r2 r 1 + r2 r1 r2 y r3 = z r2 + r3 r2 r 1 + r2 r1

Figure 10.6: Minimal-register implementation of an arithmetic expression 1 0 . 2 . 5 : Thanslate the code fragment discussed in Example lOA, using the CMOVZ conditional copy instruction of Section 10.2.6. What are the data dependences in your machin� code?

! Exercise

10.3

Basic-Block Scheduling

We are now ready to start talking about code-scheduling algorithms. We start with the easiest problem: scheduling operations in a basic block consisting of machine instructions. Solving this problem optimally is NP-complete. But in practice, a typical basic block has only a small number of highly constrained operations, so simple scheduling techniques suffice. We shall introduce a simple but highly effective algorithm, called list scheduling, for this problem.

CHAPTER 10. INSTRUCTION-LEVEL PARALLELISM

722

10.3. 1

Data-Dependence Graphs

We represent each basic block of machine instructions by a data-dependence = ( N, E) , having a set of nodes N representing the operations in the machine instructions in the block and a set of directed edges E representing the data-dependence constraints among the operations. The nodes and edges of G are constructed as follows: graph, G

1. Each operation n in N has a resource-reservation table RTn , whose value is simply the resource-reservation table associated with the operation type of n. 2. Each edge e in E is labeled with delay de indicating that the destination node must be issued no earlier than de clocks after the source node is issued. Suppose operation nl is followed by operation n2 , and the same location is accessed by both, with latencies lr and l2 respectively. That is, the location's value is produced lr clocks after the first instruction begins, and the value is needed by the second instruction l2 clocks after that instruction begins ( note lr = 1 and l2 = 0 is typical ) . Then, there is an edge nl -+ n2 in E labeled with delay lr - b . Example 1 0 .6 : Consider a simple machine that can execute two operations every clock. The first must be either a branch operation or an AL U operation of the form:

OP dst , sre 1 , sre2

The second must be a load or store operation of the form: LD dst , addr 8T addr , sre

The load operation (LD) is fully pipelined and takes two clocks. However, a load can be followed immediately by a store 8T that writes to the memory location read. All other operations complete in one clock. Shown in Fig. 10.7 is the dependence graph of an example of a basic block and its resources requirement. We might imagine that R1 is a stack pointer, used to access data on the stack with offsets such as 0 or 12. The first instruction loads register R2, and the value loaded is not available until two clocks later. This observation explains the label 2 on the edges from the first instruction to the second and fifth instructions, each of which needs the value of R2. Similarly, there is a delay of 2 on the edge from the third instruction to the fourth; the value loaded into R3 is needed by the fourth instruction, and not available until two clocks after the third begins. Since we do not know how the values of R1 and R7 relate, we have to consider the possibility that an address like 8 (R1) is the same as the address 0 (R7) . That

723

10.3. BASIC-BLOCK SCHEDULING resource reservation tables

data dependences

alu mem

2

1

Figure 10.7: Data-dependence graph for Example 10.6 the last instruction may be storing into the same address that the third instruction loads from. The machine model we are using allows us to store into a location one clock after we load from that location, even though the value to be loaded will not appear in a register until one clock later. This observation explains the label 1 on the edge from the third instruction to the last. The same reasoning explains the edges and labels from the first instruction to the last. The other edges with label 1 are explained by a dependence or possible dependence conditioned on the value of R7 . 0

10.3.2

List Scheduling of Basic Blocks

The simplest approach to scheduling basic blocks involves visiting each node of the data-dependence graph in "prioritized topological order." Since there can be no cycles in a data-dependence graph, there is always at least one topological order for the nodes. However, among the possible topological orders, some may be preferable to others. We discuss in Section 10.3.3 some of the strategies for

724

CHAPTER 10. INSTRUCTION-LEVEL PARALLELISM

Pictorial Resource-Reservation Tables It is frequently useful to visualize a resource-reservation table for an oper ation by a grid of solid and open squares. Each column corresponds to one of the resources of the machine, and each row corresponds to one of the clocks during which the operation executes. Assuming that the operation never needs more than one unit of any one resource, we may represent 1 's by solid squares, and O's by open squares. In addition, if the operation is fully pipelined, then we only need to indicate the resources used at the first row, and the resource-reservation table becomes a single row. This representation is used, for instance, in Example 10.6. In Fig. 10.7 we see resource-reservation tables as rows. The two addition operations require the "alu" resource, while the loads and stores require the "mem" resource.

picking a topological order, but for the moment, we just assume that there is some algorithm for picking a preferred order. The list-scheduling algorithm we shall describe next visits the nodes in the chosen prioritized topological order. The nodes may or may not wind up being scheduled in the same order as they are visited. But the instructions are placed in the schedule as early as possible, so there is a tendency for instructions to be scheduled in approximately the order visited. In more detail, the algorithm computes the earliest time slot in which each node can be executed, according to its data-dependence constraints with the previously scheduled nodes. Next, the resources needed by the node are checked against a resource-reservation table that collects all the resources committed so far. The node is scheduled in the earliest time slot that has sufficient resources. Algorithm

10.7 :

List scheduling a basic block.

INPUT: A machine-resource vector R = [r l ' r2 , " ' ] ' where ri is the number of units available of the ith kind of resource, and a data-dependence graph G = (N, E) . Each operation n in N is labeled with its resource-reservation table RTn ; each edge e = nl -t n2 in E is labeled with de indicating that n2 must execute no earlier than de clocks after n l · OUTPUT: A schedule S that maps the operations in N into time slots in which the operations can be initiated satisfying all the data and resources constraints. METHOD: Execute the program in Fig. 10.8. A discussion of what the "prior itized topological order" might be follows in Section 10.3.3. D

10.3. BASIC-BLOCK SCHED ULING

725

RT = an empty reservation table;

for ( each n in N in prioritized topological order ) {

s = maxe=p-+n in E (S(P) + de ) ; / * Find the earliest time this instruction could begin, given when its predecessors started. * / while ( there exists i such that RT[s + i] + RTn [i] > R) s = s + 1; / * Delay the instruction further until the needed resources are available. * / S(n) = s ; for ( all i) RT[s + i] = RT[s + i] + RTn [i]

} Figure 10.8: A list scheduling algorithm

10.3.3

Prioritized Topological Orders

List scheduling does not backtrack; it schedules each node once and only once. It uses a heuristic priority function to choose among the nodes that are ready to be scheduled next. Here are some observations about possible prioritized orderings of the nodes: •

Without resource constraints, the shortest schedule is given by the critical path, the longest path through the data-dependence graph. A metric useful as a priority function is the height of the node, which is the length of a longest path in the graph originating from the node.

•

On the other hand, if all operations are independent, then the length of the schedule is constrained by the resources available. The critical resource is the one with the largest ratio of uses to the number of units of that resource available. Operations using more critical resources may be given higher priority.

•

Finally, we can use the source ordering to break ties between operations; the operation that shows up earlier in the source program should be sched uled first.

Example 1 0 . 8 : For the data-dependence graph in Fig. 10.7, the critical path, including the time to execute the last instruction, is 6 clocks. That is, the critical path is the last five nodes, from the load of R3 to the store of R7. The total of the delays on the edges along this path is 5 , to which we add 1 for the clock needed for the last instruction. Using the height as the priority function, Algorithm 10.7 finds an optimal schedule as shown in Fig. 10.9. Notice that we schedule the load of R3 first, since it has the greatest height . The add of R3 and R4 has the resources to be

726

CHAPTER 10. INSTRUCTION-LEVEL PARALLELISM schedule

resource reservation table alu mem

LD

R3 , 8 ( R1 )

LD

R2 , O ( R1 )

ADD

R3 , R3 , R4

ADD

R3 , R3 , R2 S T 4 ( R1 ) , R2 S T 1 2 ( R1 ) , R3 S T O ( R7 ) , R7

Figure 10.9: Result of applying list scheduling to the example in Fig. 10.7 scheduled at the second clock, but the delay of 2 for a load forces us to wait until the third clock to schedule this add. That is, we cannot be sure that R3 will have its needed value until the beginning of clock 3. D 1) 2) 3) 4) 5) 6)

LD R1 , a LD R2 , b SUB R3 , R1 , R2 ADD R2 , R1 , R2 ST a , R3 ST b , R2

( a)

LD R1 , a LD R2 , b SUB R1 , R1 , R2 ADD R2 , R1 , R2 ST a , R1 ST b , R2

(b )

LD R1 , a LD R2 , b SUB R3 , R1 , R2 ADD R4 , R 1 , R2 ST a, R3 ST b , R4

( c)

Figure 10.10: Machine code for Exercise 10.3. 1

10.3.4

Exercises for Section 10.3

Exercise

10.3.1 :

dependence graph.

For each of the code fragments of Fig. 10.10, draw the data

Exercise 1 0 . 3 . 2 : Assume a machine with one ALU resource ( for the ADD and SUB operations ) and one MEM resource (for the LD and ST operations ) . Assume that all operations require one clock, except for the LD, which requires two. However, as in Example 10.6, a ST on the same memory location can commence one clock after a LD on that location commences. Find a shortest schedule for each of the fragments in Fig. 10.10.

l OA.

727

GLOBAL CODE SCHED ULING

Exercise

10.3.3 :

Repeat Exercise 10.3.2 assuming:

i. The machine has one ALU resource and two MEM resources. ii. The machine has two ALU resources and one MEM resource. iii. The machine has two ALU resources and two MEM resources. 1) 2)

3)

4) 5) 6) 7)

LD 8T LD 8T LD 8T 8T

R1 , a b , R1 R2 , c c , R1 R1 , d d , R2 a , R1

Figure 10.1 1 : Machine code for Exercise 10.3.4 Exercise

10.3.4 :

cise 10.3.2) :

Assuming the machine model of Example 10.6 ( as in Exer

a) Draw the data dependence graph for the code of Fig. 10. 1 1 . b ) What are all the critical paths in your graph from part ( a) ?

! c) Assuming unlimited MEM resources, what are all the possible schedules for the seven instructions? 10.4

G lobal C o de Scheduling

For a machine with a moderate amount of instruction-level parallelism, sched ules created by compacting individual basic blocks tend to leave many resources idle. In order to make better use of machine resources, it is necessary to con sider code-generation strategies that move instructions from one basic block to another. Strategies that consider more than one basic block at a time are referred to as global scheduling algorithms. To do global scheduling correctly, we must consider not only data dependences but also control dependences. We must ensure that 1 . All instructions in the original program are executed in the optimized program, and 2. While the optimized program may execute extra instructions specula tively, these instructions must not have any unwanted side effects.

728

10.4. 1

CHAPTER 10. INSTRUCTION-LEVEL PARALLELISM

Primitive Code Motion

Let us first study the issues involved in moving operations around by way of a simple example. Example 1 0 . 9 : Suppose we have a machine that can execute any two oper ations in a single clock. Every operation executes with a delay of one clock, except for the load operation, which has a latency of two clocks. For simplicity, we assume that all memory accesses in the example are valid and will hit in the cache. Figure 10.12(a) shows a simple flow graph with three basic blocks. The code is expanded into machine operations in Figure 10.12(b) . All the instruc tions in each basic block must execute serially because of data dependences; in fact, a no-op instruction has to be inserted in every basic block. Assume that the addresses of variables a, b, c, d, and e are distinct and that those addresses are stored in registers R1 through R5 , respectively. The com putations from different basic blocks therefore share no data dependences. We observe that all the operations in block B3 are executed regardless of whether the branch is taken, and can therefore be executed in parallel with operations from block Bl . We cannot move operations from Bl down to B3 , because they are needed to determine the outcome of the branch. Operations in block B2 are control-dependent on the test in block Bl . We can perform the load from B2 speculatively in block Bl for free and shave two clocks from the execution time whenever the branch is taken. Stores should not be performed speculatively because they overwrite the old value in a memory location. It is possible, however, to delay a store op eration. We cannot simply place the store operation from block B2 in block B3 , because it should only be executed if the flow of control passes through block B2 • However, we can place the store operation in a duplicated copy of B3 . Figure 10.12(c) shows such an optimized schedule. The optimized code executes in 4 clocks, which is the same as the time it takes to execute B3 alone. o

Example 10.9 shows that it is possible to move operations up and down an execution path. Every pair of basic blocks in this example has a different "dominance relation," and thus the considerations of when and how instructions can be moved between each pair are different. As discussed in Section 9.6. 1 , a block B is said t o dominate block B ' if every path from the entry of the control-flow graph to B' goes through B. Similarly, a block B postdominates block B' if every path from B' to the exit of the graph goes through B. When B dominates B' and B' postdominates B, we say that B and B' are control equivalent, meaning that one is executed when and only when the other is. For the example in Fig. 10.12, assuming Bl is the entry and B3 the exit, 1. Bl and B3 are control equivalent: Bl dominates B3 and B3 post dominates Bl , 2. Bl dominates B2 but B2 does not postdominate Bl , and

729

10.4. GLOBAL CODE SCHED ULING

if

L:

( a==O ) gata L

e

=

d+d

(a) Source program

LD R6 , O ( R l ) nap BEQZ R6 , L

Bl

LD R7 , O ( R2 ) nap ST ° ( R3 ) , R7 L:

B2

LD R8 , ° ( R4 ) nap ADD R8 , R 8 , R8 ST ° ( RS ) , R8

(b) Locally scheduled machine code

�----� Bl

LD R6 , O ( R l ) , LD R8 , O ( R4 ) LD R7 , O ( R2 ) ADD R8 , R8 , R8 , BEQZ R6 , L

�:

----'1

.-- ST-OS!!!" (R-'i!!! ), -R8 --

B3

�------��------------�

ST O ( RS ) , R8 , ST O ( R3 ) , R7

1

B3

'

(c) Globally scheduled machine code Figure 10.12: Flow graphs before and after global scheduling in Example 10.9

730

CHAPTER 1 0. INSTRUCTION-LEVEL PARALLELISM

3. B2 does not dominate B3 but B3 post dominates B2 . It is also possible for a pair of blocks along a path, to share neither a dominance . nor postdominance relation. 10.4.2

Upward Code Motion

We now ex:amine carefully what it means to move an operation up a path. Suppose we wish to move an operation from block src up a control-flow path to block dst. We assume that such a move does not violate any data dependences and that it makes paths through dst and src run faster. If dst dominates src, and src postdominates dst, then the operation moved is executed once and only once, when it should. If src does not postdominate dst

Then there exists a path that passes through dst that does not reach src. An extra operation would have been executed in this case. This code motion is illegal unless the operation moved has no unwanted side effects. If the moved operation executes "for free" (i.e. , it uses only resources that otherwise would be idle) , then this move has no cost. It is beneficial only if the control flow reaches src. If dst does not domjnate src

Then there exists a path that reaches src without first going through dst. We need to insert copies of the moved operation along such paths. We know how to achieve exactly that from our discussion of partial redundancy elimination in Section 9.5. We place copies of the operation along basic blocks that form a cut set separating the entry block from src. At each place where the operation is inserted, the following constraints must be satisfied: 1 . The operands of the operation must hold the same values as in the original, 2. The result does not overwrite a value that is still needed, and 3. It itself is not subsequently overwritten before reaching src. These copies render the original instruction in src fully redundant, and it thus can be eliminated, We refer to the extra copies of the operation as compensation code. As dis cussed in Section 9.5, basic blocks can be inserted along critical edges to create places for holding such copies. The compensation code can potentially make some paths run slower. Thus, this code motion improves program execution only if the optimized paths are executed more frequently than the nonopti mized ones.

l OA.

731

GLOBAL CODE SCHEDULING

10.4.3

Downward Code Motion

Suppose we are interested in moving an operation from block src down a control flow path to block dst. We can reason about such code motion in the same way as above. If src does not dominate dst

Then there exists a path that reaches dst without first visiting src. Again, an extra operation will be executed in this case. Unfortunately, downward code motion is often applied to writes, which have the side effects of overwriting old values. We can get around this problem by replicating the basic blocks along the paths from src to dst, and placing the operation only in the new copy of dst. Another approach, if available, is to use predicated instructions. We guard the operation moved with the predicate that guards the src block. Note that the predicated instruction must be scheduled only in a block dominated by the computation of the predicate, because the predicate would not be available otherwise. If dst does not post dominate src

As in the discussion above, compensation code needs to be inserted so that the operation moved is executed on all paths not visiting dst. This transformation is again analogous to partial redundancy elimination, except that the copies are placed below the src block in a cut set that separates src from the exit. Summary of Upward and Downward Code Motion

From this discussion, we see that there is a range of possible global code mo tions which vary in terms of benefit, cost, and implementation complexity. Fig ure 10.13 shows a summary of these various code motions; the lines correspond to the following four cases:

1 2 3 4

up: src postdom dst down: src dom dst yes no yes no

dst dom src dst postdom src

yes yes no no

speculation code dup. no yes no yes

compensation code no no yes yes

Figure 10. 13: Summary of code motions 1 . Moving instructions between control-equivalent blocks is simplest and most cost effective. No extra operations are ever executed and no com pensation code is needed.

732

CHAPTER 10. INSTRUCTION-LEVEL PARALLELISM

2. Extra operations may be executed if the source does not postdominate ( dominate) the destination in upward ( downward ) code motion. This code motion is beneficial if the extra operations can be executed for free, and the path passing through the source block is executed. 3. Compensation code is needed if the destination does not dominate ( post dominate ) the source in upward ( downward ) code motion. The paths with the compensation code may be slowed down, so it is important that the optimized paths are more frequently executed.

4. The last case combines the disadvantages of the second and third case: extra operations may be executed and compensation code is needed.

10.4.4

Updating Data Dependences

As illustrated by Example 10.10 below, code motion can change the data dependence relations between operations. Thus data dependences must be updated after each code movement. Example 1 0 . 1 0 : For the flow graph shown in Fig. 10. 14, either assignment to x can be moved up to the top block, since all the dependences in the original

program are preserved with this transformation. However, once we have moved one assignment up, we cannot move the other. More specifically, we see that variable x is not live on exit in the top block before the code motion, but it is live after the motion. If a variable is live at a program point, then we cannot move speculative definitions to the variable above that program point. 0

I

I

X =� l C�

Figure 10.14: Example illustrating the change in data dependences due to code motion.

10.4.5

Global Scheduling Algorithms

We saw in the last section that code motion can benefit some paths while hurting the performance of others. The good news is that instructions are not all created equal. In fact, it is well established that over 90% of a program's execution time is spent on less than 10% of the code. Thus, we should aim to

10.4. GLOBAL CODE SCHED ULING

733

make the frequently executed paths run faster while possibly making the less frequent paths run slower. There are a number of techniques a compiler can use to estimate execution frequencies. It is reasonable to assume that instructions in the innermost loops are executed more often than code in outer loops, and that branches that go backward are more likely to be taken than not taken. Also, branch statements found to guard program exits or exception-handling routines are unlikely to be taken. The best frequency estimates, however, come from dynamic profiling. In this technique, programs are instrumented to record the outcomes of conditional branches as they run. The programs are then run on representative inputs to determine how they are likely to behave in general. The results obtained from this technique have been found to be quite accurate. Such information can be fed back to the compiler to use in its optimizations. Region-Based Scheduling

We now describe a straightforward global scheduler that supports the two eas iest forms of code motion: 1. Moving operations up to control-equivalent basic blocks, and 2. Moving operations speculatively up one branch to a dominating predeces sor. Recall from Section 9.7.1 that a region is a subset of a control-flow graph that can be reached only through one entry block. We may represent any procedure as a hierarchy of regions. The entire procedure constitutes the top-level region, nested in it are subregions representing the natural loops in the function. We assume that the control-flow graph is reducible. Algorithm INPUT :

Region-based scheduling.

A control-flow graph and a machine-resource description.

OUTPUT:

slot.

10.11 :

A schedule S mapping each instruction to a basic block and a time

METHOD : Execute the program in Fig. 10.15. Some shorthand terminology should be apparent: ControIEquiv(B) is the set of blocks that are control equivalent to block B, and DominatedSucc applied to a set of blocks is the set of blocks that are successors of at least one block in the set and are dominated by all. Code scheduling in Algorithm 10. 11 proceeds from the innermost regions to the outermost. When scheduling a region, each nested subregion is treated as a black box; instructions are not allowed to move in or out of a subregion. They can, however, move around a subregion, provided their data and control dependences are satisfied.

734

CHAPTER 10. INSTRUCTION-LEVEL PARALLELISM

for (each region R in topological order, so that inner regions are processed before outer regions ) {

compute data dependences;

for (each basic block B of R in prioritized topological order ) { CandBlocks = ControlEquiv( B) U DominatedSucc( ControlEquiv( B) ) ; Candlnsts = ready instructions in CandBlocks; for (t = 0, 1 , . . . until all instructions from B are scheduled ) { for (each instruction n in Candlnsts in priority order ) if (n has no resource conflicts at time t) { S (n) = (B, t) ;

update resource commitments; update data dependences;

} update Can dIns ts; } } } Figure 10.15: A region-based global scheduling algorithm

All control and dependence edges flowing back to the header of the region are ignored, so the resulting control-flow and data-dependence graphs are acyclic. The basic blocks in each region are visited in topological order. This ordering guarantees that a basic block is not scheduled until all the instructions it de pends on have been scheduled. Instructions to be scheduled in a basic block B are drawn from all the blocks that are control-equivalent to B (including B) , as well as their immediate successors that are dominated by B. A list-scheduling algorithm is used to create the schedule for each basic block. The algorithm keeps a list of candidate instructions, CandIns ts, which contains all the instructions in the candidate blocks whose predecessors all have been scheduled. It creates the schedule clock-by-clock. For each clock, it checks each instruction from the Candlnsts in priority order and schedules it in that clock if resources permit. Algorithm 10.11 then updates Candlnsts and repeats the process, until all instructions from B are scheduled. The priority order of instructions in Candlnsts uses a priority function sim ilar to that discussed in Section 10.3. We make one important modification, however. We give instructions from blocks that are control equivalent to B higher priority than those from the successor blocks. The reason is that in structions in the latter category are only speculatively executed in block B. o

10.4. GLOBAL CODE SCHED ULING

735

Loop Unrolling

In region-based scheduling, the boundary of a loop iteration is a barrier to code motion. Operations from one iteration cannot overlap with those from another. One simple but highly effective technique to mitigate this problem is to unroll the loop a small number of times before code scheduling. A for-loop such as f or ( i = 0 ; i < N ; i++) { S Ci) ; }

can be written as in Fig. 10.16(a) . Similarly, a repeat-loop such as repeat S; unt il C ;

can be written as in Fig. 10. 16(b) . Unrolling creates more instructions in the loop body, permitting global scheduling algorithms to find more parallelism. for ( i = 0 ; i+4 < N ; i+=4) { S C i) ; S ( i+ l ) ; S ( i+2) ; S ( i+3) ; } f or ( ; i < N ; i++) { S Ci) ; }

(a) Unrolling a for-loop. repeat { S; if (C) break ; S' if (C) break ; S' if (C) break ; S' } unt il C ; ,

,

,

(b) Unrolling a repeat-loop. Figure 10. 16: Unrolled loops

736

CHAPTER 10. INSTRUCTION-LEVEL PARALLELISM

Neighborhood Compaction

Algorithm 10. 1 1 only supports the first two forms of code motion described in Section 10A. 1. Code motions that require the introduction of compensation code can sometimes be useful. One way to support such code motions is to follow the region-based scheduling with a simple pass. In this pass, we can examine each pair of basic blocks that are executed one after the other, and check if any operation can be moved up or down between them to improve the execution time of those blocks. If such a pair is found, we check if the instruction to be moved needs to be duplicated along other paths. The code motion is made if it results in an expected net gain. This simple extension can be quite effective in improving the performance of loops. For instance, it can move an operation at the beginning of one iteration to the end of the preceding iteration, while also moving the operation from the first iteration out of the loop. This optimization is particularly attractive for tight loops, which are loops that execute only a few instructions per iteration. However, the impact of this technique is limited by the fact that each code motion decision is made locally and independently.

10.4.6

Advanced Code Motion Techniques

If our target machine is statically scheduled and has plenty of instruction-level parallelism, we may need a more aggressive algorithm. Here is a high-level description of further extensions: 1 . To facilitate the extension s below, we can add new basic blocks along control-flow edges originating from blocks with more than one predecessor. These basic blocks will be eliminated at the end of code scheduling if they are empty. A useful heuristic is to move instructions out of a basic block that is nearly empty, so that the block can be eliminated completely. 2. In Algorithm 10. 1 1 , the code to be executed in each basic block is sched uled once and for all as each block is visited. This simple approach suffices because the algorithm can only move operations up to dominating blocks. To allow motions that require the addition of compensation code, we take a slightly different approach. When we visit block B , we only schedule instructions from B and all its control-equivalent blocks. We first try to place these instructions in predecessor blocks, which have already been visited and for which a partial schedule already exists. We try to find a destiriation block that would lead to an improvement on a frequently executed path and then place copies of the instruction on other paths to guarantee correctness. If the instructions cannot be moved up, they are scheduled in the current basic block as before. 3. Implementing downward code motion is harder in an algorithm that visits basic blocks in topological order, since the target blocks have yet to be

10.4. GLOBAL CODE SCHED ULING

737

scheduled. However, there are relatively fewer opportunities for such code motion anyway. We move all operations that ( a) can be moved, and (b ) cannot be executed for free in their native block.

This simple strategy works well if the target machine is rich with many unused hardware resources.

10.4.7

Interaction with Dynamic Schedulers

A dynamic scheduler has the advantage that it can create new schedules ac cording to the run-time conditions, without having to encode all these possible schedules ahead of time. If a target machine has a dynamic scheduler, the static scheduler's primary function is to ensure that instructions with high latency are fetched early so that the dynamic scheduler can issue them as early as possible. Cache misses are a class of unpredictable events that can make a big differ ence to the performance of a program. If data-prefetch instructions are avail able, the static scheduler can help the dynamic scheduler significantly by placing these prefetch instructions early enough that the data will be in the cache by the time they are needed. If prefetch instructions are not available, it is useful for a compiler to estimate which operations are likely to miss and try to issue them early. If dynamic scheduling is not available on the target machine, the static schedu�er must be conservative and separate every data-dependent pair of op erations by the minimum delay. If dynamic scheduling is available, however, the compiler only needs to place the data-dependent operations in the correct order to ensure program correctness. For best performance, the compiler should as sign long delays to dependences that are likely to occur and short ones to those that are not likely. Branch misprediction is an important cause of loss in performance. Because of the long misprediction penalty, instructions on rarely executed paths can still have a significant effect on the total execution time. Higher priority should be given to such instructions to reduce the cost of misprediction.

10.4.8

Exercises for Section 10.4

Exercise

10.4.1 :

Show how to unroll the generic while-loop

w�ile (C) S;

! Exercise

Consider the code fragment:

10.4.2 :

if ex == 0 ) a = b ; else a c· d = a; ,

738

CHAPTER

1 0.

INSTRUCTION-LEVEL PARALLELISM

Assume a machine that uses the delay model of Example 10.6 (loads take two clocks, all other instructions take one clock ) . Also assume that the machine can execute any two instructions at once. Find a shortest possible execution of this fragment. Do not forget to consider which register is best used for each of the copy steps. Also, remember to exploit the information given by register descriptors as was described in Section 8.6, to avoid unnecessary loads and stores. 10.5

Software Pip elining

As discussed in the introduction of this chapter, numerical applications tend to have much parallelism. In particular, they often have loops whose iterations are completely independent of one another. These loops, known as do-all loops, are particularly attractive from a parallelization perspective because their iter ations can be executed in parallel to achieve a speed-up linear in the number of iterations in the loop. Do-all loops with many iterations have enough par allelism to saturate all the resources on a processor. It is up to the scheduler to take full advantage of the available parallelism. This section describes an al gorithm; known as software pipelining, that schedules an entire loop at a time, taking full advantage of the parallelism across iterations.

10.5.1

Introduction

We shall use the do-all loop in Example 10.12 throughout this section to explain software pipelining. We first show that scheduling across iterations is of great importance, because there is relatively little parallelism among operations in a single iteration. Next, we show that loop unrolling improves performance by overlapping the computation of unrolled iterations. However, the boundary of the unrolled loop still poses as a barrier to code motion, and unrolling still leaves a lot of performance "on the table." The technique of software pipelitling, on the other hand, overlaps a number of consecutive iterations continually until it runs out of iterations. This technique allows software pipelining to produce highly efficient and compact code. Example

10.12 :

Here is a typical do-all loop:

for ( i = 0 ; i < n ; i++) D [i] = A [i] *B [i] + c ;

Iterations in the above loop write to different memory locations, which are themselves distinct from any of the locations read. Therefore, there are no memory dependences between the iterations, and all iterations can proceed in parallel. We adopt the following model as our target machine throughout this section. In this model

739

10.5. SOFTWARE PIPELINING •

The machine can issue in a single clock: one load, one store, one arithmetic operation, and one branch operation.

•

The machine has a loop-back operation of the form BL R , L

which decrements register R and, unless the result is 0, branches to loca tion L. •

Memory operations have an auto-increment addressing mode, denoted by ++ after the register. The register is automatically incremented to point to the next consecutive address after each access.

•

The arithmetic operations are fully pipelined; they can be initiated every clock but their results are not available until 2 clocks later. All other instructions have a single-clock latency.

If iterations are scheduled one at a time, the best schedule we can get on our machine model is shown in Fig. 10.17. Some assumptions about the layout of the data also also indicated in that figure: registers R1, R2, and R3 hold the addresses of the beginnings of arrays A , B, and D , register R4 holds the constant c, and register R10 holds the value n 1, which has been computed outside the loop. The computation is mostly serial, taking a total of 7 clocks; only the loop-back instruction is overlapped with the last operation in the iteration. 0 -

II II II L:

LD LD MUL nop ADD nop ST

R1 , R2 , R3 &A , &B , &D R4 = c R10 n 1 -

R5 , O (R1++) R6 , o (R2++) R7 , R5 , R6 R8 , R7 , R4 O (R3++) , R8

BL R10 , L

Figure 10. 17: Locally scheduled code for Example 10.12 In general, we get better hardware utilization by unrolling several iterations of a loop. However, doing so also increases the code size, which in turn can have a negative impact on overall performance. Thus, we have to compromise, picking a number of times to unroll a loop that gets most of the performance im provement, yet doesn't expand the code too much. The next example illustrates the tradeoff.

740

CHAPTER 1 0. INSTRUCTION-LEVEL PARALLELISM

Example 1 0 . 1 3 : While hardly any parallelism can be found in each iteration of the loop in Example 10.12, there is plenty of parallelism across the iterations. Loop unrolling places several iterations of the loop in one large basic block, and a simple list-scheduling algorithm can be used to schedule the operations to execute in parallel. If we unroll the loop in our example four times and apply Algorithm 10.7 to the code, we can get the schedule shown in Fig. 10. 18. (For simplicity, we ignore the details of register allocation for now) . The loop executes in 13 clocks, or one iteration every 3.25 clocks. A loop unrolled k times takes at least 2k + 5 clocks, achieving a throughput of one iteration every 2 + 5/ k clocks. Thus, the more iterations we unroll, the faster the loop runs. As n -+ 00, a fully unrolled loop can execute on average an iteration every two clocks. However, the more iterations we unroll, the larger the code gets. We certainly cannot afford to unroll all the iterations in a loop. Unrolling the loop 4 times produces code with 13 instructions, or 163% of the optimum; unrolling the loop 8 times produces code with 21 instructions, or 131% of the optimum. Conversely, if we wish to operate at, say, only 1 10% of the optimum, we need to unroll the loop 25 times, which would result in code with 55 instructions. 0

10.5.2

Software Pipelining of Loops

Software pipelining provides a convenient way of getting optimal resource usage and compact code at the same time. Let us illustrate the idea with our running example. 1 0 . 1 4 : In Fig. 10.19 is the code from Example 10.12 unrolled five times. (Again we leave out the consideration of register usage.) Shown in row i are all the operations issued at clock i; shown in column j are all the operations from iteration j. Note that every iteration has the same schedule relative to its beginning, and also note that every iteration is initiated two clocks after the preceding one. It is easy to see that this schedule satisfies all the resource and data-dependence constraints. We observe that the operations executed at clocks 7 and 8 are the same as those executed at clocks 9 and 10. Clocks 7 and 8 execute operations from the first four iterations in the original program. Clocks 9 and 10 also execute operations from four iterations, this time from iterations 2 to 5. In fact, we can keep executing this same pair of multi-operation instructions to get the effect of retiring the oldest iteration and adding a new one, until we run out of iterations. Such dynamic behavior can be encoded succinctly with the code shown in Fig. 10.20, if we assume that the loop has at least 4 iterations. Each row in the figure corresponds to one machine instruction. Lines 7 and 8 form a 2-clock in the loop, which is executed n 3 times, where n is the number of iterations ' original loop. 0

Example

-

741

10.5. SOFTWARE PIPELINING

L:

LD LD MUL ADD

LD LD MUL ADD

8T 8T

LD LD MUL ADD

LD LD MUL ADD

8T 8T

BL (L)

Figure 10. 18: Unrolled code for Example 10.12

Clock j = 1 1 LD 2 LD 3 MUL 4 5 6 ADD 7 8 8T 9 10 11 12 13 14 15 16

j=2 LD LD MUL ADD 8T

j=3 j=4 j=5

LD LD MUL ADD 8T

LD LD MUL ADD 8T

LD LD MUL ADD 8T

Figure 10.19: Five unrolled iterations of the code in Example 10.12

742

CHAPTER 10. INSTRUCTION-LEVEL PARALLELISM 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14)

LD LD MUL

L:

ADD 8T

LD LD MUL ADD 8T

LD LD MUL ADD 8T

LD LD MUL

BL (L)

ADD 8T

Figure 10.20: Software-pipelined code for Example 10.12 The technique described above is called software pipelining, because it is the software analog of a technique used for scheduling hardware pipelines. We can think of the schedule executed by each iteration in this example as an 8-stage pipeline. A new iteration can be started on the pipeline every 2 clocks. At the beginning, there is only one iteration in the pipeline. As the first iteration proceeds to stage three, the second iteration starts to execute in the first pipeline stage. By clock 7, the pipeline is fully filled with the first four iterations. In the steady state, four consecutive iterations are executing at the same time. A new iteration is started as the oldest iteration in the pipeline retires. When we run out of iterations, the pipeline drains, and all the iterations in the pipeline run to completion. The sequence of instructions used to fill the pipeline, lines 1 through 6 in our example, is called the prolog; lines 7 and 8 are the steady state; and the sequence of instructions used to drain the pipeline, lines 9 through 14, ' is called the epilog. For this example, we know that the loop cannot be run at a rate faster than 2 clocks per iteration, since the machine can only issue one read every clock, and there are two reads in each iteration. The software-pipelined loop above executes in 2n + 6 clocks, where n is the number of iterations in the original loop. As n -+ 00 , the throughput of the loop approaches the rate of one iteration every two clocks. Thus, software scheduling, unlike unrolling, can potentially encode the optimal schedule with a very compact code sequence. Note that the schedule adopted for each individual iteration is not the shortest possible. Comparison with the locally optimized schedule shown in Fig. 10. 17 shows that a delay is introduced before the ADD operation. The delay is placed strategically so that the schedule can be initiated every two clocks without resource conflicts. Had we stuck with the locally compacted schedule,

1 0.5.

SOFTWARE PIPELINING

743

the initiation interval would have to be lengthened to 4 clocks to avoid resource conflicts, and the throughput rate would be halved. This example illustrates an important principle in pipeline scheduling: the schedule must be chosen carefully in order to optimize the throughput. A locally compacted schedule, while minimizing the time to complete an iteration, may result in suboptimal throughput when pipelined.

10.5.3

Register Allocation and Code Generation

Let us begin by discussing register allocation for the software-pipelined loop in Example 10.14. Example 1 0 . 1 5 : In Example 10. 14, the result of the multiply operation in the first iteration is produced at clock :3 and used at clock 6. Between these clock cycles, a new result is generated by the multiply operation in the second iteration at clock 5; this value is used at clock 8. The results from these two iterations must be held in different registers to prevent them from interfering with each other . Since interference occurs only between adjacent pairs of itera tions, it can be avoided with the use of two registers, one for the o dd iterations and one for the even iterations. Since the code for o dd iterations is different from that for the even iterations, the size of the steady-state loop is doubled. This code can be used to execute any loop that has an odd number of iterations greater than or equal to 5.

if (N >= N2 else N2 for ( i D t i] f or ( i = D [i]

5) 3 + 2 * floor ( (N-3) /2) ; 0; 0 ; i < N2 ; i++) = A [i] * B [i] + c ; N2 ; i < N ; i++) = A [i] * B [i] + c ;

Figure 10.21: Source-level unrolling of the loop from Example 10.12 To handle loops that have fewer than 5 iterations and loops with an even number of iterations, we generate the code whose source.:.level equivalent is shown in Fig. 10.21. The first loop is pipelined, as seen in the machine-level equivalent of Fig. 10.22. The second loop of Fig. 10.21 need not be optimized, since it can iterate at most four times. 0

10.5.4

Do-Across Loops

Software pipelining can also be applied to loops whose iterations share data dependences. Such loops are known as do-across loops.

744

CHAPTER 1 0. INSTRUCTION-LEVEL PARALLELISM

I.

2. 3. 4. 5. 6. 7. 8. 9. 10.

L:

LD LD LD LD LD LD LD LD LD LD

R5 , 0 (R1++) R6 , 0 (R2++) R5 , 0 (R1++) R6 , 0 (R2++) R5 , 0 (R1++) R6 , 0 (R2++) R5 , 0 (R1++) R6 , 0 (R2++) R5 , 0 (R1++) R6 , 0 (R2++)

II.

12. 13. 14. 15. 16.

MUL R7 , R5 , R6 R9 , R5 , R6 R8 , R7 , R4 R7 , R5 , R6 R8 , R9 , R4 R9 , R5 , R6 R8 , R7 , R4 R7 , R5 , R6 R8 , R9 , R4

8T 0 (R3++) , R8

ADD R8 , R7 , R4

8T 0 (R3++) , R8

MUL ADD MUL ADD MUL ADD MUL ADD

8T 0 (R3++) , R8 8T 0 (R3++) , R8

BL R10 , L

8T 0 (R3++) , R8

Figure 10.22: Code after software pipelining and register allocation in Exam ple 10.15 Example

10.16 :

The code

f or (i = 0 ; i < n ; i++) { sum = sum + A [i] ; B [i] = A [i] * b ; }

has a data dependence between consecutive iterations, because the previous value of sum is added to A [i] to create a new value of sum. It is possible to execute the summation in O ( log n) time if the machine can deliver sufficient parallelism, but for the sake of this discussion, we simply assume that all the sequential dependences must be obeyed, and that the additions must be performed in the original sequential order. Because our assumed machine model takes two clocks to complete an ADD, the loop cannot execute faster th�n one iteration every two clocks. Giving the machine more adders or multipliers will not make this loop run any faster. The throughput of do-across loops like this one is limited by the chain of dependences across iterations. The best locally compacted schedule for each iteration is shown in Fig. 10.23 ( a) , and the software-pipelined code is in Fig. 10.23 (b ) . This software pipelined loop starts an iteration every two clocks, and thus operates at the optimal rate. 0

745

10.5. SOFTWARE PIPELINING II II II II L:

Ri = &A j R2 R3 = sum R4 = b RiO = n- i

LD R5 , O (Ri++) MUL R6 , R5 , R4 ADD R3 , R3 , R4 8T R6 , O (R2++)

&B

BL RiO , L

(a) The best locally compacted schedule. II II II II

L:

Ri = &A j R2 R3 = sum R4 = b RiO = n-2

=

LD R5 , O (Ri++) MUL R6 , R5 , R4 ADD R3 , R3 , R4 8T R6 , O (R2++)

&B

LD R5 , O (Ri++) MUL R6 , R5 , R4 ADD R3 , R3 , R4 8T R6 , O (R2++)

BL RiO , L

(b) The software-pipelined version. Figure 10.23: Software-pipelining of a do-across loop

10.5.5

Goals and Constraints of Software Pipelining

The primary goal of software pipelining is to maximize the throughput of a long-running loop. A secondary goal is to keep the size of the code generated reasonably small. In other words, the software-pipelined loop should have a small steady state of the pipeline. We can achieve a small steady state by requiring that the relative schedule of each iteration be the same, and that the iterations be initiated at a constant interval. Since the throughput of the loop is simply the inverse of the initiation interval, the objective of software pipelining is to minimize this interval. A software-pipeline schedule for a data-dependence graph G = (N, E) can be specified by 1. An initiation interval T and 2. A relative schedule S that specifies, for each operation, when that opera tion is executed relative to the start of the iteration to which it belongs.

I

746

CHAPTER 10. INSTRUCTION-LEVEL PARALLELISM

Thus, an operation n in the ith iteration, counting from 0, is executed at clock i x T + S ( n ) . Like ail the other scheduling problems, software pipelining has two kinds of constraints: resources and data dependences; We discuss each kind in detail below. Modular Resource Reservation

Let a machine's resources be represented by R = [ri , r2 , . . ; ] , where ri is the number of units of the ith kind of resource available. If an iteration of a loop requires n i units of resource i, then the average initiation interval of a pipelined loop IS at least maxi ( n d ri ) clock cycles. Software pipelining requires that the initiation intervals between any pair of iterations h ave a constant value. Thus, the initiation interval must have at least maX i In d ri l clocks. If maxi ( n d ri ) is less than 1, it is useful to unroll the source code a small number of times. 1 0 . 1 7 : Let us return to our software-pipelined loop shown in Fig. 10.20. Recall that the target machine can issue one load, one arithmetic op eration, one store, and one loop-back branch per clock. Since the loop has two loads, two arithmetic operations, and one store operation, the minimum initiation interval based on resource constraints is 2 clocks.

Example

Iteration 1 Ld Alu St Iteration 2 Ld Alu St Iteration 3 Ld Alu St Iteration 4 Ld Alu St Time

Steady state Ld Alu St

�

Figure 10.24: Resource requirements of four consecutive iterations from the code in Example 10.13 Figure 10.24 shows the resource requirements of four consecutive iterations across time. More resources are used as more iterations get initiated, culmi-

7"47

10.5. SOFTWARE PIPELINING

nating in maximum resource commitment in the steady state. Let RT be the resource-reservation table representing the commitment of one iteration, and let RTs represent the commitment of the steady state. RTs combines the commit ment from four consecutive iterations started T clocks apart. The commitment of row 0 in the table RTs corresponds to the sum of the resources committed in RT[O] , RT[2] , RT[4] , and RT[6] . Similarly, the commitment of row 1 in the ta ble corresponds to the sum of the resources committed in RT[l] , RT[3] , RT[5] , and RT[7J . That is, the resources committed in the ith row in the steady state are given by RTs [i] =

{t I (t

RT[t] . mod

2)=i}

We refer to the resource-reservation table representing the steady state as the modular resource-reservation table of the pipelined loop. To check if the software-pipeline schedule has any resource conflicts, we can simply check the commitment of the modular resource-reservation table. Surely, if the commitment in the steady state can be satisfied, so can the commitments in the prolog and epilog, the portions of code before and after the steady-state loop. D In general, given an initiation interval T and a resource-reservation table of an iteration RT, the pipelined schedule has no resource conflicts on a machine with resource vector R if and only if RTs [i] ::; R for all i = 0, 1, . . . , T - 1 . Data-Dependence Constraints

Data dependences in software pipelining are different from those we have en countered so far because they can form cycles. An operation may depend on the result of the same operation from a previous iteration. It is no longer ade quate to label a dependence edge by just the delay; we also need to distinguish between instances of the same operation in different iterations. We label a de pendence edge nl -+ n 2 with label (8, d) if operation n2 in iteration i must be delayed by at least d clocks after the execution of operation nl in iteration i - 8. Let S, a function from the nodes of the data-dependence graph to integers, be the software pipeline schedule, and let T be the initiation interval target. Then (8 x T ) + S(n 2 ) - S(nI ) � d.

The iteration difference, 8, must be nonnegative. Moreover, given a cycle of data-dependence edges, at least one of the edges has a positive iteration differ ence. Example

10.18 :

the values of p and

Consider the following loop, and suppose we do not know q:

f or (i = 0 ; i < n; i++) * ( P++) = * (q++) + c ;

748

CHAPTER 10. INSTRUCTION-LEVEL PARALLELISM

We must assume that any pair of * (p++) and * (q++) accesses may refer to the same memory location. Thus, all the reads and writes must execute in the original sequential order. Assuming that the target machine has the same characteristics as that described in Example 10. 13, the data-dependence edges for this code are as shown in Fig. 10.25. Note, however, that we ignore the loop-control instructions that would have to be present, either computing and testing i, or doing the test based on the value of R1 or R2 . 0 II R1 , R2 II R3 = c

q, p

< 1 , 1>

Figure 10.25: Data-dependence graph for Example 10.18 The iteration difference between related operations can be greater than one, as shown in the following example: for ( i = 2 ; i < n ; i++) A [i ] = B [i] + A [i-2]

Here the value written in iteration i is used two iterations later. The dependence edge between the store of A[i] and the load of A[i 2] thus has a difference of 2 iterations. The presence of data-dependence cycles in a loop imposes yet another limit on its execution throughput. For example, the data-dependence cycle in Fig. 10.25 imposes a delay of 4 clock ticks between load operations from consecutive iterations. That is, loops cannot execute at a rate faster than one iteration every 4 clocks. The initiation interval of a pipelined loop is no smaller than -

C a

max

cycl e in G

f"" · 0e L: e

de c be

ill C

in

1

clocks. In summary, the initiation interval of each software-pipelined loop is bound ed by the resource usage in each iteration. Namely, the initiation interval must be no smaller than the ratio of units needed of each resource and the units

10.5. SOFTWARE PIPELINING

749

available on the machine. In addition, if the loops have data-dependence cycles, then the initiation interval is further constrained by the sum of the delays in the cycle divided by the sum of the iteration differences. The largest of these quantities defines a lower bound on the initiation interval.

10.5.6

A Software-Pipelining Algorithm

The goal of software pipelining is to find a schedule with the smallest possible initiation interval. The problem is NP-complete, and can be formulated as an integer-linear-programming problem. We have shown that if we know what the minimum initiation interval is, the scheduling algorithm can avoid resource con flicts by using the modular resource-reservation table in placing each operation. But we do not know what the minimum initiation interval is until we can find a schedule. How do we resolve this circularity? We know that the initiation interval must be greater than the bound com puted from a loop's resource requirement and dependence cycles as discussed above. If we can find a schedule meeting this bound, we have found the opti mal schedule. If we fail to find such a schedule, we can try again with larger initiation intervals until a schedule is found. Note that if heuristics, rather than exhaustive search, are used, this process may not find the optimal schedule. Whether we can find a schedule near the lower bound depends on properties of the data-dependence graph and the architecture of the target machine. We can easily find the optimal schedule if the dependence graph is acyclic and if every machine instruction needs only one unit of one resource. It is also easy to find a schedule close to the lower bound if there are more hardware resources than can be used by graphs with dependence cycles. For such cases, it is advisable to start with the lower bound as the initial initiation-interval target, then keep increasing the target by just one clock with each scheduling attempt. Another possibility is to find the initiation interval using a binary search. We can use as an upper bound on the initiation interval the length of the schedule for one iteration produced by list scheduling.

10.5.7

Scheduling Acyclic Data-Dependence Graphs

For simplicity, we assume for now that the loop to be software pipelined contains only one basic block. This assumption will be relaxed in Section 10.5 . 1 1 . Software pipelining an acyclic dependence graph. INPUT: A machine-resource vector R = [rl , r2 , . . . ] , where ri is the number of units available of the ith kind of resource, and a data-dependence graph G = (N, E) . Each operation n in N is labeled with its resource-reservation table RTn ; each edge e = n l ---+ n2 in E is labeled with (8e , de ) indicating that n2 must execute no earlier than de clocks after node nl from the 8e th preceding iteration. OUTPUT: A software-pipelined schedule S and an initiation interval T. Algorithm

10.19 :

750

CHAPTER 1 0. INSTRUCTION-LEVEL PARALLELISM

METHOD :

Execute the program in Fig. 10.26.

mainO { To = m?-x J

r L:n i RTnrj (i , j) l '

.

0

;

for (T = To , To + 1, . " , until all nodes in N are scheduled ) {

RT = an empty reservation table with T rows;

for ( each n in N in prioritized topological order ) {

So = maxe =p--t n

in E

(8 (p) + de) ;

for (s = So , So + 1 , . . . , so + T - 1) if (NodeScheduled(RT, T, n, s) break; if (n cannot be scheduled in RT) break;

} } } NodeScheduled(RT, T, n, s) { RT' = RT; for ( each row i in RTn) RT'[(s + i) mod T] = RT'[(s + i) mod. T] + RTn [i] ; if ( for all i, RT' (i) ::; R) { RT = RT'; 8(n) = s ; return true;

}

else return false;

} Figure 10.26: Software-pipelining algorithm for acyclic graphs Algorithm 10.19 software pipelines acyclic data-dependence graphs. The algorithm first finds a bound on the initiation interval, To , based on the re source requirements of the operations in the graph. It then attempts to find a software-pipelined schedule starting with To as the target initiation interval. The algorithm repeats with increasingly larger initiation intervals if it fails to find a schedule. The algorithm uses a list-scheduling approach in each attempt. It uses a modular resource-reservation RT to keep track of the resource commitment in the steady state. Operations are scheduled in topological order so that the data dependences can always be satisfied by delaying operations. To schedule an operation, it first finds a lower bound So according to the data-dependence constraints. It then invokes NodeScheduled to check for possible resource con flicts in the steady state. If there is a resource conflict, the algorithm tries to schedule the operation in the next clock. If the operation is found to conflict for

751

10.5. SOFTWARE PIPELINING

T consecutive clocks, because of the modular nature of resource-conflict detec tion, further attempts are guaranteed to be futile. At that point, the algorithm considers the attempt a failure, and another initiation interval is tried. The heuristics of scheduling operations as soon as possible tends to minimize the length of the schedule for an iteration. Scheduling an instruction as early as possible, however, can lengthen the lifetimes of some variables. For example, loads of data tend to be scheduled early, sometimes long before they are used. One simple heuristic is to schedule the dependence graph backwards because there are usually more loads than stores.

10.5.8

Scheduling Cyclic Dependence Graphs

Dependence cycles complicate software pipelining significantly. When schedul ing operations in an acyclic graph in topological order, data dependences with scheduled operations can impose only a lower bound on the placement of each operation. As a result, it is always possible to satisfy the data-dependence con straints by delaying operations. The concept of "topological order" does not apply to cyclic graphs. In fact, given a pair of operations sharing a cycle, plac ing one operation will impose both a lower and upper bound on the placement of the second. Let nl and n2 be two operations in a dependence cycle, S be a software pipeline schedule, and T be the initiation interval for the schedule. A depen dence edge nl --+ n2 with label (81 , d1 ) imposes the following constraint on S(nd and S(n2 ) :

Similarly, a dependence edge (nl ' n2 ) with label (82 , d2) imposes constraint

Thus, A strongly connected component ( S e e ) in a graph is a set of nodes where every node in the component can be reached by every other node in the compo nent. Scheduling one node in an s e e will bound the time of every other node in the component both from above and from below. Transitively, if there exists a path p leading from nl to n2 , then

S (n2) - S(nd � Observe that

� ( de - (8e

e in p

X

T))

(10. 1)

752

CHAPTER 10. INSTRUCTION-LEVEL PARALLELISM •

Around any cycle, the sum of the 8's must be positive. If it were 0 or negative, then it would say that an operation in the cycle either had to precede itself or be executed at the same clock for all iterations .

•

The schedule of operations within an iteration is the same for all iterations; that requirement is essentially the meaning of a "software pipeline." As a result, the sum of the delays ( second components of edge labels in a data-dependence graph ) around a cycle is a lower bound on the initiation interval T.

When we combine these two points, we see that for any feasible initiation inter val T, the value of the right side of Equation (10.1) must be negative or zero. As a result, the strongest constraints on the placement of nodes is obtained from the simple paths - those paths that contain no cycles. Thus, for each feasible T, computing the transitive effect of data depen dences on each pair of nodes is equivalent to finding the length of the longest simple path from the first node to the second. Moreover, since cycles cannot increase the length of a path, we can use a simple dynamic-programming al gorithm to find the longest paths without the "simple-path" requirement, and be sure that the resulting lengths will also be the lengths of the longest simple paths (see Exercise 10.5.7) .

OJ Figure 10.27: Dependence grq,ph and resource requirement in Example 10.20 Example 1 0 . 2 0 : Figure 10.27 shows a data-dependence graph with four nodes a, b, c, d. Attached to each node is its resource-reservation table; attached to

each edge is its iteration difference and delay. Assume for this example that the target machine has one unit of each kind of resource. Since there are three uses of the first resource and two of the second, the initiation interval must be no less than 3 clocks. There are two SCC's in this graph: the first is a trivial component consisting of the node a alone, and the second consists of nodes b, c, and d. The longest cycle, b, c, d, b, has a total delay of 3 clocks connecting nodes that are 1 iteration apart. Thus, the lower bound on the initiation interval provided by data-dependence cycle constraints is also 3 clocks.

10. 5. SOFTWARE PIPELINING

753

Placing any of b , c, Or d in a schedule constrains all the other nodes in the component. Let T be the initiation interval. Figure 10.28 shows the transitive dependences. Part (a) shows the delay and the iteration difference 8, for each edge. The delay is represented directly, but 8 is represented by "adding" to the delay the value -8T. Figure 10.28(b) shows the length of the longest simple path between two nodes, when such a path exists; its entries are the sums of the expressions given by Fig. 10.28(a), for each edge along the path. Then, in (c) and (d) , we see the expressions of (b) with the two relevant values of T, that is, 3 and 4, substituted for T. The difference between the schedule of two nodes S(n2) - S(nl ) must be no less than the value given in entry (nl , n2 ) in each of the tables (c) or (d) , depending on the value of T chosen. For instance, consider the entry in Fig. 10.28 for the longest (simple) path from c to b, which is 2 - T. The longest simple path from c to b is c --+ d --+ b. The total delay is 2 along this path, and the sum of the 8's is 1 , representing the fact that the iteration number must increase by 1. Since T is the time by which each iteration follows the previous, the clock at which b must be scheduled is at least 2 - T clocks after the clock at which c is scheduled. Since T is at least 3, we are really saying that b may be scheduled T - 2 clocks before c, or later than that clock, but not earlier. Notice that considering nonsimple paths from c to b does not produce a stronger constraint. We can add to the path c --+ d --+ b any number of iterations of the cycle involving d and b. If we add k such cycles, we get a path length of 2 - T + k ( 3 - T) , since the total delay along the path is 3, and the sum of the 8's is 1 . Since T � 3, this length can never exceed 2 - T; i.e., the strongest lower bound on the clock of b relative to the clock of c is 2 - T, the bound we get by considering the longest simple path. For example, from entries (b, c) and (c, b) , we see that

S(c) - S(b) > 1 S(b) - S(c) > 2 - T. That is,

S(b) + 1 ::; S(c) ::; S(b) - 2 + T. If T = 3,

S(b) + 1 ::; S(c) ::; S(b) + 1 . Put equivalently, c must b e scheduled one clock after b . If T = 4 , however,

S(b) + 1 ::; S(c) ::; S(b) + 2. That is, c is scheduled one or two clocks after b. Given the all-points longest path information, we can easily compute the range where it is legal to place a node due to data dependences. We see that there is no slack in the case when T = 3, and the slack increases as T increases. o

754

CHAPTER 10. INSTRUCTION-LEVEL PARALLELISM a a

b

c

d

1

2 1

2

b

a

c

d

a

b

(a) Original edges.

a a

c

c

d

2

3 1

4

a

2 1

b

-1 -2

d

-1

(c) Longest simple paths (T=3) .

d

2

3 1

4

(b) Longest simple paths.

b

b

c

2 2- T 1 1- T 2- T

c

d

1- T

b

a

c

b

c

2

3 1

-2 -3

d

d 4

2 1

-2

(d) Longest simple paths (T=4).

Figure 10.28: Transitive dependences in Example 10.20 Algorithm

10.21 :

Software pipelining.

INPUT: A machine-resource vector R = [rl ' r2 , " ' ] ' where ri is the number of units available of the ith kind of resource, and a data-dependence graph G = (N, E) . Each operation n in N is labeled with its resource-reservation table RTn ; each edge e = n l -+ n2 in E is labeled with (8e , de ) indicating that n2 must execute no earlier than de clocks after node n l from the 8e th preceding iteration. OUTPUT:

A software-pipelined schedule S and an initiation interval T.

METHOD :

Execute the program in Fig. 10.29.

D

Algorithm 10.21 has a high-level structure similar to that of Algorithm 10. 19, which only handles acyclic graphs. The minimum initiation interval in this case is bounded not just by resource requirements, but also by the data-dependence cycles in the graph. The graph is scheduled one strongly connected component at a time. By treating each strongly connected component as a unit, edges be tween strongly connected components necessarily form an acyclic graph. While the top-level loop in Algorithm 10.19 schedules nodes in the graph in topological order, the top-level loop in Algorithm 10.21 schedules strongly connected com ponents in topological order. As before, if the algorithm fails to schedule all the components, then a larger initiation interval is tried. Note that Algorithm 10.21 behaves exactly like Algorithm 10.19 if given an acyclic data-dependence graph. Algorithm 10.21 computes two more sets of edges: E' is the set of all edges whose iteration difference is 0, E* is the all-points longest-path edges. That is,

755

10.5. SOFTWARE PIPELINING

mainO {

( [

E' = {ele in E, 6e = O} ; L: n , i RTn (i, j ) ,.."

}

l

l)

[

L: e in c de . , , c max = max max acyc1ein G 2: e j rj in c 6e for (T = To , To + 1 , . . . or until all see's in G are scheduled) { RT = an empty reservation table with T rows; E* = AllPairsLongestPath( G, T) ; for (each see C in G in prioritized topological order) { for (all n in C) so (n) = maxe=p-+ n in E* ,p scheduled (S(p) + de ) ; first = some n such that S o (n) is a minimum; So = So (first) ; for (s = So; s < So + T; s = s + 1) if (SccScheduled (RT , T, C, first, s)) break; if (C cannot be scheduled in RT) break; } } 10

SccScheduled(RT, T, c, first, s) { RT' = RT; if (not NodeScheduled (RT' , T, first, s)) return false; for (each remaining n in c in prioritized topological order of edges in E' ) { S f = max e =n'-+n in E* ,n' in c,n' scheduled S(n') + de - ( 6e S U = min e =n-+n' in E* ,n' in c,n' scheduled S(n') de + ( 6e for (s = S f ; ::; min(su , Sf + T 1); s = s + 1) if NodeScheduled(RT' , T, n, s) break; if (n cannot be scheduled in RT') return false; } RT = RT' ;

-

}

-

X

X

T) ; T) ;

return true;

Figure 10.29: A software-pipelining algorithm for cyclic dependence graphs

756

CHAPTER 10. INSTRUCTION-LEVEL PARALLELISM

for each pair of nodes (p, n ) , there is an edge e in E* whose associated distance de is the length of the longest simple path from p to n, provided that there is at least one path from p to n. E* is computed for each value of T, the initiation interval target. It is also possible to perform this computation just once with a symbolic value of T and then substitute for T in each iteration, as we did in Example 10.20. Algorithm 10.21 uses backtracking. If it fails to schedule a s e e , it tries to reschedule the entire see a clock later. These scheduling attempts continue for up to T clocks. Backtracking is important because, as shown in Example 10.20, the placement of the first node in an see can fully dictate the schedule of all other nodes. If the schedule happens not to fit with the schedule created thus far, the attempt fails. To schedule a see, the algorithm determines the earliest time each node in the component can be scheduled satisfying the transitive data dependences in E* . It then picks the one with the earliest start time as the first node to schedule. The algorithm then invokes SccScheduled to try to schedule the component at the earliest start time. The algorithm makes at most T attempts with successively greater start times. If it fails, then the algorithm tries another initiation interval. The SccScheduled algorithm resembles Algorithm 10. 19, but has three major differences. 1. The goal of SccScheduled is to schedule the strongly connected component at the given time slot s. If the first node of the strongly connected com ponent cannot be scheduled at s, SccScheduled returns false. The main function can invoke SccScheduled again with a later time slot if that is desired. 2. The nodes in the strongly connected component are scheduled in topolog ical order, based on the edges in E' . Because the iteration differences on all the edges in E' are 0, these edges do not cross any iteration boundaries and cannot form cycles. ( Edges that cross iteration boundaries are known as loop carried) . Only loop-carried dependences place upper bounds on where operations can be scheduled. So, this scheduling order, along with the strategy of scheduling each operation as early as possible, maximizes the ranges in which subsequent nodes can be scheduled. 3. For strongly connected components, dependences impose both a lower and upper bound on the range in which a node can be scheduled. SccSched uled computes these ranges and uses them to further limit the scheduling attempts. Example 1 0 . 22 : Let us apply Algorithm 10.21 to the cyclic data-dependence graph in Example 10.20. The algorithm first computes that the bound on the initiation interval for this example is 3 clocks. We note that it is not possible to meet this lower bound. When the initiation interval T is 3, the transitive

757

10.5. SOFTWARE PIPELINING

dependences in Fig. 10.28 dictate that S(d) - S(b) = 2. Scheduling node� b and d two clocks apart will produce a conflict in a modular resource-reservat IOn table of length 3.

Attempt

Initiation Interval

1

T=3

2

3

4

5

6

Node

Range

Schedule

a b

(0, 00 ) (2, 00 ) (3, 3) (0, 00 ) (2, 00 ) (4, 4) (5, 5) (0, 00 ) (2, 00 ) (5, 5) (6, 6) (0, 00 ) (2, 00 ) (3, 4) (4, 5) (0, 00 ) (2, 00 ) (4, 5) (5, 5) (0, 00 ) (2, 00 ) (5, 6) (6, 7)

0 2

c

T=3

T=3

T=4

T=4

T=4

a b c

d a b c

d a b c

d a b c

d a b c

d

0 3 4

Modular Resource Reservation

�

0 4 5 0 2 3 0 3 5 0 4 5 6

Figure 10.30: Behavior of Algorithm 10.21 on Example 10.20 Figure 10.30 shows how Algorithm 10.21 behaves with this example. It first tries to find a schedule with a 3-clock initiation interval. The attempt starts by scheduling nodes a and b as early as possible. However, once node b is placed in clock 2, node c can only be placed at clock 3, which conflicts with the resource usage of node a. That is, a and c both need the first resource at clocks that have a remainder of 0 modulo 3. The algorithm backtracks and tries to schedule the strongly connected com ponent {b, c, d} a clock later. This time node b is scheduled at clock 3, and node c is scheduled successfully at clock 4. Node d, however, cannot be scheduled in

758

CHAPTER 10. INSTRUCTION-LEVEL PARALLELISM

clock 5. That is, both b and d need the second resource at clocks that have a remainder of 0 modulo 3. Note that it is just a coincidence that the two con flicts discovered so far are at clocks with a remainder of 0 modulo 3; the conflict might have occurred at clocks with remainder 1 or 2 in another example. The algorithm repeats by delaying the start of the see { b, c, d} by one more clock. But, as discussed earlier, this s e e can never be scheduled with an initiation interval of 3 clocks, so the attempt is bound to fail. At this point, the algorithm gives up and tries to find a schedule with an initiation interval of 4 clocks. The algorithm eventually finds the optimal schedule on its sixth attempt. D

10.5.9

Improvements to the Pipelining Algorithms

Algorithm 10.21 is a rather simple algorithm, although it has been found to work well on actual machine targets. The important elements in this algorithm are 1. The use of a modular resource-reservation table to check for resource conflicts in the steady state. 2. The need to compute the transitive dependence relations to find the legal range in which a node can be scheduled in the presence of dependence cycles. 3. Backtracking is useful, and nodes on critical cycles (cycles that place the highest lower bound on the initiation interval T) must be rescheduled together because there is no slack between them. There are many ways to improve Algorithm 10.21. For instance, the al gorithm takes a while to realize that a 3-clock initiation interval is infeasible for the simple Example 10.22. We can schedule the strongly connected com ponents independently first to determine if the initiation interval is feasible for each component. We can also modify the order in which the nodes are scheduled. The order used in Algorithm 10.21 has a few disadvantages. First, because nontrivial see's are harder to schedule, it is desirable to schedule them first. Second, some of the registers may have unnecessarily long lifetimes. It is desirable to pull the definitions closer to the uses. One possibility is to start with scheduling strongly connected components with critical cycles first, then extend the schedule on both ends.

10.5 . 10

Modular Variable Expansion A scalar variable is said to be privatizable in a loop if its live range falls within

an iteration of the loop. In other words, a privatizable variable must not be live upon either entry or exit of any iteration. These variables are so named because

10.5. SOFTWARE PIPELINING

759

Are There Alternatives to Heuristics? We can formulate the problem of simultaneously finding an optimal software pipeline schedule and register assignment as an integer-linear programming problem. While many integer linear programs can be solved quickly, some of them can take an exorbitant amount of time. To use an integer-linear-programming solver in a compiler, we must be able to abort the procedure if it does not complete within some preset limit. Such an approach has been tried on a target machine ( the SGI R8000) empirically, and it was found that the solver could find the optimal solution for a large percentage of the programs in the experiment within a reason able amount of time. It turned out that the schedules produced using a heuristic approach were also close to optimal. The results suggest that, at least for that machine, it does not make sense to use the integer-linear programming approach, especially from a software engineering perspec tive. Because the integer-linear solver may not finish, it is still necessary to implement some kind of a heuristic scheduler in the compiler. Once such a heuristic scheduler is in place, there is little incentive to implement a scheduler based on integer programming techniques as well.

different processors executing different iterations in a loop can have their own private copies and thus not interfere with one another. Variable expansion refers to the transformation of converting a privatizable scalar variable into an array and having the ith iteration of the loop read and write the ith element. This transformation eliminates the antidependence con straints between reads in one iteration and writes in the subsequent iterations, as well as output dependences between writes from different iterations. If all loop-carried dependences can be eliminated, all the iterations in the loop can be executed in parallel. Eliminating loop-carried dependences, and thus eliminating cycles in the data-dependence graph, can greatly improve the effectiveness of software pipe lining. As illustrated by Example 10.15, we need not expand a privatizable variable fully by the number of iterations in the loop. Only a small number of iterations can be executing at a time, and privatizable variables may simultane ously be live in an even smaller number of iterations. The same storage can thus be reused to hold variables with nonoverlapping lifetimes. More specifically, if the lifetime of a register is l clocks, and the initiation interval is T, then only q = r � 1 values can be live at any one point. We can allocate q registers to the variable, with the variable in the ith iteration using the ( i mod q ) th register. We refer to this transformation as modular variable expansion. Software pipelining with modular variable expansion. A data-dependence graph and a machine-resource description.

Algorithm INPUT:

1 0 . 23 :

760

CHAPTER 1 0. INSTRUCTION-LEVEL PARALLELISM

OUTPUT:

Two loops, one software pipelined and one unpipelined.

METHOD:

1. Remove the loop-carried antidependences and output dependences asso ciated with privatizable variables from the data-dependence graph.

T

2. Software-pipeline the resulting dependence graph using Algorithm 10.21. Let be the initiation interval for which a schedule is found, and L be the length of the schedule for one iteration. 3. From the resulting schedule, compute qv , the minimum number of regis ters needed by each privatizable variable v. Let Q = maxv qv . 4. Generate two loops: a software-pipelined loop and an unpipelined loop. The software-pipelined loop has

copies of the iterations, placed

T

clocks apart. It has a prolog with

( r � l - l) T

-

instructions, a steady state with Q T instructions, and an epilog of L T instructions. Insert a loop-back instruction that branches from the bottom of the steady state to the top of the steady state. The number of registers assigned to privatizable variable v is if Q mod qv = 0 otherwise

The variable v in iteration i uses the (i mod qD th register assigned. Let n be the variable representing the number of iterations in the source loop. The software-pipelined loop is executed if n

i = i+l ; }

( b ) It is unclear when or if this loop terminates. i = 0; while ( i

Dragon Book - Compilers Principles Techniques and Tools (2nd Edition)

Related documents