the design and implementation of massively parallel

0 downloads 0 Views 2MB Size Report
for connectionist network simulation, and the design philosophy behind this machine is to keep all ..... James and Hoang (1992) analyze several software and hardware strategies for low cost ...... The visualizer is a stand-alone program written using tcl (Ousterhout, 1994). ...... Shruti-cm5|a user and programmer's manual.
THE DESIGN AND IMPLEMENTATION OF MASSIVELY PARALLEL KNOWLEDGE REPRESENTATION AND REASONING SYSTEMS: A CONNECTIONIST APPROACH

D. R. MANI A DISSERTATION in

Computer and Information Science

Presented to the Faculties of the University of Pennsylvania in Partial Ful llment of the Requirements for the Degree of Doctor of Philosophy. 1995

Dissertation Supervisor Lokendra Shastri

School of Engineering and Applied Science

University of Pennsylvania Philadelphia, PA 19104

c Copyright

D. R. Mani 1996

Acknowledgements It has been a great pleasure to work with my dissertation supervisor, Lokendra Shastri. His support and guidance have been instrumental in both the formulation and execution of this dissertation. He has always gone out of his way to help me whenever necessary. This dissertation has greatly bene ted from my association with him. I would like to take this opportunity to thank Susan Davidson (University of Pennsylvania), Chris Overton (University of Pennsylvania), Sanguthevar Rajasekaran (University of Florida) and Dave Waltz (NEC Research Institute) for taking the time to serve on my dissertation committee. Their insightful comments and suggestions have contributed to the betterment of this work. My ten month visit to the International Computer Science Institute, Berkeley|arranged by Lokendra Shastri and Jerry Feldman|has been one of the most productive and academically stimulating experiences in my tenure as a graduate student. The discussions I have had with David Bailey, Jerry Feldman, Ben Gomes, Dan Jurafsky and Srini Narayanan have contributed signi cantly to this dissertation. ICSI also provided some of the computational resources needed for my dissertation. My colleagues at Penn have in uenced my dissertation either directly|through discussions and exchange of ideas|or indirectly|through peer support and friendship. In particular, thanks to Je Aaronson, Tom Fontaine, Jim Gee, Chris Geib, Libby Levison and Chuck Liang. The CIS Department graduate coordinator, Mike Felker, has been extremely helpful in taking care of the administrative procedures and requirements. The Penn research computing sta have provided ecient and prompt help with system maintenance and administration. In addition, the consulting sta at NCSA and the University at California at Berkeley have helped with CM-5 support. This work has been supported by ARO grants DAA29-84-9-0027 and DAAL03-89-C-0031 to the Army Research Center at the University of Pennsylvania, ONR grant N00014-93-1-1149 and NSF resource grant CCR930001N to Lokendra Shastri, and National Science Foundation Infrastructure Grant CDA-8722788 to the University of California at Berkeley. All CM-5 and 32K, 16K and 8K CM-2 experiments were run on the Connection Machines at the National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana-Champaign. All experiments on the 4K CM-2 were run at the University of Pennsylvania. The CM-5 workshop at NCSA| conceived by the MetaCenter Computational Science Institutes in Parallel Computing, sponsored by the NSF, and organized by the National Center for Supercomputing Applications and the San Diego Supercomputing Center|provided me with some computing resources and a great learning opportunity. My wife Lakshmi has been a companion through almost the whole of my tenure as graduate student. In addition to her constant support and encouragement which have kept me a oat through dicult stretches, she has also helped me understand myself better. My son Deepak has been a source of joy, but for whose welcome intervention, this dissertation would have been nished a year ago! Last but not least, thanks to my parents and Doctor Mama for their unusual support during the course of my education.

iii

ABSTRACT Ecient knowledge representation and reasoning is an important component of intelligent activity, and is a crucial aspect in the design of large-scale intelligent systems. This dissertation explores the design, analysis, and implementation of massively parallel knowledge representation and reasoning systems which can encode very large knowledge bases and respond to a class of queries in real-time, with reasoning episodes expected to span a fraction of a second. The dissertation attempts to design ecient, large-scale knowledge base systems by: (i) exploiting massive parallelism; and (ii) constraining representational and inferential capabilities to achieve tractability, while still retaining sucient expressive power to capture a broad class of reasoning in intelligent systems. To this end, shruti, a connectionist reasoning system which models re exive| i.e., e ortless and spontaneous|reasoning serves as the knowledge representation and reasoning framework. Shruti-based massively parallel systems that can encode very large knowledge-bases and perform a class of reasoning in real-time are developed. Shruti-cm5, the parallel re exive reasoning system on the Connection Machine CM-5, encodes large arti cial knowledge bases with over half a million (randomly generated) rules and facts, and responds to a range of queries requiring derivation depths of up to eight in well under a second. Shruti-cm5 running WordNet, a real-world lexical database, responds to queries in times ranging from a few to a few hundred milliseconds. With a view toward exploiting machine characteristics and knowledge base structure, a quantitative analysis optimizes performance with respect to the probability of nding related knowledge elements on the same processor, thereby leading to an optimal mapping of the knowledge base onto the underlying parallel machine. The analysis makes some interesting predictions|validated by experimental data|about query response time and its relation to computation and communication costs. These results help engineer performance improvements and steer the course of future research. From a practical standpoint, in addition to providing some new insights into mapping structured connectionist networks onto massively parallel machines, this work develops viable technology for supporting large-scale knowledge base systems.

iv

Contents Acknowledgements Abstract 1 Introduction

1.1 The Problem : : : : : : : : : : : 1.1.1 The Need for Parallelism 1.1.2 Tractable Reasoning : : : 1.1.3 Problem De nition : : : : 1.1.4 Our Approach : : : : : : 1.2 Contributions : : : : : : : : : : : 1.3 An Overview of Shruti-cm5 : : 1.4 Thesis Organization : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

2.1 The Essence of Connectionism : : : : : : : : : : : : : : : : : : : : : : : 2.2 Structured Connectionism : : : : : : : : : : : : : : : : : : : : : : : : : : 2.3 Shruti: A Structured Connectionist Reasoning System : : : : : : : : : 2.3.1 Re exive Reasoning : : : : : : : : : : : : : : : : : : : : : : : : : 2.3.2 The Rule-Based Reasoning System|An Overview : : : : : : : : 2.3.3 Terminology : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.3.4 Knowledge Encoding for Backward Reasoning : : : : : : : : : : : 2.3.5 The Inference Process : : : : : : : : : : : : : : : : : : : : : : : : 2.3.6 Rules with Constants and Repeated Variables in the Consequent 2.3.7 Multiple Antecedent Rules : : : : : : : : : : : : : : : : : : : : : 2.3.8 The Type Hierarchy : : : : : : : : : : : : : : : : : : : : : : : : : 2.3.9 Multiple Dynamic Instantiation of Predicates : : : : : : : : : : : 2.3.10 Network Complexity : : : : : : : : : : : : : : : : : : : : : : : : : 2.3.11 Constraints on Rules and Inferences : : : : : : : : : : : : : : : : 2.3.12 A Characterization of Re exive Reasoning : : : : : : : : : : : : :

: : : : : : : : : : : : : : :

: : : : : : : : : : : : : : :

: : : : : : : : : : : : : : :

: : : : : : : : : : : : : : :

: : : : : : : : : : : : : : :

: : : : : : : : : : : : : : :

: : : : : : : : : : : : : : :

2 Structured Connectionism and Knowledge Representation

3 Mapping SHRUTI onto Massively Parallel Machines

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

3.1 Structured Connectionism and Massively Parallel Architectures : : : : : : : : : : : : 3.1.1 Special Purpose Hardware : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3.1.2 General Purpose Parallel Architectures : : : : : : : : : : : : : : : : : : : : : 3.1.3 Mapping Structured Connectionist Models onto Massively Parallel Architectures : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3.2 Exploiting Constraints Imposed by Shruti : : : : : : : : : : : : : : : : : : : : : : : 3.3 Representing Synchrony : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3.4 Granularity : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : v

iii iv 1 1 1 2 3 3 4 5 6

7

7 8 8 8 9 10 10 11 12 13 14 16 17 17 18

19 19 19 20

21 22 22 23

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

4.1 Massively Parallel Knowledge Representation Systems : : : 4.1.1 PARKA : : : : : : : : : : : : : : : : : : : : : : : : : 4.1.2 SNAP : : : : : : : : : : : : : : : : : : : : : : : : : : 4.1.3 Other Systems : : : : : : : : : : : : : : : : : : : : : 4.1.4 Discussion : : : : : : : : : : : : : : : : : : : : : : : : 4.2 Parallel Logic Programming : : : : : : : : : : : : : : : : : : 4.2.1 Logic Programs : : : : : : : : : : : : : : : : : : : : : 4.2.2 Parallelism in Logic Programs : : : : : : : : : : : : : 4.2.3 Parallel Logic Programming Languages : : : : : : : 4.2.4 Logic Programming and Shruti : : : : : : : : : : : 4.3 Parallel Database Systems : : : : : : : : : : : : : : : : : : : 4.4 Parallel Knowledge Representation and Reasoning : : : : : 4.4.1 Massively Parallel Semantic Networks : : : : : : : : 4.4.2 Structured Connectionist Reasoning Systems : : : : 4.4.3 Other Architectures : : : : : : : : : : : : : : : : : : 4.5 Special Purpose Hardware : : : : : : : : : : : : : : : : : : : 4.5.1 Digital Signal Processor Based Architectures : : : : 4.5.2 Other Architectures : : : : : : : : : : : : : : : : : : 4.6 Mapping Connectionist Systems onto Parallel Architectures 4.6.1 General Considerations : : : : : : : : : : : : : : : : 4.6.2 One-Dimensional SIMD Architectures : : : : : : : : 4.6.3 Two-Dimensional SIMD Architectures : : : : : : : : 4.6.4 Systolic Arrays : : : : : : : : : : : : : : : : : : : : : 4.6.5 Data ow Architectures : : : : : : : : : : : : : : : : : 4.6.6 Multiprocessor Systems : : : : : : : : : : : : : : : : 4.6.7 Discussion : : : : : : : : : : : : : : : : : : : : : : : : 4.7 Performance Measurement and Testing : : : : : : : : : : : : 4.7.1 Performance Measures : : : : : : : : : : : : : : : : : 4.7.2 Time and Space Complexity : : : : : : : : : : : : : : 4.7.3 Formal Analysis : : : : : : : : : : : : : : : : : : : : 4.7.4 Benchmarks : : : : : : : : : : : : : : : : : : : : : : : 4.7.5 Our Approach : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

3.5 3.6 3.7 3.8 3.9

3.4.1 Network-Level Mapping : : : : : : : : : : : : : 3.4.2 Knowledge-Level Mapping : : : : : : : : : : : : 3.4.3 Optimal Mapping Granularity : : : : : : : : : Processor Allocation : : : : : : : : : : : : : : : : : : : Static and Incremental Encoding of Knowledge Bases Load Balancing and Communication : : : : : : : : : : Scalability and Resource Limitations : : : : : : : : : : Analysis and Experimentation : : : : : : : : : : : : : :

: : : : : : : :

: : : : : : : :

4 Related Work

5 SHRUTI on the CM-5|Design and Implementation 5.1 The Connection Machine CM-5 : : : : : : : : 5.2 Design Considerations : : : : : : : : : : : : : 5.2.1 Granularity of Mapping : : : : : : : : 5.2.2 Representing Synchrony : : : : : : : : 5.2.3 Active Messages and Communication : 5.3 Design Features : : : : : : : : : : : : : : : : : 5.3.1 The Rule-Base : : : : : : : : : : : : : 5.3.2 The Type Hierarchy : : : : : : : : : : vi

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

23 24 25 26 26 26 27 28

29 29 29 30 30 30 31 31 32 32 33 33 34 34 35 35 36 36 36 37 37 38 38 39 40 40 41 41 41 42 42 43 43

44 44 45 45 45 45 46 46 46

5.4

5.5

5.6 5.7 5.8 5.9

5.3.3 Tags|Node and Link Labeling : : : : : : : : : : : : : : : : : 5.3.4 Queries : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5.3.5 Processor Allocation : : : : : : : : : : : : : : : : : : : : : : : Encoding the Knowledge Base : : : : : : : : : : : : : : : : : : : : : : 5.4.1 Input Processing : : : : : : : : : : : : : : : : : : : : : : : : : 5.4.2 Serial Preprocessor : : : : : : : : : : : : : : : : : : : : : : : : 5.4.3 Input Parser : : : : : : : : : : : : : : : : : : : : : : : : : : : 5.4.4 Server Processor : : : : : : : : : : : : : : : : : : : : : : : : : 5.4.5 Representing Predicates and Concepts : : : : : : : : : : : : : 5.4.6 Encoding Rules, Facts and is-a Relations : : : : : : : : : : : Spreading Activation and Inference : : : : : : : : : : : : : : : : : : : 5.5.1 Posing Queries : : : : : : : : : : : : : : : : : : : : : : : : : : 5.5.2 The Activation Propagation Loop : : : : : : : : : : : : : : : 5.5.3 Reasoning Episode Termination : : : : : : : : : : : : : : : : : 5.5.4 Interaction Between the Rule-Base and Type Hierarchy : : : 5.5.5 Activation Frontiers : : : : : : : : : : : : : : : : : : : : : : : 5.5.6 Interprocessor Communication : : : : : : : : : : : : : : : : : 5.5.7 Firing Rules : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5.5.8 Fact Matching : : : : : : : : : : : : : : : : : : : : : : : : : : 5.5.9 Propagating Collector Activation : : : : : : : : : : : : : : : : 5.5.10 Activation Propagation in the Type Hierarchy : : : : : : : : 5.5.11 Multiple Instantiation : : : : : : : : : : : : : : : : : : : : : : 5.5.12 Statistics Collection : : : : : : : : : : : : : : : : : : : : : : : The Shruti-cm5 User Interface : : : : : : : : : : : : : : : : : : : : Shruti-cm5|Other Modules and Versions : : : : : : : : : : : : : : Practical, Large-Scale Knowledge Representation with Shruti-cm5 Shruti on the CM-2 : : : : : : : : : : : : : : : : : : : : : : : : : : :

6 A Mathematical Analysis of SHRUTI-CM5

6.1 Motivation : : : : : : : : : : : : : : : : : : : : : : : : 6.2 Query Response Time : : : : : : : : : : : : : : : : : : 6.2.1 Computation and Communication Costs : : : : 6.2.2 De ning q : : : : : : : : : : : : : : : : : : : : : 6.2.3 Spreading Activation in the Rule-Base : : : : : 6.2.4 Spreading Activation in the Type Hierarchy : : 6.2.5 Observations : : : : : : : : : : : : : : : : : : : 6.3 Assumptions : : : : : : : : : : : : : : : : : : : : : : : 6.4 Communication Cost, Active Messages and the CM-5 6.4.1 Revised Equations for Response Time : : : : : 6.5 Simpli cations and Approximations : : : : : : : : : : 6.5.1 Simpli ed Equation for Response Time : : : : 6.6 Determining Query Response Time : : : : : : : : : : : 6.6.1 Queries : : : : : : : : : : : : : : : : : : : : : : 6.6.2 Cardinality of Frontier Sets : : : : : : : : : : : 6.7 Optimizing Query Response Time : : : : : : : : : : : 6.7.1 Problem De nition : : : : : : : : : : : : : : : : 6.7.2 Minimization : : : : : : : : : : : : : : : : : : : 6.7.3 Example : : : : : : : : : : : : : : : : : : : : : : 6.8 Properties of Trb and Tth : : : : : : : : : : : : : : : : 6.9 Estimating Parameters : : : : : : : : : : : : : : : : : : 6.10 Implications of the Analysis : : : : : : : : : : : : : : : 6.10.1 Some Caveats : : : : : : : : : : : : : : : : : : : vii

: : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : :

47 48 48 50 50 51 51 51 51 52 53 53 55 55 56 56 57 58 59 59 60 61 61 64 65 66 67

69 69 70 71 72 72 74 74 74 76 77 77 78 79 79 79 81 81 82 83 83 88 89 89

6.10.2 Performance|Relation to Computation and Communication Costs : 6.10.3 Random Processor Allocation : : : : : : : : : : : : : : : : : : : : : : 6.10.4 Synchronous versus Asynchronous Operation : : : : : : : : : : : : : 6.10.5 Knowledge Base Structure : : : : : : : : : : : : : : : : : : : : : : : : 6.11 Glossary of Terms and Symbols : : : : : : : : : : : : : : : : : : : : : : : : : 6.11.1 Subscripts : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6.11.2 Glossary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

7 SHRUTI on the CM-5|Experiments with Large Knowledge Bases 7.1 Generating Random Knowledge Bases : : : : : : : : : : : : : : : : : : 7.1.1 Generating Queries : : : : : : : : : : : : : : : : : : : : : : : : : 7.2 Experiments with Random Knowledge Bases : : : : : : : : : : : : : : 7.2.1 Experimental Results|Knowledge Base Size and Query Depth 7.2.2 Experimental Results|Knowledge Base Structure : : : : : : : 7.2.3 Experimental Results|E ect of Parallelism : : : : : : : : : : : 7.2.4 Real-Time Reasoning : : : : : : : : : : : : : : : : : : : : : : : 7.3 Experiments with WordNet : : : : : : : : : : : : : : : : : : : : : : : : 7.3.1 Mapping WordNet to Shruti : : : : : : : : : : : : : : : : : : : 7.3.2 Experimental Results : : : : : : : : : : : : : : : : : : : : : : : 7.4 Empirical Validation of the Analysis : : : : : : : : : : : : : : : : : : : 7.4.1 Analysis Validation|Random Knowledge Bases : : : : : : : : 7.4.2 Analysis Validation|WordNet : : : : : : : : : : : : : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

89 90 90 91 91 92 92

95

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: 95 : 96 : 97 : 98 : 103 : 103 : 113 : 114 : 116 : 117 : 121 : 121 : 124

8.1 Contributions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8.2 Future Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8.2.1 Extensions to Shruti-cm5 : : : : : : : : : : : : : : : : : : : : : 8.2.2 Applications of Shruti-cm5 : : : : : : : : : : : : : : : : : : : : 8.2.3 Integrated User Interface : : : : : : : : : : : : : : : : : : : : : : 8.2.4 Exploring the Impact of Communication and Computation Costs 8.2.5 Exploring Re exive Reasoning : : : : : : : : : : : : : : : : : : : 8.3 Conclusion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: 136 : 136 : 136 : 137 : 138 : 138 : 139 : 139

8 Recapitulation and Future Work

A B C D

: : : : : : :

Knowledge Base Input Syntax SHRUTI-CM Commands Glossary of WordNet Terms SHRUTI on the CM-2

D.1 Encoding the Knowledge Base : : : : : : : : : : D.1.1 Input Processing : : : : : : : : : : : : : D.1.2 Representing Knowledge Base Elements D.1.3 Encoding Rules and Facts : : : : : : : : D.2 Spreading Activation and Inference : : : : : : : D.2.1 The Rule-Base : : : : : : : : : : : : : : D.2.2 The Type Hierarchy : : : : : : : : : : : D.2.3 Multiple Instantiation : : : : : : : : : : D.2.4 Statistics Collection : : : : : : : : : : : D.3 Characteristics of Shruti-cm2 : : : : : : : : : D.4 Shruti-cm2 on the CM-5 : : : : : : : : : : : :

Bibliography

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

135

140 144 147 148

: 148 : 148 : 148 : 149 : 151 : 152 : 153 : 153 : 153 : 154 : 155

157 viii

List of Tables 5.1 5.2 7.1 7.2 7.3 7.4 7.5 7.6 7.7

Active message structure for rule-base activation propagation. : : : : Active message structure for type hierarchy activation propagation. : Average speedup for various knowledge base sizes. : : : : : : : : : : The lexical matrix. Adapted from Miller (1990). : : : : : : : : : : : An example lexical matrix. : : : : : : : : : : : : : : : : : : : : : : : Relations in WordNet. : : : : : : : : : : : : : : : : : : : : : : : : : : Queries used in WordNet experiments. : : : : : : : : : : : : : : : : : Average response time for WordNet queries. : : : : : : : : : : : : : : Comparison of wn and WordNet on the CM-5. : : : : : : : : : : : : :

ix

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: 58 : 61 : 113 : 114 : 115 : 115 : 118 : 119 : 120

List of Figures 1.1 1.2 1.3 2.1 2.2 2.3 2.4 2.5 2.6 3.1 4.1 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 5.17 5.18 5.19 5.20 5.21 6.1 6.2 6.3

Building large-scale AI systems. Adapted from (Kitano 1993). : : : : : : : : : : : : : Shruti-cm5 query response time for random knowledge bases. : : : : : : : : : : : : Shruti-cm5 query response time for WordNet. : : : : : : : : : : : : : : : : : : : : : (a) An example encoding of rules and facts. (b) Activation trace for the query can-sell(Mary,Book1)?. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Behavior of -btu, -and and -or nodes in the reasoning system. : : : : : : : : : : : Encoding rules with (a) Existentially quanti ed variables and constants in the consequent; (b) Repeated variables in the consequent. : : : : : : : : : : : : : : : : : : : Encoding rules with multiple predicates in the antecedent. : : : : : : : : : : : : : : : (a) The rule-based reasoner interfaced with a type hierarchy. (b) Type hierarchy encoding for the relation is-a(A,B). : : : : : : : : : : : : : : : : : : : : : : : : : : : : An overview of the multiple instantiation system. : : : : : : : : : : : : : : : : : : : : Trade-o between hardware exibility and performance per unit cost. Adapted from (Hammerstrom, 1994). : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Algorithm for executing a logic program. Adapted from (de Kergommeaux, 1994). : The q-based processor allocation algorithm used in shruti-cm5. : : : : : : : : : : : C structures used to represent predicates in shruti-cm5. : : : : : : : : : : : : : : : C structures used to represent entities in the shruti-cm5 type hierarchy. : : : : : : C structures used to encode rules and facts in shruti-cm5. : : : : : : : : : : : : : : C structure used to encode is-a relations in shruti-cm5. : : : : : : : : : : : : : : : The main propagation loop used in spreading activation during an episode of reasoning. C structure used for frontiers. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Activation propagation in the rule-base. : : : : : : : : : : : : : : : : : : : : : : : : : Handler function for activation propagation in the rule-base. : : : : : : : : : : : : : : Fact matching in the rule-base. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Collector activation propagation in the rule-base. : : : : : : : : : : : : : : : : : : : : Handler function for collector activation propagation in the rule-base. : : : : : : : : Activation propagation in the type hierarchy. : : : : : : : : : : : : : : : : : : : : : : Handler function for activation propagation in the type hierarchy. : : : : : : : : : : : Shruti-cm5 global statistics summary. : : : : : : : : : : : : : : : : : : : : : : : : : Shruti-cm5 knowledge base distribution report. : : : : : : : : : : : : : : : : : : : : Shruti-cm5 processor load and active element count display. : : : : : : : : : : : : : Shruti-cm5 execution timing display. : : : : : : : : : : : : : : : : : : : : : : : : : : An example knowledge base in shruti-cm5 syntax. : : : : : : : : : : : : : : : : : : User interface to Shruti-cm5. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : A comparison of shruti-cm2 and shruti-cm5. The timing curve for the CM-5 has been multiplied by 100. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : General activation spreading mechanism in Shruti-cm5. : : : : : : : : : : : : : : : Trb versus qrb. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Tth versus qth. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : x

2 4 4 9 11 13 14 15 16 20 31 49 52 53 54 54 55 56 57 58 59 59 60 60 60 62 63 63 63 64 65 68 71 83 84

6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 7.13 7.14 7.15 7.16 7.17 7.18 7.19 7.20 7.21 7.22 7.23 7.24 7.25 7.26 7.27 7.28 7.29 7.30 7.31 7.32 7.33 7.34

Trb versus kc and qrb. : : : : : : : : : : : : Trb versus N and qrb . : : : : : : : : : : : : Trb versus bf and qrb. : : : : : : : : : : : : Trb versus korb and qrb . kohrb = korb . : : : : Trb versus D and qrb. : : : : : : : : : : : : Tth versus kc and qth. : : : : : : : : : : : : Tth versus N and qth . : : : : : : : : : : : : Tth versus Bth and qth. : : : : : : : : : : : Tth versus koth and qth . kohth = koth . : : : : Tth versus D and qth. : : : : : : : : : : : : Trb versus qrb when kc  korb . kc = 100. : Tth versus qth when kc  koth . kc = 100. :

: : : : : : : : : : : : : : : : : : : : : : : 84 : : : : : : : : : : : : : : : : : : : : : : : 85 : : : : : : : : : : : : : : : : : : : : : : : 85 : : : : : : : : : : : : : : : : : : : : : : : 85 : : : : : : : : : : : : : : : : : : : : : : : 86 : : : : : : : : : : : : : : : : : : : : : : : 86 : : : : : : : : : : : : : : : : : : : : : : : 86 : : : : : : : : : : : : : : : : : : : : : : : 87 : : : : : : : : : : : : : : : : : : : : : : : 87 : : : : : : : : : : : : : : : : : : : : : : : 87 : : : : : : : : : : : : : : : : : : : : : : : 88 : : : : : : : : : : : : : : : : : : : : : : : 88 Parameters for generating structured random knowledge base kb1. : : : : : : : : : : 96 Parameters for generating structured random knowledge base kb2. : : : : : : : : : : 97 Parameters for generating structured random knowledge base kb3. : : : : : : : : : : 98 Query depth vs. response time. Kb1 on a 32 PE CM-5. : : : : : : : : : : : : : : : : : 99 Query depth vs. number of rules red in a reasoning episode. Kb1 on a 32 PE CM-5. 99 Query depth vs. number of active messages exchanged in a reasoning episode. Kb1 on a 32 PE CM-5. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 100 Query depth vs. number of active predicates in a reasoning episode. Kb1 on a 32 PE CM-5. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 100 Query depth vs. time needed to re a rule. Kb1 on a 32 PE CM-5. : : : : : : : : : : 101 Query depth vs. time needed to process an active predicate. Kb1 on a 32 PE CM-5. : 101 Visualization of growth of domain activation. : : : : : : : : : : : : : : : : : : : : : : 102 Query depth vs. response time. Kb2 on a 32 PE CM-5. : : : : : : : : : : : : : : : : : 104 Query depth vs. number of rules red in a reasoning episode. Kb2 on a 32 PE CM-5. 104 Query depth vs. number of active messages exchanged in a reasoning episode. Kb2 on a 32 PE CM-5. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 105 Query depth vs. number of active predicates in a reasoning episode. Kb2 on a 32 PE CM-5. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 105 Query depth vs. time needed to re a rule. Kb2 on a 32 PE CM-5. : : : : : : : : : : 106 Query depth vs. time needed to process an active predicate. Kb2 on a 32 PE CM-5. : 106 Query depth vs. response time. Kb3 on a 32 PE CM-5. : : : : : : : : : : : : : : : : : 107 Query depth vs. number of rules red in a reasoning episode. Kb3 on a 32 PE CM-5. 107 Query depth vs. number of active messages exchanged in a reasoning episode. Kb3 on a 32 PE CM-5. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 108 Query depth vs. number of active predicates in a reasoning episode. Kb3 on a 32 PE CM-5. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 108 Query depth vs. time needed to re a rule. Kb3 on a 32 PE CM-5. : : : : : : : : : : 109 Query depth vs. time needed to process an active predicate. Kb3 on a 32 PE CM-5. : 109 Query depth vs. response time. Kb2 on a 64 PE CM-5. : : : : : : : : : : : : : : : : : 110 Query depth vs. number of rules red in a reasoning episode. Kb2 on a 64 PE CM-5. 110 Query depth vs. time needed to re a rule. Kb2 on a 64 PE CM-5. : : : : : : : : : : 111 Query depth vs. response time. Kb2 on a 128 PE CM-5. : : : : : : : : : : : : : : : : 111 Query depth vs. number of rules red in a reasoning episode. Kb2 on a 128 PE CM-5.112 Query depth vs. time needed to re a rule. Kb2 on a 128 PE CM-5. : : : : : : : : : : 112 Speedup curves for 64 PE and 128 PE CM-5. : : : : : : : : : : : : : : : : : : : : : : 113 Translating WordNet to shruti. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 116 Predicted response time for kb1. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 121 Predicted response time for kb2. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 122 Predicted response time for kb3. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 123 Response time for WordNet queries as qth changes. Queries 01{08. : : : : : : : : : : 125 xi

7.35 7.36 7.37 7.38 7.39 7.40 7.41 7.42 A.1 A.2 D.1 D.2 D.3 D.4 D.5 D.6 D.7 D.8

Response time for WordNet queries as qth changes. Queries 09{16. : : : : : : : : : : 126 Response time for WordNet queries as qth changes. Queries 17{24. : : : : : : : : : : 127 Response time for WordNet queries as qth changes. Queries 25{32. : : : : : : : : : : 128 Response time for WordNet queries as qth changes. Queries 33{34. : : : : : : : : : : 129 WordNet knowledge base distribution for di erent values of qth . : : : : : : : : : : : : 130 Response time for WordNet query is-a hypernym (Entity,Sparrow)? for di erent values of qth . : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 131 Communication load for WordNet query is-a hypernym (Entity,Sparrow)? for di erent values of qth. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 132 Computation load for WordNet query is-a hypernym (Entity,Sparrow)? for di erent values of qth . : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 133 An example showing knowledge base input syntax. : : : : : : : : : : : : : : : : : : : 141 Formal grammar speci cation for knowledge base input syntax. : : : : : : : : : : : : 143 C Structures used to represent predicates in shruti-cm2. : : : : : : : : : : : : : : : 149 C Structures used to encode rules and facts in shruti-cm2. : : : : : : : : : : : : : : 150 C Structures used to represent entities in shruti-cm2. : : : : : : : : : : : : : : : : : 151 C Structures used to represent is-a relations in shruti-cm2. : : : : : : : : : : : : : 151 Encoding single- and multiple-antecedent rules. : : : : : : : : : : : : : : : : : : : : : 152 Shruti-cm2 running on a CM-2 with 4K processors. : : : : : : : : : : : : : : : : : : 154 Shruti-cm2 running on a CM-2 with 32K processors. : : : : : : : : : : : : : : : : : 155 Shruti-cm2 running on a CM-5 with 64 processors, in SIMD mode. : : : : : : : : : 156

xii

Chapter 1

Introduction In the past decade, there has been increasing consensus in the Arti cial Intelligence (AI) community that massive parallelism is an essential component in the design and development of intelligent agents (Kitano, 1993; Shastri, 1988; Feldman et al., 1988; Waltz, 1990b; Kanal et al., 1994; Kitano et al., 1994). The need for massive parallelism is all the more evident when modeling human cognition, with its constraints on real-time response and reaction times (Feldman and Ballard, 1982; Shastri, 1991b; Newell, 1992). The introduction of a large number of massively parallel architectures, combined with the widespread availability of general purpose massively parallel machines, has made it feasible for researchers to experiment with parallel approaches to AI. The resurgence of interest in inherently parallel neural and connectionist models (Rumelhart and McClelland, 1986), and their application to high-level|or symbolic|processing (Barnden and Pollack, 1991) has signi cantly in uenced the role of parallelism in AI. In addition to developing large-scale systems, it is hoped that the application of massive parallelism to AI will help us rethink our approach to modeling intelligence and provide pointers to surmount the scalability problem. Knowledge representation and ecient reasoning is an important component of intelligent activity and is therefore crucial in developing large-scale AI systems. In designing real-world AI systems and modeling realistic cognitive processing, the need for very large knowledge bases capable of realtime reasoning and retrieval is all the more acute. For example, in trying to compile common-sense knowledge Guha and Lenat (1990) claim that the knowledge base may contain several million elements. Thus, knowledge representation and reasoning is one aspect of an intelligent system which can greatly bene t from the use of massive parallelism and the ensuing speedup.

1.1 The Problem This thesis explores the design, analysis and implementation of massively parallel knowledge representation and reasoning systems which can encode very large knowledge bases and respond to a class of queries in real-time. Real-time reasoning is de ned to be reasoning that is fast enough to support real-time cognitive activity like language understanding, where reasoning episodes are expected to span a fraction of a second.1

1.1.1 The Need for Parallelism

Traditional AI systems have been aimed at modeling simple, toy domains in considerable detail. Though such small domains have been modeled successfully, the approaches have had problems

1 For example, we can understand written language at the rate of about 150{400 words per minute|i.e., we can understand a typical sentence in a second or two (Carpenter and Just, 1977).

1

task complexity

toy system

intelligent system scalability problem systems that pay off

simple, large-scale system

domain size

Figure 1.1: Building large-scale AI systems. Adapted from (Kitano 1993). with scalability|attempts at applying these techniques to real-world problems with large domains have proved futile. This traditional approach involves building systems with complex capabilities in a small domain and then scaling up to handle real-world problems. In contrast, an increasingly popular belief (Kitano, 1993) is that it would be more fruitful to develop ecient but simple, largescale systems which can deal with very large domains; these systems can then be adapted to more complex tasks. Figure 1.1 illustrates the scenario. The use of massive parallelism is essential when developing realistic (and hence large-scale) models of cognitive processes which respond in human-like time scales (Feldman and Ballard, 1982; Shastri, 1991b; Newell, 1992). Even systems which are not aimed at modeling cognition would bene t from massive parallelism. The use of massive parallelism in large-scale intelligent systems will reduce run times and make it more feasible to experiment with the system, and craft it to the needs of particular applications. More importantly, the use of massive parallelism could guide us toward a di erent approach to designing intelligent systems and provide insights which have escaped traditional approaches. The success of connectionist networks over traditional AI models in solving some problems is a case in point. Early proponents of parallelism in AI, in the form of marker passing and spreading activation, include Quillian (1968), Fahlman (1979), Charniak (1983), Wilensky (1983), Hillis (1985), Waltz (1985), and Shastri (1988). In recent years, massive parallelism has been exploited in a variety of AI and related systems (Waltz, 1990a; Kitano et al., 1991; Kanal et al., 1994; Kitano et al., 1994) including connectionist networks (Shastri and Ajjanagadde, 1993; Mani and Shastri, 1993; Lange and Dyer, 1989; Sun, 1991; Barnden and Pollack, 1991), knowledge representation (Evett, 1994; Evett et al., 1993; Mani and Shastri, 1994; Geller, 1994; Moldovan, 1993; Moldovan et al., 1992; Moldovan, 1989; Wolfson and Ozeri, 1993), logic programming (de Kergommeaux and Codognet, 1994; Kurfe, 1991; Fronhofer and Wrightson, 1992; Conery, 1987; Tick, 1991) and database systems (Valduriez, 1993a; America, 1991; Mohan et al., 1994; Dewan et al., 1994; Zhang et al., 1995; Lee and Sheu, 1994). With a few exceptions, it should be noted that most of these systems have not been implemented on parallel architectures. Chapter 4 provides a more detailed description of these systems and their relation to this thesis.

1.1.2 Tractable Reasoning

One of the goals of this thesis is to develop large-scale real-time knowledge representation and reasoning systems. There are a variety of knowledge representation formalisms including rstorder logic and horn clause logic. It would be impossible to attain real-time performance|even with 2

massive parallelism|if the knowledge representation and reasoning framework used is intractable or otherwise expensive in terms of resource usage. What we need is a limited knowledge representation and reasoning system where representational and inferential capabilities are constrained in order to achieve tractability. At the same time, we would like to retain sucient expressive power to capture a broad class of reasoning in intelligent systems. Logic programming (Kowalski, 1979) is one elegant formalism for knowledge representation and reasoning. Of the various logic programming languages, Prolog (Sterling and Shapiro, 1986) is the most common and is widely used in the AI community. Logic programming languages have good expressive power and well-de ned semantics (Lloyd, 1987). There has also been extensive work on parallelizing logic programs (de Kergommeaux and Codognet, 1994; Kurfe, 1991; Fronhofer and Wrightson, 1992) . In spite of all these advantages, inference in logic programs is expensive and inecient with very large knowledge bases, and is therefore unsuitable for real-time reasoning systems. Parallel database systems (Valduriez, 1993a; America, 1991; Mohan et al., 1994) are also unsuitable since they specialize in data retrieval and are not capable of reasoning. Deductive database systems like Datalog do support inferences on data, but are based on the principles of logic programming, and therefore su er from the same ineciencies as logic programming systems. Again, these issues are discussed in Chapter 4.

1.1.3 Problem De nition

This thesis is concerned with the design and implementation of massively parallel real-time knowledge representation and reasoning systems. We map shruti, a structured connectionist knowledge representation and reasoning system, onto general purpose massively parallel architectures with the objective of designing and developing practical, real-time reasoning systems. We also mathematically analyze the resulting system to better understand its characteristics and optimize performance.

1.1.4 Our Approach

Shruti is a structured connectionist knowledge representation and reasoning system which at-

tempts to model re exive|i.e., e ortless and spontaneous|reasoning (Shastri and Ajjanagadde, 1993; Mani and Shastri, 1993). The choice of shruti as our knowledge representation formalism is prompted by several considerations:  Shruti is an ecient, limited inference system which imposes psychologically and biologically motivated constraints in order to make reasoning tractable.  Unlike most work on tractable reasoning, which is generally theoretical, for example (Weigert and Tsai, 1994; McAllester, 1993), shruti is a tractable reasoning system amenable to ecient implementation.  Shruti is a structured connectionist system; it is therefore inherently parallel and can represent structured knowledge.  Shruti attempts to model common-sense re exive reasoning and provides an opportunity to explore its cognitive aspects. We develop shruti-based massively parallel systems that can encode very large knowledge bases and perform a class of reasoning in real-time. Such systems, which we term parallel re exive reasoning systems, exploit both the parallelism of the underlying hardware, and the biologically and psychologically motivated constraints imposed by shruti, to achieve real-time performance. In developing parallel re exive reasoning systems, we map shruti onto existing general purpose massively parallel machines. The use of general purpose parallel machines has several advantages including wide-spread availability, accessibility, portability to di erent platforms, and a good price 3

QUERY DEPTH vs. RESPONSE TIME

0.6

KB size = 110036 KB size = 219879 KB size = 329871 KB size = 440178 KB size = 550041

spmd version 07.5 32 pe cm-5 kb1 stats enabled

Response Time (sec)

0.5

0.4

KB Size 110036 219879 329871 440178 550041

0.3

0.2

Avg. Response Time Depth = 0 Depth = 1 (Retrieval) 0.8 msec 1.4 msec 1.3 msec 2.5 msec 1.9 msec 4.5 msec 2.8 msec 9.2 msec 3.4 msec 15.7 msec

0.1

0 0

2

4

6

8

10

Query Depth

Figure 1.2: Shruti-cm5 query response time for random knowledge bases. Shruti-cm5

Query is-a hypernym (Bird,x)? is-a hyponym (Sparrow,x)? is-a synonym (Bird,x)? is-a hypernym (Entity,Sparrow)?

Response Time 0.029 sec 0.007 sec 0.004 sec 0.275 sec

Serial WordNet Response Query Time wn bird -treen 13.27 sec wn sparrow -hypen 0.046 sec wn bird -synsn 0.023 sec { {

Shruti-cm5

Speedup 457 6.5 5.75 {

Figure 1.3: Shruti-cm5 query response time for WordNet. performance ratio. For reasons outlined in Chapter 3, we refrain from developing special purpose hardware for shruti.

1.2 Contributions In designing real-time reasoning systems by mapping structured connectionist networks|shruti in particular|onto massively parallel machines, we develop parallel re exive reasoning systems on the Connection Machines CM-2 and CM-5. Shruti-cm5, the SPMD asynchronous message passing CM-5 implementation, achieves very good performance by using active messages for ecient, lowlatency interprocessor communication. Shruti-cm5 encodes large arti cial knowledge bases with over half a million (randomly generated) rules and facts, and responds to a range of queries requiring derivation depths of up to eight in well under a second. Shruti-cm5 running WordNet, a real-world lexical database, responds to queries in times ranging from a few to a few hundred milliseconds. Figures 1.2 and 1.3 quantify the performance of shruti-cm5. Experimental results are presented in Chapter 7. From an analytic stance, the thesis discusses issues involved in the design and implementation of parallel re exive reasoning systems|both from machine dependent and machine independent points of view. With a view toward exploiting machine characteristics and knowledge base structure, a quantitative analysis relates average query response time to machine and knowledge base parameters. The analysis optimizes performance with respect to the probability of nding related knowledge elements on the same processor, thereby leading to an optimal mapping of the knowledge base onto the underlying parallel machine. The analysis also makes interesting predictions about response time and its relation to computation and communication costs. One such prediction, con rmed by experimental data, indicates that average response time is close to optimal when 4

knowledge base elements are mapped to random processors|i.e., load balancing is more critical than locality|when communication cost is either small, or comparable, to the computational cost of processing a knowledge base element. Such results help engineer performance improvements and steer the course of future research. To sum up, the contributions of this thesis are two-fold:  Contributions of practical signi cance. We design and implement a parallel re exive reasoning system on the CM-5. This system is capable of real-time knowledge representation and reasoning with very large knowledge bases. We therefore develop viable technology for supporting large-scale knowledge base systems. From a cognitive standpoint, the parallel re exive reasoning system facilitates exploration of common-sense re exive reasoning.  Analytical Contributions. Our quantitative analysis, in addition to optimizing system performance, provides some new insights into mapping structured connectionist networks onto distributed memory massively parallel machines. Results from this analysis can be extended to the realm of mapping general graph structures onto parallel machines. It is worth noting that the major contribution of this thesis is the design, analysis and implementation of practical large-scale real-time knowledge representation and reasoning systems. The fact that the resulting system can be used for cognitive exploration of re exive reasoning is a byproduct of using shruti as our knowledge representation framework.

1.3 An Overview of Shruti-cm5

Shruti-cm5, the parallel re exive reasoning system on the Connection Machine CM-5, supports

the encoding of very large knowledge bases and their use for real-time inference and retrieval. The knowledge base, consisting of rules and facts, is read and encoded by the processors on the CM-5. The knowledge base is mapped onto the CM-5 at a relatively coarse level of granularity, whereby individual knowledge elements|like predicates, rules, facts, etc.|are treated as primitives. Queries can then be posed to the system and inference is via spreading activation. Ecient low-latency interprocessor communication is e ected using active messages. Shruti-cm5 includes the following suite of programs and tools which support knowledge representation and reasoning on the CM-5 along with pre- and post-processing on serial workstations:  A parser for accepting knowledge base items expressed in a human readable input language. The language's syntax is similar to that of rst-order logic (see Appendix A).  A preprocessor for mapping a knowledge base onto the CM-5 processors. This involves mapping the knowledge base to an inferential dependency network whose structure is analogous to that of shruti, and partitioning this network among the processors of the parallel machine.  A reasoning algorithm for answering queries. This runs on the CM-5 and eciently performs shruti-like reasoning via spreading activation.  Procedures for collecting a number of statistics about the knowledge base and the reasoning process. These include the distribution of knowledge base items among processors, the processor load and message trac during query answering, and a count of knowledge base items of each type (rules, facts, concepts, etc.) activated during processing.  Several tools for analyzing and visualizing the knowledge base and the statistics gathered during query answering.  A utility for generating large psuedo-random knowledge bases given a speci cation of broad structural constraints. Examples of such constraints are: the number of knowledge base items of each type, any subdivision of the knowledge base into domains, the ratio of interand intra-domain rules, and the depth of the type hierarchy. 5

This collection of programs and tools facilitates automatic loading of large knowledge bases, incremental addition of items to an existing knowledge base, posing queries, recording answers, and o -line visualization and analysis of system behavior. It also allows a user to construct large arti cial knowledge bases for experimentation. The system is interactive and allows the user to load and browse knowledge bases, and process queries by issuing commands at a prompt. It is also possible to process command les and use the system in unattended batch processing mode.

1.4 Thesis Organization The remainder of the thesis is organized into seven chapters and four appendixes: Chapter 2 This chapter introduces the notion of structured connectionism and describes shruti, our representative structured connectionist knowledge representation and reasoning system. Chapter 3 The mapping of structured connectionist models onto massively parallel machines is considered. This chapter also provides an abstract framework for mapping shruti onto massively parallel machines, and lays out some of the design choices. Chapter 4 We review related work with emphasis on parallel real-time knowledge representation and mapping structured connectionist models onto parallel machines. Chapter 5 Chapter 5 is a detailed description of shruti-cm5. The design, characteristics and implementation of shruti-cm5 are discussed. Chapter 6 In this chapter, we mathematically analyze the mapping of shruti onto parallel machines. The analysis yields optimal mapping criteria, the implications of which are explored. Chapter 7 Here, we present results from our experiments with very large knowledge bases on shruti-cm5. We consider randomly generated knowledge bases and WordNet, a real-world lexical database. Chapter 8 The concluding chapter summarizes the thesis and its contributions, and provides pointers for future work. Appendix A A formal description of the knowledge base input syntax used by shruti-cm5 is provided. Appendix B This appendix describes the shruti-cm5 commands for encoding knowledge bases, running reasoning episodes, and statistics collection. Appendix C A brief glossary of WordNet terms is presented here. Appendix D The nal appendix describes shruti on the CM-2. This investigation was not pursued further since results from the CM-5 were much more promising.

6

Chapter 2

Structured Connectionism and Knowledge Representation This chapter starts out by explicating what we mean by a structured connectionist network. In doing so, we circumscribe the class of networks which constitute the domain of the mapping algorithms considered in Chapters 3 and 6. We then describe shruti, the connectionist knowledge representation and reasoning system we use for developing parallel re exive reasoning systems.

2.1 The Essence of Connectionism Connectionist models|also referred to as parallel distributed processing (PDP) models or arti cial neural networks|are neurally motivated, massively parallel models of computation. These models consist of a large number of simple computing elements called nodes profusely interconnected via weighted links. The nodes have a small amount of local memory and can compute simple functions of their inputs. The outputs of the nodes are communicated to other nodes via links. Learning is an attractive and important aspect of connectionist models. One of several learning algorithms may be used to train connectionist networks to recognize patterns in data. We recognize the following core features of connectionist models (Feldman and Ballard, 1982; Shastri, 1991a):

 Massive parallelism. Connectionist networks are made up of a very large number of nodes. Each node has a limited amount of local memory and can process input independent of other nodes|leading to embarrassingly parallel operation.

 Weighted links with a high degree of connectivity. Nodes interact and exchange information using weighted links. The degree of connectivity between nodes could vary from near complete connectivity to relatively sparse connectivity. For example, Hop eld networks and feedforward networks have a high degree of connectivity while structured connectionist networks are sparsely connected (see Section 2.2).

 Scalar messages with no internal structure. The messages which nodes exchange are very simple. They are limited-precision scalars with no internal structure.

 Nodes which compute a simple function of the input. Nodes compute only simple functions of their input. They are not expected to have arbitrary computing power. Most often nodes just compute weighted sums of their inputs. 7

2.2 Structured Connectionism Structured connectionist networks are connectionist models where the network structure is prespec-

i ed to suit a particular task (Feldman et al., 1988). Some examples of structured connectionist models are: connectionist knowledge representation systems like shruti (Shastri and Ajjanagadde, 1993; Mani and Shastri, 1993), csn (Shastri, 1988) and robin (Lange and Dyer, 1989); Terry Regier's (1992) system which learns the lexical semantics of spatial terms; Nigel Goddard's (1992) architecture for recognizing moving light displays; systems for visual motion understanding (Olson, 1989), visual attention, and visual search (Ahmad, 1991); and adjective-noun interpretation (Webber, 1989). In comparison with traditional neural network models like multilayer back-propagation networks, Hop eld nets, Boltzmann machines, etc. (Rumelhart and McClelland, 1986; Anderson and Rosenfeld, 1989), structured connectionist networks are sparsely connected. Densely connected architectures result in networks with complete, or near-complete, connectivity among nodes in the entire network or between adjacent layers in case of a layered network. In such networks, a signi cant percentage of the nodes and links in the network are a ected during every update cycle. Sparsely connected architectures, on the other hand, have connectivity patterns where each node is connected to a relatively small percentage of nodes in the network. If the network is very large, each node may be connected to a large number of other nodes in the network. Though the number of nodes connected to any node in the network may be large, these constitute only a small percentage of nodes in the entire network. As activation spreads through a sparse network only a small fraction of the total number of nodes and links in the network participate in any update step. One of the goals of this thesis is to develop parallel re exive reasoning systems by mapping structured connectionist models onto parallel machines. The di erences in the inherent characteristics of dense and sparsely connected networks necessitate the use of di erent techniques for ecient mapping onto parallel machines. In designing our mapping algorithms, we exploit the fact that only a small fraction of nodes in a structured connectionist network change state during each update cycle. We discuss this and other issues that form the basis of this work in Section 3.1.3.

2.3

Shruti:

A Structured Connectionist Reasoning System

We now describe the basic shruti knowledge representation and reasoning system (Shastri and Ajjanagadde, 1993) along with a brief look at the extensions proposed in (Mani and Shastri, 1993). Before we embark on a description of shruti, we shall characterize the notion of re exive reasoning (Shastri, 1990).

2.3.1 Re exive Reasoning

Humans can draw a restricted class of inferences|what we might call \common-sense reasoning"| with remarkable eciency. Such ecient, e ortless and spontaneous reasoning which does not require conscious thought is what we refer to as re exive reasoning. This is in contrast to re ective reasoning which requires conscious deliberation on part of the agent. Since perception|language comprehension, visual scene understanding, etc.|is re exive and it happens in a fraction of a second, we expect episodes of re exive reasoning to span a few hundred milliseconds. Furthermore, since our ability to perform re exive reasoning does not deteriorate as we learn more about the world, the time required to perform re exive reasoning must be independent of the size of the knowledge base. We also assume that the size (or space complexity) of a re exive reasoning system is at most linear in the size of the knowledge base. Since a general purpose reasoning system is computationally intractable, we expect a re exive reasoning system to impose constraints and limitations on the contents of the knowledge base and on the nature of inferences that can be handled eciently. Shruti is an attempt at building a re exive reasoning system. Section 2.3.12 8

c:can-sell

F1

c:own

from John from Mary from Book1

from John

c:give F1

c

c

e giver recip

give

e:give

e

b-obj

buyer

g-obj

g-obj

buy

recip e:buy b-obj buyer e:own o-obj

c

owner

from Mary from Ball1

e owner

e:can-sell Book1

o-obj

own

cs-obj Mary

John

p-seller

Mary

input to e:can-sell

Book1

input to cs-obj input to p-seller

Ball1

input to Book1

c

e

input to Mary

p-seller

cs-obj

can-sell

0

1

2

3

4

5

6

7

8

9

time

Figure 2.1: (a) An example encoding of rules and facts. (b) Activation trace for the query cansell(Mary,Book1)?. provides a formal de nition of re exive reasoning.

2.3.2 The Rule-Based Reasoning System|An Overview Shruti, a connectionist reasoning system that can represent systematic knowledge involving n-ary

predicates and variables, has been proposed by Shastri and Ajjanagadde (Shastri and Ajjanagadde, 1993; Ajjanagadde and Shastri, 1991). Shruti can perform a broad class of reasoning with extreme eciency. In principle, the time taken by the reasoning system to draw an inference is only proportional to the length of the chain of inference and is independent of the number of rules and facts encoded by the system. The reasoning system maintains and propagates variable bindings using temporally synchronous|i.e., in-phase| ring of appropriate nodes. This allows the system to maintain and propagate a large number of variable bindings simultaneously as long as the number of distinct entities participating in the bindings during any given episode of reasoning remains bounded. Reasoning in the system is the transient but systematic ow of rhythmic patterns of activation, where each phase in the rhythmic pattern corresponds to a distinct entity involved in the reasoning process and where variable bindings are represented as the synchronous ring of appropriate role and ller nodes. A fact behaves as a temporal pattern matcher that becomes `active' when it detects that the bindings corresponding to the fact are present in the system's pattern of activity. Rules are interconnection patterns that propagate and transform rhythmic patterns of activity. Shruti attempts to model re exive reasoning over a large body of knowledge. Shruti has been extended in (Mani and Shastri, 1993) to e ectively reason with a more exible set of rules and facts, thereby enhancing the system's ability to model common-sense re exive reasoning. 9

2.3.3 Terminology

We clarify some terminology before proceeding with a description of knowledge representation and reasoning in shruti. A predicate is a relation. For example, give(x,y,z) is a predicate which represents the three-place relation: x gives y to z. Here x, y and z constitute the arguments or roles of the give predicate. A fact is a partially or completely instantiated predicate|like give(John,Mary,Book1). Entities which are bound to predicate arguments are llers. A rule speci es the systematic correspondence between predicate arguments. The rule 8x,y,z [ give(x,y,z) ) own(y,z) ] states that \if x gives y to z, then y owns z". A query is a fact whose truth we would like to ascertain. The term entity or concept is used to collectively refer to types (or categories) and instances (or individuals, sometimes referred to as constants). An is-a relation or is-a fact captures the superconcept-subconcept relation between types, and the instance-of relation between types and instances. Predicates, along with associated rules and facts, constitute the rule-base while concepts and their associated is-a relations constitute the type hierarchy. Predicates, concepts, facts, rules and is-a relations together constitute knowledge base elements.

2.3.4 Knowledge Encoding for Backward Reasoning

We describe the reasoning system using an example. Figure 2.1a illustrates how long-term knowledge is encoded in the rule-based reasoning system. The network encodes the following rules and facts:

8x,y,z [ give(x,y,z) ) own(y,z) ], 8x,y [ buy(x,y) ) own(x,y) ], 8x,y [ own(x,y) ) can-sell(x,y) ],

give(John,Mary,Book1), buy(John,x), and own(Mary,Ball1).

The rule 8x,y,z [ give(x,y,z) ) own(y,z) ] states that `if x gives z to y, then y owns z'. The other two rules are interpreted similarly. The facts give(John,Mary,Book1) and own(Mary,Ball1) represent `John gave Mary Book1' and `Mary owns Ball1' respectively, while buy(John,x) states that `John bought something '. Rule and fact encoding makes use of several types of nodes (see Figure 2.2): -btu nodes (depicted as circles),  -and nodes (depicted as pentagons) and  -or nodes (depicted as triangles). These nodes have the following idealized behavior: On receiving a spike train, a -btu node produces a spike train that is synchronous (i.e., in-phase) with the driving input. We assume that -btu nodes can respond in this manner as long as the inter-spike distance, , lies in the interval [min; max]. Here min and max are the minimumand maximuminter-spike gaps|i.e., minimum and maximum period of oscillation|for which the system can sustain synchronous activity (Shastri and Ajjanagadde, 1993). A -and node behaves like a temporal and node, and becomes active on receiving an uninterrupted pulse train. On becoming active, a -and node produces an output pulse train comparable to the input pulse train. A -or node, on the other hand, becomes active on receiving any activation; its output is a pulse whose width and period equal max . Figure 2.2 summarizes node behavior. The maximum number of distinct entities that may participate in an episode of reasoning is bounded bmax =!c where max is the maximum period of oscillation. We de ne ! to be the width of the window of synchronization|nodes ring with a lag or lead of less than !=2 are considered to be in synchrony. In a reasoning episode, these distinct entities may occur in any number of argument bindings. 10

in

ρ -btu out

in

τ-and out

in

τ -or out

period π and

can fire with any period in the interval [ π

min



]

max

always fires with period π

max

Figure 2.2: Behavior of -btu, -and and -or nodes in the reasoning system. The encoding also makes use of inhibitory modi ers|links that impinge upon and inhibit other links. A pulse propagating along an inhibitory modi er will block a pulse propagating along the link it impinges upon. In Figure 2.1a, inhibitory modi ers are shown as links ending in solid circles. Each entity in the domain is encoded by a -btu node. An n-ary predicate P is encoded by a pair of -and nodes and n -btu nodes, one for each of the n arguments. One of the -and nodes is referred to as the enabler, e:P, and the other as the collector, c:P. In Figure 2.1a, enablers point upward while collectors point downward. The enabler e:P becomes active whenever the system is being queried about P. On the other hand, the system activates the collector c:P of a predicate P whenever the system wants to assert that the current dynamic bindings of the arguments of P match, or follow from, the knowledge encoded in the system. A rule is encoded by connecting (i) the collector of the antecedent predicate to the collector of the consequent predicate, (ii) the enabler of the consequent predicate to the enabler of the antecedent predicate, and (iii) the argument nodes of the consequent predicate to the argument nodes of the antecedent predicate in accordance with the correspondence between these arguments speci ed in the rule. A fact is encoded using a -and node that receives an input from the enabler of the associated predicate. This input is modi ed by inhibitory modi ers from the argument nodes of the associated predicate. If an argument is bound to an entity in the fact then the modi er from such an argument node is in turn modi ed by an inhibitory modi er from the appropriate entity node. The output of the -and node is connected to the collector of the associated predicate.

2.3.5 The Inference Process

Posing a query1 to the system involves specifying the query predicate and its argument bindings. This is done by activating the relevant nodes in the manner described below. Let us choose an arbitrary point in time|say, t0 |as our point of reference for initiating the query. We assume that the system is in a quiescent state just prior to t0. The query predicate is speci ed by activating the enabler of the respective predicate with a pulse train of width and periodicity  starting at t0. The argument bindings speci ed in the query are communicated to the network as follows: Let

1 We consider only yes-no queries. Shruti can also handle wh-queries. See (Shastri and Ajjanagadde, 1993) for a detailed discussion.

11

the argument bindings in the query involve n distinct entities: c1; :::; cn. With each ci , associate a delay i such that no two delays are within ! of one another and the longest delay is less than  ? !. As mentioned earlier, ! is the width of the window of synchrony and  is the period of oscillation. Each of these delays may be viewed as a distinct phase within the period t0 and t0 + . Now the argument bindings of an entity ci are indicated to the system by providing an oscillatory spike train of periodicity  starting at t0 + i , to ci and all arguments to which ci is bound. Thus, a distinct phase|i.e., non-overlapping time interval within a period of oscillation|is associated with each distinct entity in the query and argument bindings are represented by the synchronous activation of the appropriate entity and argument nodes. We illustrate the reasoning process with the help of an example. Consider the query cansell(Mary, Book1)? (i.e., `Can Mary sell Book1?') The query is posed by (i) Activating the enabler e:can-sell ; (ii) Activating Mary and p-seller in the same phase (say, 1 ), and (iii) Activating Book1 and cs-obj in some other phase (say, 2 ). As a result of these inputs, Mary and p-seller re synchronously in phase 1 of every period of oscillation, while Book1 and cs-obj re synchronously in phase 2 . See Figure 2.1b. The node e:can-sell will also oscillate and generate a pulse train of periodicity and pulse width . The activations from the arguments p-seller and cs-obj reach the arguments owner and o-obj of the own predicate, and consequently, starting with the second period of oscillation, owner and o-obj become active in 1 and 2 , respectively. At the same time, the activation from e:can-sell activates e:own. The system has essentially created dynamic bindings for the arguments of predicate own. Mary has been bound to the argument owner, and Book1 has been bound to the argument o-obj. These newly created bindings in conjunction with the activation of e:own encodes the internally generated query own(Mary,Book1)? (i.e., `Does Mary own Book1?'). The fact node associated with own(Mary, Ball1) does not match the query and remains inactive. The activations from owner and o-obj reach the arguments recip and g-obj of give, and buyer and b-obj of buy respectively. Thus beginning with the third period of oscillation, arguments recip and buyer become active in 1 , while arguments g-obj and b-obj become active in 2 . In essence, the system has created new bindings for the predicates give and buy that can be thought of as encoding two new queries: give(x,Mary,Book1)? (i.e., `Did someone give Mary Book1?'), and buy(Mary,Book1)?. The fact node F1 associated with the fact give(John,Mary,Book1) becomes active as a result of the uninterrupted activation from e:give. The inhibitory inputs from recip and g-obj are blocked by the in-phase inputs from Mary and Book1, respectively. The activation from F1 causes c:give to become active. The output from c:give in turn causes c:own to become active and transmit an output to c:can-sell. Consequently, c:can-sell, the collector of the query predicate can-sell, becomes active (refer to Fig. 2.1b) resulting in an armative answer to the query can-sell(Mary,Book1)?. The system produces an yes answer in time t = 2(l + 1), where  is the period of oscillation and l is the length of the shortest derivation of the query. If the collector of the query predicate does not receive any activation within time 2(d + 1), where d is the maximum possible derivation depth, then the query does not have an yes answer.2

2.3.6 Rules with Constants and Repeated Variables in the Consequent

Encoding rules containing constants and/or existentially quanti ed variables in the consequent requires extra consideration. Figure 2.3a shows the encoding of the rule 8x1; x2; y [ P(x1,x2) ) 9z Q(x1,x2,y,z,a) ]. In this rule, a is a constant. In encoding this rule, we use a -or node (node g1 in Figure 2.3a) whose behavior is described in Figure 2.2. Node g1 projects inhibitory modi ers that can block the ring of the rule. This node ensures that the rule participates in an inference only if all the conditions implicit in the consequent of the rule are met. The rst condition is to prevent a rule from ring if an existentially quanti ed variable in the consequent of the rule gets bound in a reasoning episode. In the gure, this behavior is achieved by 2

This can be viewed as a no answer if we subscribe to the closed-world assumption.

12

P

P

g2

g1

g1

from a

from a

Q

Q

Figure 2.3: Encoding rules with (a) Existentially quanti ed variables and constants in the consequent; (b) Repeated variables in the consequent. the link from the existentially quanti ed (fourth) argument of Q to g1, and the inhibitory modi ers emanating from g1. The second condition relates to constants in the consequent. A rule with a constant in its consequent cannot be used in an inference if the corresponding argument gets bound to any other constant. Thus, in Figure 2.3a, the g1 node, in conjunction with the link from the fth argument of Q and the inhibitory modi er from a, ensures that if this argument is bound to any constant other than a, the ring of the rule will be blocked. Figure 2.3b shows the encoding of the rule 8x [ P(x) ) Q(x,x,a) ], with repeated variables in the consequent. If the same variable occurs in multiple argument positions in the consequent of a rule, then this variable must either remain unbound or get bound to the same entity in all the argument positions. In other words, unless all the arguments bound to the same variable are in synchrony, the ring of the rule should be inhibited. In the gure this is achieved by using node g2 which is a variant of a -or node and becomes active if it receives input in more than one phase. Some examples of rules with existentially quanti ed and repeated variables in the consequent are:

8x,y,z [ move(x,y,z) ) 9t present(x,z,t) ], 8x,y,z [ move(x,y,z) ) 9t present(x,y,t) ], 8x [ commit-suicide(x) ) kill(x,x) ]. The rst two rules|which state that if x moves from y to z, then there is a time when x is at y, and a time when x is at z|contain an existentially quanti ed variable t in the consequent. The last rule has repeated variables in the consequent and states that if someone commits suicide, then that person kills himself.

2.3.7 Multiple Antecedent Rules

A rule of the form P1(: : :) ^ P2 (: : :) ^ : : : ^ Pm (: : :) ) Q(: : :), with conjunctive predicates in the antecedent is encoded using an additional -and node with a threshold of m. This node activates the collector of Q only if all the collectors of P1; : : :; Pm are active. The interconnection between the argument nodes of the antecedent and consequent predicates remain unchanged. Figure 2.4 illustrates the encoding of the multiple antecedent rule 8x,y [ P(x,y) ^ Q(y,x) ) R(x,y) ]. The g3 -and node has a threshold of 2. The de nition of twins: 8x,y [ sibling(x,y) ^ born-together(x,y) ) twins(x,y) ] is an example of a rule with multiple predicates in the antecedent. 13

P

Q

g3

R

Figure 2.4: Encoding rules with multiple predicates in the antecedent.

2.3.8 The Type Hierarchy

Integrating a type hierarchy with the reasoning system (Mani and Shastri, 1991a; Mani and Shastri, 1993) allows the use of types (categories) as well as instances in rules, facts, and queries. Consequently, the system can encode and retrieve facts which refer to types (for example, `cats prey on birds'), as well as facts which refer to nonspeci c instances of types (`John bought a Porsche'). The reasoning system can also combine rule-based reasoning with type inheritance. For example, it can infer that `Tweety is scared of Sylvester', based on the generic fact `cats prey on birds', the rule `if x preys on y then y is scared of x' and the is-a relations `Sylvester is a cat' and `Tweety is a bird'. Finally, the integrated system can use category information to qualify rules by specifying restrictions on the type of argument llers. An example of such a rule is 8x:Animate, y:Solid-obj [ walk-into(x,y) ) hurt(x) ] which speci es that the rule is applicable only if the two arguments of walk-into are of the type Animate and Solid-object, respectively. Figure 2.5a shows how the type hierarchy interfaces with the rule-base. The network shown encodes the following rules and facts:

8x; y [ preys-on(x,y) ) scared-of(y,x) ], 8x:Cat, y:Bird preys-on(x,y), and 9x:Cat 8y:Bird loves(x,y). The type hierarchy encodes the following IS-A relations: is-a(Bird, Animal), is-a(Cat, Animal), is-a(Robin, Bird), is-a(Canary, Bird), is-a(Tweety, Canary), is-a(Chirpy, Robin), and is-a(Sylvester, Cat).

Rules are encoded as before. Facts involving typed variable are encoded in the following manner. A typed, universally quanti ed variable is treated as being equivalent to its type. Thus, 8x:Cat, y:Bird preys-on(x,y) is encoded as preys-on(Cat,Bird). A typed, existentially quanti ed variable is encoded using a unique subconcept of the associated type. For example, in Figure 2.5a, 9x:Cat 8y:Bird loves(x,y) is encoded as loves(Cat-1,Bird), where Cat-1 is some unique instance of Cat. 14

F2

lovee lover

loves Animal

B

F1

Cat Bird

B

A up

B down

preys-on c

e

prey predator

A

Cat-1 Canary

Robin

A Sylvester

top-down switch for A

Representation

bottom-up switch for A

scarer c

e

Tweety

Chirpy

scaree

scared-of

Type Hierarchy Rule-Base

Figure 2.5: (a) The rule-based reasoner interfaced with a type hierarchy. (b) Type hierarchy encoding for the relation is-a(A,B). Figure 2.5b gives an overview of how the type hierarchy is encoded. Each entity is represented as a cluster of nodes organized into k1 banks. Any entity can accommodate up to k1 dynamic instantiations, k1 being the multiple instantiation constant for concepts. The type hierarchy encoding provides built in mechanisms for automatically controlling the propagation of activation in the type hierarchy so as to deal correctly with queries containing existentially and universally quanti ed variables. The correct treatment of quanti ed variables|assuming all is-a links are indefeasible, i.e., without exceptions|requires that activation originating from a concept C that is either an entity or type corresponding to a universally quanti ed variable in the query should propagate upwards to all the ancestors of C. This upward propagation checks if the relevant fact is universally true of some superconcept of C. Activation originating from a concept C that appears as an existentially quanti ed variable in the query should propagate to the ancestors of C, the descendants of C, as well as the ancestors of the descendants of C.3 Since an entity can accommodate k1 instantiations, and it could receive activation from both superconcepts and subconcepts/instances, each entity is associated with two type-hierarchy switches| a top-down t-switch and a bottom-up t-switch|which regulate the ow of activation into the entity. The t-switches are shown as squares in Figure 2.5b. The t-switches bring about ecient and automatic dynamic allocation of concept banks to ensure that (i) any concept represents at most k1 instantiations, and (ii) a given instantiation is represented at most once; in other words, no two banks represent the same instantiation. We refrain from a detailed discussion of the architecture of the t-switch and entity cluster. These details can be found in (Mani and Shastri, 1993) and are not relevant to the design of shruti-cm5, where most of the functionality is procedurally implemented. 3 The rationale for this is the following. A fact is true of some entity of type C if one or more of the following holds: (i) The fact is universally true of a superconcept of C , (ii) the fact is true of some subconcept/instance of C , or (iii) the fact is universally true of a superconcept of a subconcept/instance of C . The last condition is required if concepts in the type hierarchy can have multiple parents.

15

bank 1 2

2

2

2

2

2

2

Fact Nodes P(?,?)

bank 2 2

(connections not shown)

bank 3 2

Predicate P cable 2

2

latch enable

2

cable

SWITCH (for Predicate P)

latch enable

cables

bank 1 2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

bank 1 bank 2 2

bank 2 bank 3 2

bank 3

Predicate R

Predicate Q

Figure 2.6: An overview of the multiple instantiation system.

2.3.9 Multiple Dynamic Instantiation of Predicates

Extending the reasoning system to incorporate multiple instantiation of predicates (Mani and Shastri, 1992a; Mani and Shastri, 1993) provides shruti with the ability to simultaneously represent multiple dynamic facts pertaining to a predicate. For example, the dynamic facts loves(John,Mary) and loves(Mary,Tom) can now be represented at the same time. As a result, we can represent and reason using a set of rules with more complex inferential dependencies, including circularities and bounded recursion. The system can encode rules like 8x; y [ sibling(x,y) ) sibling(y,x) ]. A forward reasoning system can use a rule such as4 8x; y; z [ greater-than(x,y) ^ greater-than(y,z) ) greaterthan(x,z) ] and infer `a is greater than c' on being told that `a is greater than b' and `b is greater than c'. Thus, the ability to accommodate multiple instantiation of predicates adds the capability to handle bounded symmetry, transitivity and recursion. Introduction of multiple dynamic instantiation of predicates relies on the assumption that, during an episode of re exive reasoning, any given predicate need only be instantiated a bounded number of times. In (Shastri and Ajjanagadde, 1993), it is argued that a reasonable value for this bound is around three. We shall refer to this bound as the multiple instantiation constant for predicates, k2.5 Figure 2.6 is an overview of the reasoning system with multiple instantiation capabilities. The gure shows three predicate P, Q and R, with the interconnections suggesting the existence of two 4 This rule cannot be represented in the backward reasoner since the antecedent has a repeated variable y that does not occur in the consequent. See Section 2.3.11. 5 This is the factor that limits symmetry, transitivity and recursion, since each predicate can accommodate at most k2 dynamic instantiations.

16

rules: P(: : :) ) Q(: : :) and P(: : :) ) R(: : :). Predicate representations are augmented so that each predicate can represent up to k2 dynamic instantiations. For a predicate P, the enabler of the ith bank will be active if that bank has been instantiated with a dynamic binding; the collector of the ith bank will be activated if the instantiation in that bank matches the knowledge encoded in the system. All inputs to a predicate are routed through a multiple instantiation switch (or m-switch) which mediates the ow of activation between the banks of the consequent and antecedent predicates. The switch brings about ecient dynamic allocation of predicate banks by (i) automatically channeling (at most k2) input instantiations into available banks of its associated predicate; and (ii) ensuring that each distinct instantiation occupies only one bank, irrespective of the number of predicates that may be communicating this instantiation to the switch. We refrain from presenting the architecture of the m-switch since shruti-cm5 simulates its behavior procedurally. With multiple instantiation of predicates, facts and rules are encoded by essentially replicating the network structure k2 times|once for each bank of the predicate (Mani and Shastri, 1993).

2.3.10 Network Complexity

The reasoning system|including the type hierarchy and multiple instantiation extensions|requires O (k1 C +k2 F +k22 P ) nodes, where C is the total number of entities in the system, F is the total number of long-term facts, and P is the sum of the arities of all predicates in the rule-base. k1 and k2 are the multiple instantiation constants for the type hierarchy and the rule-base respectively. The network complexity is therefore at most linear in the size of the knowledge base. As for time complexity, the system can answer queries in time proportional to the length of the shortest derivation. Given a predicate P, the best case propagation time for activation passing through its m-switch is proportional to n, the arity of P; in the worst case, propagation time is proportional to k2  n. If we assume that nmax is the maximum arity of any predicate in the reasoning system, then the constant of proportionality for the time complexity will be proportional to nmax (in the best case) or k2  nmax (in the worst case), irrespective of the predicate under consideration. The time taken for activation to traverse the type hierarchy is independent of k1, and is only dependent on the number of is-a links that need to be traversed in order to answer the query. The time taken to answer a query will be proportional to the maximum of the time taken for activation to spread in the (i) rule-base and (ii) in the type hierarchy.

2.3.11 Constraints on Rules and Inferences

Shruti is an ecient, limited inference system, and imposes several psychologically and biologically

motivated constraints in order to make reasoning tractable (Shastri and Ajjanagadde, 1993; Dietz et al., 1993; Mani and Shastri, 1993). We brie y describe these constraints and the motivation behind positing them. The number of distinct entities that can participate in any given episode of reasoning is bounded, and is typically a small number in the 7{10 range. The maximum number of entities that can be simultaneously represented is limited by the ratio b max ! c, where max is the maximum period of oscillation (see Figure 2.2), and ! is the tolerance window for synchrony|any two nodes ring with a maximumlag or lead of ! are considered to be in synchrony. Based on biologically plausible values (Shastri and Ajjanagadde, 1993) of max  35 milliseconds and !  5 milliseconds, we determine that about 7 entities can be simultaneously represented. Furthermore, as synchronous activity propagates in a noisy system like a group of neurons, it is reasonable to expect that synchrony will be gradually lost as activation propagates farther from the origin. In a reasoning system like shruti, this translates to a bound on the depth of reasoning, and is generally a small value of about ve. In a backward reasoning system, activation ows from the consequent predicate to the antecedent predicate(s). For a rule to participate in re exive reasoning in such a system, any repeated 17

variable in the antecedent of the rule must also appear in the consequent of the rule and get bound during a given episode of reasoning. It has been shown in (Dietz et al., 1993) that this constraint is necessary in order to perform reasoning in time proportional to the depth of inference with a linear bound on space complexity. More generally, any predicate argument in the antecedent that requires some condition to be enforced must occur in the consequent and get bound during a given episode of reasoning. Thus, repeated variables, typed variables and entities which occur in the antecedent of a rule must also occur in the consequent and get bound during any episode of reasoning. As is evident from Section 2.3.10, maintaining multiple instantiations of predicates is expensive. For example, a backward reasoning system which supports k2 instantiations could require on the order of k22 nodes for encoding rules and in the worst case, propagation time could increase by a factor of k2. Given this high cost in terms of space and time, it has been claimed that the number of instantiations is limited to a small value around three. Finally, while representing rules, facts, and queries which involve typed variables, only those situations where all the universally quanti ed typed variables are within the scope of the existential typed variables can be represented. This condition ensures that the number of entities involved in any given episode of reasoning remains limited. If existential variables are within the scope of universally quanti ed variables, a di erent instantiation of the existential variable(s) may be required for each instantiation of the universally quanti ed variables, leading to a potential explosion of the number of entities required for successful derivation.

2.3.12 A Characterization of Re exive Reasoning

Based on the characteristics of shruti, and the constraints set out in the previous section, we characterize re exive reasoning as a tractable reasoning class which satis es the following conditions (for a backward reasoning system):  Rules are of the following form:

9w1:W1; : : :; wp:Wp 8x1; : : :; xr ; y1:Y1; : : :; ys:Ys [P1(: : :) ^    ^ Pn(: : :) ) 9z1 ; : : :; zt Q(: : :)]: The arguments of Pi (1  i  n) can be (i) constants, (ii) any of the universally quanti ed variables or (iii) elements of fw1; : : :; wpg. The arguments of Q are constants or any of the

variables. Y1 ; : : :; Ys and W1 ; : : :; Wp are types which specify restrictions on the bindings of y1 ; : : :; ys and w1 ; : : :; wp respectively. Rules can be cyclic or recursive, as long as the number of dynamic instantiations of any predicate in a derivation remains bounded. In order to satisfy the constraints listed in Section 2.3.11, rules should satisfy the following restrictions: Repeated variables, constants and typed variables (w1; : : :; wp ; y1; : : :; ys) occurring in the antecedent of the rule (as arguments to one or more Pi ) should also occur in the consequent (as an argument of Q) and get bound during an episode of reasoning. Note that all the universally quanti ed variables are within the scope of the typed existential variables.  Facts are partially or completely instantiated predicates of the form

9x1; : : :; xr ; y1:Y1; : : :; ys:Ys 8z1:Z1; : : :; zt:Zt [ P(: : :) ]: Again, Y1 ; : : :; Ys and Z1; : : :; Zt are types. The arguments of P are either constants or any of the variables, and are subject to the constraints set out in Section 2.3.11.  Queries, which have the same form as facts, can be answered in time proportional to the depth of inference.  The space complexity of the encoding is at most linear in the size of the knowledge base. 18

Chapter 3

Mapping SHRUTI onto Massively Parallel Machines We begin with a discussion of massively parallel processors (MPPs) in light of mapping structured connectionist models. Subsequent sections consider issues that need to be addressed in order to obtain e ective performance and to strike a compromise between resource usage and response time when mapping shruti onto any massively parallel machine. Later chapters bring out how these issues are resolved in actual implementations on the CM-5. Though we present the discussion in terms of mapping shruti onto parallel machines, similar considerations apply to mapping any network or graph which conforms to the assumptions laid out in Section 3.1.3.

3.1 Structured Connectionism and Massively Parallel Architectures

Connectionist models are one avor of massively parallel architectures. If we assign one simple processor to each node in the connectionist network and assume each link to be a communication channel between the processors, we have a specialized massively parallel system. An alternative approach would be to simulate connectionist models on existing general purpose parallel machines. In this section, we present a summary of custom-tailored and general purpose parallel architectures from the perspective of mapping structured connectionist models. We also note that connectionist models can be simulated on serial machines. A connectionist network being inherently parallel, serial execution will extract a heavy performance penalty, and would be viable only for small networks. Achieving acceptable performance with large networks would be extremely dicult, if not impossible, due to sequential processing and the von Neumann bottleneck.

3.1.1 Special Purpose Hardware

With decreasing manufacturing costs for VLSI systems and with increasing packaging density of modern VLSI circuits, building special purpose hardware for connectionist systems is becoming increasingly feasible. Although special purpose hardware could provide the best way to implement connectionist networks, such hardware is not exible and will have to be tailor-made for speci c connectionist architectures and networks. Any change in the connectionist network would require modi cations, changes or a complete redesign of hardware. Figure 3.1, adapted from (Hammerstrom, 1994), illustrates this trade-o between exibility and performance per unit cost. Furthermore, in spite of recent technological advances, constructing special purpose hardware is still an 19

flexibility

traditional processors

flexible digital specialized digital

analog

performance per unit cost (log scale)

Figure 3.1: Trade-o between hardware exibility and performance per unit cost. Adapted from (Hammerstrom, 1994). expensive and time-consuming option. The in exibility of special purpose hardware is a disadvantage when considering structured connectionist networks since the resulting network representation or architecture could depend on the speci c problem instance. For instance, in the case of shruti, the actual topography of the network depends entirely on the knowledge base being encoded. This can be problematic when dealing with in exible hardware. There have been, and continue to be, attempts to build specialized, recon gurable machines which support a wide variety of connectionist networks. Such \general purpose connectionist machines" provide hardware platforms for eciently running connectionist networks with varied structures and architectures. Though such machines are much more exible and can be easily recon gured, they are not widely available as of now; but see (Asanovic et al., 1993) and Section 4.5. Unless such general purpose machines are built with structured connectionist networks in mind, the performance penalty (as compared to network architectures optimally suited for the machine) could be signi cant.1 Furthermore, the limited availability of such one-of-a-kind machines restricts their usefulness.

3.1.2 General Purpose Parallel Architectures

There are several massively parallel, general purpose architectures onto which we can map connectionist networks. Several of these architectures, and their suitability for simulating structured connectionist models, are discussed below. A detailed exposition of these architectures and their use in simulating connectionist models can be found in Section 4.6.  Vector Supercomputers. These are machines with hardware support for vector and array processing. These architectures use ecient multi-stage pipelines with very few processors. Vector machines are well suited for number crunching operations but are not very promising platforms for massively parallel structured connectionist model simulations. The irregularity of these networks makes it dicult to exploit the processing power of these machines. The Cray Y-MP and the Convex C3 are examples of vector supercomputers. 1 For example, Mueller and Gomes (1994) show that the performance degradation when simulating sparse nets on the CNS-1 (Asanovic et al., 1993) could be anywhere from 2 to 10 as compared to performance for densely connected networks.

20

 SIMD Architectures. SIMD (S ingle I nstruction M ultiple D ata) machines generally consist of

a large number of simple processors, all of which execute the same instruction, but on di erent data values. The processors are linked together by a communication network which most commonly supports ecient nearest neighbor communication as well as general interprocessor communication between any two processors in the system. Though the presence of a large number of simple processors makes it easy to map structured connectionist networks onto an SIMD machine, the synchronous execution of the same instruction on all processors could lead to inecient use of computational resources. The Connection Machine CM-2 is a typical example of an SIMD machine.

 General Purpose MIMD Architectures. MIMD (M ultiple I nstruction M ultiple D ata) machines o er advantages over SIMD architectures when the model can make e ective use of relatively powerful processors which can function independently of each other. A structured connectionist network can be partitioned into several fragments and each fragment run asynchronously on a separate processor; interprocessor communication may be used when crossing partition boundaries. Most modern MIMD machines have very ecient interprocessor communication. The Connection Machine CM-5 and the Cray T3D are examples of MIMD machines.

 Systolic Arrays. Systolic array architectures consist of several processors operating on several

data streams and interacting in a regular and symmetric manner. They can be con gured to mimic SIMD architectures by programming all the processors to perform the same operations on di erent data streams (data partitioning); or to operate like a MIMD machine by simulating a part of the network on each processor (network partitioning). Regularity in the data, operations and processor interactions is a prerequisite for optimum performance on systolic arrays.

 Data ow Architectures. Data ow architectures o er asynchronous, data-driven parallelism. Unlike other parallel machines, data ow machines can automatically expose the inherent parallelism in a given computation. The ease of programming and exibility of data ow architectures make it easy to support a variety of connectionist models. The major limitations of this architecture are limited availability, lack of ecient hardware realizations and lack of good compilers.

3.1.3 Mapping Structured Connectionist Models onto Massively Parallel Architectures

In summary, we exploit the features of structured connectionist networks in mapping them onto general purpose MPP machines:  Structured connectionist models are inherently parallel. But in order to get the maximum bene t from this parallelism, the mapping granularity must be tailored to the computational capabilities of the MPP processors.

 The simplicity of the messages exchanged between nodes supports the use of interprocessor

communication schemes which can handle short message packets very eciently. Complex messages and communication protocols are unnecessary.

 Structured connectionist networks are sparsely interconnected so that spreading activation will touch only a small fraction of the nodes in the network.

 In view of the fact that we will be dealing with shruti-like knowledge base systems, we make the following assumptions:

21

{ Since a knowledge representation system must be able to support any well-formed query

as de ned in Section 2.3.12, the source of initial activation and the depth to which it will propagate are unknown. This being the case, the system should be such that it will provide good performance on average. { An episode of reasoning being very short, dynamic load balancing on the parallel machine is infeasible. We need to ensure that the static distribution of the knowledge base will guarantee good dynamic load balancing on average. { The networks we deal with are large, since the system has to reason with very large knowledge bases. In terms of target architectures, we use existing general purpose distributed memory MPP machines. MPPs are becoming increasingly popular and several of these machines have ecient interprocessor communication mechanisms.

3.2 Exploiting Constraints Imposed by Shruti As brought out in Chapter 2, shruti is a limited inference system, and imposes several psychologically and/or biologically motivated constraints in order to make reasoning tractable:  The form of rules and facts that can be encoded is constrained. Shruti attains its tractability from this fundamental constraint (Shastri, 1993; Dietz et al., 1993), which implicitly in uences the resulting network encoding the knowledge base, and makes it amenable to ecient inference by spreading activation.  The number of distinct entities that can participate in an episode of reasoning is bounded. This restricts the number of active entities, and hence, the amount of information contained in an instantiation. In turn, this limits the amount of information that must be communicated between predicates.  Entities and predicates can only represent a limited number of dynamic instantiations. Entities and predicates therefore have a bounded number of banks, which constrains both space and time requirements.  The depth of inference is bounded. This constrains the spread of activation in the network and therefore directly a ects response time and resource usage. The motivation for these constraints and their impact are discussed in Section 2.3.11. In terms of mapping shruti onto parallel machines, it would be to our advantage to exploit these constraints to the fullest extent to achieve ecient resource usage and rapid response with large knowledge bases. Of course, if any of these constraints can be relaxed without paying a severe performance penalty, we would like to obtain a more powerful system by relaxing these constraints.

3.3 Representing Synchrony Temporal synchrony is a distinguishing feature of shruti. Representing dynamic bindings using synchronous ring of connectionist nodes, and relating this to the spiking of neurons is crucial in deriving several of the constraints and in de ning re exive reasoning. When mapping shruti onto MPPs, we could explicitly represent synchrony by dividing up the time scale into distinct intervals and forcing processors to synchronize after each interval. Though this would provide a direct representation of synchrony, it may not lead to an ecient implementation. On most MPP machines, especially those not designed for SIMD operation, repeated 22

synchronization is expensive, and can signi cantly slow down the system. Instead, we opt to represent temporal synchrony by using tags or markers|integers which take on a small number of values representing the phases in a cycle. Nodes having the same value are considered to be temporally synchronous. Using this representation for synchrony allows the processors to run essentially independently, propagating instantiations by interprocessor communication. Although we do not explicitly use temporal synchrony, we still exploit the characteristics and constraints derived from shruti's temporal synchrony approach to reasoning. The use of tags or markers makes the resulting system exible so that it can be easily adapted to support other related marker-passing algorithms. We believe that using markers to represent temporal synchrony will not limit the knowledge representation and reasoning ability of the system in comparison with shruti. There are some situations|for example when modeling the neurophysiological aspects of synchrony, oscillation and spiking in relation to re exive reasoning|where temporal synchrony is essential and cannot be dispensed with. But for practical knowledge representation applications, the use of markers should not introduce any new limitations or compromises. In fact, the use of markers simpli es some operations like internally generating new phases.

3.4 Granularity For e ective mapping, the shruti network encoding a knowledge base must be partitioned among the processors in the machine. The network partitioning can be speci ed at di erent levels of granularity. At the ne-grained network-level, the partitioning would be at the level of the basic nodes and links constituting the network. A more coarse-grained knowledge-level mapping would partition the network at the level of knowledge elements like predicates, concepts, facts, rules and is-a relations. The appropriate level of granularity for a given situation depends on several factors including the characteristics of the network, the processing power of individual processors on the machine and interprocessor communication mechanisms. We consider these factors below.

3.4.1 Network-Level Mapping

At this level of granularity, the network is viewed as a collection of nodes and links. Factors that need to be considered when using network-level partitioning include: Processor Allocation Nodes and links in the network should be assigned to processors on the target machine so as to minimize response time. Several options are possible: Each node and link could be assigned to a random processor; groups of nodes and/or links could be assigned to a single processor; processors could be partitioned so that some handle only nodes and some handle only links; and so on. Nodes The network which shruti uses to encode a knowledge base consists of several di erent types of nodes. A given processor could handle only one type of node or could simulate an assorted combination of node types. The complexity of the node function should also be taken into consideration. Links Like nodes, the links can also be of several types|including weighted, unweighted and inhibitory. Placement of the links (on processors) relative to the placement of the nodes they connect is important since this is a major factor determining the volume of interprocessor communication. Communication and Computation The partitioning scheme used to assign network components to processing elements should take into account the balance of computation and communication in the resulting system. Communication between network nodes, and hence, 23

interprocessor communication, is an essential aspect of connectionist network simulation. Trying to eliminate or unduly minimize interprocessor communication could lead to severe load imbalances whereby a few of the processing elements are overburdened with computation. Trying to evenly spread the computational load among the processing elements could result in increased communication and poor performance. A well designed system should strike a compromise between communication and computation so as to achieve e ective performance.

3.4.2 Knowledge-Level Mapping

Knowledge-level mapping views the network at a relatively abstract level. At this granularity, knowledge base elements like predicates, concepts, facts, rules and is-a relations form the primitives. As is evident from Section 2.3, each primitive is constituted by a group of nodes and/or links. The behavior of these primitives is directly simulated without recourse to the underlying nodes and links constituting the primitive. Issues at this level include: Predicates Each predicate could be assigned to a separate processor, or a group of predicates could be assigned to a single processor. In the latter case, predicates constituting a rule could all be placed on the same processor or could be scattered on di erent processors. Grouping related predicates on a processor reduces the number of messages required to spread activation, but makes load balancing more dicult. If the number of predicates is larger than the number of processors, grouping predicates is unavoidable. Facts Facts may be stored on the same processors to which the corresponding predicates have been assigned. An alternative approach is to have dedicated processors for encoding facts. Such processors receive inputs from both the predicate and the type hierarchy, and signal fact matches globally or by communicating with the processor containing the predicate under consideration. In any case, we may need some mechanism to circumvent the situation where processors run out of memory since predicates could have a large number of associated facts. Concepts Concept clusters are used in the type hierarchy to represent types and instances. Apart from being linked up to form the type hierarchy, these clusters must also communicate with the rule-base. Careful choice of the mechanisms used to communicate entity activations to the rule-base will make the system more e ective and reduce the number of messages exchanged in the system. Rules When encoding rules, e ective placement of predicates constituting the rule minimizes communication costs. The arbitration mechanism for accommodating multiple instantiations of a predicate must also be taken into account.  When encoding rules, there are several choices available for the placement of predicates constituting the rule|i.e., several processor assignment schemes or processor allocation schemes are possible: { We could allocate predicates occurring in a rule to the same processor. This reduces interprocessor communication since fewer messages are required when the rule res. But the processor will have a higher computation load. A weaker form of the above scheme is to allocate predicates in a rule on nearby processors. This scheme will require relatively more messages in order to re a rule. { The other extreme is to scatter the predicates randomly. This requires more messages, and messages travel a longer average distance than for the previous schemes. There are, however, indications that random allocation may distribute messages uniformly over the entire machine instead of localizing it to \hot spots" where all the action happens, thereby reducing the incidence of message collisions (Leighton, 1992). Furthermore, this scheme provides better load balancing when answering a query. 24

{ In general, an arbitrary fraction of the antecedents of rules with a given conse-

quent predicate could be placed on the same processor as the consequent predicate. Such schemes require global constraint satisfaction to achieve satisfactory processor allocation. Actual realization of such schemes can therefore be extremely dicult. { Making copies of a given predicate on more than one processor is also an option, especially when the predicate has a large number of rules and/or facts. In such a case, the rules and/or facts are partitioned among the copies of the predicate. Though this requires extra resources and complicates book-keeping, it might be worthwhile since it provides increased parallelism and improved load balancing.  Identifying suitable performance measures and attempting to optimize these will aid in the objective placement of predicates when encoding rules. The performance measure should take into account factors like load balancing, cost of computation, and communication cost. It should be easy to compute the measure|or at least approximate it|using only local information.  Predicate instance arbitration mechanisms (\switches") may need to be redesigned. When one or more predicates are assigned to each processor, switches may be unnecessary. Space (\banks") can be allocated for k2 instances of each predicate. Incoming activation can be received in a bu er and then allocated to an empty bank under program control. In essence, multiple instantiation could be handled procedurally. Type Hierarchy Most of the issues raised above will also need to be reconsidered with respect to the location and interaction of concepts in the type hierarchy. We also need to streamline the interaction between the type hierarchy and the rule-base for enhanced eciency and e ectiveness. Extending the scheme mentioned above for dealing with multiple instantiation, we might be able to do away with the type hierarchy t-switch. Most of the concerns addressed above are intertwined in that choosing one aspect will a ect the choice of other aspects of the mapping. On a global scale, our aim is to develop an ecient and e ective mapping by ensuring load balancing, minimizing interprocessor communication, and by eciently using resources including processors and memory|all with the objective of attaining real-time performance with very large knowledge bases.

3.4.3 Optimal Mapping Granularity

Most general purpose massively parallel machines|target architectures for our mapping|have full- edged relatively powerful microprocessors constituting the processing elements. This being the case, network-level mapping is too ne a granularity. It is unnecessary, given the computing capability of the processing elements, to consider shruti networks at the level of individual connectionist nodes and links. Network-level mapping would force us to simulate the behavior of individual connectionist nodes and links. Knowledge-level mapping consolidates subnetworks representing knowledge-level primitives, and implements them using appropriate data structures and associated procedures without necessarily mimicking the detailed behavior of individual nodes and links in the subnetwork. The abstraction inherent in knowledge-level mapping, combined with the computing power of the processors, yields a system where low level detail is not directly simulated. Not only does this result in a faster and more ecient system, it also provides a conceptually clean mapping for the knowledge engineer or system designer to use. Though the processing elements may be capable of handling a granularity coarser and more abstract than knowledge-level mapping, it is not straightforward to devise a set of primitives for e ective mapping at coarser granularities. Furthermore, when dealing with coarser granularities, it is not clear how the constraints imposed by shruti and the spreading activation paradigm could be exploited. With knowledge-level mapping, coarser granularities can be achieved by grouping or 25

clustering knowledge elements|for example, facts can be grouped with the respective fact predicates, is-a relations can be grouped with either of the concepts involved, and so on. Based on these considerations, we believe that knowledge-level partitioning is the appropriate granularity for mapping shruti onto general purpose massively parallel machines. Note that with knowledge-level mapping, the resulting structure can still be viewed as a graph, except that nodes and links are constituted by more abstract entities in the knowledge base. Predicates and concepts can be treated as nodes; rules are links between predicates; and is-a relations are links between concepts. Facts can be represented as nodes with links to the predicate node; but since facts are generally closely associated with their predicates, one could also consider facts as being implicitly represented by the predicate node.

3.5 Processor Allocation Having chosen an appropriate granularity for encoding, the knowledge base can be viewed as a graph structure imposed on a collection of primitive knowledge base elements. In order to map the knowledge base onto an MPP machine, these primitive elements should be partitioned among the MPP processors. The processor allocation scheme directs the assignment of knowledge base elements to processors in the MPP. The choice of an appropriate processor allocation scheme is a crucial aspect of mapping shruti onto massively parallel machines. A variety of processor allocation schemes are possible and different architectures warrant di erent schemes for optimal performance. The analysis in Chapter 6 brings out the e ect of processor allocation on performance when mapping shruti onto distributed memory MPP machines.

3.6 Static and Incremental Encoding of Knowledge Bases Typically, a knowledge base would start out with some rules and facts, and more rules and facts would be added in the course of time. In mapping shruti onto massively parallel machines, a static encoding scheme takes into consideration the structure of the entire knowledge base in determining the mapping and the layout of nodes. When large chunks of new knowledge needs to be added to such a knowledge base it becomes necessary to globally reorganize and remap the entire knowledge base. Global optimization of large knowledge bases is computationally very expensive and this strategy may be acceptable only if new knowledge is added at infrequent intervals. Irrespective of costs or acceptability of such a scheme, if an optimal mapping is desired, we have no option but to globally analyze and map the knowledge base onto the underlying MPP machine. On the other hand, incremental encoding of knowledge bases does not take the entire global structure into account. Such mapping schemes locally map any given knowledge base element by connecting it to its neighboring knowledge base element(s) without a detailed analysis of its relation to the rest of the knowledge base. Such schemes are very easy to implement and computationally inexpensive; but they could result in non-optimal and skewed mappings. A random processor allocation scheme, where knowledge base elements are assigned to random processing elements on the MPP, is an incremental encoding technique which, under certain circumstances, provides the best of both worlds. When the communication cost is comparable to the computational cost of processing a knowledge base element, it turns out that the random allocation scheme provides close to optimal performance. Chapter 6 analyzes this situation in detail.

3.7 Load Balancing and Communication Striking a compromise between load balancing and locality is an important aspect in mapping shruti onto MPPs. In order to fully utilize the parallelism of the machine, computations must 26

be uniformly distributed on all the processors. At the same time, interprocessor communication should also be minimized. Moreover, in a spreading activation system, activation originates at a small number of sources and spreads to some fraction of the network. When the origin of activation is unknown, i.e., the queries that will be posed are not known beforehand, one can only aim at attaining optimal performance on average (see Section 3.1.3). The processor allocation scheme is critical in achieving a balance between computation and communication. These issues are considered analytically in Chapter 6. When it comes to choosing among possible interprocessor communication protocols, several options are available. A large volume of communicationbetween two processors warrants a relatively complex protocol, like synchronous send-receive or establishment of virtual channels, where both processors get ready to communicate and then quickly and eciently exchange a large volume of information. Under such circumstances, packet based communication may introduce signi cant overhead in terms of identifying packets and reordering arriving packets. In a shruti-like system, where the informationexchanged between nodes is small, a packet based communication scheme might well reduce communication costs. If the message can be tailored such that a single packet is all that is needed, and if a relatively low-level, low overhead packet based communication scheme can be used, we expect signi cant improvements in performance. In the absence of low cost packet based communication, processors can consolidate messages that need to be sent to a given destination; when enough messages have accumulated, they are sent in a single batch. Such a scheme may, however, delay activation spread and hinder the e ective use of parallelism. Reduction in costs will ensue if communication is send-based rather than get-based. In a sendbased system, nodes send out messages when they are ready and the receiving nodes processes the messages as and when they arrive. A get-based system requires nodes to ask other nodes for information when they are ready to process it. Unless the system is systolic or based on implicit interprocessor communication patterns, a get-based system requires twice the number of messages to be exchanged in comparison with a send-based system. Overlapping computation and communication provides yet another opportunity to improve performance. If possible, it will be bene cial to use some form of an active message system (von Eicken et al., 1992). A processor initiates a send and then continues with its computation without waiting for the message to arrive at the receiver. The network transports the message packet to the receiver, where a message handler is invoked to process the arriving message. Thus, network latency is completely hidden and neither the sender nor the receiver need to wait for the message.

3.8 Scalability and Resource Limitations One of the goals of mapping shruti onto MPPs is to develop a real-time reasoning system. In addition to speed, scalability is another crucial aspect of any knowledge representation system. In theory, shruti is a highly scalable system. Shruti's space or node requirements scale linearly with the size of the knowledge base; the time taken to answer queries or make predications remains una ected by the size of the encoded knowledge base (Section 2.3). But when shruti is mapped onto a real machine with nite resources, the inherent scalability of shruti may be adversely a ected. Given a machine with a xed number of processors, a xed amount of memory per processor, and a nite interprocessor communication bandwidth, only knowledge bases up to a certain size can be dealt with without undue performance degradation. Though careful design can help deal with larger knowledge bases, there will always be a point where machine resources limit performance. Our aim is to extend this limit as far as possible and investigate the behavior of the system when it is operating within these resource limits. Several techniques can be used to extend the limits imposed by resource limitations. For instance, the e ect of bounded memory resources can be mitigated by loading fragments of the 27

network on an as-needed, demand basis, supported by proper caching and indexing schemes. Another possible approach is to adapt techniques from database systems. Though such methods would eventually be needed in dealing with very large knowledge bases, we concentrate more on achieving an optimal mapping and real-time performance without recourse to special techniques to circumvent or extend resource limits. Future work will explore these aspects.

3.9 Analysis and Experimentation Having considered the various issues discussed above and having made choices based on various factors, we expect analysis and experimentation with very large arti cial and real-world knowledge bases to yield evidence supporting our choices. In this thesis, we map shruti onto the Connection Machines CM-5 and CM-2. Since the performance of the CM-5 system, shruti-cm5, is vastly superior to the corresponding CM-2 implementation, shruti-cm2, we consider shruti-cm5 in detail in the ensuing chapters. A discussion of shruti-cm2 is relegated to the appendix.

28

Chapter 4

Related Work In this chapter, we present a brief survey of previous and on-going work related to the development of massively parallel knowledge representation systems. Since we propose to map shruti onto parallel architectures, we also consider the mapping of connectionist networks onto massively parallel machines. We begin with a description of knowledge representation systems which have been implemented on massively parallel machines. Parallel logic programming and database systems are considered next. We also present relevant structured connectionist models and special purpose hardware for connectionist networks. We conclude with a brief discussion of performance evaluation and testing.

4.1 Massively Parallel Knowledge Representation Systems This section provides a brief summary of parallel knowledge representation systems that have actually been implemented on either special purpose or general purpose hardware. Section 4.4 considers systems which are inherently parallel, but have not been implemented.

4.1.1 PARKA Parka (PARallel Knowledge representation and Association) is a frame-based knowledge repre-

sentation system providing ecient and massively parallel implementation of semantic networks (Evett and Hendler, 1992). It runs on the Connection Machine CM-2 and is written in *Lisp, a parallel version of Common Lisp. Parka uses processors to represent nodes, and one or more parallel variables or pvars to represent links. Bottom-up and top-down inferencing is handled by parallel activation wave propagation using send and get operations on the Connection Machine. The inference algorithms are described in (Evett, 1994). Parka takes steps to minimize Hillis bottlenecks1|performance degradation resulting from increased incidence of message collisions on the CM. The performance of parka is tested using pseudo-random networks (with up to 130,000 nodes) as well as subsets of CYC (Evett et al., 1993). The CYC subsets had about 26,000 units. Parka's run time for inheritance queries is O(d) and for recognition queries is O(d + p) where d is the depth of the is-a hierarchy and p is the number of property constraints. Actual run-times range from a fraction of a second (for inheritance queries) to a little more than a second (for recognition queries with 15{20 conjuncts). Parka does not support rule-based reasoning; it can only handle frame-based knowledge with some extensions to deal with memory-based reasoning. 1

In contrast to von Neumann bottlenecks.

29

4.1.2 SNAP

Snap is a semantic network array processor developed at the University of Southern California (Moldovan et al., 1992). The conceptual design of snap is based on associative memory and

marker passing, and is a special purpose architecture optimized for representing and reasoning with semantic networks. Snap consists of a central controller which controls the operation of the snap array. The snap array is constituted by 256 processors each with local content-addressable memory sucient to encode 64 semantic network nodes and their associated links. The machine can therefore store a 16K node semantic network. The snap interconnection network is a bus-hypercube, with each dimension of a four-dimensional hypercube having several processors on a bus. The snap array executes instructions from the controller in a SIMD fashion. In addition, marker passing proceeds independently of instruction execution endowing snap with limited MIMD capabilities. Snap provides a special instruction set for semantic network creation and maintenance, marker creation and propagation, logic operations and search/retrieval. A snap prototype built with o -the-shelf components is used to implement a parallel, memorybased parser (Moldovan et al., 1992). The parser is capable of processing sentences in 1{10 seconds depending on the sentence length and the size of the knowledge base used. The largest knowledge base used consists of about 2,000 nodes.

4.1.3 Other Systems

Geller (1994) presents a massively parallel implementation of tree structured is-a hierarchies on the Connection Machine CM-2. The hierarchy is attened into a pointer-free linear tree representation using Schubert's (Schubert et al., 1987) numbering scheme. The linear tree is then mapped onto the processors of the CM-2. Parallel algorithms for ecient update and inheritance operations are presented. These algorithms manipulate the linear representation without reconstructing the tree structure of the hierarchy. Experiments on the CM-2 indicate that, for a given machine size, downward inheritance|where the attributes of a concept are derived by searching through its superconcepts|can be performed in constant time, irrespective of the depth of the hierarchy. At the current time, this system can handle only tree structured hierarchies, and hence, cannot deal with multiple inheritance. Implementing semantic networks and marker passing on the CM-2 is explored in (Chung et al., 1989). The execution time of the system they describe is proportional to the length of the is-a chain, the number of markers (i.e., properties) being propagated and the branching factor. On a SIMD machine like the CM-2 all processors have to execute the same instruction irrespective of whether they have data to process. This can lead to deterioration in performance for large networks. Based on the results from experiments on the CM-2, the authors argue for special purpose hardware which overcomes the shortcomings of the CM-2 architecture. Other frame-based knowledge representation systems implemented on the CM-2 are described in (Kolodner and Thau, 1988) and (Aronis, 1993). The paradyme system described in (Kolodner and Thau, 1988) is a restricted frame-based system designed expressly for use in case-based reasoning systems. The system developed in (Aronis, 1993) can encounter severe performance problems when dealing with large knowledge bases.

4.1.4 Discussion

In surveying the various massively parallel knowledge representations systems in this section, we note the following. Firstly, all the systems we have encountered are con ned to implementing is-a hierarchies or frame-based systems. Though this is a very important aspect of any knowledge base system, the capability to reason with rules combined with interaction between the rule-base and is-a hierarchy is essential for reasoning in intelligent systems. Shruti, and hence shruti-cm5, 30

set current resolvent to initial query while current resolvent is nonempty { select any goal G from the resolvent if there exists a clause whose head unifies with G { /* resolution step */ select a clause of the program whose head unifies with G replace G by the body of this clause in the resolvent apply substitution resulting from unification to the resolvent } else { /* backtracking step */ if there exists a previous resolvent with untried alternatives restore this resolvent else exit with failure } } if resolvent is empty then success

Figure 4.1: Algorithm for executing a logic program. Adapted from (de Kergommeaux, 1994). incorporates both a rule-base and a type hierarchy, leading to a more powerful system than any described in this section. Secondly, with the exception of snap, all the other systems described above are implemented on the CM-2. The CM-2 is an SIMD architecture where all the processors are required to execute the same instruction in lock-step. For marker passing and spreading activation models, this SIMD operation can result in inecient use of processing resources. The markers or activation originate at a small number of nodes in the network and then propagate through the network. Typically, when the network has a large number of nodes, activation will reach only a small fraction of these nodes. If each node is assigned to one processor on the CM-2, then a signi cant fraction of the processors will be idle during any reasoning episode. This problem can be alleviated by assigning several nodes to a processor. But it is still dicult to balance the load on all the processors. The biggest disadvantage of a SIMD architecture like the CM-2 is that processors which have no data to process during the current instruction must remain idle, even if they have other operations to execute. We have found this to be a major impediment in developing shruti-cm2 (Appendix D). This problem can be overcome to a large extent and load balancing signi cantly improved by using asynchronous MIMD or SPMD operation. As we shall see shruti-cm5 is an asynchronous message passing system and has execution times signi cantly better than shruti-cm2.

4.2 Parallel Logic Programming Shruti can reason using rules and facts. So can logic programming systems. There has been extensive research aimed at parallelizing logic programs, especially since they can be executed in serial or parallel without changing their declarative semantics. In this section we summarize some of the parallel logic programming paradigms and systems. A survey of parallel logic programming systems can be found in (de Kergommeaux and Codognet, 1994) and (Kurfe, 1991).

4.2.1 Logic Programs

A logic program consists of a set of Horn clauses of the form A : ? B1 ; : : :; Bn

and 31

A

and a query

: ? Q1; : : :; Qm : The Ai , Bi and Qi are atomic formulas consisting of predicates and their arguments. Not only can logic programs be given declarative, or logical, semantics (Lloyd, 1987) but they can also be viewed procedurally. All clauses with the same head predicate A can be considered as a de nition of the procedure A. Each goal Bi in the body of a clause is interpreted as a procedure call, with parameter passing e ected by uni cation of predicate arguments. A logic program can also be interpreted as a dynamic set of processes that communicate via messages (Shapiro, 1987; Conery, 1987). Processes are created to solve goals, and computation is therefore an and/or tree of processes. The execution of a logic program begins by taking the query as the initial resolvent2 and then proceeds using the algorithm shown in Figure 4.1.

4.2.2 Parallelism in Logic Programs

Based on the execution model of Figure 4.1, there are two sources of parallelism in logic programs: 1. Instead of selecting a single goal G in the resolvent in order to perform resolution, several goals can be considered simultaneously. All these goals should be satis ed for successful resolution and leads to what is called and-parallelism. 2. During the resolution step, several unifying clauses can be explored in parallel. Resolution succeeds if any one of these alternative goals is satis ed and leads to or-parallelism. It is also possible to combine and- and or-parallelism to obtain a range of parallel execution models. Or-parallelism results in the parallel development of several resolvents which would otherwise have been tried out by backtracking in serial execution. Such parallelism is exploited in the language interpreter (or compiler) and is transparent to the programmer. In implementing or-parallelism, the system must be able to manage the parallel resolvents independently and without crosstalk between the or branches. And-parallelism is much more dicult to implement due to the diculty of obtaining coherent bindings for variables shared by several goals executed in parallel. One way around this problem| termed independent and-parallelism|is to parallelize only those goals which have no common variables. On the other hand, dependent and-parallelism allows synchronized parallel execution of goals with common variables so as to ensure con ict-free variable binding.

4.2.3 Parallel Logic Programming Languages

Most parallel logic programming languages are derivatives and variants of Prolog (Clocksin and Mellish, 1981; Sterling and Shapiro, 1986) and Concurrent Prolog (Shapiro, 1987). Dialects of Prolog include IC-Prolog (Clark and McCabe, 1981), Prolog-II (Colmerauer, 1986) and MU-Prolog (Naish, 1984). Concurrent logic programminglanguages include Concurrent Prolog (Shapiro, 1987), Parlog (Clark and Gregory, 1987), Flat Concurrent Prolog or FCP (Shapiro, 1989) and Guarded Horn Clauses or GHC (Ueda, 1987). Several techniques are used in these languages in order to e ectively support parallelism:

 Evaluation of some goals can be explicitly delayed in dialects of Prolog. Such mechanisms are also provided in concurrent logic programming languages by declaring producer-consumer relationships between goals. This allows synchronization and coroutining.

2

The set of goals that remain to be executed at any point in the computation is de ned to be the current resolvent.

32

 Dependent and-parallel languages like Concurrent Prolog, Parlog and GHC use the concept

of committed choice parallelism to ensure that only a single binding is generated for each shared variable. Committed choice is generally achieved by the use of a guard (Dijkstra, 1975) between the head and body of a clause. Committed choice languages are deterministic and never backtrack|one of the satis able guards is chosen and the computation is committed to the corresponding clause, neglecting alternative clauses and forsaking inherent search capability.3 Pointers to a large number of parallel logic programming systems implemented on parallel hardware can be found in (de Kergommeaux and Codognet, 1994). Most of these existing systems are implemented on shared-memory architectures.

4.2.4 Logic Programming and Shruti

Logic programming languages, as the name implies, are full- edged programming languages and o er the entire range of features for general purpose programming. These languages therefore carry a lot more baggage than is needed for knowledge representation and reasoning. Though this results in greater exibility, there will also be a performance penalty. Shruti, on the other hand, is tailored explicitly for ecient knowledge representation. More importantly, logic programming languages are more expressive than shruti. Shruti constrains the class of reasoning supported to ensure tractability and ecient parallel implementation. In contrast, it is much more dicult to eciently implement logic programming languages. Though the expressive power of shruti is limited in comparison with logic programming languages, it is still capable of supporting a relatively rich class of computations. Therefore, shruti can be used to eciently solve non-trivial knowledge representation and reasoning problems. For example, (Henderson, 1993) uses shruti in parsing natural language. It is interesting to note that shruti (Section 2.3) supports both or-parallelism as well as a limited form of and-parallelism (in the logic programming sense). Furthermore, the constraint that repeated variables in the antecedent occur in the consequent and get bound during any episode of reasoning, is exactly what is needed to circumvent the shared variable binding problem in dependent and-parallelism. In and-parallel logic programming, consistent bindings for shared variables are obtained using various techniques like guards and committed choice parallelism; in shruti, the query itself ensures consistent bindings for shared variables. This constraint therefore greatly improves the eciency of shruti implementations. The discussion above makes it clear that in designing massively parallel knowledge representation systems, we can obtain signi cant leverage towards attaining real-time performance by using shruti as our knowledge representation framework, as opposed to logic programming paradigms. This can be attributed to shruti's well-tailored limited reasoning capability.

4.3 Parallel Database Systems Database systems essentially provide a mechanism for storing and retrieving facts. Database systems generally do not provide reasoning capabilities.4 Parallel database systems exploit parallelism in order to develop high-performance and highavailability database servers. Early e orts in building special purpose database machines (Hsiao, 1983) were abandoned in favor of using recent multiprocessor architectures. Research issues (Valduriez, 1993a; Mohan et al., 1994; America, 1991) in parallel database systems include:  Architectural Considerations. The most commonlyconsidered parallel architectures for database systems are shared-memory, shared-disk and shared-nothing. There are database systems 3 4

Search can be explicitly programmed using guards. Deductive database systems like Datalog are similar to logic programming systems.

33

which have been implemented on each of these architectures. Though a large fraction of recent parallel machines conform to the shared-nothing architecture, there is still no consensus on what is best for database systems. A comparative evaluation of the various architectures can be found in (Valduriez, 1993b).  Data Placement. Data placement and processor allocation are crucial to proper load balancing. The database relations must be distributed on all the processors of the machine; this is generally achieved by declustering, whereby each relation is partitioned across several processors based on some hash function. Skewed data distributions are a problem, and need special techniques to maintain load balancing. In addressing these problems, several data partitioning and load balancing techniques have been developed. Some examples and further pointers can be found in (Dewan et al., 1994; Zhang et al., 1995). Since the set of facts in a shruti knowledge base is essentially a database, these data partitioning and load balancing techniques can be exploited when mapping shruti onto parallel machines. We expect research in these areas to provide guidance in mapping facts onto the parallel machine, handling skewed fact distributions (where some predicates in the knowledge base have an unusually large number of facts associated with them), and e ective use of secondary storage.  Parallel Query Processing. Ecient and automatic parallelization of database queries and operations is also being explored. This aspect of a parallel database system is closely linked to data placement and load balancing. Issues of static versus dynamic parallelization and load balancing need to be considered. The design of database languages which can be eciently parallelized is another avenue of research. Parallelization of the join operation|one of the most expensive database operations, especially for large relations|has been studied extensively (Dewitt and Gerber, 1985; Shasha and Wang, 1991; Valduriez and Gardarin, 1984). Though database systems per se cannot be used as e ective knowledge representation systems, database research|especially related to data placement and load balancing|can bene t knowledge base design endeavors like the one we undertake in this thesis.

4.4 Parallel Knowledge Representation and Reasoning

This section summarizes knowledge representation and reasoning systems which are inherently parallel but have not been implemented on parallel machines.

4.4.1 Massively Parallel Semantic Networks

NETL: A Parallel Marker Propagation System

One of the early attempts at using massive parallelism to make knowledge representation more e ective is described in (Fahlman, 1979). netl, the system developed in (Fahlman, 1979), is a parallel marker propagation scheme which uses simple processing units to represent knowledge. The units constituting the parallel network are either nodes which can store a small number of markers, or links which propagate markers from one node to another. The marker propagation proceeds under the strict control of an external serial network controller. The propagation and interaction of markers can be e ectively used to search the knowledge base and to handle inheritance and classi cation. Fahlman (1979) also outlines how the processing elements can be realized in hardware. Though netl was never implemented in hardware, it inspired the design of the Connection Machine (Hillis, 1985).

CSN: A Connectionist Semantic Memory 34

Shastri (1989) describes a connectionist semantic memory based on an evidential formalization. The system includes mechanisms to deal with exceptions, con icting information and multiple inheritance. Using evidential information, the semantic network nds most likely answers to inheritance and recognition queries. A connectionist implementation of the system which uses a distributed control mechanism (without a central controller) is presented in (Shastri, 1988). The connectionist semantic memory solves a class of inheritance and classi cation problems in time proportional to the depth of the network. The time taken is independent of the size of the knowledge base, in spite of the fact that the system uses an evidential formulation that handles exceptions and multiple inheritance by combining information from all the relevant sources. CSN identi es speci c constraints on the partial ordering of concepts (induced by the is-a hierarchy) that would guarantee ecient computation of inheritance and recognition queries.

4.4.2 Structured Connectionist Reasoning Systems

Several structured connectionist reasoning systems have been proposed in the recent past, shruti being one of them. Each of these systems have been developed under di erent contexts with the intention of achieving di erent goals and objectives. This section brie y summarizes a representative sample.

CONPOSIT

Conposit, a system introduced by (Barnden and Srinivas, 1991), uses relative position encoding and pattern-similarity association to develop a reasoning system. An interpreter reads a scratch pad of registers, updates them and propagates argument bindings. The use of an interpreter restricts parallelism and allows only one rule to re at a time. Conposit can handle a relatively complex class of rules and seems to be more suited for re ective reasoning rather than for re exive reasoning.

CONSYDERR

Consyderr is another connectionist rule-based reasoning system which uses a two-level architecture (Sun, 1991). One level uses a distributed representation while the other uses a localist representation. The two levels interact to provide a robust system that can handle partial, fuzzy and uncertain information.

ROBIN

The system most similar to shruti is robin5 (Lange and Dyer, 1989). Robin addresses the problem of ambiguity resolution in natural language understanding and uses signatures to represent dynamic variable bindings. Each constant in the domain under consideration is assigned a unique numerical signature. Argument- ller bindings are represented by propagating the signature of the appropriate constant to the argument to which it is bound. Like shruti, robin represents rules and facts using a network of nodes and links and can re rules in parallel. Activation levels of the various concepts associated with an input word indicate which of the multiple meanings of that word are plausible in the current context. Dynamic inhibitory mechanisms ensure that the most probable meaning of a word (in the current context) has the maximum activation level.

4.4.3 Other Architectures

A structured connectionist natural language interpretation system has been proposed by (Waltz and Pollack, 1985). For each input sentence, an activation network which takes into account recent context and long-term knowledge is created. Spreading activation and lateral inhibition are used to implement parallel relaxation. The nal steady state into which the network settles o ers a 5 A simulator for robin has been developed on the CM-2 (Lange, 1990). But that work does not focus on the design and implementation of ecient knowledge representation systems.

35

single interpretation for a possibly ambiguous input sentence. The model illustrates and explains the in uence of context on meaning, concept priming, garden-path sentences, etc. Another model for word sense disambiguation and sentence comprehension is described by (Cottrell, 1989). A modular connectionist system organized into three layers and four components| lexical, word-sense, syntax and case|is developed. Syntactic and semantic knowledge is encoded in these components and the system operates using spreading activation. The interplay of excitation and inhibition results in forming stable, connected coalitions in the network which specify the correct interpretation of the input.

4.5 Special Purpose Hardware In this section we brie y discuss special purpose hardware systems arising from attempts to develop ecient implementations of connectionist networks. General characteristics of connectionist networks and/or the special capabilities of one or more processors are exploited to make implementations faster, larger and cost e ective.

4.5.1 Digital Signal Processor Based Architectures

Digital signal processors (DSP) are special purpose chips tailored for signal processing applications. These processors are optimized for excellent multiply-and-add performance. Most of the computational resource requirements for neural network simulation|especially for back-propagation in multilayer feed-forward networks|can be attributed to multiply-and-add operations. As such, using several digital signal processors in parallel can provide very good performance. The Ring Array Processor (rap) was envisioned as a hardware system for supporting continuous speech recognition algorithms based on arti cial neural networks (Beck, 1990). This parallel computation system uses several digital signal processors along with high bandwidth interprocessor communication to achieve the required performance for the algorithms of interest. The rap system is accessed through a front-end processor. Each rap node consists of a digital signal processor, local memory, communication links and other control logic. The rap array, which can contain up to 64 nodes, is connected in a ring via a synchronous, unidirectional communication channel. An object-oriented rap interface and command interpreter (Bilmes and Kohn, 1991) are provided for programming the rap array. Some of the other digital signal processor based architectures include: topsi which uses a DSP chip and a general purpose microprocessor for each of its processor elements (Frazier, 1990); and planns (Planar Lattice Architecture for Neural Network Simulations) which uses a DSP array (Fujimoto and Fakuda, 1989).

4.5.2 Other Architectures

SPERT: Synthetic Perceptron Testbed

spert is a single chip microprocessor designed for ecient execution of neural network algorithms (Asanovic et al., 1992). Compared to a rap processor of similar performance, spert represents almost an order of magnitude reduction in the cost. spert achieves this cost/performance improve-

ment through the use of reduced precision xed point arithmetic, wide data paths, high operand and instruction bandwidth, and by employing a highly parallel architecture which includes an onchip SIMD array. spert is intended to function as an attached processor for a conventional host, and is the rst step towards cns-1, a full blown connectionist multiprocessor machine (Asanovic et al., 1993).

CNS-1: The Connectionist Network Supercomputer

The cns-1, as the name implies, is a connectionist network supercomputer (Asanovic et al., 1993) 36

that is being developed at ICSI, Berkeley with the objective of providing a platform for fast and ecient simulation of very large, sparsely connected connectionist networks. The cns-1 is a multiprocessor system designed for moderate precision integer arithmetic. Nodes|consisting of a processor, memory and control|are connected in a high bandwidth cylindrical mesh topology to establish a MIMD array with up to 1024 processors. The processors use wide data and instruction paths combined with an on-chip SIMD array operating at high clock speeds. Communication throughput is maximized by using on-chip routers and bidirectional links between neighboring processors. The cns-1 will operate as an attached co-processor to a front-end host. The suitability of the cns-1 for connectionist network simulation, and the design philosophy behind this machine is to keep all components simple, fast and ecient; to provide a high bandwidth, low latency communication network; and to exploit the fact that connectionist network simulations almost never need high precision oating point operations.

CNAPS: Connected Network of Adaptive Processors

The cnaps neurocomputer, manufactured by Adaptive Solutions, is a one-dimensional SIMD array of processors, controlled by an external sequencer. Interprocessor communication is achieved by means of a broadcast scheme. The processors are equipped with multiplier and adder units which can use adjustable precision for weights and activation values. The architecture also supports ecient representation of sparse matrices in the local processor memories. Both back-propagation networks (McCartor, 1990) and self-organizing maps (Hammerstrom and Nguyen, 1991) have been implemented on this machine. This architecture has also been used for speech and image processing.

REMAP3 : A Recon gurable Massively Parallel Computer

The remap3 is a massively parallel, modular architecture that allows the modules and module interconnections to be adjusted for a speci c application. A multi-module system is supposedly well suited for the implementation of arti cial neural networks. A small prototype of this system has been built (Nordstrom and Svensson, 1992). Several of the common neural net models| including back-propagation, Hop eld networks, self-organizing maps, etc.|have been mapped onto this architecture. A very large number of other VLSI implementations for arti cial neural network simulations have been proposed. A representative sample can be found in (Przytula and Prasanna, 1993; Hammerstrom, 1994).

4.6 Mapping Connectionist Systems onto Parallel Architectures Most of the past work on mapping connectionist systems onto massively parallel architectures has focused on multilayer feed-forward networks and back-propagation, as is evident from the large body of literature in this area. In the presentation below, the parallel architecture used in the implementation is used to classify related work. We do not attempt to provide an exhaustive summary of parallel architectures.

4.6.1 General Considerations

A general introduction to parallel architectures with emphasis on their application to neural network simulation is described in (Nelson et al., 1989; Nordstrom and Svensson, 1992). Apart from an introduction to parallel computers, memory organization in parallel computers, connection topology, communication and overall machine performance, (Nelson et al., 1989) also consider issues like problem decomposition, load balancing, communication overhead and synchronization from the standpoint of neural network simulation. The paper also describes quantitative estimates for 37

these parameters, tradeo s and optimization. The methods described are illustrated by applying them to actual simulations. While Nelson et al. (1989) are concerned with modeling the behavior of biological neurons, Nordstrom and Svensson (1992) survey various aspects of simulating arti cial neural networks (ANN) on massively parallel architectures. After summarizing ANN architectures and learning algorithms, (Nordstrom and Svensson, 1992) go on to consider computation and communication requirements for ANN simulation, types of parallelism in ANN computations and performance measurements. The authors then take stock of how feed-forward and recurrent networks with several popular learning algorithms have been mapped onto various general purpose and special purpose massively parallel machines. Structured connectionist networks (see Section 4.4) are not considered. The paper is mostly limited to multilayer networks with back-propagation learning (Rumelhart et al., 1986).

4.6.2 One-Dimensional SIMD Architectures

One-dimensional SIMD architectures usually consist of a linear array of processors. Communication is accomplished by means of a ring or by using a bus. Such architectures have a small number of processors and have limited communication facilities. Neural network simulation on shared-memory vector multiprocessors is described in (Wang et al., 1989). Vectorized algorithms for the simulation of feed-forward networks, recurrent networks and Hop eld networks are presented. Simulations were run on the Alliant FX/80. Wilson (1989) describes how a one-dimensional SIMD array can handle large, densely connected networks in near real-time by eciently organizing memory storage of weights, inputs and outputs.

4.6.3 Two-Dimensional SIMD Architectures

A wide variety of two-dimensional SIMD machines are available and this mesh-connected architecture has been one of the most popular for connectionist network simulation. In addition to nearest neighbor communication some machines also provide general communication capabilities enabling any processor to communicate with any other processor. Nearest neighbor communication is usually more ecient. An algorithmic approach to mapping sparsely connected neural network models onto parallel mesh-connected SIMD machines is described in (Lin et al., 1991). Network simulation is performed using the following basic operations: send (communication), redistribute (local broadcast), reshuffle (permutation), collect (accumulation) and send-back (communication). p Analysis shows that the simulation execution time complexity is O ( NP logPK N) where N = d n + e e, n is the number of nodes in the network, e is the number of links, P  P is the size of the processor array, with each processor having local memory size O (K). In spite of being tailored for sparse networks, the algorithms are unsuitable for simulation of structured connectionist networks since: (i) an SIMD machine with mesh interconnection is required; (ii) given a parallel machine with a xed number of processors, this mapping technique is inecient for very large networks. 2

Connection Machine (CM-2)

The Connection Machine (Hillis, 1985) model CM-2 has been one of the most popular machines for connectionist network simulation. The CM-2, which can support up to 64K simple processors, provides both general interprocessor communication and ecient nearest neighbor communication. An implementation of back propagation on the Connection Machine is described in (Zhang et al., 1990). The network is partitioned into \columns" (input, hidden and output units along with links impinging on these nodes); multiply-accumulate-rotate algorithms based on nearest neighbor communication are used for the forward and backward passes. The approach of (Zhang et al., 1990) has been applied to much larger data sets by (Diegert, 1990). The training data sets 38

are stored in the DataVault|a parallel array of disk drives (TMC, 1991a)|and accessed as and when required. A mapping of feed-forward networks onto the CM-2 in which each node is mapped onto a processor and each link is mapped onto two processors|one processor at each end of the link|is reported in (Belloch and Rosenberg, 1987). A survey of several implementations of back-propagation on the CM-2 can be found in (Singer, 1990). In addition to considering the advantages and disadvantages of the various approaches, (Singer, 1990) also compares simulation performance. Implementation of recurrent back-propagation on the CM-2 is described in (Deprit, 1989) and (Fontaine, 1992). Obermeyer et al. (1990) have implemented Kohonen's self-organizing map (Kohonen, 1990) on the CM-2. Kanerva's sparse-distributed memory (Kanerva, 1988) implementations on the CM-2 have been explored by (Rogers, 1988).

Distributed Array Processor (DAP)

The DAP is a 2d SIMD architecture with anywhere from 1024 to 4096 simple bit-serial processors, each with 4 to 16 kilobits of local memory. Apart from nearest neighbor connections, row and column \highways" are provided for faster communication over longer distances. Forrest et al., (1987) have used the DAP to implement a variety of neural network algorithms| especially applications of Hop eld networks. The Hop eld-Tank method for solving the traveling salesman problem has been simulated on the DAP using a parallel sparse matrix algorithm (Gupta et al., 1990). A mapping of back-propagation and mean eld theory learning algorithms onto generic two-dimensional SIMD machines is described in (Nun~ez and Fortes, 1990). The ecacy of the mapping is tested using the DAP.

Other Architectures

Some of the other 2d SIMD architectures that have been used in the simulation of connectionist networks are the following. The MasPar MP-1 is a SIMD machine with both grid and global communication capabilities and having up to 16K processors, each with about 1K bits of register memory and 16K bytes of general storage. The oating-point support and other architectural features provided by this machine have been exploited by (Chinn et al., 1990) and (Grajski et al., 1990) to implement back-propagation and self-organizing maps. The AAP-2 is a massively parallel cellular array processor with 64K bit-serial processors. The AAP-2 provides fast nearest neighbor communication|by means of shift operations and ripplethrough data transfers|as well as general communication capabilities. Back-propagation has been implemented on the AAP-2 by mapping each link onto a processor (Watanabe et al., 1989) and using a table lookup technique to eciently evaluate the node transfer function. Neural network simulation on the MPP, Goddard Space Flight Center's Massively Parallel Processor is discussed in (Hastings and Waner, 1987) and (Murali et al., 1990). The MPP is a mesh connected SIMD machine with 16K processors organized as an array of size 128  128. Each processor has a small local memory of 1024 bits. Due to the limited amount of memory available, large networks cannot be easily simulated on the MPP. One way to partially circumvent this problem is sketched in (Murali et al., 1990).

4.6.4 Systolic Arrays

The systolic array, introduced by Kung and associates at Carnegie-Mellon (Kung, 1982), consists of a set of interconnected cells or processors, each capable of performing some simple operation. The cells can communicate in a simple, regular and synchronous manner resulting in pipeline-like operation. The array size and organization are dependent on the problem at hand. The systolic array has been proposed as a general and exible parallel architecture for the simulation of arti cial neural networks (Kung and Hwang, 1988; Kung and Hwang, 1989). The authors express neural processing in terms of (recursive or iterative) matrix operations, convert the 39

matrix operations into dependence graphs, and nally map these dependence graphs onto systolic arrays. The CMU Warp processor is a multiprocessor programmable systolic array. Pomerleau et al., (1988) describe two algorithms for running back propagation algorithms on the Warp: A network partitioning scheme where each processor on the Warp is in charge of simulating a portion of the back propagation network; and a data partitioning scheme where each cell simulates the entire network on a di erent set of training patterns. The GAPP (Geometric Arithmetic Parallel Processor) is a mesh-connected systolic array with very simple processing elements. Feed-forward networks with back-propagation have been implemented on the GAPP by (Brown et al., 1988) and (Baresh and Eshera, 1989). Since the GAPP does not provide any oating-point support, (Brown et al., 1988) have used xed-point numbers for activation values and weights; the transfer function has been approximated using a step-wise linear function. Baresh and Eshera (1989) have used communication by circulation to obtain good performance. The Hughes Systolic/Cellular Architecture is a 16  16 SIMD processor array. The processors are fairly complex with a small local memory, and are mesh connected to form a torus. This Systolic/Cellular co-processor has been used by (Shams and Przytula, 1990) to implement backpropagation. Each input pattern is mapped onto a row of processors, so that each row simulates the complete network for one input pattern|thereby using both network and training example parallelism.

4.6.5 Data ow Architectures

Data ow architectures can be used as exible neural network simulation systems (Smotro , 1990), especially if they can be optimized for this purpose. Data ow architectures (Arvind and Culler, 1986) are general computation engines where operations are executed in an asynchronous and datadriven fashion. Data ow architectures expose inherent parallelism without any extra e ort on the part of the programmer. Furthermore, there is a natural mapping between neural networks and data ow graphs. Unfortunately, data ow machines are not widely available, let alone being customized for neural network simulation. Simulation results indicating how data ow architectures can maximize inherent parallelism are described in (Smotro , 1990).

4.6.6 Multiprocessor Systems

Multiprocessor systems could be either full edged MIMD machines, or a more restricted architecture constituted by several processors running the same program. The later architectures are sometimes termed SCMD (S ame C ode M ultiple D ata) or SPMD (S ingle P rogram M ultiple D ata) machines (Nordstrom and Svensson, 1992). Multiprocessor systems are generally characterized by an architecture with multiple (a few to a few thousand) relatively complex and independent processors linked together by a communication network. Processors operate asynchronously and are usually controlled by a host machine. A formal study of mapping multilayer networks and learning rules onto message-passing multiprocessor machines is the subject of (Chu and Wah, 1992). Formal mathematical models for both arti cial neural networks and multicomputer systems are proposed. The neural network model is then mapped onto the multicomputer model using optimization techniques to minimize training time. James and Hoang (1992) analyze several software and hardware strategies for low cost simulation of large neural networks in real-time. The authors conclude that a MIMD system is required for ecient and exible simulation of neural network architectures. They propose a multicomputer design for ecient simulation of completely connected as well as sparsely connected networks. In (Manner et al., 1989), a multiprocessor simulation system called nerv is presented. The hardware 40

architecture of nerv is derived by analyzing the computational requirements for simulating neural networks. The partitioning and mapping of production systems (or rule-based systems) onto multiprocessors is the topic of (Moldovan, 1989). A performance index, obtained by analyzing rule interdependencies, is optimized so as to maximize inherent parallelism and minimize interprocessor communication. A message-passing multiprocessor architecture (rubic, for RU le-B ased I nference C omputer) for parallel execution of production systems is also described. Several neural network simulations have been performed on Transputer-based multiprocessor systems. The Transputer is a single-chip microprocessor with hardware support for concurrent processing, including on-chip RAM and communication links. Several hundred to a few thousand of these microprocessors can be linked in a variety of topologies to build highly parallel machines. Researchers have used Transputer-based machines to implement back-propagation (Kung and Hwang, 1988), Hop eld-Tank and elastic net models (Forrest et al., 1987) and self-organizing maps (Siemon and Ultsch, 1990). The implementations are e ective if coarse-grained parallelism is used. Flexible arti cial neural network programming and simulation environments with Transputers as one of the target architectures are described in (Koikkalainen and Oja, 1988; Treleaven, 1991). Koikkalainen (Koikkalainen and Oja, 1988; Koikkalainen, 1991) views neural network models as a general class of communicating sequential processes (Hoare, 1978) which work in parallel and communicate through channels. Such a system is implemented on a network of Transputers using the Occam programming language. A C/C++ based programming environment, along with an X Window graphics interface, is described in (Treleaven, 1991). Simulation of back-propagation on message-passing multiprocessors is discussed in (Yoon et al., 1990). Multilayered networks are vertically partitioned and each partition assigned to a processor. All-to-all broadcasting, where every processor broadcasts to every other processor, is used to maintain the consistency of weights on di erent processors.

4.6.7 Discussion

Most of the approaches described above either concern themselves with layered, feed-forward networks or result in inecient mappings for the simulation of very large structured connectionist networks like shruti. Thus, an ecient and e ective mapping of structured connectionist networks onto massively parallel architectures has not been achieved in the past.

4.7 Performance Measurement and Testing Once we have devised a scheme to map a connectionist architecture onto some other massively parallel architecture, we will need techniques to evaluate the mapping scheme. In this section, we review several criteria that have been suggested for carrying out such an evaluation.

4.7.1 Performance Measures

De ning, estimating and optimizing performance measures is essential for an e ective and ecient mapping of connectionist networks onto parallel machines. The performance measure may take into account several factors such as time and space requirements, communication cost, response time under speci ed conditions, eciency of resource utilization, etc. Some of the more commonly used performance measures are enumerated below:  Operations Per Second. The number of integer or oating-point operations performed per unit time in the course of the simulation, compared to the maximum capability of the machine, indicates how well the simulation utilizes the resources provided by the machine. Most often, connectionist simulations usually consider multiply and/or add operations. 41

 Connections Per Second. This measure is proportional to the number of operations per

second. In most of the popular connectionist models, especially back-propagation networks, computing each connection requires a multiply operation, after which all the input values are summed using addition operations. Thus, comparing the number of connections per second with the maximum multiply-and-add capability of the machine provides a rough estimate of how well the speci c mapping suits the architecture.6  Connection Updates Per Second. While the number of connections per second measures how fast a system can compute the output given some inputs, the number of connection updates per second also takes into account the training or learning performance. This measure is generally applied to back-propagation networks, in which case it provides a combined performance index for both the forward and backward passes. In some cases epochs per second|an epoch being a single presentation of all the patterns in the training set|is used as an alternative to indicate the speed of learning.  Synaptic Processing Rate. The synaptic processing rate (SPR)|de ned as the number of connections per second per weight|is an indication of the balance between processing power and network size. Biological neurons re at approximately 100 Hz which implies that synapses process input at the rate of about 100 per second, an SPR  100. Systems with SPR  100 are computationally underbalanced while systems with SPR  100 have too much processing power per weight. Optimizing performance with respect to communication overhead and load balancing is discussed in (Nelson et al., 1989).

4.7.2 Time and Space Complexity

Another way to evaluate a mapping is to compare the time and/or space complexity of the simulation with the time and/or space complexity of the ideal connectionist system. Lin et al. (1991) use the simulation execution time complexity to evaluate their mapping scheme. In (Evett and Hendler, 1992), the Connection Machine implementation of parka is evaluated by comparing its performance with the ideal connectionist system. Any mapping could be considered satisfactory if the resource requirements of the actual implementation and ideal system are similar. The ease or diculty of attaining this goal depends on the system being modeled, the mapping that is being used and what we set out to achieve. For example, the Connection Machine implementation of parka performs as well as the ideal system (Evett and Hendler, 1992) as long as the number of nodes in the network is less than the number of processors on the Connection Machine. When the network is signi cantly larger, it is much more dicult to obtain a level of performance comparable to the ideal system. Using any analytical complexity measure for evaluating the performance of a system will only yield a gross, order-of-magnitude estimation. It is usually extremely dicult to formally determine average time or space complexity. Most often only upper|and sometimes lower|bounds for the complexity can be established. An alternative to formal complexity analysis would be to experimentally determine the time and space requirements as a functions of some input parameter. For complex simulation systems, the empirical approach may be the only way to go.

4.7.3 Formal Analysis

In this approach, the mapping scheme is formally analyzed using some mathematicaltechnique. The mapping can then be optimized using some \goodness of t" parameter. One such approach is used in (Moldovan, 1989), where the mapping of production systems onto multiprocessors is analyzed

6 Other connectionist architectures may use operations other than multiply and/or add to compute a connection. In any case, we consider only those operations that are relevant to the model at hand.

42

and optimized using a parameter computed by analyzing compile time rule interdependencies. Optimizing this parameter maximizes parallelism and minimizes interprocessor communication. Chu and Wah (1992) also outline a formal approach to mapping neural networks onto parallel message-passing multiprocessors. The neural network model is mapped onto the multicomputer model using optimization techniques to minimize training time. If the system under consideration is complex, then the analysis, the optimization, or both could become intractable and very dicult to compute. In (Moldovan, 1989), the optimization is intractable, as is the case in (Chu and Wah, 1992). The latter authors make simplifying assumptions in an attempt make optimization tractable. Manner et al. (1989) begin by analyzing the computational requirements for simulating neural networks and build a system to meet these requirements.

4.7.4 Benchmarks

Simulating one or more benchmark problems or networks is another way to compare and evaluate mapping schemes. The benchmarks could be standardized networks, networks tailored to estimate some speci c performance characteristic or networks arising from hard-to-solve, popular or \classic" problems. An extreme form of this testing mechanism is to use random or semi-random networks. Almost all the testing in parka (Evett et al., 1990) is done using automatically generated semi-random networks. When using random networks, very large test networks can be easily generated. An opposite extreme is to use networks which closely mirror the typical networks that would eventually be simulated on the system. When testing systems which claim to model common-sense re exive reasoning, simulating large, realistic networks which are either limited to a few domains (like expert-system knowledge bases) (Doorenbos et al., 1992) or those which span several domains (like knowledge bases for common-sense reasoning) (Guha and Lenat, 1991) are possible candidates. The latter approach is preferable, though it is harder to nd large knowledge bases spanning several domains.

4.7.5 Our Approach

In evaluating performance and testing the parallel re exive reasoning systems we develop, we employ both empirical and formal approaches, each approach complementing the other. The empirical approach involves extensive experimentation with very large random and real-world knowledge bases. In addition, we also analyze the system formally. An accurate formal analysis is intractable; we therefore use several simplifying assumptions to make the analysis tractable. In spite of these approximations, the analysis remains fruitful. We also validate the results of the analysis experimentally.

43

Chapter 5

SHRUTI on the CM-5|Design and Implementation This chapter describes the design and implementation of the SPMD asynchronous message passing parallel re exive reasoning system|shruti-cm5|that has been developed on the Connection machine CM-5.

5.1 The Connection Machine CM-5 The Connection Machine model CM-5 (TMC, 1991b) is an MIMD machine consisting of anywhere from 32 to 1024 powerful processors.1 Each processing node is a general-purpose computer which can execute instructions autonomously and perform interprocessor communication. Each processor can have up to 32 megabytes of local memory2 and optional vector processing hardware. The processors constitute the leaves of a fat tree interconnection network, where the bandwidth increases as one approaches the root of the tree. Every CM-5 system has one or more control processors which are similar to the processing nodes but are specialized to perform managerial and diagnostic functions. A low-latency control network provides tightly coupled communications including synchronization, broadcasting, global reduction and scan operations. A high bandwidth data network provides loosely coupled interprocessor communication. A standard network interface connects nodes and i/o units to the control and data networks. The virtual machine emerging from a combination of the hardware and operating system consists of a control processor acting as a partition manager, a set of processing nodes, facilities for interprocessor communication and a unix-like programming interface. A typical user task consists of a process running on the partition manager and a process running on each of the processing nodes. Though the basic architecture of the CM-5 supports MIMD style programming, operating system and other software constraints restrict users to SPMD (Single Program Multiple Data) style programs (TMC, 1994). In SPMD operation, a single program runs on all the processors, each acting on its share of data items. Both data parallel (SIMD) and message-passing programming on the CM-5 use the SPMD model. If the user program takes a primarily global view of the system|with a global address space and a single thread of control|and processors run in synchrony, the operation is data parallel; if the program enforces a local, node-level view of the system and processors function asynchronously, the machine is used in a more MIMD fashion. We shall consistently use \SPMD" to be synonymous with the latter mode of operation. In this mode, all communication, synchronization and data layout are under the programs' explicit control. 1 2

In principle, the CM-5 architecture can support up to 16K processors. The amount of local memory is based on 4-Mbit dram technology and will increase as dram densities increase.

44

5.2 Design Considerations We now discuss how some of the issues raised in Chapter 3 are resolved when mapping shruti onto the CM-5. A detailed discussion of processor allocation and knowledge base processing appears in Section 5.3.5. Analysis and experimentation are dealt with in the next two chapters.

5.2.1 Granularity of Mapping

The individual processing elements on the CM-5 are full- edged sparc processors. A subnetwork in the connectionist model can therefore be implemented on a processor using appropriate data structures and associated procedures without necessarily mimicking the detailed behavior of individual nodes and links in the subnetwork. As discussed in Section 3.4, we use knowledge-level partitioning in mapping shruti onto the CM-5. We also treat predicates and concepts to be \nodes" in the system, and rules and is-a relations to be \links".

5.2.2 Representing Synchrony

Shruti-cm5 represents temporal synchrony by using \tags" or \markers"|integers with values

ranging from 1 to the maximum number of phases. Though temporal synchrony can be simulated on the CM-5 by using repeated processor synchronization, we have opted against this approach since unnecessary processor synchronization can slow down the system (see Section 3.3).

5.2.3 Active Messages and Communication

Shruti-cm5 uses CMMD library functions (TMC, 1993) for broadcasting and synchronization,

while almost all interprocessor communication is achieved using CMAML (CM Active Message Library) routines. CMAML provides ecient, low-latency interprocessor communication for short messages (TMC, 1993; von Eicken et al., 1992). Active messages are asynchronous (non-blocking) and have very low communication overhead. A processor can send o an active message and continue processing without having to wait for the message to be delivered to its destination. When the message arrives at the destination, a handler procedure is automatically invoked to process the message. The use of active messages improves communication performance by about an order of magnitude compared with the usual send/receive protocol. The main restriction on such messages is their size|they can only carry 16 bytes of information. However, given the constraints on the number of entities involved in dynamic bindings ( 10), there is an excellent match between the size of an active message and the amount of variable binding information that needs to be communicated between predicate instances during reasoning as speci ed by shruti. Shruti-cm5 exploits this match to the fullest extent.

Interrupts versus Polling

Sending active messages is always done by invoking appropriate CMAML routines. Receiving active messages, on the other hand, can be done in one of two ways: (i) Using interrupts, in which case an interrupt is generated upon arrival of an active message. The interrupt routine invokes the handler function with appropriate arguments. (ii) By polling, where the user program explicitly checks for, and receives, incoming active messages. Both interrupts and polling have their advantages and disadvantages. In designing shruti-cm5, we have opted to use interrupts for receiving active messages. Polling is most ecient when the volume of communication between processors is relatively constant and the ow of the program has convenient and suciently frequent polling points. In order to avoid saturating the network with pending messages, processors should poll for messages frequently. Excessive polling, on the other hand, can result in unnecessary overhead. The polling frequency should therefore be altered 45

to suit the volume of incoming messages. In a system like shruti-cm5, the number of incoming messages could vary from processor to processor due to variations in processor load; even for a given processor, message trac could change from reasoning episode to reasoning episode. This makes it dicult to reap the full performance bene ts of polling. The use of interrupts for receiving active messages o ers exibility and simplicity. It is transparent to the programmer and there are no chances of saturating the communication network. Compared to optimal frequency polling however, the use of interrupts is slightly more expensive.

5.3 Design Features Given the basic design choices, we now lay out the functional aspects of shruti-cm5. This is essentially an abstract description of shruti-cm5's capabilities. Actual implementation details are provided in subsequent sections of this chapter.

5.3.1 The Rule-Base

Shruti-cm5 is a mapping of shruti onto the CM-5. Thus, shruti-cm5 supports a rule-base,

which can consist of rules and facts. In its most general form, rules are of the form: 9w1:W1; : : :; wp:Wp 8x1; : : :; xr ; y1:Y1; : : :; ys:Ys [P1(: : :) ^    ^ Pn (: : :) ) 9z1 ; : : :; zt Q(: : :)]: W1 ; : : :; Wp and Y1 ; : : :; Ys are types, and hence, represent concepts in the type hierarchy. The arguments of Pi (1  i  n) can be constants or any of the variables except z1 ; : : :; zt . The arguments of Q are constants or any of the variables. In addition, the following constraints should hold for rules used in backward reasoning: any constants, repeated variables, or typed variables occurring in the antecedent must also occur in the consequent. Rules satisfying these conditions are said to be balanced (for backward reasoning) (Shastri, 1993). Shruti-cm5 only admits balanced rules. Constants, repeated variables, or typed variables in the antecedent are called special variables. Such rules|termed \rules with special variables", or special rules| re only if its special variables get bound in an episode of reasoning.3 Facts in shruti-cm5 are of the form: 9x1; : : :; xr ; y1:Y1; : : :; ys:Ys 8z1:Z1; : : :; zt:Zt [ P(: : :) ]: The arguments of P are either constants or any of the variables, with the constraint that a given xi (1  i  r) cannot occur in multiple argument positions. In interpreting facts, arguments of P which are universally typed variables are replaced by their types and arguments which are existentially typed variables are replaced by a unique subconcept of the type. Any untyped variables appearing as an argument of P must be existentially quanti ed, i.e., untyped variables must be one of x1 ; : : :; xr . Note that all universally quanti ed variables are within the scope of the existential variables. Details about the interpretation and encoding of facts can be found in (Mani and Shastri, 1993).

5.3.2 The Type Hierarchy

The type hierarchy is speci ed by using is-a relations. Shruti-cm5 supports two kinds of is-a relations: default is-a relations and labeled is-a relations. Default is-a relations are of the form is-a (A; B), and explicates the fact that A is a subconcept of B.4 A default is-a relation is encoded so as to support the activation propagation pattern described in (Mani and Shastri, 1993): 3 For constants, \getting bound" would mean becoming active. We will con ne all discussion to backward reasoning, especially since shruti-cm5 does not yet support forward reasoning. The conditions for forward reasoning are the dual of the conditions for backward reasoning. 4 Or that A is an instance of B . We do not distinguish between the superconcept-subconcept relation and the instance-of relation.

46

 Upward propagating activation (which arises when the query has a universally quanti ed typed variable, see Sections 5.3.4 and 2.3.8) continues to propagate upward to reach superconcepts; and

 Downward activation propagates downward and also sets o an upward trail to reach subconcepts, superconcepts and superconcepts of subconcepts.

Labeled is-a relations have the form is-a R (A; B), where R is a relation. Such relations are interpreted to mean A R B, i.e., A !R B, and are implemented using tags for link labels (see Section 5.3.3). Labeled is-a relations di er from default is-a relations in not allowing both upward and downward activation propagation. Labeled is-a relations support only unidirectional activation propagation: if A becomes active, then activation propagates to B. Nothing happens if B becomes active. Moreover, not all activation can traverse the link from A to B|only activation satisfying certain user-de ned conditions can traverse a link of type R. Details can be found in Section 5.3.3 and 5.5.

5.3.3 Tags|Node and Link Labeling

Shruti-cm5 supports both node and link tags. These are integers in a speci ed range. When a

predicate or concept receives a new instantiation, the tag associated with the incoming instantiation is noted. Before ring any rule or is-a relation based on that instantiation, the system checks if the node and link tags satisfy user-de ned conditions. Activation is propagated only if these conditions are satis ed. Since node and link tags are checked against user-de ned conditions, they can be harnessed for various purposes depending on the application at hand. Details on how to de ne and use these conditions can be found in (Mani, 1995). The default behavior of node and link tags in shruti-cm5 is as follows.

Node Tags|Predicate and Concept Grouping Predicates and concepts in shruti-cm5 can be partitioned or classi ed into groups by using node tags. A node tag is either which assigns tag p-tag to predicate P, or which assigns tag c-tag to concept C. Node tags like p-tag and c-tag are directly speci ed as

numbers. On assigning such tags to predicates and concepts in the system, the user can determine the number of active elements in each partition induced by the tags. We have used this mechanism to group predicates into domains and to monitor the activation propagation in these domains (see Section 7.2). Whenever a predicate or concept has no node tag assigned, a special DEFAULT TAG is used.

Link Tags|Labeled Relations Both rules and is-a relations can be labeled. Labeled rules have a rule type speci ed after the antecedent: P1 (: : :) ^    ^ Pn(: : :) rule-type ) Q(: : :). Labeled is-a relations are as described above: is-a relation (A; B). Labels are speci ed as mnemonic names, which are internally converted to integers for the link tags. Link tags are used to restrict activation propagation to only those instantiations which arrive with tags matching the link tags. This feature is not of much use in the rule-base, but labeled relations in the type hierarchy are crucial for encoding multiple transitive relations (refer to the implementation of WordNet on shruti-cm5, in Section 7.3). Shruti-cm5 provides three special link tags. A DEFAULT TAG is used whenever no label is needed. This is also the tag used to identify default is-a relations. An ALL TAG matches any tag and is used when activation should be propagated irrespective of the tag associated with incoming activation. A STOP TAG stops further activation propagation, again, irrespective of the tag associated with incoming activation. 47

5.3.4 Queries

Queries can be posed both to the rule-base and to the type-hierarchy. Rule-based queries have the form 9x1; : : :; xr ; y1:Y1; : : :; ys:Ys 8z1:Z1; : : :; zt:Zt [ P(: : :) ] rule-type ? and result in activating the given query predicate and the entities which function as role- llers for the predicate. If no rule-type is speci ed, the default rule label is used; if not, the given rule tag is used for the initial instantiation (and then propagated as per the user-de ned tag handling functions). Type hierarchy queries take the form: is-a relation (A; B)?. If B is a variable instead of a concept, i.e., an enumeration query,5 A is activated and activation propagated for a xed number of steps. The active concepts in the system can then be enumerated. If the relation is not speci ed, the query is assumed to be based on the default is-a relation; otherwise, the initial instantiation is tagged with relation.

Activation in the Type Hierarchy

Entities activated by a rule-based query are tagged to traverse the default is-a relations in the type hierarchy. Universally quanti ed typed variables are activated to set o upward activation and existentially quanti ed typed variables in the query set o both downward and upward activation. Thus, only default is-a relations play a role in answering rule-based queries. A type hierarchy query is-a relation (A; B)? activates A in some phase , and waits for B to become active in the same phase. The activation of A is tagged with relation.

5.3.5 Processor Allocation

Shruti-cm5 supports two major processor allocation schemes: random processor allocation and

q-based processor allocation.

Random Processor Allocation Random processor allocation involves allocating knowledge elements to random processors. Every predicate and concept in the knowledge base is allocated to a random processor. When a predicate P is allocated to a processor p, all rules with P as the consequent, and all facts pertaining to P are assigned to processor p. Given a rule R of the form Q ) P, activation starts at P and propagates to Q when the rule R res. In order to use send-based communication in the system (see Section 3.7), a rule is encoded on the processor housing its consequent: When R res, processor p just needs to assemble the right instantiation based on the activation of P and then send a message to the processor housing Q; once the message is sent, p can continue with its processing independently of other events relating to Q. Facts pertaining to P are also encoded on p to reduce unnecessary communication. In order to match facts, we need to look at the predicate instantiation. If the facts are on processors other than p, messages will be needed in order to convey the current instantiations of P to the other processors; furthermore, to avoid the need for repeated messages, copies of P will need to be stored on all processors containing facts pertaining to P. Thus, distributing the facts for P on processors other than p result in increased communication and space requirements.6 As noted in Section 5.5.3, enumeration queries are always run synchronously. When a small number of predicates in the knowledge base have a disproportionately large number of facts, spreading these facts out on di erent processors will result in better distribution of the knowledge base. We base our design on the assumption that on average the distribution of facts will be reasonably uniform. Also see Section 3.8. In working out solutions to this problem, the large body of literature on handling data skew in database systems should provide helpful pointers. 5 6

48

/* random seeding */ for (i = 1; i own(y,z) ]; own(x,y) => can_sell(x,y) ]; preys_on(x,y) => scared_of(y,x) ]; move(x,y,z) => present(x,z,t) ]; move(x,y,z) => present(x,y,t) ]; sibling(x,y) & born_together(x,y) => twins(x,y) ]; sibling(x,y) => sibling(y,x) ];

/* Facts */ give (John,Mary,Book1); move (John,Nyc,Boston); sibling (John,x); Forall x:Cat, y:Bird [ preys_on(x,y) ]; Exists x:Robin [ own(Mary,x) ]; /* Type hierarchy */ is-a (Bird,Animal); is-a (Cat,Animal); is-a (Canary,Bird); is-a (Tweety,Canary); is-a (Sylvester,Cat).

Figure 5.19: An example knowledge base in shruti-cm5 syntax. Both the cumulative iteration timer and the reasoning episode timer report the respective total times even if the reasoning episode is run incrementally.

5.6 The Shruti-cm5 User Interface The following example illustrates the existing user interface to shruti-cm5 and supporting utilities. 1. Knowledge base generation. The user must begin with a knowledge base in a syntax recognized by shruti-cm5. Knowledge bases in other formats should be translated into a form accepted by the system.14 Figure 5.19 is an example knowledge base in shruti-cm5 syntax. It is also possible to create a pseudo-random knowledge base using the knowledge base generator (Section 7.1). The output of the generator is also in the above syntax. 2. Preprocessing and loading. The preprocessor reads the input knowledge base, assigns knowledge base items to CM-5 processors (using one of several available processor assignment schemes) and writes out a set of les. These les are read and encoded on the CM-5. 3. Parallel knowledge processing. Once the KB has been loaded on the CM-5 one can pose queries, obtain answers, and gather performance and timing data. The dialog in Figure 5.20 illustrates how the user interacts with the system. The system prompt is >>. User input is in typewriter font while system output is shown in slanted font. The input command i is used to input the knowledge base and to pose queries.15 The run command r runs a reasoning episode. It reports elapsed time if the query is answered (as in the case of For example, see Section 7.3. The m -g command puts the system in direct input mode. The system always starts up in parallel input mode, and hence, the rst i command reads input in parallel. In order to pose the query directly using the second i command, the input mode is changed using the m command. See Appendix B. 14 15

64

>> i input-kb.pp

Processing le input-kb.pp .... done

>> m -g >> i

Enter Rules/Facts or Query: can sell(Mary,Book1)?

>> r

>> z

Simulating ... done Query answered armatively in 0.001638 seconds Resetting network ... done.

>> i query >> r

>>

Processing le query .... done Simulating ... done Query not answered

Figure 5.20: User interface to Shruti-cm5. ). If the query is not answered, no timing is displayed (as in the case of the query contained in the le query). Further commands can be used to view knowledge base distribution on the processors, processor load, individual processor timing, number of rules red, active predicates and concepts, number of messages sent, and so on (see Appendix B). The system also provides the capability to process command les in order to facilitate unattended batch processing. 4. Data processing and visualization. The data obtained from reasoning episodes can be analyzed and plotted as graphs; dynamic processor load, timing, etc. can be visualized; knowledge base distribution can be analyzed and visualized; and the actual connectivity of the knowledge base can be graphically displayed. Chapter 7 shows graphs and visualization examples. All data analysis and visualization are done o -line on a serial workstation. can sell(Mary,Book1)?

5.7

Shruti-cm5|Other Modules and

Versions

We have described the design and implementation of the core of shruti-cm5|the parallel knowledge representation and reasoning system. We have also described some of the supporting modules like the serial preprocessor. In this section we describe some of the other supporting modules and tools which help in experimentation, processing statistical data and visualization. More details about using these tools and programs can be found in (Mani, 1995). Shruti-cm5 and Shruti-cm5-Lite

The shruti-cm5 system described in this chapter is Version 08.3. This version has all the features presented in the previous sections. It can support node and link tags, and hence, can handle both random knowledge bases and WordNet. Version 07.5 of shruti-cm5 is also available as shruticm5-lite. Version 07.5 does not support link tags, and does not have some of the re nements of Version 08.3. But Version 07.5 can provide full support for experimenting with random knowledge bases, and is a little more ecient than Version 08.3 when dealing with random knowledge bases. 65

This is due to the fact that we have avoided the overhead needed for handling tags and other features which are not required for encoding all knowledge bases. All experiments with random knowledge bases reported in Section 7.2 have been run using Version 07.5 while all the WordNet experiments in Section 7.3 have been run using Version 08.3.

Serial Shruti-cm5

By using appropriate compilation ags, shruti-cm5 can be compiled to run on a serial workstation. The internal mechanics of the implementation are identical to the parallel version except that the machine is treated as a one-processor partition. The single processor also functions as server. Since all knowledge elements are on the same processor, no message passing is needed. The CM-5 timing routines are substituted by Unix timers which provide the same timing information, though the precision is lower than on the CM-5. The serial version is useful in determining the speedup obtained with shruti-cm5 on parallel partitions. We also expect the serial version to be useful in workstation based applications.

Inference Dependency Graph Processor

The inference dependency graph processor is yet another module that can be generated from the

shruti-cm5 source code. The inference dependency graph processor runs on a serial workstation and can read the server hash tables and inference dependency graphs generated by the shruticm5 preprocessor. The inference dependency graph processor can be used to output graph les of

knowledge base fragments. These les can be loaded into graph visualization programs in order to graphically study the structure of the knowledge base. The inference dependency graph processor currently supports two graph visualization programs|xgrab (Barnes, 1990) and dotty (Koutso os and North, 1993). The inference dependency graph processor is also useful in generating random queries for experimenting with random knowledge bases.

The Visualizer

The visualizer is a stand-alone program written using tcl (Ousterhout, 1994). The visualizer can read in data columns and graphically display the values using squares whose area is proportional to the data value. This is very useful in visualizing processor load distribution and other aspects of a reasoning episode. The visualizer can also animate several data sets in a le to show, for example, the progress of a reasoning episode. Chapter 7 shows some examples of using the visualizer.

Data Processing

Several shell and awk scripts are provided for processing statistical data collected during reasoning episodes. The scripts support automatic generation of mean values, con dence intervals, etc. using Mathematica (Wolfram, 1991), graph generation using gnuplot, and other useful operations when experimenting with large knowledge bases.

5.8 Practical, Large-Scale Knowledge Representation with Shruti-cm5

The parallel shruti-cm5 core is a major and signi cant piece in the design and development of practical, large-scale real-time knowledge representation and reasoning systems. Not only is this parallel core very ecient, but, being based on rst-order logic, is more powerful than frame-based systems. The shruti-cm5 core is designed to eciently handle very large knowledge bases. The response time for a given knowledge base depends on several factors, including the target hardware platform, 66

the structure of the knowledge base, and the number of active elements in a reasoning episode. For example, response times will be smaller when the knowledge base have a large proportion of facts as opposed to a large proportion of rules. Similarly, performance will improve, even with much larger knowledge bases, if the total number of active elements is smaller. See Chapter 7. The parallel shruti-cm5 core can be embedded in a front-end. The resulting system can provide a range of features including:  Graphical User Interface. A user-friendly, graphical interface to the parallel core simpli es the task of loading a knowledge base, posing queries, retrieving answers, and inspecting the knowledge base.  Multiple Queries. The front-end can read several queries simultaneously and presents them sequentially to the parallel core. As the parallel core responds to the queries one by one, the front-end collects the results and displays them to the user.  Conjunctive and Disjunctive Queries. The system can handle conjunctive and/or disjunctive queries. The front-end reads the queries, poses them to the parallel core sequentially, and then logically combines the individual responses to answer such queries.  Answer Extraction. It should be trivial to provide answer extraction capabilities, whereby the system not only answers a query like own(John,x)? (`Does John own something?'), but also provides a list of things which John owns. The parallel core can perform answer extraction in the same time it takes to answer yes/no queries, and the front-end displays the list (or table, if there are several wildcards in the query) to the user.  Relational Operations. The front-end can apply relational operations like joins, selections and projections to the result of a query. Thus, a front-end to the parallel core can greatly enhance the perceived value of the system. Though some form of a front-end will be necessary in most practical applications, it is easy, if not trivial, to design and implement. Bells and whistles aside, the ultimate strength of a large-scale knowledge representation system lies in the parallel core. We therefore devote this thesis to the design and implementation of an ecient parallel core.16

5.9

Shruti on

the CM-2

In the early stages of this research, we developed Shruti-cm2, a data parallel implementation of shruti on the Connection Machine CM-2 (TMC, 1991a). A detailed description of shruti-cm2,

including design, knowledge encoding, spreading activation and performance characteristics can be found in Appendix D. However, due to the overwhelmingly superior performance of shruti-cm5| the SPMD implementation on the CM-5|the shruti-cm2 project was abandoned. Figure 5.21 compares the performance of shruti-cm2 and shruti-cm5. The graph shows timing data for shruti-cm2 running on 32K, 16K and 8K processor CM-2 machines and shruticm5 running on a 32 PE CM-5. The same structured, random knowledge base which included special rules (Section 5.3.1) and a type hierarchy was used on all the machines (see Section 7.1). Note that the timing curve for the CM-5 has been multiplied by 100. Queries used in the experiments were generated as follows. In order to obtain data for a depth D, a special arti cial chain of rules was embedded in the random knowledge base. A fact F was attached to the predicate at depth D from the query predicate. The query Q was tailored so that F would be the only fact that could armatively answer Q. The predicates in this chain were unique and not present in the random knowledge base. The chain was then \interfaced" to the random knowledge base by adding 16 In its current incarnation, shruti-cm5 does not include a front-end. But as noted above, it should be easy to provide one when needed.

67

QUERY DEPTH vs. RESPONSE TIME 10 simd / spmd version 05.1 / 07.3 32k, 16k, 8k pe cm-2 / 32 pe cm-5 kb size = 55000 kb with special rules and type hierarchy

Response Time (seconds)

8

32 PE CM-5 x 100 8K Processor CM-2 16K Processor CM-2 32K Processor CM-2

6

4

2

0 0

2

4

6

8

10

12

14

Query Depth

Figure 5.21: A comparison of shruti-cm2 and shruti-cm5. The timing curve for the CM-5 has been multiplied by 100. extra rules that link each predicate in the chain to several (about ve) predicates in the random knowledge base. This way, a signi cant number of predicates in the knowledge base were activated during an episode of reasoning. Shruti-cm2 can also be run on the CM-5 in data-parallel mode. Results of these experiments are described in Section D.4.

68

Chapter 6

A Mathematical Analysis of SHRUTI-CM5 6.1 Motivation We mathematically analyze the mapping of shruti onto parallel machines in an attempt to derive general guidelines for e ective implementation on massively parallel machines. A comparison of parallel re exive reasoning systems on the CM-2, CM-5, and serial workstations indicate that SPMD message passing implementations on the CM-5 are by far the most promising, and outperform SIMD CM-2 implementations and serial implementations. We therefore base our analysis on distributed memory massively parallel processors (MPPs) where processors operate autonomously and interprocessor communication is achieved by message passing. We expect the results of the analysis to provide pointers to ecient encoding of large knowledge bases on general purpose massively parallel machines. We apply the results to validate the design of shruti-cm5. The analysis is based on the assumption that the crucial factor in e ectively mapping shruti onto MPPs is to appropriately control the distribution of primitive elements among the processors of the MPP. The primitive elements are determined by the mapping granularity; when using knowledge-level partitioning, knowledge base elements constitute the primitives. Other than knowledge base distribution, most of the other factors that a ect implementation eciency|including the message passing overhead, knowledge base characteristics like number of predicates, branching factor, etc.|are beyond our control. Thus, given a particular machine and a knowledge base to be encoded on that machine, the assignment of knowledge base elements to the processors in the machine is the only major factor that is entirely under the control of the system designer. The analysis presented here provides guidelines on optimum placement of knowledge base elements taking machine and knowledge base parameters into consideration. The analysis is based on the average case and yields optimum performance on average. Processor assignment should take into account two factors|load balancing and locality. Load balancing attempts to maximally utilize available parallelism by providing a reasonably balanced load for all the processors in the machine. Locality, on the other hand, is aimed at minimizing communication overhead by clustering related elements on the same or nearby processors. In the analysis presented in this chapter, we introduce a single metric|the probability of nding related elements on the same processor|which is a measure of locality and implicitly determines load balance. We optimize response time with respect to this metric and study the results with a view toward designing large-scale knowledge base systems. We shall consider only backward reasoning. The analysis uses a relatively large number of symbols to denote various aspects of mapping shruti onto parallel machines and running reasoning episodes. Section 6.11 provides a glossary of terms and symbols for easy reference. 69

6.2 Query Response Time Let KB be the given knowledge base. Let P , F , and C be the set of predicates, facts, and entities respectively in KB. Suppose a query Q is posed. Let dQ be the inference depth required to answer Q. In view of the constraints imposed by shruti, dQ is bounded (Section 2.3.11). If dQmax is the upper bound on dQ , then responding to Q requires D = min(2dQ + 1; 2dQmax + 1) iterations. If Q has an armative answer (and assuming dQ  dQmax ), dQ hops are needed to spread activation from the query predicate to the predicate with the relevant fact, one iteration for matching the relevant fact and the equivalent of another dQ hops for collector activation to reverse propagate to the query predicate|for a total of 2dQ + 1 iterations. If dQ > dQmax , or if Q does not have an answer, the reasoning episode terminates after 2dQmax + 1 iterations. For type hierarchy queries, D is always equal to dQ , again assuming that dQ  dQmax . We use the term hop to denote activation traversing a link; a link could represent either a rule or is-a relation.1 Hops can be numbered based on how far activation has spread from its origin. For example, in the rule-base, the rst hop will result in activating all predicates in the antecedent of rules with the query predicate as the consequent. The computation needed to processes all KB elements receiving activation which has traveled i hops constitutes iteration i.2 In this chapter, we use \iteration" only in this special sense. One iteration could span several passes through the body of the activation propagation loop (Figure 5.6), and each pass could partially advance iterations for several values of i.3 Let the MPP machine have N processors, numbered from 0 to N ? 1. Let T(i; x) be the time needed for iteration i on processor x. By de nition, T(i; x) represents the total time needed by processor x in order to process all activation for hop number i. Assuming that spreading activation in the rule-base and type hierarchy are performed separately (see Section 6.3) the time T(i; x) can be split into T(i; x) = Trb (i; x) + Tth (i; x) (6:1) where Trb (i; x) is the time taken to spread activation in the rule-base and Tth (i; x) is the time required for spreading activation in the type hierarchy. Depending on when processors synchronize during an episode of reasoning, the total time TQ for answering Q can be represented in several ways: 1. When processors run asynchronously in course of a reasoning episode, synchronizing only at the start and nish, we have N ?1 TQ = max x=0

D

X

T(i; x)

(6:2)

N ?1 max T(i; x) x=0

(6:3)

i=0

2. Synchronization after every iteration results in TQ =

D

X

i=0

By this de nition, reverse propagationof collector activation is not consideredto be a hop. In a reasoningepisode, collector activation must traverse d links to reach the query predicate. In the time needed to do this, activation can spread another d hops. Thus, reverse propagating collector activation from the relevant fact predicate to the query predicate is equivalent to d hops, though, technically, this activation propagation does not constitute the hops. 2 If a predicate P is instantiated by activation which has traveled i hops, then the computation needed to process collector activation for P is included in the cost of iteration i. Of course, the actual propagationof collector activation for P may take place at a later time, during a di erent pass through the activation propagation loop. 3 In Chapter 5 we use \iteration" to mean one pass through the activation propagation loop. In this chapter, we use iteration in the technical sense de ned above. 1

Q

Q

Q

70

1. 2. 3. 4. 5. 6. 7.

for every element E in the respective frontier for every element related to E by a rule / is-a relation perform necessary computation if related element is on the same processor as E directly call respective procedure to receive activation else send message to invoke respective receive procedure

Figure 6.1: General activation spreading mechanism in Shruti-cm5. 3. Processors could also synchronize more than once every iteration. For example, synchronization after each component of Equation 6.1 would lead to TQ =

D

X

i=0

N ?1 N ?1 [ max T (i; x) + max T (i; x) ]: x=0 rb x=0 th

(6:4)

Note that the term to which the max operator is applied depends on the synchronization pattern. In order to synchronize, all processors must wait for the slowest processor to nish; hence, the time between two synchronization points is dictated by the processor which takes the longest. We use Equation 6.2 as the expression for TQ since shruti-cm5 runs asynchronously by default. We also brie y consider synchronization after every iteration (Equation 6.3) since shruti-cm5 can optionally be run in this mode with command options, and uses this mode for type hierarchy enumeration queries. Section 6.10.4 compares and contrasts synchronous and asynchronous operation in light of the analysis. Given an analysis based on any of the above three equations, it is quite straightforward to modify it for other synchronization patterns.

6.2.1 Computation and Communication Costs

We use Figure 6.1 to describe how the analysis takes communication and computation costs into account. The gure shows, at an abstract level, the steps involved in spreading activation in the system. The frontier under consideration is scanned and activation is spread for each element in the frontier. This is done for every frontier in every iteration. In any iteration i, the computation cost for a given frontier L includes: (i) the cost of receiving (local or remote) activation which results in building the frontier L; and (ii) the cost of processing the elements in L, i.e., executing the loop in Figure 6.1 for frontier L. For iteration i, the cost of line 5 or 7 in Figure 6.1 is con ned to calling the respective procedure. The cost of actually executing the body of this procedure will be included in the computation cost for iteration i + 1, since this is part of the cost for building the frontier for iteration i + 1. We use ts , the time needed to call an empty local procedure, as the unit for expressing computation and communication costs. We have chosen this relatively non-standard unit for two reasons. Firstly, when spreading activation using active messages on the CM-5, the fundamental operation performed by an active message is to invoke a (handler) procedure on a remote processor. As described in Section 5.5, activation propagation can be expressed in terms of a computation phase which consists of assembling the new instantiation, and a communication phase where activation is communicated to the relevant predicate or concept by calling a handler procedure (either locally or remotely). This makes it natural to talk in terms of procedure calls (see lines 5 and 7 in Figure 6.1). Secondly, the outcome of the analysis (see Sections 6.8 and 6.10) depends only on the relative magnitudes of (see below for de nitions of the various parameters mentioned here) the communication cost kc = ttsc and computation cost ko = ttso , i.e., on the ratio kkoc = ttoc . This ratio is independent of ts |in other words, ts is essentially a proportionality constant used in converting cost to time. 71

Thus, ts could represent the time for some convenient local operation like a memory write, empty procedure call, etc. This ensures that the analysis is valid for any MPP machine. Let tc be the time needed to call an empty procedure on a remote processor. This includes both the communication overhead and network latency. The communication cost, kc , is therefore ttsc . On a distributed memory message passing machine, tc > ts . When a processor p1 invokes a procedure on a remote processor p2 by sending a message, p1 will incur a cost proportional to kc . Processor p2 will need to receive this message before it can actually invoke the procedure, and will therefore incur a cost. This cost of receiving a message is h represented by kch = ttcs where thc is the time for receiving a message.4 The magnitude of thc with respect to tc depends on the communication protocol in use. With these de nitions, the time for executing line 5 in Figure 6.1 is ts and that for line 7 is tc . When line 7 is executed, the remote processor spends time thc to receive the relevant message. ts and tc denote the time needed to invoke the procedure|locally or remotely|to processes the new activation appropriately. When using active messages, the procedure invoked is generally the handler. When invoked remotely, the remote processor expends time thc to receive the invocation request and arguments necessary for invoking the appropriate procedure. The time for actually executing the body of the receive activation procedure is represented by tho . This contributes to the cost of building the frontier. Computation that is independent of the location of the elements, including the cost of executing line 3 in Figure 6.1, is accounted for by the computation time to , which gives rise to the computation cost ko = ttso .5 Of the parameters introduced above, ts , tc and thc are machine dependent and are therefore invariant for any given machine. to and tho , on the other hand, could be di erent for di erent frontiers. When dealing with a speci c frontier, extra subscripts will be added to distinguish parameters for di erent frontiers (see Section 6.11).

6.2.2 De ning q

In the process of spreading activation, predicates in the rule-base and entities in the type hierarchy need to communicate with other predicates and entities. These other predicates and entities could be located on the same processor or on another processor. Let prob (s) be the probability of nding related elements on the same processor. Thus, given a rule [is-a relation] P(: : :) ) Q(: : :) [isa(A,B)], prob (s) is the probability that both P and Q [A and B] are on the same processor. Let prob (c) = 1 ? prob (s) be the probability that related elements are on di erent processors. We use q to denote prob (s), the probability of nding related elements on the same processor. Then, prob (c) = 1 ? q. Let E1 and E2 be two related elements. Let E1 be assigned to processor p. Then, E2 is also assigned to p with probability q. With probability 1 ? q, E2 is assigned to a processor other than p. When this happens, E2 has a uniform probability of being assigned to any processor other than p. Thus, q is a measure of locality; the uniform assignment probability when two elements are on di erent processors ensures load balancing, subject to the locality requirements of q. Since the rule-base and type hierarchy could have di erent connectivity characteristics, qth and qrb are used to represent q for the type hierarchy and rule-base respectively.

6.2.3 Spreading Activation in the Rule-Base

In order to compute T(i; x), we begin with Trb (i; x), the time taken to spread activation in the rulebase. Trb (i; x) can be further factored into (see body of activation propagation loop in Figure 5.6): Trb (i; x) = Tr (i; x) + Tf (i; x) (6.5) 4 5

The superscript h is mnemonic for handler. We assume that the loop overhead and cost of executing conditionals is also included in t . o

72

Tr (i; x) = Trf (i; x) + Trr (i; x)

(6.6)

where Tr (i; x) = time for processing rules; Tf (i; x) = time for matching facts; Trf (i; x) = time for ring rules (spreading rule activation forward); and Trr (i; x) = time for propagating collector activation (spreading reverse activation). The expression for Trf (i; x) can be written as Trf (i; x) = time for building rule frontier + time for processing rule frontier X X bf (p) [ ts qrb + tc (1 ? qrb ) + torf ]: (6.7) [ thc (1 ? qrb ) + thorf ] + = i;x p2Par f

i;x p2Par f

In the above equation, the summations extend over all elements of the rule frontier. The terms in the rst summation constitute the time for building the rule frontier. Each element in the frontier is added when executing the handler and therefore takes up time thorf . Of the elements in the current rule frontier, a fraction qrb arises from elements active on the same processor during the previous iteration; these elements do not incur any extra cost in invoking the handler. The remaining 1 ? qrb fraction of elements require an extra time of thc to receive the incoming messages before invoking the handler. The second summation gives the total time needed to process all the elements in the rule frontier: ts qrb + tc (1 ? qrb) is the time to execute the if statement (lines 4{7) in Figure 6.1; torf accounts for the remaining computation time (which corresponds to line 3 in Figure 6.1, for loop control and other loop overhead); bf (p) is the forward branching factor for predicate p. Using a similar line of reasoning, X X Trr (i; x) = [ thc (1 ? qrb) + thorr ] + br (p) [ ts qrb + tc (1 ? qrb ) + torr ] (6.8) Tf (i; x) =

i;x p2Par r X

X

i;x f 2Fp p2Paf

i;x p2Par r

argc (f) tof :

(6.9)

Note that we use tof to denote computation associated with fact matching. In the above equations, in addition to the parameters described earlier, we have: Pai;x is the set of all active predicates on processor x during iteration i; Pari;xf is the rule frontier on x during iteration i; Pari;xf  Pai;x , and if p 2 Pari;xf , then there is a rule R with p in the consequent and R res in iteration i; Pari;xr is the reverse-propagation frontier on x during iteration i; Pari;xr  Pai;x , and if p 2 Pari;xr , then the collector of p is active; Pafi;x is the fact frontier on x during iteration i; Pafi;x  Pai;x and if p 2 Pafi;x , then p has facts that need to be matched; Fp is the set of facts that need to be matched for predicate p; bf (p) is a function which speci es the forward branching factor for p 2 P (i.e., the number of rules with p as the consequent); br (p) is a function which speci es the reverse branching factor for p 2 P (i.e., the number of rules with p in the antecedent); argc(f) is the number of arguments of fact f that have been bound to entities (i.e., arguments which are constants or typed variables). 73

6.2.4 Spreading Activation in the Type Hierarchy In order to compute Tth (i; x), we factor it into

Tth (i; x) = Tu (i; x) + Td (i; x)

(6:10)

with Tu (i; x) and Td (i; x) representing the time required to propagate bottom-up and top-down activation in the type hierarchy, respectively. In a manner analogous to Equation 6.7, these two components of Tth (i; x) can be expressed as Tu (i; x) = Td (i; x) =

X

i;x c2Cau X

i;x c2Cad

X

[ thc (1 ? qth ) + thou ] + [ thc (1 ? qth ) + thod ] +

i;x c2Cau X

i;x c2Cad

bu (c) [ ts qth + tc (1 ? qth) + tou ] (6.11) bd (c) [ ts qth + tc (1 ? qth ) + tod ]

(6.12)

where Cai;x is the set of active entities on processor x during iteration i; i;x is the type hierarchy frontier for bottom-up activation on processor x during iteration i; Cau i;x  C i;x , and if c 2 C i;x, then there is an is-a relation with c as the rst argument, and Cau a au this is-a relation \ res" in iteration i; i;x is the type hierarchy frontier for top-down activation on processor x during iteration i; C i;x  Cad i;x , then there is a default is-a relation with c as the superconcept, andadthis Cai;x, and if c 2 Cad is-a relation \ res" in iteration i; bu(c) is a function which speci es the upward branching factor for c 2 C (i.e., the number of is-a relations with c as the rst argument); bd (c) is a function which speci es the downward branching factor for c 2 C (i.e., the number of default is-a relations with c as the superconcept).

6.2.5 Observations

We observe that several terms in the expressions above are dependent on the characteristics of the knowledge base. The communication parameters, ts , tc and thc , are dependent on the underlying machine. The computation overheads depend on the actual implementation of the system. Only the distribution of the active elements, represented by the various Pai;x and Cai;x sets, and the probabilities, qrb and qth , are under the control of the system designer. Our aim is to use a processor assignment scheme which tailors these parameters for maximum eciency. The analysis presented above is general and does not make unwarranted assumptions about the nature of the machine or about the processor assignment and mapping scheme. The assumptions that have been made are listed in Section 6.3 along with justi cations. The expressions arising from the above general analysis are complex and it is dicult to apply standard mathematical tools in order to extract useful information and make predications based on these equations. In order to make the expressions more manageable, we use a set of simplifying assumptions (Section 6.5). The simpli ed analysis makes several predictions some of which are experimentally validated in Chapter 7.

6.3 Assumptions The analysis presented above makes the following assumptions: 74

1. Each of the operations needed to spread activation in the rule-base and in the type hierarchy| like ring rules, matching facts, reverse-propagating collector activation, spreading bottom-up activation in the type hierarchy, and spreading top-down activation in the type hierarchy| are performed serially on each processor (in keeping with the activation propagation loop in Figure 5.6). This is essential if the network has to be partitioned at a relatively coarse level of granularity. We argue in Chapter 3 that knowledge-level partitioning provides conceptual clarity and facilitates ecient mapping when the MPP has relatively powerful processors. The individual processors run in parallel. 2. The synchronization pattern of the processors in the MPP machine decides which of Equations 6.2{6.4 are used to compute response time. By default, reasoning episodes in shruticm5 are run asynchronously (Equation 6.2) except for enumeration queries; episodes can be forced to run synchronously (Equation 6.3). 3. The predicates and entities in the KB are distributed over the same processors. Thus, any processor in the machine has both predicates and entities assigned to it. As a result, spreading activation in the rule-base and in the type hierarchy are two separate operations which are performed separately on each processor. Another possible option is to assign part of the processors in the machine to handle the type hierarchy while the remaining processors encode the rule-base. If the machine is partitioned in this manner, the analysis can be easily modi ed to re ect this distribution. Distributing both predicates and entities over all the processors has the advantage of improved load balancing even if the balance of computation between the rule-base and type hierarchy is uneven. For example, the random knowledge bases used in Chapter 7 have very little activation propagating in the type hierarchy, while WordNet does not even have a rule-base. Under these situations, independent processor partitions for the rule-base and type hierarchy will result in ine ective use of parallelism. 4. All processors in the MPP have local copies of the activation values of active entities. Entity activations are broadcast whenever an entity becomes active. The cost of this overhead is included in tou and tod . During fact matching and special variable checking for rule ring, processors retrieve entity activations from the local cache, eliminating communication. This is in keeping with the shruti-cm5 implementation. Fact matching and rule ring would incur extra costs if this were not true. 5. The multiple instantiation constants k1 and k2 have not been explicitly included. We assume that the multiple instantiation constants will be indirectly taken into account while considering the set of active predicates or entities. Even if this is not done, the analysis will at worst be a ected by a constant factor. 6. Activation that has already been propagated is not re-propagated in every iteration. Predicates and entities are placed on frontiers only when they receive new activation, and even then, only the new activation is processed. For example, if a predicate has two active banks, and receives new activation in its third bank, only the activation in the third bank is processed. Thus, not all active predicates and entities are placed on frontiers. 7. The communication cost is independent of knowledge base size. When the size of the knowledge base is very large, it is possible that a large number of elements are active, resulting in a lot of messages being injected into the network. This could potentially increase network latency. For the knowledge bases sizes we are interested in, we assume that the network latency remains constant. In the following section, we argue that this assumption is justi able if active messages are used on the CM-5. If this assumption is invalid, then the communication cost should be portrayed as a function of the number of active elements in KB. 75

8. The communication cost is the same for interprocessor communication between any two processors in the MPP. If this is not true, the communication cost should be a function of the source and destination of the message. This xed communication cost assumption holds when the topology of the communication network is such that the distance between any two processors is constant. For example, a hypercube is such a network. On machines where the distance between nodes varies by a small amount|for example the CM-5 with its fat tree interconnection network|we can use the average or worst case cost as the xed communication cost between any two processors.

6.4 Communication Cost, Active Messages and the CM-5 Based on the fact that shruti-cm5 uses active messages for interprocessor communication, the following assumptions are justi ed:

 When a processor sends an active message, it can continue with its computation as soon as

the message has been injected into the network. The processor need not wait for the message to arrive at its destination. This overlapping of communication and computation means that tc need not include network latency. The e ective cost of invoking a procedure on a remote processor is just the communication overhead, i.e., the cost of injecting the request into the network.

 The time needed to receive a message, thc , is the time needed to retrieve the message from the

network. Since this operation is essentially similar to injecting a message into the network, we can safely assume that thc = tc . In other words, the cost of sending a message and the cost of receiving a message are comparable.

 With network latency excluded from communication cost, it is reasonable to assume that the communication cost is independent of the source and destination processors.

On the CM-5, experiments indicate that calling an empty procedure 100,000 times requires about 0.5 sec. The value of ts is therefore approximately 5 sec. Further sending out 100,000 active messages takes around 2.5 sec, indicating that tc  25 sec. The communication cost kc is therefore about 5. By virtue of our assumption that thc = tc , we have thc  25 sec and kch  5. Depending on the type of frontier under consideration, we expect ko to range between 10{100. The estimate for ko values has been con rmed experimentally (for example, see Section 7.4). Moreover, experiments indicate that the average time per message when dealing with very large knowledge bases is about 8 sec on a 32 PE CM-5. Each processor, therefore, generates a message in about 32  8 sec  250 sec on average, or about 3000{4000 messages per second. Assuming uniform distribution of messages, each processor receives, on average, about 3000{4000 messages per second. Each active message is 20 bytes long leading to a trac of around 0.1 MB/sec between any two processors. The CM-5 can sustain a minimum interprocessor bandwidth of 5 MB/sec which is well above the average shruti-cm5 message trac|even if message distribution is highly skewed. In addition to the fact that network latency is excluded from communication cost (when using active messages), the above argument o ers further justi cation for the assumption that network latency on the CM-5 is independent of the size of KB, for the range of KB sizes we have experimented with. The range of KB sizes over which the assumption continues to hold is dependent on the granularity and locality of the mapping. 76

6.4.1 Revised Equations for Response Time

Substituting thc = tc , tc = kc ts and to = ko ts in Equations 6.7{6.9, 6.11 and 6.12, the equations for the components of response time can be written as: Trf (i; x) = Trr (i; x) = Tf (i; x) = Tu (i; x) = Td (i; x) =

X

i;x p2Par f X

i;x p2Par r X

ts f kc (1 ? qrb ) + korh f + bf (p) [ qrb + kc (1 ? qrb ) + korf ] g (6.13) h + br (p) [ qrb + kc (1 ? qrb ) + kor ] g (6.14) ts f kc (1 ? qrb ) + kor r r X

i;x f 2Fp p2Paf X

i;x c2Cau X

i;x c2Cad

argc (f) ts kof

(6.15)

h + bu (c) [ qth + kc (1 ? qth ) + kou ] g ts f kc (1 ? qth ) + kou

(6.16)

h + bd (c) [ qth + kc (1 ? qth) + kod ] g ts f kc (1 ? qth ) + kod

(6.17)

6.5 Simpli cations and Approximations As noted earlier, we attempt to simplify the expressions for the components of query response time by making several assumptions and approximations about the various parameters appearing in the equations: 1. To simplify Tf (i; x), we assume that for a given fact f pertaining to predicate p, the number of argument positions bound to entities is identical to the number of arguments of p, i.e., argc (f) = arg(p). In other words, all the arguments of f are bound to entities. If this is not the case, the actual timing will be better than that predicated by the analysis. With this assumption, X Tf (i; x) = f(p) ts kof (6:18) i;x p2Paf

where f(p) = jFp j arg(p) is the product of the number of facts for predicate p and the number of arguments for p. 2. In general, the probabilities qrb and qth could depend on the processor x and the iteration i. We assume that the probabilities are independent of x and i. The implications of this assumption are:  The processor allocation scheme treats all processors identically when assigning knowledge base elements to processors. There are no \distinguished" processors. Thus, if we have a rule P(: : :) ) R(: : :) and predicate R is assigned to processor p, then P is assigned to p with probability qrb and to any of the other N ? 1 processors with uniform probability 1N??qrb1 . A similar statement can be made about the distribution of entities in the type hierarchy.  The probabilities do not change during an episode of reasoning. The processor assignment is therefore static and is determined when the knowledge base is encoded.6 3. The \branching factor" functions bf (p), br (p), f(p), bu (c) and bd (c) are replaced by average values|bf , br , f, bu and bd respectively|for the knowledge base. 6 Changing processor allocation dynamically during a reasoning episode can be extremely expensive, given that any reasoning episode usually involves only a small fraction of the knowledge base and is con ned to a short time span. See Section 3.1.3.

77

4. The operations involved in spreading forward and reverse activation in the rule-base are h  kh . reasonably similar. It is therefore justi able to assume that korf  korr , and kor orr f h h Similarly, we can assume that kou  kod and kou  kod . We therefore have: (6.19) korf = korr = korb (6.20) kou = kod = koth h h h (6.21) korf = korr = korb h h h (6.22) kou = kod = koth 5. Finally, we assume that, on average (6.23) Pari;xf = Pari;xr = Pafi;x = (Pai;x ? Pai?1;x ) = i;x P i;x = C i;x = (C i;x ? C i?1;x ) = i;x: Cau (6.24) a a C ad i;x The frontier sets i;x P and C represent the elements in the rule-base and type hierarchy frontiers. As noted earlier, in an asynchronous system the creation of these sets could span several passes through the activation propagation loop. i;x = P i;x , i.e., the forward and reverse activation frontiers in the ruleThe assumption Par arr f base are identical, is equivalent to stating that every active predicate will eventually have its collector activated. This may not always be the case, and biases the analysis towards the worst case. We believe that the resulting simpli cation is signi cant enough to warrant the possible error introduced by this assumption. A similar argument can be made for the type hierarchy.7

6.5.1 Simpli ed Equation for Response Time

The equation for response time, in conjunction with the approximations stated above, provide an estimate of average response time. Due to some of the assumptions and approximations (especially Assumptions 1 and 5 in Section 6.5, and the simplifying assumptions in Section 6.6.2), the response time has a bias towards the worst case. Combining Equations 6.13{6.17, the expressions for response time can now be rewritten as: Trb (i; x) = Trf (i; x) + Trr (i; x) + Tf (i; x) h = ji;x P j ts fbf qrb + (bf + 1) kc (1 ? qrb ) + korb + bf korb g + h ji;x P j ts fbr qrb + (br + 1) kc (1 ? qrb ) + korb + br korb g + ji;x P j ts f kof h = ji;x P j ts f(bf + br ) qrb + (bf + br + 2) kc (1 ? qrb) + [ (bf + br ) korb + 2korb + f kof ]g (6.25) = ji;x P j ts fBrb qrb + (Brb + 2) kc (1 ? qrb) + Korb g and Tth (i; x) = Tu (i; x) + Td (i; x) h = ji;x C j ts fbu qth + (bu + 1) kc (1 ? qth ) + koth + bu koth g + h ji;x C j ts fbd qth + (bd + 1) kc (1 ? qth) + koth + bd koth g h = ji;x C j ts f(bu + bd ) qth + (bu + bd + 2) kc (1 ? qth) + [ (bu + bd ) koth + 2koth ]g (6.26) = ji;x C j ts fBth qth + (Bth + 2) kc (1 ? qth ) + Koth g

7 If it is known that the reverse activation frontier in the rule-base and/or the top-down activation frontier in the type hierarchy will be empty, then the corresponding terms can be eliminated from the expression for response time. For example, with WordNet, is always empty; Section 7.4 considers this case. C

i;x

ad

78

i;x ji;x P j and jC j represent the cardinality of the respective frontier sets. We have also de ned the

constants

Brb Bth Korb Koth

= = = =

bf + br bu + bd (bf + br ) korb + 2kohrb + f kof (bu + bd ) koth + 2kohth

(6.27) (6.28) (6.29) (6.30)

6.6 Determining Query Response Time To determining query response time using Equations 6.25 and 6.26, we need to compute the cardinality of the frontier sets. Substituting the frontier set cardinality in these equations results in expressions for query response time in terms of machine and knowledge base parameters.

6.6.1 Queries

Any reasoning episode starts o by posing a query. If the query is posed in the rule-base, it results in activating the query predicate and the entitiesSNappearing in the query. If the query is S ?1 0;x ?1 0;x = fc ;    ; c g.8 pQ (c1 ;    ; cn), then 0P = Nx=0 P = fpQ g and 0C = x=0 1 n C If a type hierarchy query is posed, then 0P is empty since no predicates are activated. 0C has one element for both normal and enumeration queries. Moreover, in the case of a type hierarchy query, the reasoning episode only spans D = min(dQ ; dQmax) iterations, and no activation is propagated in the rule-base. Thus, Trb (i; x) = 0 and we need to consider only Tth (i; x). In the ensuing analysis, we assume that we are dealing with a query posed to the rule-base.

6.6.2 Cardinality of Frontier Sets

As the reasoning episode progresses, rules and is-a relations re, spreading activation in the system. After i iterations, it is evident that, on average, the total number of frontier elements on all processors is:9

jiP j = jiC j =

NX ?1 x=0 NX ?1 x=0

0 i i ji;x P j = jP j (bf ) = (bf )

(6.31)

0 i i ji;x C j = jC j (Bth ) = kn (Bth ) :

(6.32)

where 0  kn  n is the number of entities appearing in the query. Consider any processor p on the MPP machine. The query predicate pQ could be located either on p, or on another processor. In the worst case, we can assume that pQ is located on p so that

j0P;pj = j0P j = 1:

(6:33)

We can now recursively de ne ji;p P j for i  1 as

ji;pj = ji?1;pj b P

8 9

P

f qrb +

NX ?1 x=0 x6=p

ji?1;xj P



bf (1 ? qrb)  N ?1

Some of the c ; j = 1; ; n may be variables. If c is a variable, then it is not included in 0C . We are assuming that the inference dependency graph is acyclic. j



j

79

(6:34)

During iteration i, processor p has jiP?1;pj active predicates on its frontier. Since bf is the average branching factor, the frontier predicates on p activate jiP?1;pj bf predicates, of which jiP?1;pj bf qrb are located on p. The rst term on the rhs of Equation 6.34 represents these predicates. The second term on the rhs of Equation 6.34 accounts for predicates on p which are activated by predicates housed on all the other processors. The above reasoning is based on the following assumption (see Section 6.3). Suppose we have a rule P(x,y) ) R(x,y) with predicate R assigned to processor p0 . Then, P is assigned to p0 with probability qrb and to any of the other N ? 1 processors with uniform probability 1N??qrb1 . The second term on the rhs of Equation 6.34 can be simpli ed as: NX ?1

 ? qrb )  = jiP?1;xj bf (1 N ?1 x=0

x6=p



 ?1 bf (1 ? qrb ) NX jiP?1;xj N ?1 x=0

x6=p NX ?1

(

? qrb ) jiP?1;xj ? jiP?1;pj = bf (1 N ?1 x=0 n  o b (1 ? q ) f rb i?1j ? ji?1;pj j  = P P N ?1  n ? qrb ) bi?1 ? ji?1;pjo = bf (1 P f N ?1 

bif (1 ? qrb) bf (1 ? qrb) i?1;p N ? 1 ? N ? 1 jP j Substituting this expression in Equation 6.34, we have: bif (1 ? qrb) bf (1 ? qrb) i?1;p i?1;p ji;p P j = jP j bf qrb + N ? 1 ? N ? 1 jP j which simpli es to   i i?1;pj b Nqrb ? 1 + bf (1 ? qrb ) ji;p j = j  f P P N ?1 N ?1 =

)

(6.35) (6:36) (6:37)

By recursively substituting for the cardinality in the rhs of the above equation, we can see that  k kX ?1 bi (1 ? qrb)  Nq ? 1 j Nq ? 1 rb rb f i;p i ? k;p k jP j = jP j bf N ? 1 + (6:38) N ?1 j =0 N ? 1 When k = i in the above equation, using j0P;p j = 1, we have  i  j i?1 i i Nqrb ? 1 + X bf (1 ? qrb ) Nqrb ? 1 ji;p j = b f P N ?1 N ?1 j =0 N ? 1

(6:39)

If the number of processors on the machine is reasonably large, we can assume that N  1. This leads to the approximation NqNrb??1 1  qrb. When using random processor allocation on a 32 node CM-5, we have qrb = N1 = 321 (see Section 6.10.3), and hence, NqNrb??1 1 = NNN??11 = 0. The above approximation results in an error of about 0.03 in the value of NqNrb??1 1 . Note that since 0  q  1, the value of qrb is always greater than NqNrb??1 1 , and therefore biases the analysis towards the worst case. Using this approximation in the previous equation, and replacing p by x, we have i?1 i i qi + X bf (1 ? qrb ) qj ji;x j = b f rb rb P N j =0 1

80

i?1 bi (1 ? q ) X = bif qrbi + f N rb qrbj j =0 i (1 ? qrb )  1 ? qi  b rb = bif qrbi + f N 1 ? qrb bi = bif qrbi + Nf (1 ? qrbi )   bi 1 i i = bf qrb 1 ? N + Nf bi  bif qrbi + Nf (6.40) The last step follows from the assumption that N  1. In a similar fashion, one can show that i kn Bth i i (6:41) ji;x C j  kn Bth qrb + N In simplifying kthese equations, we have used the expression for the sum of a geometric series: Pk ?1 i 1?a i=0 a = 1?a . Combining Equations 6.25, 6.26, 6.31, 6.32, 6.40 and 6.41, we have

For i = 0:

Trb (0; x) = ts fBrb qrb + (Brb + 2) kc (1 ? qrb) + Korb g Tth (0; x) = kn ts fBth qth + (Bth + 2) kc (1 ? qth) + Koth g (

Trb (i; x) =

For i  1:

Tth (i; x) =



bi bif qrbi + Nf

(6.42) (6.43)

)

Trb (0; x) i



i qi + knBth Tth (0; x) kn Bth rb N

(6.44) (6.45)

6.7 Optimizing Query Response Time

6.7.1 Problem De nition

?1 From Section 6.2, we have Equation 6.2: TQ = maxNx=0 Tth (i; x) we can rewrite Equation 6.2 as

P

D T(i; x). i=0

Since T(i; x) = Trb (i; x) +

D

N ?1 [ T (i; x) + Tth (i; x) ] TQ = max x=0 i=0 rb ( X

=

) D D X N ?1 X max T (i; x) + Tth (i; x) x=0 i=0 rb i=0

(6.46)

For an ecient mapping, we need TQ to be as small as possible. For TQ to be minimum, we need PD to minimize T = i=0 T(i; x). As mentioned earlier, for an implementation on a given machine and for a given knowledge base, all parameters except qrb and qth are xed in the expression for T. P P Furthermore, Trb = Di=0 Trb (i; x) and Tth = Di=0 Tth (i; x) are processing times and are always positive. Thus, in order to minimize T, it suces to independently minimize Trb and Tth . The minimization problem reduces from: Minimize T with respect to qrb and qth , 0  qrb ; qth  1 81

to the simpler problem: Minimize Trb with respect to qrb , 0  qrb  1, and, Minimize Tth with respect to qth , 0  qth  1. The optimal value of qrb obtained is used in encoding the rule-base while the value of qth is used in encoding the type hierarchy.10 This scheme independently exploits the structure and characteristics of the rule-base and type hierarchy.

6.7.2 Minimization

Using the de nition of Trb , and treating N1 as a negligible quantity, we have

Trb =

D

X

i=0

Trb (i; x)

= Trb (0; x) +

D

X

i=1

Trb (i; x)

= Trb (0; x) + Trb (0; x) (

= Trb (0; x) 1 +

= =

Trb =

i=1 D X

i=1

! bif i i bf qrb + N

!) bif i i bf qrb + N

D bi X Trb (0; x) 1 + bif qrbi + N1 + Nf i=1 i=1 ( D bi ) D X X bif qrbi + Nf Trb (0; x) i=0 i=0 ) ( D +1 D +1 bf qrb ? 1 bDf +1 ? 1 Trb (0; x) bf qrb ? 1 + N(bf ? 1) ) ( bDf +1 qrbD+1 ? 1 bDf +1 ? 1 bf qrb ? 1 + N(bf ? 1) ts fBrb qrb + (Brb + 2) kc (1 ? qrb) + Korb g (6.47) (



D X

D

X

!

!)

In simplifying the summations, we have used the result: ki=0 ai = aka?1?1 . Similarly, we can show that ( ) D+1 qD+1 ? 1 D+1 ? 1 B B th th th Tth = B q ? 1 + N(B ? 1) kn ts fBth qth + (Bth + 2) kc (1 ? qth) + Koth g (6:48) th th th Equating the derivative of Equation 6.47 to zero, we can solve for qrb to obtain those values of qrb at which Trb is stationary. Trb , and hence, ddqTrbrb , could be a high degree polynomial in qrb, depending on the value of D, and may not have analytical solutions. We would therefore need to solve ddqTrbrb = 0 numerically to determine its roots. Since 0  qrb  1, we need to concern ourselves with only those values of qrb in the interval [0,1]. Trb has a minimum for qrb = q, subject to the constraint q 2 [0; 1], if ddqTrbrb q =q > 0. Proceeding in an identical manner, we can also numerically rb estimate the minimum for Tth : Tth has a minimum for qth = q subject to the constraint q 2 [0; 1] if ddqTthth q =q > 0. P

+1

2

2

2

2

th

The processor allocation considers all processors in the underlying parallel machine when mapping both the type hierarchy and the rule-base. See Assumption 3 in Section 6.3. 10

82

T_rb vs. q

T_rb vs. q

1000

0.2 D = 10

D=5 D=3

800 0.15

T_rb

T_rb

600 0.1

400

0.05 200

0

0 0

0.2

0.4

0.6

0.8

1

0

q

0.2

0.4

0.6

0.8

1

q

Figure 6.2: Trb versus qrb .

6.7.3 Example

Suppose the parameters in Equations 6.47 and 6.48 are set to: Bth = bf = f = 5; Brb = 8; koth = korb = kohth = kohrb = kof = 25; Korb = 375; Koth = 175 ; kc = 10; kn = 1; N = 32; D = 10 and ts = 10?6. We have dTrb = ?35:0929 + 0:02283 q + 0:171225 q2 + 1:1415 q3 + 7:13437 q4 + 42:8062 q5 + rb rb rb rb rb dqrb 249:703 qrb6 + 1426:88 qrb7 + 8026:17 qrb8 + 44589:8 qrb9 ? 9882:81 qrb10 dTth = ?24:7944 + 0:0116 q + 0:087 q2 + 0:58 q3 + 3:625 q4 + 21:75 q5 + th th th th th dqth 6 + 725 q7 + 4078:13 q8 + 22656:3 q9 ? 6982:42 q10 126:875 qth th th th th

Numerically solving ddqTrbrb = 0 gives qrb  0:43. The equation has ten roots of which eight are imaginary. The two real roots are qrb = 0:428494 and qrb = 4:69177. Thus, qrb = 0:43 is the only root in the interval [0,1] and is a minimum since ddqTrbrb q =0:43 = 689:473 is positive. Similarly, rb we nd the minimum of Tth to be at qth = 0:45. Thus, for the parameter values listed above, the optimal values of qrb and qth are given by 2

2

qrb = 0:43 qth = 0:45 Figures 6.2 and 6.3 plot the variation of Trb and Tth as qrb and qth vary in the interval [0,1] respectively. The parameter values from the above example have been used in generating these plots. In both gures, the plot on the left shows response time when D = 10; the plot on the right shows response time when D = 5 (top curve) and D = 3 (bottom curve).

6.8 Properties of Trb and Tth

In this section, we graphically explore the properties of Trb and Tth in order to better understand the behavior of these functions as parameters change. We also derive the implications of these equations from the standpoint of system design. We select the same parameter values as used in the example above: Bth = bf = f = 5; Brb = 8; koth = korb = kohth = kohrb = kof = 25; kc = 10; kn = 1; N = 32; D = 10 and ts = 10?6. The value of Korb and Koth are computed as needed. Figures 6.4{6.13 indicate the behavior of Trb and 83

T_th vs. q

T_th vs. q

500

0.1

450

0.09

400

0.08

350

0.07

300

0.06 T_th

T_th

D = 10

250

0.05

200

0.04

150

0.03

100

0.02

50

0.01

0

D=5 D=3

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

q

0.6

0.8

1

q

Figure 6.3: Tth versus qth .

6000 1000 T_rb 4000 800 2000 600

0 0

k_c

0.2

400 0.4 200

0.6 q_rb 0.8 1

Figure 6.4: Trb versus kc and qrb.

Tth when one of the parameters is changed as q varies between 0 and 1. Parameters which do

not vary in a given plot are assigned values from the above list. In generating these graphs and in understanding their implications, we have generally assumed that ko = koh . On inspecting these gures, we note the following:  Response time T = Trb + Tth is quite close to optimal as long as qrb and qth have values below some \threshold". Response time increases rapidly when the q values increase beyond this threshold. This is more clearly brought out by Figures 6.2 and 6.3, and is applicable as long as kc is small, or comparable, to the computation cost (korb and koth ) of processing a knowledge base element.  When the cost of communication kc becomes signi cantly large in comparison to the computation cost, system performance begins to deteriorate for small values of q as can be seen in Figures 6.4 and 6.9. Under these conditions, deviating from the optimal values of qrb and qth imposes signi cant performance penalties. Figures 6.14 and 6.15 plot the variation of Trb and Tth when kc is signi cantly large.  If the communication cost is much smaller than the computational cost of a knowledge base element, then, Trb and Tth are monotonically increasing, with a minimum value at qrb = qth = 0. In such a situation, the smaller the value of q, the better the response time. In other words, load balancing takes precedence over locality. Due to the small communication cost incurred when using active messages on the CM-5, this scenario is applicable and suggests 84

3000 1000

T_rb 2000 1000

750

0 0

500 0.2

N

0.4 250 0.6 q_rb 0.8 1

Figure 6.5: Trb versus N and qrb.

12 4. 10 12 3. 10 T_rb

50

12

2. 10

40

12 1. 10 30

0 0

b_f

0.2

20 0.4 0.6

10

q_rb 0.8 1

Figure 6.6: Trb versus bf and qrb .

20000 1000 T_rb 10000

800 600

0 0

k_orb

0.2

400 0.4 200

0.6 q_rb 0.8 1

Figure 6.7: Trb versus korb and qrb . kohrb = korb . 85

24 6. 10 T_rb 24 4. 10

50 40

24 2. 10 30

0 0

D

0.2

20 0.4 0.6

10

q_rb 0.8 1

Figure 6.8: Trb versus D and qrb .

4000 1000

3000 T_th 2000

800

1000 600

0 0

k_c

0.2

400 0.4 200

0.6 q_th 0.8 1

Figure 6.9: Tth versus kc and qth .

1500 1000

T_th1000 500

750

0 0

500 0.2

N

0.4 250 0.6 q_th 0.8 1

Figure 6.10: Tth versus N and qth . 86

12 3. 10

50

T_th 12 2. 10 40

12 1. 10 30

0 0

B_th

0.2

20 0.4 0.6

10

q_th 0.8 1

Figure 6.11: Tth versus Bth and qth.

15000 1000 T_th 10000 800 5000 600

0 0

k_oth

0.2

400 0.4 200

0.6 q_th 0.8 1

Figure 6.12: Tth versus koth and qth. kohth = koth .

24 3. 10 T_th 24 2. 10

50 40

24 1. 10 30

0 0

D

0.2

20 0.4 0.6

10

q_th 0.8 1

Figure 6.13: Tth versus D and qth. 87

T_rb vs. q 1000

800

T_rb

600

400

200

0 0

0.2

0.4

0.6

0.8

1

q

Figure 6.14: Trb versus qrb when kc  korb . kc = 100. T_th vs. q 500 450 400 350

T_th

300 250 200 150 100 50 0 0

0.2

0.4

0.6

0.8

1

q

Figure 6.15: Tth versus qth when kc  koth . kc = 100. that the performance of shruti-cm5 can be improved by improving load balancing.

 Changing most of the other parameters results in shifting the \threshold", but does not signi cantly change the shape of the Trb and Tth curves.  As expected, reducing computational costs ko reduces the response time. Note that the analysis always considers average response time.

6.9 Estimating Parameters The various parameters appearing in Equations 6.47 and 6.48 depend on the MPP machine, the knowledge base and the characteristics of the parallel re exive reasoning system implementation. The machine parameters ts and kc can be determined by running relevant tests (see Section 6.4). The number of processors on the machine xes N. The computational costs korb , koth , kohrb , kohth and kof depend on the characteristics of the parallel re exive reasoning system, and can be estimated by running tests using the implementation in question. The branching factors Brb and Bth are dependent on the knowledge base. These can be estimated by scanning the knowledge base and determining the average branching factors for the rule-base and type hierarchy. The shruti-cm5 preprocessor can compute these average branching factors. The query being posed determines kn . 88

For shruti-cm5, estimates of these parameters are: ts  5 sec and kc = kch  5 as determined in Section 6.4; based on experiments, ko values range from 10{100 (Section 7.4 shows some examples). Chapter 7 provides ample examples of values for knowledge base parameters; the branching factors for the various knowledge bases used range from 2{10.

6.10 Implications of the Analysis

6.10.1 Some Caveats

The actual implementation of shruti-cm5 has some di erences compared with the assumptions made in the analysis presented in this chapter. Moreover, the analysis considers average performance, taking into account several simpli cations and approximations. The result of the analysis is therefore expected to be approximate. The aim of the analysis is to direct some of the design decisions in addition to providing pointers for improvements and ball-park estimates for response time. The key di erences between the analysis assumptions and actual implementation are listed below:  Shruti-cm5 stops spreading activation which has traveled dQmax hops from its origin, even though the reasoning episode terminates only after D = min(2dQ +1; 2dQmax +1) iterations. After dQmax iterations, only reverse propagation of collector activation is allowed. This scheme ensures that unnecessary propagation of activation is avoided, thereby improving response time. If D > dQmax , the actual response time would be less than that predicted by the analysis since the analysis assumes that all activation is propagated in all the D iterations.  Whenever a special rule does not re due to any of the required conditions not being satis ed, or whenever a fact match is unsuccessful, the operation is retried. The operation could be successful later due to activation propagation in the type hierarchy. This situation is not accounted for in the analysis and could result in actual frontiers having more elements than that predicated by the analysis.  The analysis assumes that all facts pertaining to a predicate are matched. In the shruti-cm5 implementation, fact matching is aborted as soon as the collector of the relevant predicate bank becomes active.  Synchronous and asynchronous operation can be intermixed when running reasoning episodes in shruti-cm5. The analysis handles either synchronous or asynchronous operation, but not a combination of both. See Section 6.10.4 for a detailed discussion.

6.10.2 Performance|Relation to Computation and Communication Costs

As brought out in Section 6.8, response time is close to optimal for small values of q when the communication cost is small or comparable to the computation costs|i.e., load balancing is more important than locality. Thus, on machines like the CM-5, where communication is very ecient, especially when using low-overhead communication mechanisms like active messages, it is advantageous to aim for small q values. Fortunately, it is easy to obtain small q values using random processor allocation. The analysis also provides guidance for distributing knowledge bases when using more expensive communication schemes. For example, when mapping shruti onto a network of workstations, we can use results of the analysis to determine optimal distributions of a knowledge base using optimal q values. The analysis can also be extended to study the partitioning and mapping of general graph structures onto parallel machines. The nodes in the graph represent primitive units of computation while the links represent communication requirements. The results of the analysis can be interpreted to 89

mean that random allocation of nodes is ecient|which, under the assumptions of this analysis, translates to load balancing taking priority over locality|when the granularity of problem decomposition is such that communication cost is comparable to the computational cost of a primitive element. When the granularity of problem decomposition is such that communication cost greatly outweighs the computational cost of processing a primitive element, processor allocation becomes crucial and a compromise between load balancing and locality should be enforced for optimal performance. On the other hand, when the computational cost of a primitive element is much larger than the communication cost, it is best to make maximum utilization of available parallelism by concentrating on e ective load balancing. This should help determine the granularity of problem decomposition for e ective mapping onto the underlying parallel machine.

6.10.3 Random Processor Allocation When random processor allocation is used, predicates and entities are uniformly distributed over the processors of the MPP. We therefore have qrb = qth = N1

(6:49)

When N  1, i.e., the machine has a reasonably large number of processors, the above equation yields relatively small values for qrb and qth. For example, on a 32 node CM-5, we would have qrb = qth = 0:03. When the communication cost kc is small compared to the computational cost of a primitive element, the analysis indicates that system performance, on average, is very close to optimal. Furthermore, random processor allocation has the signi cant advantage of allowing incremental knowledge base encoding and processing. Encoding a large chunk of knowledge could alter both knowledge base parameters and q values. Thus large increments of the knowledge base would require processing of the entire knowledge base in order to recompute optimal q values, and the reallocation of knowledge base elements to realize the required q values. This expensive overhead can be avoided by using random processor allocation whereby new chunks of the knowledge can be added dynamically and quickly, without signi cant computational e ort. Expensive static analysis of the entire knowledge base is unnecessary. When kc is small or comparable to the computation cost of primitive elements, the performance penalty paid for using random processor allocation is small or negligible. For example, when using active messages on the CM-5, kc  5. With the emphasis on ecient interprocessor communication, it is reasonable to expect that most modern MPP machines will have kc values in the same ballpark. Using kc = 5, and other parameter values from Section 6.7.3, random processor allocation results in a response time degradation of 3.62% in comparison with the optimal value.

6.10.4 Synchronous versus Asynchronous Operation The expressions for Trb (i; x) and Tth (i; x) (Equations 6.47 and 6.48) are independent of x, the processor under consideration. This is due the fact that these equations represent average values of the respective response time components. Under these circumstances, the response time for asynchronous operation (Equation 6.2) and the response time for synchronous operation (Equation 6.3) would be same, since the max operator will have no e ect. Thus, on average, the response time for answering a query TQ = T is identical irrespective of whether the reasoning episode is run synchronously or asynchronously. With asynchronous operation, however, there is greater variability in response time from run to run due to the fact that either all D iterations may not be completed, or that more than D iterations may be processed, before the reasoning episode terminates. 90

6.10.5 Knowledge Base Structure In the analysis presented in this chapter, the structure of the knowledge base is essentially captured in the branching factors. Under the assumptions made in Sections 6.3 and 6.5, the branching factors combined with the q values are sucient to compute the average response time of queries. During processor allocation, however, where our goal is to achieve a certain value of q for the given knowledge base, it is necessary to have more information about the structure of the knowledge base. Without explicitly using the structure of the knowledge base, it may not be possible to achieve the target value of q. For example, if a knowledge base has N disjoint domains and the underlying machine has N processors, it is straightforward to attain q = 1 by allocating each domain to a processor. It would be much more dicult to achieve q = 1 without utilizing this knowledge. The q-based processor allocation scheme (Section 5.3.5) does not explicitly utilize information about the knowledge base structure; this accounts for the fact that the procedure cannot always come up with an allocation which provides the target q value. Making good use of knowledge base structure in processor allocation requires complex heuristics and is beyond the scope of this thesis.

Domains and Multiple Queries Structured knowledge bases consisting of domains (see Section 7.1) warrant special consideration. In these knowledge bases, domains are constituted by clusters of predicates. Predicates within each domain are richly interconnected by rules. Rules that interconnect predicates in two di erent domains are either sparse or non-existent. When encoding such knowledge bases, each domain can be mapped onto the underlying MPP machine independently|with an optimal q computed for each domain. The mapping of the knowledge base is a superposition of the individual domain mappings. Such a mapping begins to lose its e ectiveness when the number of rules linking predicates in di erent domains increases. When there are a large number of such rules, the entire knowledge base should be considered as a whole instead of con ning oneself to individual domains. Consider a knowledge base with disjoint (or nearly disjoint) domains, mapped to a parallel machine using a superposition of individual domain mappings. Suppose a query Q is posed to the knowledge base. If the query predicate belongs to domain d, there is a very high probability that activation for that reasoning episode will be con ned to the predicates in domain d. The optimal distribution of these predicates on the processors of the machine ensures optimal response time. The above mapping technique works well when only one query is posed to the system at any given time. When multiple queries are posed, it is possible that superposition mapping could lead to pockets of concentrated activity. Processors harboring these hot spots would become bottlenecks and slow down the reasoning episode. A better mapping is possible if we know the distribution of queries: Suppose the queries posed at any given time are such that a fraction q1 pertain to domain d1, fraction q2 pertains to domain d2, : : :, fraction qm pertains to domain dm , such that q1 + q2 + : : : + qm = 1. Then, partitioning the N processors in accordance with these fractions will partition resources appropriately. Thus, domain di will be optimally mapped onto a partition consisting of qiN processors. If qi is large, then a relatively large fraction of queries pertain to di; accordingly, a relatively large partition of processors will be available to eciently process these queries. If the query distribution is initially unknown, the system could determine the distribution based on the queries that are posed and the predicates could gradually migrate to eventually achieve the appropriate processor assignment.

6.11 Glossary of Terms and Symbols The following is a glossary of terms and symbols used in this chapter. This is essentially a collection of de nitions from various sections in the chapter for quick reference. 91

6.11.1 Subscripts

Extra subscripts are used on several of the symbols to indicate speci c components of the quantity represented by the symbol. These subscripts are summarized here. The glossary in the next section speci es which symbols take these subscripts. rf Spreading forward rule activation in the rule-base. rr Spreading reverse (collector) activation in the rule-base. f Fact matching in the rule-base. u Spreading bottom-up activation in the type hierarchy. d Spreading top-down activation in the type hierarchy.

6.11.2 Glossary

i;x i;x i?1;x i;x C Frontier set for the type hierarchy. C = Ca ? Ca . i;x i;x i?1;x i;x P Frontier set for the rule-base. P = Pa ? Pa . argc(f) Number of arguments of fact f that have been bound to entities. Brb Brb = bf + br . Bth Bth = bu + bd . bd Average value of bd (c) in KB. bd (c) Downward branching factor for entity c, i.e., the number of default is-a relations with c as the superconcept. bf Average value of bf (p) in KB. bf (p) Forward branching factor for predicate p, i.e., the number of rules with p as the consequent. br Average value of br (p) in KB. br (p) Reverse branching factor for predicate p, i.e., the number of rules with p in the antecedent. bu Average value of bu(c) in KB. bu(c) Upward branching factor for entity c, i.e., the number of is-a relations with c as the rst argument. C Set of entities in the knowledge base KB. Cai;x Set of active entities on processor x during iteration i. Subscripts u and d from Section 6.11.1 are allowed. See Section 6.2.4 for detailed de nitions. D Total number of iterations needed to answer a query Q. D = min(2dQ + 1; 2dQmax + 1) for a rule-base query; D = min(dQ ; dQmax ) for a type hierarchy query. dQ Depth of inference chain, in number of hops, from query predicate Q to the predicate with the relevant fact, or the concept under consideration in a type hierarchy query. dQmax Maximum query depth allowed; constitutes an upper bound on dQ . F Set of facts in the knowledge base KB.

92

Fp Set of facts that need to be matched for predicate p; f Average number of facts per predicate times average arity of predicates in KB. Hop Denotes activation traversing a rule or is-a link. Reverse propagation of collector activation is not considered to be a hop. Iteration The computation needed to processes all knowledge base elements receiving activation which has traveled i hops constitutes iteration i. This includes the computation needed to process collector activation. One iteration could span several passes through the body of the activation propagation loop (Figure 5.6), and each pass could partially advance iterations for several values of i. KB The knowledge base under consideration. Korb Korb = (bf + br ) korb + 2kohrb + f kof . Koth Koth = (bu + bd ) koth + 2kohth . kc Communication cost. kc = ttsc . h

kch Cost of receiving a message. kch = ttcs .

ko Computation cost. ko = ttos . Possible subscripts for ko are shown in Section 6.11.1. koh Cost of executingh the body of the receive activation procedure; includes cost of building the frontier. koh = ttos . Possible subscripts for koh are shown in Section 6.11.1. korb Computational cost of spreading forward or reverse activation in the rule-base. korb = korf = korr . h . kohrb kohrb = korh f = kor r koth Computational cost of spreading bottom-up or top-down activation in the type hierarchy. koth = kou = kod . h = kh . kohth kohth = kou od N Number of processors in the MPP machine, numbered from 0 to N ? 1. P Set of predicates in the knowledge base KB. Pai;x Set of all active predicates on processor x during iteration i. Of the subscripts shown in Section 6.11.1, rf , rr and f are allowed. See Sections 6.2.3 and for detailed de nitions. q Probability of nding related elements on the same processor. qrb Value of q in the rule-base. qth Value of q in the type hierarchy. T T = PDi=0 T(i; x).

Trb Trb = PDi=0 Trb (i; x). Tth Tth = PDi=0 Tth (i; x).

TQ Total time for answering query Q. 93

T(i; x) Time for iteration i on processor x. Section 6.11.1 shows possible subscripts for T(i; x). T(i; x) = Trb (i; x) + Tth (i; x). Trb (i; x) Time taken by processor x to spread activation in the rule-base during iteration i. Trb (i; x) = Trf (i; x) + Trr (i; x) + Tf (i; x). Tth (i; x) Time taken by processor x to spread activation in the type hierarchy during iteration i. Tth (i; x) = Tu (i; x) + Td (i; x). ts Time needed to call an empty local procedure on any processor. tc Time needed to call an empty procedure on a remote processor. thc Time for receiving a message. to Time for computation that is independent of the location of knowledge base elements. Section 6.11.1 shows possible subscripts. tho Time for executing the body of the receive activation procedure; includes time for building the frontier. Subscript usage is shown in Section 6.11.1.

94

Chapter 7

SHRUTI on the CM-5|Experiments with Large Knowledge Bases Shruti-cm5 has been tested using (i) arti cial knowledge bases containing over half a million rules and facts, and (ii) WordNet, a real-world lexical database (Miller et al., 1990). In this chapter we present these experimental results which bring out the e ectiveness of shruti-cm5 as a real-time reasoning system. We also experimentally validate some of the predictions made by the analysis in Chapter 6.

7.1 Generating Random Knowledge Bases Part of the experimentation with shruti-cm5 has been carried out using randomly generated structured knowledge bases. Though the individual knowledge base elements are generated at random, these elements are organized into domains thereby imposing structure on the knowledge base. Each domain is a cluster of predicates along with their associated rules and facts. Domains could be of two types: target domains, which correspond to \expert" knowledge about various realworld domains; and special domains, which represent basic cognitive and perceptual knowledge about the world. A typical structured knowledge base would consist of several target domains and a small number of special domains. The predicates within each domain are richly interconnected by rules. Predicates in each target domain are also richly connected by rules to predicates in the special domains. Predicates across di erent target domains however, are sparsely connected. Predicates in two di erent special domains are left unconnected. The structure imposed on the knowledge base is a gross attempt to mimic a plausible structuring of real-world knowledge bases. This is motivated by the notion that knowledge about complex domains are learned and grounded in metaphorical mappings from certain basic, perceptually grounded domains (Lako and Johnson, 1980). The concepts in the random knowledge bases are partitioned into several tree-structured type hierarchies. The branching factor of the type hierarchies are decided by the maximum depth of the type hierarchy. A random subset of all the possible links in each type hierarchy are used as is-a relations. The type hierarchies can share a fraction of the leaves. One of the supporting modules for shruti-cm5 is a structured random knowledge base generator. The knowledge base generator takes several parameters as input which determine the number of predicates, entities, rules, facts and is-a relations to be generated, the fraction of special rules and facts, the number of domains, the distribution of the knowledge base among the domains and the 95

------------ Knowledge Base Parameters -----------Number of rules: 250000 Number of facts: 250000 Number of predicates: 50000 Number of concepts: 50000 Multiple antecedent rule fraction: 0.10 Multiple instantiation rule fraction: 0.10 Special rule fraction: 0.40 Fraction of is-a facts: 0.25 Fraction of facts with E vars: 0.10 ------------ Domain Parameters -------------------Number of special domains: 3 Number of target domains: 150 Spl-Tgt knowledge base split: 0.02 Fraction of intra-special-domain rules: 1.00 Fraction of inter-special-domain rules: 0.00 Fraction of intra-target-domain rules: 0.96 Fraction of inter-target-domain rules: 0.01 Number of type hierarchies: 10 Maximum depth of type hierarchies: 5 Fraction of shared leaves in type hiers.: 0.05

Figure 7.1: Parameters for generating structured random knowledge base kb1. fraction of inter- and intra-domain rules. The number and maximum depth of the type hierarchies generated can also be controlled by setting appropriate parameters. The parameters supplied to generate one of the knowledge bases used in the CM-5 experiments (identi ed in the gures as kb1) are shown in Figure 7.1.

7.1.1 Generating Queries Queries for experimenting with random knowledge bases are generated using the inference dependency graph processor (Section 5.7). To generate queries, the inference dependency graph processor starts by generating a random predicate P. Starting at this predicate, the inference dependency graph is traversed to reach predicates related to P. Let P11; : : :; P1r be predicates directly linked to P, i.e., the knowledge base has the rules [ P11(: : :) ) P(: : :) ]; : : :; [ P1r (: : :) ) P(: : :) ]. One of P11; : : :; P1r is arbitrarily selected. Let P1i be the selected predicate. A fact P1i(C1; : : :; Cn) is chosen and is used to generate a query P(C1; : : :; Cn)?. When this query is posed, it will be answered with a depth of at most 1 if the rule P1i(: : :) ) P(: : :) res. If the rule does not re, say, because some special variable did not satisfy a required condition, then the query P(C1; : : :; Cn)? may not be answered. Traversing the inference dependency graph further, a query of depth 2 can be generated by considering predicates P21; : : :; P2r which are directly related to P11; : : :; P1r . The inference dependency graph can be traversed to any required depth and random queries generated for predicate P. The process can be repeated for several di erent starting predicates. Queries of depth 0 (fact retrieval) can also be generated by considering the facts for the starting predicate P. Note that since the type of a rule and the conditions that must be satis ed in order to re that rule are not taken into account when generating queries, it is possible that some of the queries generated may not have an answer. Therefore, we generate a large number of queries for a given random knowledge base and use only those that have an answer. 1

1

1

1

2

96

------------ Knowledge Base Parameters -----------Number of rules: 250000 Number of facts: 250000 Number of predicates: 32000 Number of concepts: 50000 Multiple antecedent rule fraction: 0.10 Multiple instantiation rule fraction: 0.10 Special rule fraction: 0.40 Fraction of is-a facts: 0.25 Fraction of facts with E vars: 0.10 ------------ Domain Parameters -------------------Number of special domains: 3 Number of target domains: 100 Spl-Tgt knowledge base split: 0.03 Fraction of intra-special-domain rules: 1.00 Fraction of inter-special-domain rules: 0.00 Fraction of intra-target-domain rules: 0.96 Fraction of inter-target-domain rules: 0.01 Number of type hierarchies: 10 Maximum depth of type hierarchies: 5 Fraction of shared leaves in type hiers.: 0.05

Figure 7.2: Parameters for generating structured random knowledge base kb2.

7.2 Experiments with Random Knowledge Bases In experimenting with shruti-cm5 using random knowledge bases, we have used three arti cial knowledge bases, each with over half a million elements. All the random knowledge bases have been generated using the random knowledge base generator described in Section 7.1. These knowledge bases, which we identify as kb1, kb2 and kb3, have generation parameters as shown in Figures 7.1, 7.2 and 7.3 respectively.1 Each random knowledge base, containing about 500,000 elements, is split into ve fragments of about 100,000 elements. These fragments are loaded successively in investigating the e ect of knowledge base size. For kb3, in spite of having generated 500,000 elements, we use only 300,000 elements in our experimentation (see Section 7.2.2). Queries for experimenting with random knowledge bases were generated using the inference dependency graph processor. For each of kb1, kb2 and kb3, about 500{600 random queries were generated of which some 300 were answered and had depths ranging from 0 (fact retrieval) to 8. These queries were repeatedly run 5 times and the resulting data used to investigate the performance of shruti-cm5. In the experimental results plotted below, points represent average values, point ranges shown are 95% con dence intervals, while the curves shown are piece-wise best- ts. The curves are mostly quadratic best- ts, while some are exponential or linear best- ts. The timing reported is the elapsed time (Section 5.5.12) needed to process the queries. The random, structured knowledge bases used in these experiments exploited the full functionality of the reasoning system and had a mix of regular rules and facts, rules with special variables, quanti ed facts and is-a relations. Rules with special variables included: rules with repeated variables, typed variables, existential variables and entities; rules with multiple predicates in the antecedent; and rules which lead to multiple instantiation of predicates. 1 We have experimented with a total of eight random knowledge bases with a wide range of generation parameters. We report results for only three of these knowledge bases|kb1, kb2 and kb3. Even though kb1 and kb2 are similar in comparison with kb3, these three knowledge bases illustrate representative aspects of shruti-cm5 performance. Experiments with the other ve random knowledge bases produce results which are very similar to one of these three, and are therefore not reported in this thesis.

97

------------ Knowledge Base Parameters -----------Number of rules: 250000 Number of facts: 250000 Number of predicates: 31720 Number of concepts: 31720 Multiple antecedent rule fraction: 0.10 Multiple instantiation rule fraction: 0.10 Special rule fraction: 0.40 Fraction of is-a facts: 0.25 Fraction of facts with E vars: 0.10 ------------ Domain Parameters -------------------Number of special domains: 0 Number of target domains: 1 Spl-Tgt knowledge base split: 0.00 Fraction of intra-special-domain rules: 0.00 Fraction of inter-special-domain rules: 0.00 Fraction of intra-target-domain rules: 1.00 Fraction of inter-target-domain rules: 0.00 Number of type hierarchies: 10 Maximum depth of type hierarchies: 5 Fraction of shared leaves in type hiers.: 0.05

Figure 7.3: Parameters for generating structured random knowledge base kb3. Most of the experimentation has been carried out on a 32 node machine, while some have been run on 64 and 128 node CM-5 machines. All experiments with random knowledge bases use random processor allocation. Furthermore, the reasoning episodes elicit little activity in the type hierarchy. We therefore concentrate on activity in the rule-base. WordNet (Section 7.3) exercises the type hierarchy of shruti-cm5.

7.2.1 Experimental Results|Knowledge Base Size and Query Depth

Figures 7.4{7.9 show the performance of shruti-cm5 with random knowledge base kb1. Figure 7.4 plots response time for varying query depths and knowledge base sizes. When the knowledge base size is about 200,000 or smaller, the response time for di erent query depths is essentially linear. Given the random nature of rule generation, long chains of rules are less likely for knowledge bases of this size. Hence, very few queries of depth greater than 5 get answered. At these sizes, the average branching factor for the knowledge base is around 2{3 and queries of depth 5 activate about 200{300 predicates. With 50,000 predicates partitioned among a total of 153 domains (see Figure 7.1), each domain has about 325 predicates. Thus, activation is, for the most part, con ned to the query domain (the domain in which the query was posed) and source domains. As the size of the knowledge base increases, with a proportional increase in average branching factor, the curve for each knowledge base size can be partitioned into two parts: for depths up to about 3 the response time increases steeply; beyond that, the rate of response time increase is lower. This behavior can be explained by the structure of the knowledge base. As query depth increases from 0 up to about 2{3, all the predicates in the query target domain and source domains become completely active. Beyond that, the rate at which response time increases depends on the number of active predicates in other target domains, and hence, on the number of rules that link predicates in di erent target domains. As the knowledge base size increases, one would expect the number of inter-domain rules to increase, and hence, the response time increases at higher rates, as brought out by the top three curves in the gure. 98

QUERY DEPTH vs. RESPONSE TIME

0.6

KB size = 110036 KB size = 219879 KB size = 329871 KB size = 440178 KB size = 550041

spmd version 07.5 32 pe cm-5 kb1 stats enabled

Response Time (sec)

0.5

0.4

0.3

0.2

0.1

0 0

2

4

6

8

10

Query Depth

Figure 7.4: Query depth vs. response time. Kb1 on a 32 PE CM-5.

QUERY DEPTH vs. NUMBER OF RULES FIRED 60000

50000

Number of Rules Fired

KB size = 110036 KB size = 219879 KB size = 329871 KB size = 440178 KB size = 550041

spmd version 07.5 32 pe cm-5 kb1 stats enabled

40000

30000

20000

10000

0 0

2

4

6

8

10

Query Depth

Figure 7.5: Query depth vs. number of rules red in a reasoning episode.

99

Kb1

on a 32 PE CM-5.

QUERY DEPTH vs. NUMBER OF ACTIVE MESSAGES 60000

50000

Number of Active Messages

KB size = 110036 KB size = 219879 KB size = 329871 KB size = 440178 KB size = 550041

spmd version 07.5 32 pe cm-5 kb1 stats enabled

40000

30000

20000

10000

0 0

2

4

6

8

10

Query Depth

Figure 7.6: Query depth vs. number of active messages exchanged in a reasoning episode. a 32 PE CM-5.

Kb1

on

QUERY DEPTH vs. NUMBER OF ACTIVE PREDICATES 10000

8000

Number of Active Predicates

KB size = 110036 KB size = 219879 KB size = 329871 KB size = 440178 KB size = 550041

spmd version 07.5 32 pe cm-5 kb1 stats enabled

6000

4000

2000

0 0

2

4

6

8

10

Query Depth

Figure 7.7: Query depth vs. number of active predicates in a reasoning episode. CM-5.

100

Kb1

on a 32 PE

QUERY DEPTH vs. TIME PER RULE FIRED 0.0005 0.00045

KB size = 440178 KB size = 550041

spmd version 07.5 32 pe cm-5 kb1 stats enabled

0.0004 0.00035

Time (sec)

0.0003 0.00025 0.0002 0.00015 0.0001 5e-05 0 0

2

4

6

8

10

Query Depth

Figure 7.8: Query depth vs. time needed to re a rule. Kb1 on a 32 PE CM-5.

QUERY DEPTH vs. TIME PER ACTIVE PREDICATE 0.0005 0.00045

KB size = 440178 KB size = 550041

spmd version 07.5 32 pe cm-5 kb1 stats enabled

0.0004 0.00035

Time (sec)

0.0003 0.00025 0.0002 0.00015 0.0001 5e-05 0 0

2

4

6

8

10

Query Depth

Figure 7.9: Query depth vs. time needed to process an active predicate.

101

Kb1

on a 32 PE CM-5.

Figure 7.10: Visualization of growth of domain activation. Since ring a rule requires an active message and could potentially activate a new predicate, the number of rules red, the number of active messages exchanged and the number of active predicates in the knowledge base are closely related. Figures 7.5{7.7 show that this is indeed the case for knowledge base kb1. Moreover, the number of rules red|and hence, the number of active messages and the number of active predicates|also in uence response time, as is evident from the similarity between these graphs and the response time in Figure 7.4. Note that response time would be proportional to the number of rules red only if the processor load is relatively balanced; random processor allocation ensures that this is indeed the case. A rule ring will activate a new predicate|as opposed to instantiating a new bank of an already active predicate|with relatively high probability only when the number of active predicates is small compared to the total number of predicates in the knowledge base. Figure 7.8 shows the time needed to re a rule as a function of knowledge base size and query depth. If r rules re in a reasoning episode which takes t sec, we compute time per rule as rt sec. In plotting this graph, we are assuming that computational cost of processing rules constitutes a major fraction, if not all, of the response time. In shruti-cm5 with random knowledge bases, this assumption is valid when a reasonably large number of rules re in a given reasoning episode. Under these conditions, the time needed per rule ring settles to a small, relatively constant value of about 8 sec. The \time-per-rule" metric represents the average time needed to process a rule. We therefore expect this value to be constant for a give machine, irrespective of the knowledge base under consideration|provided the above assumption holds. That this is indeed the case will become evident in Section 7.2.2 where we present results from kb2 and kb3. Section 7.2.3 discusses the e ect of the CM-5 partition size on this metric. Figure 7.9 is closely related to Figure 7.8, again due to the relation between rule rings and active predicates. At larger depths, more predicates are active and the fraction of rules which activate new predicates is smaller. This accounts for the increase in processing time per active predicate for the larger depths. Figure 7.10 is a visualization of the growth of activation in the knowledge base for a random 102

query of depth 5. The gure shows each domain in the knowledge base as an element in a twodimensional array; the rst row represents domains 1{13, the second row shows domains 14{26, and so on. The query, posed to a knowledge base of type kb1 which had about half a million elements, is answered synchronously in about 0.6 sec. Domains 1{3 are source domains while the remaining 150 domains are target domains. The query is posed in domain 90. The gure is the result of an animation, using viz the shruti-cm5 visualizer, where, in each step of the reasoning episode, a square whose size is proportional to the number of active predicates in that domain is drawn. Squares from previous steps are not erased as the reasoning episode proceeds. This results in showing the rate of growth since the number of active elements in any domain is monotonically increasing. For example, domain 3 grows slowly at rst, but increases its growth rate towards the end; domain 1, on the other hand, starts out by increasing the number of active elements quickly at rst and then saturates as the entire domain becomes active.

7.2.2 Experimental Results|Knowledge Base Structure

Experimental results obtained with random knowledge bases kb2 and kb3 are shown in Figures 7.11{ 7.16 and Figures 7.17{7.22 respectively. Kb2 has fewer domains and fewer predicates compared to kb1, but the same number of rules. Hence, kb2 has a higher density of rules per domain and also a higher branching factor. As a result, reasoning episodes activate a much larger number of predicates, re more rules and exchange more messages|thereby taking more time to respond to queries on average. The number of rules red, the number of active predicates and the number of active messages exchanged for depth greater than 3 increase at a rate less than that for kb1; in fact the curves even show some signs of saturation. Again, time-per-rule is constant when a large number of rules are red, and has essentially the same value ( 8sec) as for kb1. Kb3 has a single domain|in other words, the knowledge base is completely unstructured. Activation from any predicate can spread to any other predicate relatively easily. From Figure 7.20 we can see that for queries of depth about 5 or more, almost all predicates in the knowledge base become active, even for knowledge bases with only about 300,000 elements. The e ect of this saturation is evident in Figures 7.18{7.20 and 7.22. Response time also increases signi cantly. Note that in plotting the graph showing the number of rules red, a predicate P and a rule R having P as the consequent, can contribute up to K2 rule rings|one possible rule ring for each bank of the predicate. Since the entire knowledge base becomes active even when there are only 300,000 elements (of which 150,000 are rules), we have not loaded larger knowledge bases of this type. Observe that even in the case of an unstructured knowledge base such as kb3, the time-per-rule value is about 8 sec on a 32 node CM-5.

7.2.3 Experimental Results|E ect of Parallelism

We have experimented with knowledge base kb2 running on 64 PE and 128 PE CM-5 partitions. In this section, we compare the performance of shruti-cm5 on these larger partitions with that on the 32 node CM-5 reported in Section 7.2.2. Subsequently, in Section 7.3.2, we will compare the performance of shruti-cm5 on a 32 node CM-5 with that on a serial workstation. Figures 7.23{7.25 and 7.26{7.28 show experimental results on 64 node and 128 node CM-5 machines respectively. We again note the similarity between the graphs for response time and the number of rules red. We also note that the time-per-rule metric has a value of about 4 sec on the 64 PE CM-5 and about 2 sec on the 128 PE CM-5. We recall that the value was about 8 sec for a 32 PE CM-5. These results can be related as follows: If r rules re in a reasoning episode which takes t sec, we compute time per rule as rt sec. If the computational load is reasonably balanced|as is the case with random processor allocation|each of the N processors in the machine handles Nr rules in t sec. Time per rule per processor would therefore be T = N rt . T is the time taken, on average, by any given processor to re a single rule (subject to the assumptions stated in Section 7.2.1). Thus, T should remain constant irrespective of the number of processors on the 103

QUERY DEPTH vs. RESPONSE TIME 2.5

2

Response Time (sec)

KB size = 110036 KB size = 219879 KB size = 329871 KB size = 440201 KB size = 550049

spmd version 07.5 32 pe cm-5 kb2 stats enabled

1.5

1

0.5

0 0

2

4

6

8

10

Query Depth

Figure 7.11: Query depth vs. response time. Kb2 on a 32 PE CM-5.

QUERY DEPTH vs. NUMBER OF RULES FIRED 250000

200000

Number of Rules Fired

KB size = 110036 KB size = 219879 KB size = 329871 KB size = 440201 KB size = 550049

spmd version 07.5 32 pe cm-5 kb2 stats enabled

150000

100000

50000

0 0

2

4

6

8

10

Query Depth

Figure 7.12: Query depth vs. number of rules red in a reasoning episode.

104

Kb2

on a 32 PE CM-5.

QUERY DEPTH vs. NUMBER OF ACTIVE MESSAGES 250000

200000

Number of Active Messages

KB size = 110036 KB size = 219879 KB size = 329871 KB size = 440201 KB size = 550049

spmd version 07.5 32 pe cm-5 kb2 stats enabled

150000

100000

50000

0 0

2

4

6

8

10

Query Depth

Figure 7.13: Query depth vs. number of active messages exchanged in a reasoning episode. a 32 PE CM-5.

Kb2

on

QUERY DEPTH vs. NUMBER OF ACTIVE PREDICATES 25000

20000

Number of Active Predicates

KB size = 110036 KB size = 219879 KB size = 329871 KB size = 440201 KB size = 550049

spmd version 07.5 32 pe cm-5 kb2 stats enabled

15000

10000

5000

0 0

2

4

6

8

10

Query Depth

Figure 7.14: Query depth vs. number of active predicates in a reasoning episode. CM-5.

105

Kb2

on a 32 PE

QUERY DEPTH vs. TIME PER RULE FIRED 0.0002 KB size = 440201 KB size = 550049

spmd version 07.5 32 pe cm-5 kb2 stats enabled

Time (sec)

0.00015

0.0001

5e-05

0 0

2

4

6

8

10

Query Depth

Figure 7.15: Query depth vs. time needed to re a rule. Kb2 on a 32 PE CM-5.

QUERY DEPTH vs. TIME PER ACTIVE PREDICATE 0.0003 KB size = 440201 KB size = 550049

spmd version 07.5 32 pe cm-5 kb2 stats enabled

0.00025

Time (sec)

0.0002

0.00015

0.0001

5e-05

0 0

2

4

6

8

10

Query Depth

Figure 7.16: Query depth vs. time needed to process an active predicate.

106

Kb2

on a 32 PE CM-5.

QUERY DEPTH vs. RESPONSE TIME 5

4

Response Time (sec)

KB size = 110036 KB size = 220146 KB size = 330388

spmd version 07.5 32 pe cm-5 kb3 stats enabled

3

2

1

0 0

2

4

6

8

10

Query Depth

Figure 7.17: Query depth vs. response time. Kb3 on a 32 PE CM-5.

QUERY DEPTH vs. NUMBER OF RULES FIRED KB size = 110036 KB size = 220146 KB size = 330388

350000 spmd version 07.5 32 pe cm-5 kb3 stats enabled

Number of Rules Fired

300000

250000

200000

150000

100000

50000

0 0

2

4

6

8

10

Query Depth

Figure 7.18: Query depth vs. number of rules red in a reasoning episode.

107

Kb3

on a 32 PE CM-5.

QUERY DEPTH vs. NUMBER OF ACTIVE MESSAGES 400000 KB size = 110036 KB size = 220146 KB size = 330388

spmd version 07.5 32 pe cm-5 kb3 stats enabled

350000

Number of Active Messages

300000

250000

200000

150000

100000

50000

0 0

2

4

6

8

10

Query Depth

Figure 7.19: Query depth vs. number of active messages exchanged in a reasoning episode. a 32 PE CM-5.

Kb3

on

QUERY DEPTH vs. NUMBER OF ACTIVE PREDICATES 35000

30000

Number of Active Predicates

KB size = 110036 KB size = 220146 KB size = 330388

spmd version 07.5 32 pe cm-5 kb3 stats enabled

25000

20000

15000

10000

5000

0 0

2

4

6

8

10

Query Depth

Figure 7.20: Query depth vs. number of active predicates in a reasoning episode. CM-5.

108

Kb3

on a 32 PE

QUERY DEPTH vs. TIME PER RULE FIRED 0.0003 KB size = 330388 spmd version 07.5 32 pe cm-5 kb3 stats enabled

0.00025

Time (sec)

0.0002

0.00015

0.0001

5e-05

0 0

2

4

6

8

10

Query Depth

Figure 7.21: Query depth vs. time needed to re a rule. Kb3 on a 32 PE CM-5.

QUERY DEPTH vs. TIME PER ACTIVE PREDICATE 0.0005 KB size = 330388 0.00045

spmd version 07.5 32 pe cm-5 kb3 stats enabled

0.0004 0.00035

Time (sec)

0.0003 0.00025 0.0002 0.00015 0.0001 5e-05 0 0

2

4

6

8

10

Query Depth

Figure 7.22: Query depth vs. time needed to process an active predicate.

109

Kb3

on a 32 PE CM-5.

QUERY DEPTH vs. RESPONSE TIME 1

0.8

Response Time (sec)

KB size = 110036 KB size = 219879 KB size = 329871 KB size = 440201 KB size = 550049

spmd version 07.5 64 pe cm-5 kb2 stats enabled

0.6

0.4

0.2

0 0

2

4

6

8

10

Query Depth

Figure 7.23: Query depth vs. response time. Kb2 on a 64 PE CM-5.

QUERY DEPTH vs. NUMBER OF RULES FIRED 200000 KB size = 110036 KB size = 219879 KB size = 329871 KB size = 440201 KB size = 550049

spmd version 07.5 64 pe cm-5 kb2 stats enabled

Number of Rules Fired

150000

100000

50000

0 0

2

4

6

8

10

Query Depth

Figure 7.24: Query depth vs. number of rules red in a reasoning episode.

110

Kb2

on a 64 PE CM-5.

QUERY DEPTH vs. TIME PER RULE FIRED 0.0002 KB size = 440201 KB size = 550049

spmd version 07.5 64 pe cm-5 kb2 stats enabled

Time (sec)

0.00015

0.0001

5e-05

0 0

2

4

6

8

10

Query Depth

Figure 7.25: Query depth vs. time needed to re a rule. Kb2 on a 64 PE CM-5.

QUERY DEPTH vs. RESPONSE TIME 0.5 0.45

KB size = 110036 KB size = 219879 KB size = 329871 KB size = 440201 KB size = 550049

spmd version 07.5 128 pe cm-5 kb2 stats enabled

0.4

Response Time (sec)

0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0

2

4

6

8

10

Query Depth

Figure 7.26: Query depth vs. response time. Kb2 on a 128 PE CM-5.

111

QUERY DEPTH vs. NUMBER OF RULES FIRED KB size = 110036 KB size = 219879 KB size = 329871 KB size = 440201 KB size = 550049

140000 spmd version 07.5 128 pe cm-5 kb2 stats enabled

Number of Rules Fired

120000

100000

80000

60000

40000

20000

0 0

2

4

6

8

10

Query Depth

Figure 7.27: Query depth vs. number of rules red in a reasoning episode.

Kb2

on a 128 PE CM-5.

QUERY DEPTH vs. TIME PER RULE FIRED 0.0002 KB size = 440201 KB size = 550049

spmd version 07.5 128 pe cm-5 kb2 stats enabled

Time (sec)

0.00015

0.0001

5e-05

0 0

2

4

6

8

10

Query Depth

Figure 7.28: Query depth vs. time needed to re a rule. Kb2 on a 128 PE CM-5.

112

QUERY DEPTH vs. SPEEDUP 8 64 pe CM-5 / KB Size = 329871 64 pe CM-5 / KB Size = 440201 64 pe CM-5 / KB Size = 550049 128 pe CM-5 / KB Size = 329871 128 pe CM-5 / KB Size = 440201 128 pe CM-5 / KB Size = 550049

7

Speedup (wrt 32 pe CM-5)

6

5

4

3

2

1

0 0

1

2

3

4 Query Depth

5

6

7

8

Figure 7.29: Speedup curves for 64 PE and 128 PE CM-5. KB Size

Average Speedup 64 PE CM-5 128 PE CM-5 329871 1.82 2.96 440201 1.67 3.10 550049 1.76 3.39

Table 7.1: Average speedup for various knowledge base sizes. machine and the structure of the knowledge base under consideration. We note that this is indeed the case: 32  8 sec = 64  4 sec = 128  2 sec = 256 sec: The speedup obtained for various query depths when running shruti-cm5 on 64 and 128 node machines as compared to its performance on a 32 PE CM-5 is shown in Figure 7.29. The speedup is computed by taking the ratio of average response times for each knowledge base size and query depth. Table 7.1 shows speedup obtained by averaging over all query depths for a given knowledge base size. The maximum speedup possible for 64 and 128 node machines (over a 32 node machine) is 2 and 4 respectively. But the graphs in Figure 7.29 have values which are larger in some cases. This can be explained by the fact that the reasoning episodes run asynchronously; the number of rules red and the number of active predicates could vary randomly. In fact, we observe that the number of rules red is lesser in some cases and results in a faster termination of the reasoning episode.

7.2.4 Real-Time Reasoning

As shown in the previous section, the time per rule per processor is approximately T = 256 sec. In other words, a processor takes 256 sec to process a single rule. If jKBj is the size of the knowledge base, and a fraction r of this knowledge base becomes active in an average reasoning episode,2 the average number of rules red is jKBj  r and the average response time of shruti-cm5 on a 2 We are still assuming that the computation cost is predominantly from the rule-base, and the computation is uniformly distributed on all the processors of the machine.

113

Word Word Forms Meanings F1 F2 F3 : : : Fn M1 E11 E12 M2 E22 M3 E33 .. ... . Mm Emn Table 7.2: The lexical matrix. Adapted from Miller (1990). N processor CM-5 is T = jKBNjrT . In a real-time reasoning system, let Tmax be the maximum response time that the application can tolerate. Then, we require: jrT Tmax  kjKB N where k is a factor of safety whose value depends on the severity of the penalty for response time exceeding Tmax . Based on this equation, we can estimate the largest knowledge base that will provide real-time responses on a given CM-5 partition. Conversely, given a knowledge base, we can determine the size of the CM-5 partition needed in order to obtain real-time performance.

7.3 Experiments with WordNet WordNet is an on-line lexical database which attempts to organize information in terms of word meanings and their inter-relationships (Miller et al., 1990). In WordNet, the concept of a lexical matrix is used to organize and group words. Table 7.2 illustrates a lexical matrix. The matrix relates \word forms"|physical representations of words|and \word meanings"|concepts expressed by words. An entry Eij in the lexical matrix3 indicates that the word form Fj can be used to express the meaning speci ed by Mi . A word form, like F2, is polysemous and represents multiple meanings if its column has multiple entries. Each entry represents one \word sense" for the word form. On the other hand, word forms having entries in the same row are synomymous and can be used interchangeably to represent the respective concept (in an appropriate context). WordNet simulates a lexical matrix by using groups of synonymous words, termed synsets, to represent meanings.4 For example, in Table 7.2, meaning M1 can be represented by the synset fF1; F2,: : : g and M2 can be represented by fF2,: : : g. Note that the F2's appearing in the synsets for M1 and M2 represent di erent word senses. This can be explicated by using a superscript following the word form to represent word senses. Then the synsets for M1 and M2 would be fF11; F21; : : :g and fF22,: : : g. Note that, by de nition, a word sense Fjk of word form Fj can appear in the synset for exactly one concept. In other words, word senses which constitute a synset are unique. An example lexical matrix extracted from the WordNet database is shown in Table 7.3.5 Here, each entry Eij is indicated as a . A superscript number at an entry indicates the word sense of the respective word form. Word forms bird, dame, chick and lady are polysemous. Word froms with entries in a given row are synonymous: for example, bird, dame and chick are synonyms. The meaning \young woman or girl" can be represented by the synset fbird1 ,dame1,chick1 g; similarly \bird hunted as game" is represented by the synset fbird4 ,fowl1g and so on. Note that the lexical matrix in Table 7.2 is partial in that not all E entires are explicitly shown. In many cases, when the synset by itself cannot unambiguously convey the concept being referred to, a textual gloss or de nition is used to clarify the meaning of the synset. 5 In order to keep the size manageable, the table does not include all the information WordNet has for the words; only a small sample is chosen. 3 4

ij

114

Word Meanings polite name for any woman young woman or girl woman of aristocratic family woman of re nement shuttle used in badminton egg-laying vertebrate with wings bird hunted as game young bird esp. of domestic fowl

Word Forms bird dame chick fowl lady biddy madam shuttlecock peeress 

1

2 3 4 



1



2





1



1

2 3







1









2



1

1

1 

1

Table 7.3: An example lexical matrix. Symbol ! @ 

#m #s #p %m %s %p =

Nouns Relation Antonym Hypernym Hyponym Member meronym Substance meronym Part meronym Member holonym Substance holonym Part holonym

Symbol ! @  

>

^

Verbs Relation Antonym Hypernym Hyponym Entailment Cause Also see

Adjectives Symbol Relation ! Antonym & Similar to n

= ^

Pertainym Attribute Also see

Symbol ! n

Adverbs Relation Antonym Derived from adj

Attribute

Table 7.4: Relations in WordNet. Synsets in WordNet are classi ed into nouns, verbs, adjective and adverbs, based on the part of speech which the words fall into. Synsets are related to each other by a number of relations; relations can also exist between individual word senses constituting synsets. The actual relations that can be used depend on the part of speech. Relations fall into two categories: lexical relations between individual word senses and semantic relations between synsets. Relations like subconcept (hyponym), superconcept (hypernym), member of, part of, etc. are semantic relations between synsets. For example, the Animal concept is a hypernym of the Bird concept, and relates the synset for Animal to the synset for Bird. Antonymy is a widely used lexical relation between word senses. For example, the word form brother has two senses as a noun in the WordNet database: (i) a fellow member, usually of some religious group and (ii) a male with the same parents as someone else. Of these two senses of brother, the second sense|represented by the synset fbrother, blood brotherg6|has the word sense of sister meaning female sibling as its antonym. The rst word sense of brother does not have an antonym. Table 7.4 gives a list of relations permitted in WordNet. Relations in normal font are semantic while those in italics are lexical. Some of the semantic relations are not transitive and these are shown in sans serif font. Note that all lexical relations are non-transitive by de nition. The table also shows the symbol used to denote a given relation in WordNet. While some of the relations are self-explanatory, others are non-obvious. A de nition of the relations used in WordNet appears in Appendix C. WordNet comes with both a text-based and graphical interface which can be used to access and explore the WordNet database on Unix workstations, IBM and Apple personal computers, and other machines. The text-based interface is provided by a serial program called wn. WordNet (version 1.4) has about 95,000 word forms, grouped into 75,000 synsets. For a discussion of the organization of synsets in the di erent parts of speech, and other details concerning the text and graphical WordNet interface, the reader is referred to (Miller et al., 1990). 6

Word collocations, or phrases like blood brother are treated as words in WordNet.

115

R3

X1

Y1

Y2

X2

S1 in

R1 S2 out

SYNONYM

S1 out

SYNONYM

X3

R2

S2 in Y3

X4

Y4

Figure 7.30: Translating WordNet to shruti.

7.3.1 Mapping WordNet to Shruti

One of the main reasons we have encoded WordNet on shruti-cm5 is to evaluate its performance with large, non-random knowledge bases. In addition, the advantages of translating WordNet to shruti and running it on shruti-cm5 are:  The parallelism inherent in shruti-cm5 provides rapid responses when using WordNet.  Using the rule-based reasoning capability of shruti-cm5, WordNet can be integrated with other rule-like knowledge.  Shruti-cm5 provides incremental knowledge base processing which can be used to easily explore the e ect of new knowledge. In summary, WordNet on shruti-cm5 can be used in a variety of rapid reasoning applications and explorations. A straightforward way to translate WordNet to shruti is to represent synsets as nodes and the word senses constituting the synset as satellite nodes of the synset node. The synset to word sense links are bi-directional and carry activation tagged with any relation|i.e., bi-directional ALL TAG links (Section 5.3.3). Thus, (i) whenever a word sense node is activated, its associated synset node will become active; and (ii) when a synset node becomes active, all the word senses constituting the synset will be activated. Semantic relations are encoded as labeled links connecting the two synset nodes and lexical relations are encoded as labeled links between word sense nodes. Labeled links representing relations transmit only activation tagged with the appropriate relational label. This simple encoding, however, has the following problem. Whenever any word sense node is activated, all the word sense nodes for the respective synset also get activated. This is unacceptable since it amounts to asserting X R Y for any relation R given any two word senses, X and Y, in the same synset. To circumvent this problem, the WordNet database has been mapped to shruti by using the following mapping (Figure 7.30):  Synsets are represented by a pair of nodes: in and out. The in node receives all incoming links from other synsets, and links going to other synsets originate from the out node. A synonym link connects the synset out node to its in node. This ensures separation of the bi-directional activation ow between word sense and synset nodes. 116

 Word-senses are, as before, represented using single nodes. The word sense nodes interact

with the respective synset nodes in the following manner. Activating the in node causes all the word senses in the synset to become active. Activating any of the word sense nodes activates the out node. In other words, word sense nodes communicate with the out node via unidirectional ALL TAG links; the in node is also linked to the word sense nodes using unidirectional ALL TAG links.  Each word form is also represented by a node. The word form node is linked to all the word sense nodes for that word form via (unidirectional) ALL TAG links. The links are such that activating the word form node activates all the word sense nodes; in the other direction, STOP TAG links are used so that activation from a word sense node reaches the word form node and then stops propagating. Thus, if the word form node is activated via one of its word sense nodes, this activation will not spread to all the word sense nodes. But if the word form node is directly activated, all the word sense nodes will become activate.  Relations are encoded using labeled is-a relations (Section 5.3.3). A semantic relation between two synsets is represented by a labeled link from the out node of the rst (source) synset to the in node of the second (destination) synset. If the relation is lexical, the labeled link connects the respective word sense nodes. Figure 7.30 shows the encoding of two synsets S1=fX1 ; X2; X3 ; X4g and S2=fY1 ; Y2; Y3; Y4g. The gure represents two semantic synset-to-synset relations S2 R1 S1 and S1 R2 S2. The lexical relation R3 links Y1 to X1 : Y1 R3 X1 . The links connecting the synset nodes and the word sense nodes transmit all activation irrespective of its associated tag. The labeled relation links only transmit activation which has the appropriate relation as its associated tag. The synonym links transmit only activation tagged with synonym. Note that when any word sense node in a synset is activated with synonym as tag, then all the word senses in the synset|i.e., all the synonyms|become active. Furthermore, whenever the in node of a synset becomes active by virtue of activation arriving along a relation link representing a transitive relation, the out node also becomes active through the word sense nodes. Thus, activation for transitive relations continues to propagate while the propagation of activation for non-transitive relations is inhibited.

7.3.2 Experimental Results

Using the above translation scheme, the entire WordNet database of about 75,000 synsets translates into a shruti-cm5 knowledge base of about 880,000 is-a relations. As described in Section 5.3.3 shruti-cm5 has been extended to support labeled is-a relations of the form is-a R (A,B), which represents A !R B. For example, is-a hypernym (Bird,Sparrow) asserts that Bird is a hypernym (or superconcept) of Sparrow. Furthermore, the system supports both speci c queries like is-a hypernym (Bird,Sparrow)? (\Is Bird a hypernym of Sparrow?") and enumeration queries like is-a hypernym (Bird,x)? (\Enumerate entities which have Bird as a hypernym"). Note that even though WordNet exercises only the type hierarchy of shruti-cm5, the similarity of activation propagation in the type hierarchy and rule-base ensures that WordNet helps evaluate the e ectiveness of shruti-cm5 in terms of its ability to handle large knowledge bases in real-time. Using shruti-cm5, the translated WordNet knowledge base was loaded onto the CM-5. Experiments involved running several regular and enumeration queries (Section 5.3.4). Table 7.5 lists the queries used. Queries where the second argument is a variable are enumeration queries. As mentioned in Section 5.5.3, enumeration queries are always run synchronously. In order to be able to compare regular and enumeration queries, and to limit random variations (especially for analysis validation in Section 7.4.2), all queries are run synchronously. Response time for the queries in Table 7.5 are shown in Table 7.6. Column 4 in Table 7.6 shows response time on a 32 node CM-5 using random processor allocation. Queries were repeatedly run 117

Query No. 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

Shruti-cm5 Query

is-a hypernym (Bird,x)? is-a hypernym (Bird,Sparrow)? is-a hypernym (Animal,Sparrow)? is-a hypernym (Entity,Sparrow)? is-a hyponym (Sparrow,x)? is-a hyponym (Sparrow,Bird)? is-a hyponym (Sparrow,Animal)? is-a hyponym (Sparrow,Entity)? is-a synonym (Bird,Dame)? is-a synonym (Bird,x)? is-a antonym (Heavy,Light)? is-a antonym (Heavy,x)? is-a synonym (Heavy,Weighty)? is-a synonym (Heavy,x)? is-a similar (Heavy,Massive)? is-a similar (Heavy,x)? is-a pertainym (Stellar,Star)? is-a pertainym (Stellar,x)? is-a attribute (Deep,Depth)? is-a attribute (Deep,x)? is-a haspart (Bird,Web)? is-a haspart (Bird,x)? is-a hassubst (Organism,Tissue)? is-a hassubst (Organism,x)? is-a haspart (Cell,Gene)? is-a haspart (Cell,x)? is-a hasmember (Flock,Bird)? is-a hasmember (Flock,x)? is-a ispartof (Rudder,Airplane)? is-a ispartof (Rudder,x)? is-a cause (Remind,Recall)? is-a cause (Remind,x)? is-a entail (Snore,Sleep)? is-a entail (Snore,x)?

Table 7.5: Queries used in WordNet experiments.

118

Query Hops Active Response Time Speedup No. Nodes 32 PE CM-5 Serial on CM-5 04 23 85312 0.274616 2.9662 10.801264 03 17 3566 0.0191858 0.121324 6.323635 01 ? 5150 0.02962 0.178118 6.013437 02 8 603 0.0049589 0.0264439 5.332614 15 5 181 0.0024312 0.0081617 3.357066 16 ? 706 0.0092081 0.0303078 3.291428 13 5 106 0.0026706 0.0070002 2.621209 22 ? 120 0.0049411 0.0106783 2.161118 14 ? 106 0.0049702 0.0074997 1.508933 21 8 120 0.0032757 0.004806 1.467167 11 3 44 0.0015436 0.0021153 1.370368 09 5 40 0.0023461 0.0028757 1.225736 30 ? 37 0.0055888 0.0067644 1.210349 10 ? 40 0.004249 0.0048093 1.131866 25 11 67 0.0033865 0.003831 1.131256 12 ? 44 0.0036676 0.0040621 1.107564 26 ? 67 0.0048346 0.0046988 0.971911 20 ? 25 0.0035902 0.0033718 0.939168 19 5 25 0.0016947 0.0015506 0.914970 28 ? 21 0.0038177 0.0032523 0.851900 31 5 19 0.0018847 0.001563 0.829310 24 ? 7 0.0037044 0.0030453 0.822076 18 ? 16 0.0037691 0.0030729 0.815287 34 ? 21 0.0038868 0.0030558 0.786199 27 5 21 0.001893 0.001467 0.774960 06 8 25 0.0024775 0.001897 0.765691 32 ? 19 0.0039006 0.0028899 0.740886 07 17 49 0.0048445 0.0034989 0.722242 33 5 21 0.0019587 0.0013777 0.703375 17 5 16 0.0018265 0.0012205 0.668218 08 23 63 0.0065298 0.0043587 0.667509 05 ? 63 0.0072454 0.0047586 0.656775 29 14 37 0.004347 0.0025781 0.593076 23 5 7 0.0017754 0.000975 0.549172 Table 7.6: Average response time for WordNet queries.

119

Query No. 01 12 14 20 26 22 05 28 10 30 24 18 34 32

WordNet (wn) Response Speedup Query Time on CM-5 t-wn bird -treen 12.99 438.55 t-wn heavy -antsa 0.09 24.54 t-wn heavy -synsa 0.07 14.08 t-wn deep -attra 0.03 8.35 t-wn cell -hmern 0.03 6.21 t-wn bird -partn 0.03 6.07 t-wn sparrow -hypen 0.04 5.53 t-wn ock -membn 0.02 5.24 t-wn bird -synsn 0.02 4.71 t-wn rudder -hholn 0.02 3.58 t-wn organism -subsn 0.01 2.70 t-wn stellar -perta 0.01 2.65 t-wn snore -entav 0.01 2.57 t-wn remind -causv 0.01 2.56

Table 7.7: Comparison of wn and WordNet on the CM-5. ten times and the times reported are average values. In the hops column, the table shows the number of hops activation had to propagate before the query was answered. Due to the manner in which WordNet relations are encoded in shruti-cm5 (Section 7.3.1), the actual depth of the query is (H ? 2)=3 where H is the number of hops given in column 2 of the table. Enumeration queries never get answered armatively, and hence, their \depth" is the maximum allowed. Such queries have a `?' in the hops column. The number of nodes that are activated in course of answering the query is shown in column 3 titled \active nodes". Column 5 of the table shows response time on a serial workstation for the WordNet queries. These times were obtain by running the serial version of shruti-cm5 (Section 5.7) on a sparc server. The last column shows the speedup obtained on the CM-5 with respect to the serial workstation. The speedups are only approximate since the processor on the sparc server and the CM-5 processing nodes are di erent, the CM-5 nodes being the slower of the two. Also, Table 7.6 has been sorted in order of decreasing speedup. The actual queries can be determined by looking up the query number in Table 7.5. From Table 7.6, we observe that speedup varies from about 11 for queries which activate a large number of nodes to less than 1 for queries which activate only a small number of nodes. With a large number of active nodes, the system makes e ective use of parallelism to achieve signi cant speedups. When activation spreads only to a small number of nodes, the communication and system overhead on the CM-5 swamps the little (if any) performance gained by the use of parallelism, rendering the parallel version slower than the serial version. Between these two extremes, performance on the CM-5 improves with increase in the number of active nodes. We have modi ed the serial WordNet interface wn to report processing time using Unix timers. This modi ed version, t-wn, reports the time needed to retrieve the required information from the WordNet database les. It does not include time needed to format and display the retrieved information. t-wn can report both elapsed (wall-clock) time and busy time. Elapsed time is the total time from the start of the process to its termination, and includes both busy and waiting times. Busy time is the sum of cpu time for the user process and any system processes invoked on behalf of the user process. Table 7.7 compares the performance of WordNet on the CM-5 to the serial WordNet interface t-wn. The response time column indicates busy time for answering the respective query and is the minimum obtained in six consecutive runs of the same query on a sparc Station 10; this avoids counting the time needed to process cache misses. Timing on the CM-5 also excludes output time, as does the time reported by t-wn. Due to the manner in which the serial WordNet interface 120

QUERY DEPTH vs. PREDICTED RESPONSE TIME

0.6

KB size=329871: Predicted Actual KB size=440178: Predicted Actual KB size=550041: Predicted Actual

knowledge base kb1 32 pe cm-5

Response Time (sec)

0.5

0.4

0.3

0.2

0.1

0 0

2

4

6

8

10

Query Depth

Figure 7.31: Predicted response time for kb1. is designed, only enumeration queries can be posed. Here again, the comparison is approximate because of the di erent processors used on the two machines. The results are sorted by speedup. By inspecting the results above, we note that with respect to both serial shruti-cm5 and t-wn, speedup on a 32 node CM-5 is better when the number of active elements is larger. This is in keeping with our expectations since e ective use of parallelism will result when a large number of nodes are active|thereby providing better speedup.

7.4 Empirical Validation of the Analysis In this section, we empiricallyvalidate the analysis presented in Chapter 6 using results from random knowledge base and WordNet experiment running on shruti-cm5. With random knowledge bases, we compare actual and predicted response time as a function of knowledge base size and query depth. WordNet experiments explore the relation between response time and the value of q. The following sections show that actual response times, for both random knowledge bases and WordNet, are very close to the values predicted by the analysis. The predicted and actual values agree closely as long as the assumptions made in the analysis are valid|which is the case when a large number of knowledge elements are active. The fact that the analysis accurately predicts performance with random knowledge bases and specialized real-world knowledge bases like WordNet, provides strong empirical validation of the theoretical analysis.

7.4.1 Analysis Validation|Random Knowledge Bases

From Chapter 6, the average response time for answering a query Q is TQ = Trb + Tth , where the expressions for Trb and Tth are given by Equations 6.47 and 6.48, respectively. In dealing with random knowledge bases, Tth  0 (see below) and bD+1 qD+1 ? 1 bDf +1 ? 1 ts fBrb qrb + (Brb + 2) kc (1 ? qrb) + Korb g Trb = f b q rb ? 1 + N(b f rb f ? 1) (

)

where Brb = bf + br and Korb = (bf + br ) korb + 2kohrb + f kof . This is valid for both synchronous and asynchronous operation (Section 6.10.4). 121

QUERY DEPTH vs. PREDICTED RESPONSE TIME 2.5 KB size=329871: Predicted Actual KB size=440201: Predicted Actual KB size=550049: Predicted Actual

knowledge base kb2 32 pe cm-5

Response Time (sec)

2

1.5

1

0.5

0 0

2

4 6 Query Depth QUERY DEPTH vs. PREDICTED RESPONSE TIME

8

10

1 KB size=329871: Predicted Actual KB size=440201: Predicted Actual KB size=550049: Predicted Actual

knowledge base kb2 64 pe cm-5

Response Time (sec)

0.8

0.6

0.4

0.2

0 0

2

4 6 Query Depth QUERY DEPTH vs. PREDICTED RESPONSE TIME

8

10

0.5 0.45

KB size=329871: Predicted Actual KB size=440201: Predicted Actual KB size=550049: Predicted Actual

knowledge base kb2 128 pe cm-5

0.4

Response Time (sec)

0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0

2

4

6

8

Query Depth

Figure 7.32: Predicted response time for kb2. 122

10

QUERY DEPTH vs. PREDICTED RESPONSE TIME 5 KB size=220146: Predicted Actual KB size=330388: Predicted Actual

knowledge base kb3 32 pe cm-5

Response Time (sec)

4

3

2

1

0 0

2

4

6

8

10

Query Depth

Figure 7.33: Predicted response time for kb3. In experimenting with random knowledge bases, we note the following empirical facts which are of relevance to validating the theoretical analysis:  The number of active elements in the type hierarchy is negligibly small compared to the number of active predicates in the rule-base. In other words, Tth  0, and TQ = Trb .  The time needed to check fact matches is small compared to the time for processing rules, i.e., f  0.  The computational cost of processing a rule is split about equally among the processors housing the consequent and antecedent predicates. This implies that korb  kohrb . Section 7.2.3 shows that the time needed to process a rule is about 250 sec.7 Thus, torb +thorb  250 sec; using ts = 5 sec from Section 6.4, we have korb = kohrb = 25.  The reverse branching factor in the rule-base is about one quarter the forward branching factor, i.e., br = 41 bf . Hence, Brb = bf + br = 54 bf .  Random processor allocation is used resulting in qrb = N1 on a N processor machine.  Reasoning episodes are run asynchronously. This will result in a greater variation in the response time values, especially for queries with larger depth. Incorporating these parameter values into the equation for response time, we have ) ( bDf +1 qrbD+1 ? 1 bDf +1 ? 1 Trb = b q ? 1 + N(b ? 1) 5  10?6 f 45 bf qrb + ( 45 bf + 2) 5 (1 ? qrb ) + 25 ( 45 bf + 2)g: f rb f For each query, the branching factor bf is estimated by taking into account the depth D of the reasoning episode and the number of rules red8 r in the reasoning episode: r = bDf +1 . This 7 Subject to the assumptions stated in Section 7.2.1. These assumptions are valid when experimenting with random knowledge bases kb1, kb2 and kb3. 8 Note that we do not use the number of active predicates, since shruti-cm5 ignores multiple instantiation of predicates when reporting the number of active predicates. Moreover, when the predicate banks ll up, new instantiations will be discarded. The active predicate count will therefore not provide an accurate estimate of the total computation and communication costs.

123

r

provides a good estimate of bf = e D for that reasoning episode. The value of N is decided by the number of processors on the machine running the experiments. Figures 7.31, 7.32 and 7.33 present the response time predications of the analysis for varying knowledge base sizes and query depths, with random knowledge bases kb1, kb2 and kb3 respectively. The actual experimental response times and 95% con dence intervals (from Section 7.2) are superimposed for comparison. Since the assumptions of the analysis, including the items listed above, are valid only when a large number of rules re, the graphs compare experimental and predicated response times only for larger knowledge bases. From the gures, we note that actual and predicted response times match quite closely in most of the cases, especially for query depths less than about 6. For larger query depths, particularly with larger knowledge bases, the occasional variation in predicated and actual values can be attributed to: (i) the increased variation in actual response time due to asynchronous operation; and (ii) the relatively limited number of data points for estimation of con dence intervals. ln( ) +1

7.4.2 Analysis Validation|WordNet

Since WordNet uses only the type hierarchy, the average query response time TQ = Tth . From Equation 6.48, we have )

(

D+1 D+1 BthD+1 ? 1 k t fB q + (B + 2) k (1 ? q ) + K g Tth = BthB qqth ? ?1 1 + N(B n s th th th c th oth th th th ? 1)

with Bth = bu + bd and Koth = (bu + bd )koth + 2kohth . This equation holds irrespective of whether reasoning episodes are run synchronously or asynchronously (Section 6.10.4). WordNet does not have default is-a relations; thus any is-a relation will only contribute to bu, giving Bth = bu. For each query, we estimate this branching factor by taking into account the depth D of the reasoning episode and the number n of active nodes: n = bDu +1 . This ensures that we obtain a reasonable estimate of the branching factor for that episode of reasoning. The experiments were run on a 32 node CM-5 so that N = 32. An is-a query always activates a single entity, and hence, kn = 1. From Section 6.4 we have ts = 5 sec and kc = 5. Experiments also indicate that toth and thoth are approximately 25 sec, leading to koth = kohth = 5. Thus, response time when running WordNet on shruti-cm5 is given by: (

)

D+1 D+1 D+1 ? 1 Bth Tth = BthB qqth ? ?1 1 + 32(B 5  10?6 fBth qth +(Bth +2) 5 (1 ? qth )+5Bth +10g (7:1) th th th ? 1) n

where Bth = bu = e D , D is the query depth, n is the number of active nodes for the reasoning episode and qth is the q value for the type hierarchy. We use the q-based processor allocation capability of shruti-cm5 (Section 5.3.5) to obtain WordNet mappings with several values of qth in the range [0,1]. We then determine response time for the queries in Table 7.5. Figures 7.34{7.38 show experimental results superimposed with the predictions made by the analysis. The gures show data for eight di erent values of qth. Each query was run 10 times and the response times shown are average values with 95% con dence intervals. In each graph, the heavier dotted line is the best t for the response time obtained on a 32 node CM-5. Noting that Equation 7.1 is a (D + 1)-th degree polynomial in qth , the best t curves for D+1 where t is the response time. The thinner line experimental data are of the form t = c1 + c2 qth is a plot of Equation 7.1 with q = qth as the independent variable. The gures also show the query number, the actual query, number of hops needed to answer the query and the number of active nodes. The curves in Figures 7.34{7.38 show that response time remains relatively constant when the value of q is less than about 0.7, and increases rapidly when q tends to 1. When the value of ln( ) +1

124

Q vs. RESPONSE TIME

Q vs. RESPONSE TIME

0.6

0.09 01 is-a hypernym (Bird,x)? [hops=30, active nodes=5150] best fit predicted

02 is-a hypernym (Bird,Sparrow)? [hops=8, active nodes=603] best fit predicted

0.08

0.5 0.07

0.06 Response Time (sec)

Response Time (sec)

0.4

0.3

0.2

0.05

0.04

0.03

0.02 0.1 0.01

0

0 0

0.1

0.2

0.3

0.4 0.5 0.6 q Q vs. RESPONSE TIME

0.7

0.8

0.9

0

0.45

0.1

0.2

0.3

0.4 0.5 0.6 q Q vs. RESPONSE TIME

0.7

0.8

0.9

4 03 is-a hypernym (Animal,Sparrow)? [hops=17, active nodes=3566] best fit predicted

0.4

04 is-a hypernym (Entity,Sparrow)? [hops=23, active nodes=85312] best fit predicted

3.5

0.35

3

Response Time (sec)

Response Time (sec)

0.3

0.25

0.2

2.5

2

1.5

0.15 1

0.1

0.5

0.05

0

0 0

0.1

0.2

0.3

0.4 0.5 0.6 q Q vs. RESPONSE TIME

0.7

0.8

0.9

0

0.01

0.1

0.2

0.3

0.4 0.5 0.6 q Q vs. RESPONSE TIME

0.7

0.8

0.9

0.004 05 is-a hyponym (Sparrow,x)? [hops=30, active nodes=63] best fit predicted

0.009

06 is-a hyponym (Sparrow,Bird)? [hops=8, active nodes=25] best fit predicted

0.0035

0.008

0.003

Response Time (sec)

Response Time (sec)

0.007

0.006

0.005

0.0025

0.002

0.0015

0.004 0.001

0.003

0.0005

0.002

0.001

0 0

0.1

0.2

0.3

0.4 0.5 0.6 q Q vs. RESPONSE TIME

0.7

0.8

0.9

0

0.008

0.1

0.2

0.3

0.4 0.5 0.6 q Q vs. RESPONSE TIME

0.7

0.8

0.9

0.01 07 is-a hyponym (Sparrow,Animal)? [hops=17, active nodes=49] best fit predicted

08 is-a hyponym (Sparrow,Entity)? [hops=23, active nodes=63] best fit predicted

0.009

0.007 0.008 0.006 Response Time (sec)

Response Time (sec)

0.007 0.005

0.004

0.006

0.005

0.004 0.003 0.003 0.002 0.002

0.001

0.001 0

0.1

0.2

0.3

0.4

0.5 q

0.6

0.7

0.8

0.9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

q

Figure 7.34: Response time for WordNet queries as qth changes. Queries 01{08. 125

0.8

0.9

Q vs. RESPONSE TIME

Q vs. RESPONSE TIME

0.007

0.01 09 is-a synonym (Bird,Dame)? [hops=5, active nodes=40] best fit predicted

10 is-a synonym (Bird,x)? [hops=30, active nodes=40] best fit predicted

0.009

0.006 0.008 0.005 Response Time (sec)

Response Time (sec)

0.007 0.004

0.003

0.006

0.005

0.004 0.002 0.003 0.001 0.002

0

0.001 0

0.1

0.2

0.3

0.4 0.5 0.6 q Q vs. RESPONSE TIME

0.7

0.8

0.9

0

0.005

0.1

0.2

0.3

0.4 0.5 0.6 q Q vs. RESPONSE TIME

0.7

0.8

0.9

0.009 11 is-a antonym (Heavy,Light)? [hops=3, active nodes=44] best fit predicted

0.0045

12 is-a antonym (Heavy,x)? [hops=30, active nodes=44] best fit predicted

0.008

0.004 0.007

Response Time (sec)

Response Time (sec)

0.0035 0.003 0.0025 0.002

0.006

0.005

0.004

0.0015 0.003 0.001 0.002

0.0005 0

0.001 0

0.1

0.2

0.3

0.4 0.5 0.6 q Q vs. RESPONSE TIME

0.7

0.8

0.9

0

0.014

0.1

0.2

0.3

0.4 0.5 0.6 q Q vs. RESPONSE TIME

0.7

0.8

0.9

0.02 13 is-a synonym (Heavy,Weighty)? [hops=5, active nodes=106] best fit predicted

14 is-a synonym (Heavy,x)? [hops=30, active nodes=106] best fit predicted

0.018

0.012 0.016 0.01 Response Time (sec)

Response Time (sec)

0.014 0.008

0.006

0.012

0.01

0.008 0.004 0.006 0.002 0.004

0

0.002 0

0.1

0.2

0.3

0.4 0.5 0.6 q Q vs. RESPONSE TIME

0.7

0.8

0.9

0

0.02

0.1

0.2

0.3

0.4 0.5 0.6 q Q vs. RESPONSE TIME

0.7

0.8

0.9

0.07 15 is-a similar (Heavy,Massive)? [hops=5, active nodes=181] best fit predicted

0.018

16 is-a similar (Heavy,x)? [hops=30, active nodes=706] best fit predicted 0.06

0.016 0.05 Response Time (sec)

Response Time (sec)

0.014 0.012 0.01 0.008 0.006

0.04

0.03

0.02

0.004 0.01 0.002 0

0 0

0.1

0.2

0.3

0.4

0.5 q

0.6

0.7

0.8

0.9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

q

Figure 7.35: Response time for WordNet queries as qth changes. Queries 09{16. 126

0.8

0.9

Q vs. RESPONSE TIME

Q vs. RESPONSE TIME

0.0022

0.0045 17 is-a pertainym (Stellar,Star)? [hops=5, active nodes=16] best fit predicted

0.002

18 is-a pertainym (Stellar,x)? [hops=30, active nodes=16] best fit predicted

0.004

0.0018 0.0035

Response Time (sec)

Response Time (sec)

0.0016 0.0014 0.0012 0.001

0.003

0.0025

0.002

0.0008 0.0015 0.0006 0.001

0.0004 0.0002

0.0005 0

0.1

0.2

0.3

0.4 0.5 0.6 q Q vs. RESPONSE TIME

0.7

0.8

0.9

0

0.003

0.1

0.2

0.3

0.4 0.5 0.6 q Q vs. RESPONSE TIME

0.7

0.8

0.9

0.0045 19 is-a attribute (Deep,Depth)? [hops=5, active nodes=25] best fit predicted

20 is-a attribute (Deep,x)? [hops=30, active nodes=25] best fit predicted 0.004

0.0025

0.0035 Response Time (sec)

Response Time (sec)

0.002

0.0015

0.003

0.0025

0.001 0.002

0.0005

0.0015

0

0.001 0

0.1

0.2

0.3

0.4 0.5 0.6 q Q vs. RESPONSE TIME

0.7

0.8

0.9

0

0.012

0.1

0.2

0.3

0.4 0.5 0.6 q Q vs. RESPONSE TIME

0.7

0.8

0.9

0.016 21 is-a haspart (Bird,Web)? [hops=8, active nodes=120] best fit predicted

22 is-a haspart (Bird,x)? [hops=30, active nodes=120] best fit predicted 0.014

0.01

0.012 Response Time (sec)

Response Time (sec)

0.008

0.006

0.01

0.008

0.004 0.006

0.002

0.004

0

0.002 0

0.1

0.2

0.3

0.4 0.5 0.6 q Q vs. RESPONSE TIME

0.7

0.8

0.9

0

0.0018

0.1

0.2

0.3

0.4 0.5 0.6 q Q vs. RESPONSE TIME

0.7

0.8

0.9

0.004 23 is-a hassubst (Organism,Tissue)? [hops=5, active nodes=7] best fit predicted

0.0016

24 is-a hassubst (Organism,x)? [hops=30, active nodes=7] best fit predicted 0.0035

0.0014

Response Time (sec)

Response Time (sec)

0.003 0.0012

0.001

0.0008

0.0025

0.002

0.0015 0.0006

0.001

0.0004

0.0002

0.0005 0

0.1

0.2

0.3

0.4

0.5 q

0.6

0.7

0.8

0.9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

q

Figure 7.36: Response time for WordNet queries as qth changes. Queries 17{24. 127

0.8

0.9

Q vs. RESPONSE TIME

Q vs. RESPONSE TIME

0.009

0.009 25 is-a haspart (Cell,Gene)? [hops=11, active nodes=67] best fit predicted

0.008

26 is-a haspart (Cell,x)? [hops=30, active nodes=67] best fit predicted

0.008

0.007

0.007

Response Time (sec)

Response Time (sec)

0.006

0.005

0.004

0.006

0.005

0.004

0.003 0.003

0.002

0.002

0.001

0

0.001 0

0.1

0.2

0.3

0.4 0.5 0.6 q Q vs. RESPONSE TIME

0.7

0.8

0.9

0

0.0025

0.1

0.2

0.3

0.4 0.5 0.6 q Q vs. RESPONSE TIME

0.7

0.8

0.9

0.005 27 is-a hasmember (Flock,Bird)? [hops=5, active nodes=21] best fit predicted

28 is-a hasmember (Flock,x)? [hops=30, active nodes=21] best fit predicted

0.0045

0.002

0.004

Response Time (sec)

Response Time (sec)

0.0035 0.0015

0.001

0.003

0.0025

0.002

0.0015

0.0005

0.001

0

0.0005 0

0.1

0.2

0.3

0.4 0.5 0.6 q Q vs. RESPONSE TIME

0.7

0.8

0.9

0

0.006 29 is-a ispartof (Rudder,Airplane)? [hops=14, active nodes=37] best fit predicted

0.0055

0.2

0.3

0.4 0.5 0.6 q Q vs. RESPONSE TIME

0.7

0.8

0.9

30 is-a ispartof (Rudder,x)? [hops=30, active nodes=37] best fit predicted

0.0055

0.005

0.005

0.0045

0.0045

0.004

Response Time (sec)

Response Time (sec)

0.1

0.006

0.0035 0.003 0.0025

0.004 0.0035 0.003 0.0025

0.002

0.002

0.0015

0.0015

0.001 0.0005

0.001 0

0.1

0.2

0.3

0.4 0.5 0.6 q Q vs. RESPONSE TIME

0.7

0.8

0.9

0

0.0024

0.1

0.2

0.3

0.4 0.5 0.6 q Q vs. RESPONSE TIME

0.7

0.8

0.9

0.005 31 is-a cause (Remind,Recall)? [hops=5, active nodes=19] best fit predicted

0.0022

32 is-a cause (Remind,x)? [hops=30, active nodes=19] best fit predicted

0.0045

0.002 0.004 0.0018 Response Time (sec)

Response Time (sec)

0.0035 0.0016 0.0014 0.0012 0.001

0.003

0.0025

0.002 0.0008 0.0015 0.0006 0.001

0.0004 0.0002

0.0005 0

0.1

0.2

0.3

0.4

0.5 q

0.6

0.7

0.8

0.9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

q

Figure 7.37: Response time for WordNet queries as qth changes. Queries 25{32. 128

0.8

0.9

Q vs. RESPONSE TIME

Q vs. RESPONSE TIME

0.0025

0.005 33 is-a entail (Snore,Sleep)? [hops=5, active nodes=21] best fit predicted

34 is-a entail (Snore,x)? [hops=30, active nodes=21] best fit predicted

0.0045

0.002

0.004

Response Time (sec)

Response Time (sec)

0.0035 0.0015

0.001

0.003

0.0025

0.002

0.0015

0.0005

0.001

0

0.0005 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

q

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

q

Figure 7.38: Response time for WordNet queries as qth changes. Queries 33{34. q approaches 1, knowledge base distribution is such that a larger and larger fraction of related predicates and concepts are located on the same processor. The ine ective use of parallelism stemming from the imbalance in load distribution results in a deterioration of response time. This behavior is in keeping with the predictions made by the analysis (Section 6.8). Studying the graphs, we notice that whenever the number of active nodes is large, say about 100 or more, the best t and predicated curves are a reasonably good match|thereby validating the analysis. Of course, the curves are not exactly identical due to the approximations used in the the analysis, and the di erences between the assumptions made during the analysis and the actual implementation (see Sections 6.3, 6.5 and 6.10). Furthermore, the analysis does not take into account xed one-time overhead costs like time needed for synchronization, initialization, and so on. In other words, the analysis assumes that the number of active elements in an episode of reasoning is large, and hence, xed one-time overhead costs are insigni cant compared to the cost of spreading activation. When this assumption is not satis ed|as is the case when very few nodes become active in a reasoning episode|the predicted curves di er from the actual curves. This accounts for the especially large discrepancy between the predicated and best t curves for some of the queries. Figures 7.39{7.42 visualize various performance parameters using viz, the shruti-cm5 visualizer. Each gure shows the variation of one parameter for di erent qth values; the bottom of each window indicates the respective qth value. In all these gures, each square represents the parameter value on one CM-5 processor. The processors are numbered from 0{31; the rst row represents processors 0{5, the second row 6{11 and so on. The last row shows only processor 30. Processor 31 is the server and has no knowledge base elements allocated; this processor is therefore excluded from the plot. The area of the square for each processor is proportional to the relative value of the respective parameter. Figure 7.39 shows the distribution of the WordNet knowledge base on the CM-5 processors. For smaller values of qth , the knowledge base is relatively uniformly distributed; the distribution gets skewed for qth values close to 1. Note that the distribution of knowledge base elements on the processors is decided by the processor allocation scheme. Figure 7.40{7.42 show the response time, communication load and computation load for the query is-a hypernym (Entity,Sparrow)?. The communicationload for a processor is the total number of messages sent and received by that processor. The computation load is the total number of active entities on that processor. Given the average branching factor for the knowledge base (about 2.25 for WordNet on shruti-cm5), the total number of active elements on a processor determines the number of is-a relations red. These gures show a snap shot of the respective parameters at the end of the reasoning episode. We again observe that parameter values across processors are 129

Figure 7.39: WordNet knowledge base distribution for di erent values of qth . 130

Figure 7.40: Response time for WordNet query is-a hypernym (Entity,Sparrow)? for di erent values of qth . 131

Figure 7.41: Communication load for WordNet query is-a hypernym (Entity,Sparrow)? for di erent values of qth. 132

Figure 7.42: Computation load for WordNet query is-a hypernym (Entity,Sparrow)? for di erent values of qth. 133

relatively uniform for smaller values of qth , but get increasing skewed as qth approaches 1.

134

Chapter 8

Recapitulation and Future Work Ecient knowledge representation and reasoning is an important component of intelligent activity, and is a crucial aspect in the design of large-scale intelligent systems. In this thesis, we set out to design ecient knowledge representation and reasoning systems which can encode very large knowledge bases and respond to queries in real-time. In order to design large-scale real-time reasoning systems, we propose (i) exploiting massive parallelism; and (ii) constraining representational and inferential capabilities to achieve tractability, while still retaining sucient expressive power to capture a broad class of reasoning in intelligent systems. To this end, we use shruti, a connectionist reasoning system, as our framework for knowledge representation. Being a connectionist model, shruti is inherently parallel; shruti also circumscribes a tractable reasoning class by imposing cognitively motivated constraints on the representational and inferential capabilities supported. Based on considerations like performance per unit cost, exibility, portability, availability and cost, we opt to use general purpose massively parallel machines as our target parallel architectures. We map shruti onto such general purpose distributed memory parallel machines, especially the Connection Machine CM-5. We describe the design and implementation of shruti-cm5, the parallel re exive reasoning system on the CM-5. Shruti-cm5 uses knowledge-level mapping to map the knowledge base onto the CM-5 processors. The design of shruti-cm5 is based on SPMD asynchronous message passing operation, using active messages for interprocessor communication. The system exploits the nonblocking, low-overhead and short packet-size characteristics of active messages to achieve real-time performance with very large knowledge bases. Experiments show that shruti-cm5 can handle very large arti cial and real-world knowledge bases in real-time. Shruti-cm5 encodes arti cial knowledge bases with up to half a million randomly generated rules and facts and responds to a class of queries, requiring derivation depths of up to eight, in well under a second. Empirical data indicates that shruti-cm5 running on a 32 node CM-5 can process, on average, about 125,000 rule rings per second; in other words, each processor on the CM-5 can process approximately 4000 rule rings per second. Thus, for a range of knowledge base structures, depth of reasoning chains, and number of processors, shruti-cm5 can handle knowledge bases with several hundred thousand to a few million rules and facts in real-time. Experiments with WordNet also show comparable performance, with query response times ranging from a few to a few hundred milliseconds and signi cant speedups compared to serial versions. We also mathematically analyze the mapping of shruti onto parallel machines from both machine dependent and machine independent points of view. With a view toward exploiting machine characteristics and knowledge base structure, the analysis relates average query response time to machine and knowledge base parameters. The analysis optimizes performance with respect to the probability of nding related knowledge elements on the same processor, thereby leading to an optimal mapping of the knowledge base onto the underlying parallel machine. The analysis makes 135

some interesting predictions about response time and its relation to computation and communication costs. One such prediction, validated by experimental data, indicates that average response time is close to optimal when knowledge base elements are mapped to random processors|i.e., load balancing is more critical than locality|when communication cost is either small, or comparable, to the computational cost of processing a knowledge base element. Such results help engineer performance improvements and steer the course of future research.

8.1 Contributions This thesis has made the following contributions:  Design and development of shruti-cm5, a practical, large-scale, massively parallel re exive reasoning system which can encode very large knowledge bases and respond to queries in real-time. Shruti-cm5 is based on shruti, and hence, as a byproduct, models re exive reasoning. By encoding appropriate common-sense knowledge bases, for example CYC (Guha and Lenat, 1990), shruti-cm5 can be used as a platform to explore the characteristics of re exive reasoning. Such experimentation will provide better insight into the limitations of re exive reasoning and help compare it with human common-sense reasoning.

 The mathematical analysis of mapping shruti onto MPP's relates average response time to

machine and knowledge base parameters, and provides pointers toward optimizing system performance. The analysis also provides some new insights into mapping structured connectionist networks onto distributed memory parallel machines. Results from the analysis can also be extended to the more general case of mapping graph structures onto parallel machines. We expect such results to have potential applications in areas like program parallelization, data- ow computation and simulation.

 By encoding WordNet on shruti-cm5, we have developed a massively parallel implementation of WordNet. This system not only allows experimentation with WordNet, but can also be used in applications of WordNet. Applications which require extensive searching of the WordNet database would stand to bene t from the parallelism.

8.2 Future Work Future work involves extending and diversifying the parallel re exive reasoning system, and applying the resulting technology to a number of applications. The analysis also suggests several research possibilities. Following are sketches of the lines of research we plan to pursue.

8.2.1 Extensions to Shruti-cm5

Shruti-cm5 can be extended in several ways in order to make it more versatile, and hence, applicable in a wider range of applications:  Currently, shruti-cm5 supports only backward reasoning. We plan to add a forward reasoning module to complement the backward reasoner. Shruti already supports forward reasoning; we just need to incorporate this in shruti-cm5. The forward reasoner would endow shruti-cm5 with the capability to perform not only deductive inferences (backward chaining), but also predictive inferences (forward chaining).

136

The design of a forward reasoning system for shruti-cm5 would be the dual of the backward reasoner discussed in this thesis and therefore entails considerations similar to that presented here.  Designing systematic techniques for combining the results of forward and backward reasoning will lay the groundwork for abductive reasoning, generating explanations and other applications which need a combination of forward and backward chaining. Some work in this direction has been reported in (Ajjanagadde, 1991).  We plan to introduce the capability to handle evidential or soft rules. A straightforward, and easily implementable, way to do this would be to specify numerical strengths for rules, with some principled way of dealing with multiple antecedent rules. The current versions of shruti-cm5 support multiple activation levels for predicate enablers and collectors, which can be used to represent rule strength. But in order for the numbers to be meaningful and useful in applications, we might have to commit to some formal theory of evidential or uncertain reasoning. In this vein, it would be interesting to explore the relation between shruti and Bayesian networks (Pearl, 1988). Bayesian networks, or probabilistic networks, provide an elegant and well-studied formalism for uncertain reasoning and evidence combination. The one major impediment to using Bayesian networks for large-scale real-world problems is the computational complexity of evaluating these networks. The symbiosis of the resource and computational constraints imposed by shruti combined with the formalism of Bayesian networks may to lead to a tractable formulation of probabilistic networks. Such a system would have wide ranging applications in AI.  Shruti-cm5 currently maintains all knowledge base structures in main memory. This has not been a problem for any of the knowledge bases we have used, but we will invariably encounter applications where the memory available on the CM-5 will be insucient. As mentioned in Section 3.8, we need to add the capability to use secondary storage eciently. This is especially needed since the CM-5 does not support virtual memory. The extensive research into the ecient use of secondary storage for large database systems should provide a good starting point for this endeavor.  The ability to retract long-term rules and facts will enhance the practical applicability of shruti-cm5. This requires modi cation of the rule and fact encoding scheme in shruti-cm5 to maintain information for easy identi cation of structures representing rules and facts. In the current versions, once a rule or fact has been encoded, an exhaustive search through the rule/fact structures of the respective predicate is needed to locate those structures representing a speci c rule/fact. With the current encoding, this operation can get expensive for large knowledge bases.  Principled support for a limited form of negation is possible by enhancing the representation of predicates (Shastri, 1994). Each predicate P in the system will have \positive banks" to represent P(: : :), and \negative banks" to represent :P(: : :). Given this modi ed representation of predicates, both P and :P can be used in rules, facts and queries. The positive and negative banks interact to either ag an inconsistency whenever both banks have the same instantiation, or enforce consistency by disallowing the same instantiation in both banks. These extensions will render shruti-cm5 better able to support abductive, evidential, and uncertain reasoning in practical applications.

8.2.2 Applications of Shruti-cm5

Any knowledge base or database that can be translated to a form acceptable to shruti-cm5 can be encoded on the system, with parallel reasoning episodes being run on the CM-5. We have described 137

the encoding of WordNet on shruti-cm5. We have also obtained a large knowledge base consisting of over 14,000 frames and 170,000 attribute-value pairs about plant anatomy and physiology from Bruce Porter of the University of Texas at Austin (Porter et al., 1988). The mapping of this knowledge base to our system is very similar to that of WordNet. We are also attempting to acquire and encode a subset of the CYC knowledge base (Lenat et al., 1990). Once a knowledge base or database has been encoded on shruti-cm5, it can then be embedded in applications which need to access this information eciently. Applications include ecient large-scale expert systems, natural language processing, and real-time speech understanding. A planned application of shruti-cm5 is to couple it to the Berkeley Restaurant Project (BeRP) speech understanding system being developed at the International Computer Science Institute (Jurafsky et al., 1994a; Jurafsky et al., 1994b). BeRP functions as a knowledge consultant whose domain is restaurants in the city of Berkeley, California. Users ask spoken-language questions of BeRP which then queries a database of restaurants and gives advice based on cost, type of food, location, and transportation. The current BeRP system cannot perform inferences and any possible inferences are either hard wired into the grammar or added to the restaurant database. Our knowledge base system will allow BeRP to make inheritance-like inferences (a Chinese restaurant is an Asian restaurant) as well as more complex inferences (if the user has a car he can get to more distant restaurants). The rapid response of our knowledge base system will be particularly useful for an on-line speech understanding system like BeRP. We are also exploring other potential applications of shruti-cm5, including the use of shruticm5 for language understanding and abductive reasoning based on (Hobbs et al., 1993).

8.2.3 Integrated User Interface

In the existing shruti-cm5 system, all tools and utilities are separate programs. The parallel re exive reasoning system forms the core of the shruti-cm5 system around which all the other programs and tools are organized. Data processing, analysis and visualization tools are a combination of scripts, already existing tools and custom generated programs. Except for the parallel core, all the other tools are o -line and usable on a workstation. The user must manually invoke the required program or script in order to execute any kind of processing, data analysis or visualization. Future versions of shruti-cm5 will provide an easy-to-use graphical user interface which integrates the entire suite of programs and tools. We will also provide for automated remote access to the CM-5 so that all o -line tools and processing can be con ned to the local workstation. The parallel reasoning episodes will be run on the remote CM-5 and the results and output transferred back to the local workstation for further processing.

8.2.4 Exploring the Impact of Communication and Computation Costs

It would be worthwhile to study the mapping of shruti on a variety of parallel architectures, using a range of interprocessor communication protocols. Such experimentation will yield data on the e ect of communication cost on query response time which can be used to further validate and re ne the analysis. Results from such experiments, combined with the analysis, will provide insights into ecient mapping of shruti onto small-scale multiprocessors and networks of workstations. Such a system can be exploited in practical, real-world applications like fast data retrieval in very large database systems. We also expect this endeavor to provide deeper insights into mapping structured connectionist networks and graphs onto parallel machines. Any algorithm which can be viewed as a graph, with nodes representing computation and links representing communication, can then be eciently mapped onto parallel machines, taking into account the computational cost of the nodes and the communication cost of the links. Applications of such studies include program parallelization and ecient simulation of data ow computation. 138

8.2.5 Exploring Re exive Reasoning

Shruti-cm5 is based on shruti, a cognitively motivated reasoning system which attempts to model

common-sense re exive reasoning. By encoding large common-sense knowledge bases, like CYC, we can explore re exive reasoning and compare it with aspects of human common-sense reasoning. This would help circumscribe the characteristics and limitations of re exive reasoning in terms of actual reasoning patterns in human cognitive processes. In addition, we can also experimentally evaluate the e ect of the constraints imposed by shruti, both intrinsically and in comparison with common-sense reasoning, and hopefully provide answers to such questions as:  What the strong points and shortcomings of shruti in terms of its capability to model common-sense reasoning?  Do the constraints on the form of rules in shruti exclude any signi cant class of commonsense reasoning?  What is the e ect of the multiple instantiation constants on the class of inferences supported?  How restrictive is the bound on the number of distinct entities that can participate in a reasoning episode? How about the bound on the depth of reasoning? Though we have intuitive expectations of what the answers to these questions might be, large-scale experimentation with common-sense knowledge bases is one way of validating shruti as a commonsense re exive reasoning system. Such empirical evaluation will lead to a deeper understanding of common-sense reasoning per se, and help in re ning and extending shruti as a model of such reasoning.

8.3 Conclusion From a practical standpoint, this work develops viable technology for supporting large-scale realtime knowledge base systems. It also provides new insights into mapping structured connectionist networks onto massively parallel machines. From a cognitive standpoint, use of shruti as the knowledge representation framework facilitates exploration of common-sense re exive reasoning. We hope that pursuing this work further will result in contributions to the elds of parallel processing, parallel knowledge representation and reasoning, parallel database systems, and cognitive science.

139

Appendix A

Knowledge Base Input Syntax In this appendix, we describe the knowledge base input syntax for shruti-cm2 and shruti-cm5.1 Some of these features are not supported by shruti-cm2 and these are indicated by a dagger (y). Features not supported by shruti-cm2 and shruti-cm5 Version 07.5 (shruti-cm5-Lite) are identi ed by a double dagger (yy). To illustrate the input syntax for rules, facts and is-a relations, we begin with an extension of the example in Section 5.6, shown in Figure A.1. Any text included between /'s are comments. The comments in the gure are enclosed between /* : : : */ so that they look identical to comments in C code. Some aspects of the input syntax are elaborated below:

 A rule meant for the backward reasoner is said to be balanced if the following conditions are satis ed:

{ Repeated variables in the antecedent are also present in the consequent. { Typed variables and entities present in the antecedent are also present in the consequent. Only balanced rules are accepted by the system. Rules which do not satisfy the above conditions are rejected. A warning message to this e ect is printed.

 Any variable (used in a rule) which is not listed in either the list of universally quanti ed

variables or in the list of existentially quanti ed variables is assumed to be existentially quanti ed.

 Any name beginning with an uppercase alphabetic character is assumed to be an entity. All

names beginning with lowercase characters are variable names. Names of predicates can begin with either uppercase or lowercase letters. In shruti-cm2 and shruti-cm5-Lite, capitalization of names is signi cant|for example, NAME1 and Name1 are di erent. In yyshruti-cm5 Version 08.3, all names are internally converted to lower case, and hence, NAME1 and Name1 would be the same.

 A semicolon (;) indicates that a rule, fact or is-a relation has been entered; it also indicates that more input is to follow. The occurrence of a period (.) in the input indicates the end of a rule, fact or is-a relation and also terminates the input. A (quanti ed or unquanti ed) predicate terminated by a ? is interpreted as a query.

1

The syntax described here is that accepted by shruti-cm5 Version 08.3.

140

/* RULES */ forall x,y,z [give(x,y,z) => own(y,z)]; forall x,y [buy(x,y) => own(x,y)]; forall x,y [own(x,y) => can_sell(x,y)]; forall x,y [sibling(x,y) & born_together(x,y) => twins(x,y)]; forall x,y [preys_on(x,y) => scared_of(y,x)]; forall x,y,z [move(x,y,z) => present(x,z,t)]; forall x,y,z [move(x,y,z) => present(x,y,t)]; forall x,y exists t [born(x,y) => present(x,y,t)]; forall x:Animate, y:Solid_obj [walk_into(x,y) => hurt(x)]; /* FACTS */ give (John, Mary, Book1); give (x, Susan, Ball2); forall x:Cat, y:Bird preys_on (x,y); exists x:Robin [own(Mary,x)]; /* IS-A FACTS */ is-a (Bird,Animal); is-a (Cat,Animal); is-a (Robin,Bird); is-a (Canary,Bird); is-a (Tweety,Canary); is-a (Sylvester,Cat).

Figure A.1: An example showing knowledge base input syntax.

 The lexical analyzer removes all whitespace; the input is therefore una ected by the addition of extra blanks, tabs or newlines. Furthermore, spaces can be omitted wherever it is not essential2 .

 The lexical analyzer also removes all comments. Any text enclosed between /'s (/

... /) is a comment. The text of a comment can contain any character or symbol except /. A comment can start and end at any point in the input. In particular, a comment may span several lines or may be limited to part of a single input line.

 yNode Tags. Predicates and entities can be tagged (with a non-zero, positive integer)

by using the < > construct: or . Tags can be used to group \similar" predicates and entities together.

 yyLink Tags. Rules and is-a relations can also be assigned tags in shruti-cm5 Version 08.3, using the > construct. For example de nes the relation ispartof in the type hierarchy. Similarly, by using rb in place of th, tags can be associated with rules. Link tags are generally speci ed using mnemonic names which are internally converted to positive integers. Link tags can also have aliases.

 Error Handling. When syntax errors are detected in the input, the action taken depends on the mode of input:

2 To distinguish between the variable `forallx' and `forall x', a space is essential. But a space is not required after the `,' in `own(x,y)'. In general, spaces are not essential before and after punctuation symbols.

141

{ If input is being read from the terminal (stdin), an error message is issued, and the last

rule or fact should be re-entered after typing one or more semi-colons. { If input is being read from a le, the parser prints the line number containing the syntax error and continues reading the le, so that all syntax errors in the le are listed. Rules or facts in the input that were correctly recognized (i.e., had no syntax errors) will be encoded; the others will be ignored. The formal grammar for the input language (for rules, facts, is-a relations and queries) specifying the exact form of each input structure is shown in Figure A.2. The grammar is accurate for shruticm5 Version 08.3. Shruti-cm5-Lite and shruti-cm2 have minor di erences some of which are indicated above. Shruti-cm5-Lite supports only node tags, whereas shruti-cm2 does not support any tags. In addition, the grammar for shruti-cm5 Version 08.3 has been rewritten to use left recursion for productions like input which get reduced repeatedly; Shruti-cm5-Lite and shruticm2 use grammars with right recursion. Though this does not a ect the input language, yacc generates a more ecient parser with left recursive grammars. In the grammar speci cation, CONST represents entities (any token starting with an uppercase letter); VAR are variables (quanti ed or unquanti ed) and are tokens beginning with lowercase letters. In addition to alphanumeric characters, variable and entity names can also contain some special characters. Any integer is recognized as a NUM. The tokens FORALL and EXISTS are recognized when the input contains these words, spelled with any combination of uppercase and lowercase letters (i.e., arbitrarily capitalized).

142

input ! 

/* empty string */

j input . /* stop { no more input */ j input ; /* continue { more input */ j input input-item input-item ! query /* query */ j fact /* fact */ j rule /* rule */ j isa-relation /* is-a relation */ j tag-def /* node tag de nition */ j relation /* link tag de nition */ j tag-alias /* tag alias de nition */ rule ! q-pre x [ pred-list tag => predicate ] fact ! predicate j q-pred isa-relation ! IS A tag ( arg-list ) query ! isa-relation ? j query-pred ? tag-def ! < predicate , NUM > j < constant , NUM > tag-alias ! name = name relation ! > query-pred ! q-pred : tag j predicate : tag j q-pred j predicate tag ! NUM j name j q-pred ! q-pre x [ predicate ] q-pre x ! FORALL type-list j EXISTS type-list j FORALL type-list EXISTS type-list j EXISTS type-list FORALL type-list j type-list ! variable j variable : constant j variable , type-list j variable : constant , type-list pred-list ! pred-list & predicate j predicate predicate ! name ( arg-list ) arg-list ! arg-list-aux j arg-list-aux ! name j arg-list-aux , name name ! constant j variable constant ! CONST variable ! VAR

Figure A.2: Formal grammar speci cation for knowledge base input syntax.

143

Appendix B

SHRUTI-CM Commands Commands recognized by shruti-cm2 and shruti-cm5 are listed below. Some of the commands and descriptions are not applicable to shruti-cm2 and/or shruti-cm5-Lite. Descriptions not applicable to shruti-cm2 are marked with a dagger (y); those applicable only to shruti-cm5-Lite are pre xed by y7 and material applicable only to shruti-cm5 Version 08.3 is indicated by y8. The shruti-cm5 preprocessor only supports the commands i, w and q. The i and w commands in the preprocessor support di erent options compared to the i and w commands in shruti-cm5. These di erences are explicated when describing the respective commands. In addition, the shruticm5 Version 08.3 preprocessor supports the b command. Each command is invoked by using a single character. The rst non-blank character typed at the input prompt is taken to be the command. Any non-blank text following the rst character forms the argument(s) for the command. The list below indicates the purpose of the command, the command syntax and a brief description of the command. Quit Syntax: q Terminates the shruti-cm program. Help Syntax: ? Prints out a list of available commands and the command-line options and/or arguments accepted by the commands. Read Input Syntax: i [ -f | -b ] [input-file] Reads input from the terminal (when input-file is not speci ed) or a le (when input-file is speci ed). The -b option is used to build a backward reasoning system (default), while the -f option builds a forward reasoning system (currently unsupported). yIn shruti-cm5 the behavior of this command is dictated by the current input mode. The system always starts up in parallel asynchronous mode; the mode can be changed using the m command. In parallel asynchronous mode, each processor in the partition processes a di erent input le input-file.pid where pid is a three digit processor index (pre xed by zeros if necessary). In this mode, input cannot be read from the terminal. If input-file is not speci ed, kb.pp is used as the default input-file. In global synchronous mode, all processors cooperatively process the same input le input-file. y7Preprocessor Syntax: i [-h hash-table-file] [ -f | -b ] [input-file] The -h option for read input is supported by the shruti-cm5 preprocessor and can be used to update the internal server hash tables which store processor assignment and other details for predicates and concepts. This feature is useful for incremental preprocessing of large knowledge bases. y8Preprocessor Syntax: i [ -1 | -2 ] [-h htab [-i idg]] [ -f | -b ] [input-file] In shruti-cm5 Version 08.3, the preprocessor has options for supporting q-based processor 144

allocation. The -1 option performs the rst pass of the q-based processor allocation scheme where all input les are read and the hash tables and inference dependency graphs are constructed. After all input les have been read, the second pass is invoked using -2 to perform global processor allocation.1 In addition to reading hash table les using the -h option, inference dependency graphs can be input using the -i option. yChange Input Mode Syntax: m [ -p | -g ] Changes input mode to parallel asynchronous (with the -p option) or to serial, global synchronous (with the -g option). Without any option, this command prints out the current input mode. The current input mode dictates the behavior of the i command. yWrite Out Hash Table Syntax: w [-o output-file-prefix] Writes out the current server hash tables to the speci ed le (with a .hashtables extension). If no output le pre x is given, kb.pp is used as default. The hash tables written out can be read by the preprocessor (using the i command with the -h option) and supports incremental preprocessing of large knowledge bases. yPreprocessor Syntax: w [ -g ] [-o output-file-prefix] This command, when used on the shruti-cm5 preprocessor, writes out the preprocessed knowledge base. The output le names are suxed with the processor number. If the output le pre x is not speci ed, kb.pp is used as the default. If the -g option is absent, the inference dependency graph for the knowledge base is also written out (with le extension .idg).2 Run Reasoning Episode Syntax: r [[-f] #steps] Runs the reasoning episode after a query has been posed. It is an error to invoke this command when a query has not been posed. Without any options or arguments, r runs the reasoning episode to completion|till the query is answered or the reasoning episode has proceeded long enough to conclude that there will be no answer. When #steps is speci ed with the -f option, the reasoning episode is forced to run for #steps propagation steps (irrespective of whether the query has been answered or not). If the -f option is not speci ed, the reasoning episode terminates either after #steps cycles or after the query has been answered, whichever happens rst. y7Since shruti-cm5 runs reasoning episodes asynchronously, this command does not support the -f and/or #steps arguments. y8Syntax: r [-s #steps] [-d #depth] Shruti-cm5 Version 08.3 provides support for both synchronous and asynchronous operation. When the -s option is not provided, reasoning episodes are run asynchronously. When -s is speci ed, it directs shruti-cm5 to run the reasoning episode synchronously for #steps iterations. The -d option can be used to x the maximum depth of activation propagation in the network. When the reasoning episode has terminated, calling r will just result in a Query processing complete message. Reset Network Syntax: z [ -q | -v ] Resets the network and removes all activation, including the query. With the -v option, a message is printed out indicating that the network has been reset (default). The message can be suppressed by using the -q option. Set Phases Syntax: p [#phases] Sets the number of phases per clock cycle to #phases. The current number of phases is printed out if the command is invoked without an argument. Calling i with -1 after using -2 can result in unexpected behavior. In shruti-cm5 Version 08.3, this command should be called only after the second pass in q-based processor allocation mode. 1 2

145

Display Syntax: d f

-p | -c

g

name

Displays the current instantiations of the predicate (with the -p option) or concept (with the -c option) speci ed by name. An error message is printed if the named predicate or concept is not present in the system. ySyntax: d f -p name | -c name g* Shruti-cm5 supports multiple -p and/or -c options. Statistics Syntax: s [ -a | -k | -q | -c | -s ] Prints out knowledge base and reasoning episode statistics. When the system is con gured for detailed statistics collection, this command will print out more information. The -a option prints out all the accumulated data (default). The -k option prints out information about the knowledge base. All details about the current reasoning episode are printed out by the -q option. The -c and -s options print out cumulative data and data from the last propagation step respectively, for the current query. yDue to the asynchronous nature of the shruti-cm5 system, a global propagation step is not well de ned. Hence, shruti-cm5 does not support the -c and -s options. Even when running synchronously in Version 08.3, only cumulative statistics for the last invocation of the r command are reported. yDisplay Tagged Activation Syntax: a -f first-tag [-l last-tag] Displays the number of active predicates and entities with tag values in the speci ed range. If the -l option is not speci ed, active predicates and entities with tag value equal to first-tag are printed. yDisplay Processor Load Syntax: l [ -a | -k | -q | -t ] [-n processor] Prints out the processor load for the current reasoning episode. When the system is con gured for detailed statistics collection, this command will print out more information. The -a option prints out all information (default). The -k option prints out the distribution of the knowledge base on the processing elements. The distribution of active elements for the current reasoning episode are printed out by the -q option. The timing for individual processors (for the current reasoning episode) is displayed by the -t option. If the -n option is given, requested information is displayed for the speci ed processor. If the -n option is not used, data is displayed for all processors in the partition. y8Enumerate Active Elements Syntax: e [ -c | -p ] [-n #items] Enumerates the active entities (with the -c option) or active predicates (with the -p option) in the system. The -n option is used to set a bound on the number of elements enumerated. If -n is not speci ed, a default number of elements are enumerated. The hash tables are scanned from the beginning and as many active elements as needed are printed out. y8Branching Factor Preprocessor Syntax: b Prints out the average branching factor for the currently loaded knowledge base. Can only be invoked from the preprocessor.

146

Appendix C

Glossary of WordNet Terms We present a glossary of WordNet terms, especially the various relations listed in Table 7.4. The de nitions here have been adapted from the WordNet documentation.

Synset A synonym set; a set of words that are interchangeable in some context. Hypernym The generic term used to designate a whole class of speci c instances. Y is a hypernym

of X if X is a (kind of) Y . Hyponym The speci c term used to designate a member of a class. X is a hyponym of Y if X is a (kind of) Y . Meronym The name of a constituent part of, the substance of, or a member of something. X is a meronym of Y if X is a part of Y . Holonym The name of the whole of which the meronym names a part. Y is a holonym of X if X is a part of Y . Entailment A verb X entails Y if X cannot be done unless Y is, or has been, done. Troponym A verb expressing a speci c manner elaboration of another verb. X is a troponym of Y if toX is toY in some manner. Pertainym A relational adjective. Adjectives that are pertainyms are usually de ned by such phrases as \of or pertaining to" and do not have antonyms. A pertainym can point to a noun or another pertainym. Attribute A noun for which adjectives express values. The noun weight is an attribute, for which the adjectives light and heavy express values.

147

Appendix D

SHRUTI on the CM-2 The Connection Machine CM-2 (TMC, 1991a) is an SIMD data parallel computing machine which can be con gured with up to 64K processing elements. Each processor has several kilobits of local memory and can execute arithmetic and logical instructions, calculate memory addresses, read and store information in memory and perform interprocessor communication. The processors are organized as an n-dimensional hypercube. The CM-2 is controlled by a standard serial front end processor (usually a vax or sun machine). A sequencer decodes commands from the front end and broadcasts them to the data processors, all of which execute the same instruction simultaneously and synchronously. A news grid provides fast communication between adjacent processors and a router network provides general interprocessor communication between any two processors. The design and implementation of the SIMD parallel re exive reasoning system on the CM2|shruti-cm2|is based on knowledge-level partitioning (Section 3.4) of the underlying network generated by a knowledge base. We describe techniques used to encode the knowledge base and implement spreading activation when answering queries. We then explore the characteristics of the system by running a battery of tests. All discussion pertains only to backward reasoning.

D.1 Encoding the Knowledge Base The knowledge base is encoded by presenting rules, facts and is-a relations to the shruti-cm2 system. The input syntax for rules, facts, is-a relations and queries is speci ed in Appendix A. Appendix B gives a listing of commands recognized by shruti-cm2.

D.1.1 Input Processing

A lexical analyzer and parser read the input, parse it and build internal data structures which represent the rules and/or facts presented to the system. All input processing is performed sequentially on the front-end. As predicates and entities are recognized in the input, the parser builds hash tables which keep track of processor assignments. The hash tables can be used to eciently access these predicates and entities while encoding rules and facts, posing queries and inspecting their state. Once a rule, fact or is-a relation has been recognized and processed, the resulting internal data structures are used to encode the knowledge base element on the Connection Machine processors. In the case of a query, the data structures will be used to pose the query to the system.

D.1.2 Representing Knowledge Base Elements

Knowledge base elements are represented on the processors using parallel structures. A parallel structure allocates space for the speci ed structure on every processor. Figures D.1 and D.2 indicate 148

typedef struct cm_pred { bool used; byte noOfArgs;

/* predicate on the CM */

byte nextFree; struct cm_predbank bank[K2]; } CM_Pred;

/* index of next free bank */ /* predicate banks */

/* flag */

typedef struct cm_predbank /* predicate bank on the CM */ { /* no fields used to encode KB */ bool cChange; bool eChange; byte collector; byte enabler; byte args[MAX_ARGS]; } CM_PredBank;

/* collector value changed */ /* enabler value changed */

/* arg activation phase */

Figure D.1: C Structures used to represent predicates in shruti-cm2. the structures used to encode predicates, rules and facts in the rule-base. In these gures, MAX-ARGS is the maximum number of arguments a predicate can have, and K2 is the multiple instantiation constant for predicates. Flags have type bool while processor indices have type index. The top part of the typedefs contain elds used to encode the knowledge base while the bottom part has elds used in a given episode of reasoning. The structures used to encode concepts and is-a relations in the type hierarchy are similar, though simpler, as shown in Figures D.3 and D.4. Note that a parallel structure is allocated for each knowledge base element|predicate, fact, rule, entity and is-a link. When the knowledge base grows and more space is needed, the size of the parallel structure is doubled. The virtual processor capability of the CM-2 ensures that each (physical) processor now houses two structures. This is transparent to the programmer and one can still assume that each processor houses one structure, with double the number of (virtual) processors in the machine. Using this scheme, the representation automatically scales with the size of the knowledge base. As the number of virtual processors increases, the system will run proportionately slower. The virtual processor mechanism therefore provides a simple, scalable and transparent way of trading o time for space.

D.1.3 Encoding Rules and Facts

Depending on the processor allocation scheme used (Section 3.5), every predicate and entity appearing in the knowledge base will be assigned to a (virtual) processing element on the CM-2. Furthermore, a rule, fact or is-a relation that is being encoded will also be assigned to a (virtual) processor. These two processor allocations|one for the relevant predicates/entities and the other for the rule/fact/is-a relation under consideration|may or may not be independent. The actual details of the processor allocation are dictated by the processor assignment scheme being used. The current and more recent versions of shruti-cm2 use random processor assignment schemes for all knowledge base elements. Earlier versions used random allocation for predicates and concepts; however, facts and is-a relations were encoded on the processors containing the relevant predicate or concept and rules were encoded on the processor containing the consequent predicate. Once the knowledge base elements are assigned to processing elements on the CM-2, all that 149

typedef struct cm_rule { /* knowledge base encoding */ bool used; bool dummy; index antecedent; index consequent; byte noOfAnts; int weight; byte antNoOfArgs; byte argMap[MAX_ARGS]; byte splCond[MAX_ARGS]; int splIndex[MAX_ARGS];

/* rule slot on the CM */

/* /* /* /* /*

flag */ rule slot is dummy if flag set */ invalid for head rule slots */ points to head slot in a dummy */ > 1 in a head rule slot */

/* /* /* /*

invalid for head rule slots */ arg mapping; invalid on head slot */ not used in dummy slots */ not used in dummy slots */

/* reasoning episode */ byte dummyCollector[K2]; /* used only in dummy slots */ bool fire; /* rule can fire if set */ bool selected; /* instantiation selected if set */ byte nextBank; /* next conseq pred bank to consider */ byte bankSelected[K2]; /* rule back pointer */ /* NOTE: bankSelected[i] == j if bank i in the ant pred has instantiation from bank j in the conseq pred; valid only on non-head rule slots; in a head rule slot bankSelected[i] == i */ } CM_Rule;

typedef struct cm_fact { bool used; index factPred; byte noOfArgs; index constant[MAX_ARGS]; bool active; } CM_Fact;

/* fact on the CM */ /* flag */ /* fact predicate index */ /* fact arguments */ /* fact active if set */

Figure D.2: C Structures used to encode rules and facts in shruti-cm2.

150

typedef struct cm_isabank /* entity bank on the CM */ { /* no fields used to encode KB */ bool buRelay; bool tdRelay; byte activation; } CM_isABank;

/* bottom-up relay */ /* top-down relay */ /* entity activation phase */

typedef struct cm_entity { bool used;

/* entity on the CM */

byte nextFree; struct cm_isabank bank[K1]; } CM_Entity;

/* index of next free bank */ /* entity banks */

/* flag */

Figure D.3: C Structures used to represent entities in shruti-cm2. typedef struct cm_isalink { bool used; index source; index destination; bool selected; byte nextBank; } CM_isALink;

/* is-a links on the CM */ /* flag */ /* source of link */ /* destination of link */ /* activation selected if set */ /* next entity bank to consider */

Figure D.4: C Structures used to represent is-a relations in shruti-cm2. remains to be done in order to encode a rule, fact or is-a relation is to correctly ll out the various elds in the relevant structures. Encoding a fact involves the corresponding predicate and the entities lling the arguments of the predicate. Encoding a rule (is-a relation) involves two predicates (concepts) and a rule-slot (is-a link). If a rule has multiple predicates in the antecedent, the encoding is slightly more complex, as pictured in Figure D.5. The gure on the left indicates the encoding of single-antecedent rules while the gure on the right depicts the encoding of multiple antecedent rules. Arrows indicate links which are implemented using interprocessor communication.

D.2 Spreading Activation and Inference Queries can be posed after the knowledge base has been encoded. Again, queries have a speci c syntax (as described in Appendix A) and result in activating the relevant predicate and concepts in keeping with the description in (Shastri and Ajjanagadde, 1993) and (Mani and Shastri, 1993). The reasoning episode can then be run, either step-wise or to completion. The gross structure of the activation propagation loop is indicated in Figure 5.6. Phases in shruti are represented as \markers"|integers with values ranging from 1 to the maximum number of phases. 151

antecedent predicates predicate

predicate

predicate

dummy rule slot

dummy rule slot

dummy rule slot

Dummy rule slots handle activation transformation predicate antecedent

rule slot

head rule slot

The rule slot handles: 1. activation transformation 2. special variable checking

Head rule slots check for special variables

consequent

consequent

predicate

predicate

Figure D.5: Encoding single- and multiple-antecedent rules.

D.2.1 The Rule-Base

As shown in Figure 5.6 spreading activation in the rule-base consists of three steps:  Propagating rule activation. Spreading activation in the rule-base by rule ring is achieved by executing the following. The various rule-slots mentioned here are shown in Figure D.5. 1. Every non-dummy rule-slot gets the instantiation in the consequent predicate bank under consideration. 2. Non-dummy rule-slots check if all special variables in the rule satisfy required conditions. 3. If all special variable conditions are satis ed, the dummy rule-slots get the respective instantiations from the corresponding head rule-slot. 4. All non-head rule-slots transform the activation and send it to the respective antecedent predicates. In the process of ring a rule, the system maintains sucient book-keeping information to reverse-propagate collector activation to the consequent of a rule. Once a rule res, it will not re again unless a new bank of the consequent predicate becomes active. This ensures that the same rule does not re repeatedly, thereby minimizing unnecessary interprocessor communication. Note also that the processor housing the rule-slot will need to communicate with other processors in order to get predicate bank instantiations, get information from the head rule-slot, send information to dummy rule-slots and send the transformed activation to the antecedent predicate.  Checking fact matches for active predicates. Facts for predicates with active collectors are matched in parallel. Processors encoding the facts communicate with the processors housing the relevant predicates and concepts in order to check if the ring \phases" match. If a fact \ res", the collector of the corresponding predicate is activated.  Reverse-propagating collector activation. Propagating collector activation to predicate banks which originated the activation involves the following: 152

1. Non-head rule-slots get the state of the predicate collector. 2. Dummy rule-slots send the collector value to the head rule slot which accumulates all the incoming values. 3. Non-dummy rule-slots send the activation to the respective consequent predicates provided the collector activation exceeds a prespeci ed threshold. The threshold could depend on the number of antecedent predicates for the rule, the level of activation of antecedent predicate(s), and/or other factors. Rule-slots that have already reverse propagated collector activation will not participate in this step. Again, this is done in order to minimize unnecessary interprocessor communication.

D.2.2 The Type Hierarchy Propagating activation in the type hierarchy is similar to spreading activation in the rule-base, except that it is simpler. Spreading bottom-up activation and top-down activation are handled separately in the type hierarchy. When spreading bottom-up (top-down) activation, all is-a links which have an active bank in the subconcept (superconcept) \ re" and spread activation to the respective superconcept (subconcept). The is-a link gets activation from the subconcept (superconcept) and sends it to the superconcept (subconcept). Again, in order to minimize communication, we ensure that any new activation traverses corresponding is-a links exactly once. Shruti-cm2 supports only default is-a relations and cannot handle labeled is-a relations (Section 5.3.2).

D.2.3 Multiple Instantiation Multiple instantiation in shruti-cm2 is handled without the use of switches (Mani and Shastri, 1993). Predicates and concepts can accommodate K2 and K1 instances respectively. When spreading activation in the network, predicate and concept banks are considered one at a time. In other words, in any iteration of the propagation loop (Figure 5.6) only one active bank of a predicate or concept will be considered. Whenever a predicate or concept receives activation, it is compared with existing activation in the banks. If the incoming activation is not already represented, it is then deposited into the next available bank. The rule- or link-slot that sent in the activation is noti ed that the instantiation it sent has been selected. In the rule-base, the rule-slot also receives the bank number accommodating the new instantiation. This information is needed when reverse-propagating collector activation. If the incoming activation is already represented in the predicate or concept, or if all banks are already in use, the incoming activation is discarded. Even in this case, rule-slots are noti ed so that they can proceed to the next bank of the consequent predicate. A rule-slot retries sending the same instantiation if it does not receive noti cation that the activation was either selected or discarded. This protocol simulates the function of the multiple instantiation switches, and brings about ecient dynamic allocation of predicate and concept banks.

D.2.4 Statistics Collection Apart from timing the reasoning episodes, shruti-cm2 can also be con gured to gather data about several other aspects including knowledge base parameters (number of predicates, concepts, rules, facts and is-a relations) and communication data (number of messages, sends and gets). Enabling full- edged data collection can slow down the system due to the extra time needed to accumulate the required data. 153

SIZE OF KNOWLEDGE BASE vs. RESPONSE TIME 14 simd version 05.1 4k pe cm-2 simple kb -- no special rules or type hierarchy

Response Time (seconds)

12

10

8

Inference Depth zero one two three four five six seven eight nine ten

6

4

2

0 0

50000

100000 150000 Size of Knowledge Base (Rules+Facts)

200000

Figure D.6: Shruti-cm2 running on a CM-2 with 4K processors.

D.3 Characteristics of Shruti-cm2 Shruti-cm2 has been run on a 4K CM-2 and on a 32K CM-2. Both machines had 256 kilobits

of memory on each processor. Figures D.6 and D.7 summarize the results of experiments run on these machines. The graph shows the e ect of the size of the knowledge base on response time for queries which require inference depths ranging from 0 to 10. In these gures, the response time shown is the actual CM time used. The timing routines available on the CM-2 also report elapsed time for the reasoning episode. Elapsed time is a ected by other processes running on the front end and is therefore unreliable. The experiments use random, unstructured knowledge bases with no is-a relations or rules with special variables. Queries were generated based on a special chain of rules and facts embedded in the random knowledge base, as described in Section 5.9. The inference path for a given query was tailored to ensure a reasonable branching factor. Based on these and other experiments, and on the design of shruti-cm2, we can summarize the characteristics of the system:  The response time is approximately linear with respect to the size of the knowledge base, for knowledge bases with up to 160,000 elements. Thus, as the size of the knowledge base increases, query response time increases proportionately. This is to be expected since more predicates would be active on average and would entail proportionately more processing and interprocessor communication as the size of the knowledge base increases. Beyond a certain limit, we expect response time to increase steeply with the size of the knowledge base. However, e ort was not expended in locating this limit or studying the characteristics of the system near this threshold since our focus shifted to the CM-5. As a result, all timing results stated here apply only to knowledge bases with up to 160,000 rules and facts.  Time taken to answer a query increases as the average branching factor of the knowledge base 154

SIZE OF KNOWLEDGE BASE vs. RESPONSE TIME 5 simd version 05.1 32k pe cm-2 simple kb -- no special rules or type hierarchy

Response Time (seconds)

4

3

Inference Depth zero one two three four five six seven eight nine ten

2

1

0 0

50000

100000 150000 Size of Knowledge Base (Rules+Facts)

200000

Figure D.7: Shruti-cm2 running on a CM-2 with 32K processors.

   



increases. This again is caused by increased processing and interprocessor communication. Increasing inference depth needed to answer a query proportionately increases response time. Every extra inference step requires an extra activation propagation step (i.e., an extra iteration of the loop in Figure 5.6). Response time is approximately inversely proportional to the number of (physical) processing elements on the machine. This can be attributed to the increased computing power and the lower \density" (with fewer knowledge base elements per processor) which results in enhanced parallelism. The time taken to answer a query ranges from a fraction of a second to a few tens of seconds. An inherent problem with the use of parallel variables on the CM-2 is inecient memory usage. Since the number of virtual processors must always be a power of two, this could potentially lead to signi cant waste of memory. There appears to be no simple solution to this problem without breaking out of SIMD operation. SPMD implementations on the CM-5 avoid this problem. The maximum size of the knowledge base that can be encoded on a machine depends on the total amount of memory available on the machine. In addition, with increasingly large knowledge bases, the communicationbottleneck would also signi cantly slow down the system.

D.4

Shruti-cm2 on

the CM-5

In this section, we brie y evaluate shruti-cm2 running on the CM-5. Since shruti-cm2 is written in C*, and a C* compiler is available for the CM-5, shruti-cm2 can be recompiled to run on the CM-5. Shruti-cm2 running on the CM-5 uses the CM-5 in data-parallel (SIMD) mode. Figure D.8 155

SIZE OF KNOWLEDGE BASE vs. RESPONSE TIME 10 simd version 05.1 64 pe cm-5 simple kb -- no special rules or type hierarchy

Response Time (seconds)

8

6

Inference Depth zero one two three four five six seven eight nine ten

4

2

0 0

50000

100000 150000 Size of Knowledge Base (Rules+Facts)

200000

Figure D.8: Shruti-cm2 running on a CM-5 with 64 processors, in SIMD mode. summarizes the results. The graph shows the e ect of the size of the knowledge base on response time for queries which require inference depths ranging from 0 to 10. Comparing Figure D.8 with Figures D.6 and D.7, we observe that the performance of shruti-cm2 on the CM-5 is comparable to that on the CM-21, though message passing on the CM-5 appears to be more robust.

1 The rule of thumb seems to be that a 32 node CM-5 is approximately equivalent to a CM-2 with 8K processing elements.

156

Bibliography Ahmad, S. (1991). VISIT: An ecient computational model of human visual attention. Technical Report TR-91-049, International Computer Science Institute. Ajjanagadde, V. and Shastri, L. (1991). Rules and variables in neural nets. Neural Computation, 3:121{134. Ajjanagadde, V. G. (1991). Abductive reasoning in connectionist networks: Incorporating variables, background knowledge, and structured explananda. Technical Report WSI-91-7, WilhelmSchickard Institute, University of Tubingen. America, P., editor (1991). Parallel Database Systems. Springer-Verlag, Berlin. Anderson, J. A. and Rosenfeld, E. (1989). Neurocomputing: Foundations of Research. The MIT Press, Cambridge, MA. Aronis, J. (1993). Implementing Inheritance with Roles on the Connection Machine. PhD thesis, University of Pittsburgh. Arvind and Culler, D. E. (1986). Data ow architectures. Technical Report MIT/LCS/TM-294, Massachusetts Institute of Technology. Asanovic, K., Beck, J., Callahan, T., et al. (1993). CNS-1 architecture speci cation. Technical Report TR-93-021, International Computer Science Institute, Berkeley, CA. Asanovic, K., Beck, J., et al. (1992). SPERT: A VLIW/SIMD microprocessor for arti cial neural network computations. Technical Report TR-91-072, International Computer Science Institute, Berkeley, CA. Baresh, S. C. and Eshera, M. A. (1989). The systolic array neurocomputer: Fine-grained parallelism at the synaptic level. In International Joint Conference on Neural Networks, volume 2, page 613, Washington, DC. Abstract only. Barnden, J. A. and Pollack, J. B., editors (1991). Advances in Connectionist and Neural Computation Theory, Volume 1. Ablex Publishing Corporation, Norwood, NJ. Barnden, J. A. and Srinivas, K. (1991). Encoding techniques for complex information structures in connectionist systems. Connection Science, 3(3):269{315. Barnes, G. (1990). How to use Xgrab. Beck, J. (1990). The ring array processor (RAP): Hardware. Technical Report TR-90-048, International Computer Science Institute, Berkeley, CA. Belloch, G. and Rosenberg, C. R. (1987). Network learning on the connection machine. In Proceedings of the Tenth International Joint Conference on Arti cial Intelligence, Milan, Italy. 157

Bilmes, J. and Kohn, P. (1991). The ring array processor (RAP): Software architecture. Technical Report TR-90-050, International Computer Science Institute, Berkeley, CA. Brown, J. R., Garber, M. M., and Vanable, S. F. (1988). Arti cial neural network on a SIMD architecture. In Proceedings of Frontiers '88, The Second Symposium on the Frontiers of Massively Parallel Processing, pages 43{47, Fairfax, VA. Carpenter, P. A. and Just, M. A. (1977). Reading comprehension as eyes see it. In Just, M. A. and Carpenter, P. A., editors, Cognitive Processes in Comprehension. Erlbaum. Charniak, E. (1983). Passing markers: A theory of contextual inference in language comprehension. Cognitive Science, 7(3):171{190. Chinn, G., Grajski, K. A., Chen, C., Kuszmaul, C., and Tomboulian, S. (1990). Systolic array implementations of neural nets on the MasPar MP-1 massively parallel processor. In International Joint Conference on Neural Networks, pages 169{173, San Deigo, CA. Chu, L.-C. and Wah, B. W. (1992). Optimal mapping of neural-network learning on message-passing multicomputers. Journal of Parallel and Distributed Computing, 14(3):319{339. Chung, S.-H., Moldovan, D., and Tung, Y.-W. (1989). Modeling semantic networks on the connection machine. Technical Report CENG 89-13, University of Southern California. Clark, K. and Gregory, S. (1987). PARLOG: parallel programming in logic. In Shapiro, E., editor, Concurrent Prolog: Collected Papers, volume 1, pages 84{139. MIT Press, Cambridge, MA. Clark, K. and McCabe, F. (1981). The control facilities of IC-Prolog. In Mitchie, D., editor, Expert Systems in the Micro Electronic Age. Edinburgh University Press, Edinburgh. Clocksin, W. F. and Mellish, C. S. (1981). Programming in Prolog. Springer-Verlag, Berlin. Colmerauer, A. (1986). Theoretical model of Prolog II. In van Caneghem, M. and Warren, D., editors, Logic Programming and Its Application, pages 3{31. Ablex, Norwood, NJ. Conery, J. S. (1987). Parallel Execution of Logic Programs. Kluwer Academic Publishers, Boston, MA. Cottrell, G. W. (1989). A Connectionist Approach to Word Sense Disambiguation. Morgan Kaufmann, San Mateo, CA. de Kergommeaux, J. C. and Codognet, P. (1994). Parallel logic programming systems. ACM Comupting Surveys, 26(3):295{336. Deprit, E. (1989). Implementing recurrent back-propagation on the connection machine. Neural Networks, 2(4):295{314. Dewan, H. M., Stolfo, S. J., and Woodbury, L. (1994). Scalable parallel and distributed expert database systems with predictive load balancing. Journal of Parallel and Distributed Computing, 22(3):506{522. Dewitt, D. J. and Gerber, R. (1985). Multiprocessor join algorithms. In Proceedings of the International Conference on Very Large Databases. Diegert, C. (1990). Out-of-core backpropagation. In International Joint Conference on Neural Networks, volume 2, pages 97{103, San Deigo, CA. Dietz, P., Krizanc, D., Rajasekaran, S., and Shastri, L. (1993). A lower bound result for the common element problem and its implication for re exive reasoning. Technical Report MS-CIS-93-73, Department of Computer and Information Science, University of Pennsylvania. 158

Dijkstra, E. W. (1975). Guarded commands, non-determinacy and formal derivation of programs. Communications of the ACM, 18(8). Doorenbos, B., Tambe, M., and Newell, A. (1992). Learning 10,000 chunks: What's it like out there? In AAAI-92: Proceedings of the Fourteenth National Conference of the American Association of Arti cial Intelligence, pages 830{836. Evett, M. P. (1994). PARKA: A System for Massively Parallel Knowledge Representation. PhD thesis, University of Maryland at College Park. Evett, M. P., Andersen, W. A., and Hendler, J. A. (1993). Massively parallel support for ecient knowledge representation. In Proceedings of the Thirteenth International Joint Conference on Arti cial Intelligence, pages 1325{1330. Evett, M. P. and Hendler, J. A. (1992). An update of PARKA, a massively parallel knowledge representation system. Technical Report CS-TR-2850, University of Maryland at College Park. Evett, M. P., Hendler, J. A., and Spector, L. (1990). PARKA: Parallel knowledge representation on the connection machine. Technical Report CS-TR-2409, University of Maryland at College Park. Fahlman, S. E. (1979). NETL: A System for Representing and Using Real World Knowledge. MIT Press, Cambridge MA. Feldman, J. A. and Ballard, D. H. (1982). Connectionist models and their properties. Cognitive Science, 6(3):205{254. Feldman, J. A., Fanty, M. A., and Goddard, N. H. (1988). Computing with structured neural networks. IEEE Computer, pages 91{103. Fontaine, T. (1992). Data-parallel training of spatiotemporal connectionist networks on the connection machine. In Proceedings of the International Joint Conference on Neural Networks, volume 4, pages 555{559. Forrest, B. M., Roweth, D., Stroud, N., Wallace, D. J., and Wilson, G. V. (1987). Implementing neural network models on parallel computers. Computer Journal, 30(5):413{419. Frazier, G. (1990). TeraOPs and TeraBytes for neural networks research. Texas instruments Technical Journal, 7(6):22{33. Fronhofer, B. and Wrightson, G., editors (1992). Parallelization in Inference Systems. SpringerVerlag, Berlin. Fujimoto, Y. and Fakuda, N. (1989). An enhanced parallel toroidal lattice architecture for large scale neural network simulations. In International Joint Conference on Neural Networks, volume 2, page 614, Washington, DC. Abstract only. Geller, J. (1994). Inheritance operations in massively parallel knowledge representation. In kanal, L. N., Kumar, V., Kitano, H., and Suttner, C. B., editors, Parallel Processing for Arti cial Intelligence 1, pages 95{113. North-Holland, New York, NY. Goddard, N. H. (1992). The Perception of Articulated Motion: Recognizing Moving Light Displays. PhD thesis, University of Rochester. Grajski, K. A., Chinn, G., Chen, C., Kuszmaul, C., and Tomboulian, S. (1990). Neural network simulation on the MasPar MP-1 massively parallel processor. In International Neural Network Conference, volume 2, page 673, Paris, France. Abstract only. 159

Guha, R. V. and Lenat, D. B. (1990). CYC: A mid-term report. AI Magazine, 11(3):32{59. Guha, R. V. and Lenat, D. B. (1991). CYC: A mid-term report. Applied AI, 5(1):45{86. Gupta, S. N., Zubair, M., and Grosch, C. E. (1990). Simulation of neural networks on a massively parallel computer (DAP-510) using sparse matrix techniques. In Proceedings of the 3rd Symposium on the Frontiers of Massively Parallel Computation, pages 376{379, College Park, MD. Hammerstrom, D. (1994). Neural network, image processing and pattern recognition hardware. Neural Information Processing Systems Tutorial Notes. Hammerstrom, D. and Nguyen, N. (1991). An implementation of Kohonen's self-organizing map on the Adaptive Solutions neurocomputer. In International Conference on Arti cial Neural Networks, volume 1, pages 715{720, Espoo, Finland. Hastings, H. M. and Waner, S. (1987). Neural nets on the MPP. In Fisher, J. R., editor, Frontiers of Massively Parallel Scienti c Computation. NASA. Henderson, J. (1993). Description Based Parsing in a Connectionist Network. PhD thesis, University of Pennsylvania. Hillis, W. D. (1985). The Connection Machine. MIT Press, Cambridge, MA. Hoare, C. A. R. (1978). Communicating sequential processes. Communications of the ACM, 21(8):666{676. Hobbs, J. R., Stickel, M. E., Appelt, D. E., and Martin, P. (1993). Interpretation as abduction. Arti cial Intelligence, 63(1{2):69{142. Hsiao, D. (1983). Advanced Database Machine Architectures. Prentice Hall, Englewood Cli s, NJ. James, M. and Hoang, D. (1992). Design of low-cost, real-time simulation systems for large neural networks. Journal of Parallel and Distributed Computing, 14(3):221{235. Jurafsky, D., Wooters, C., Tajchman, G., Segal, J., Stolcke, A., Fosler, E., and Morgan, N. (1994a). The Berkeley restaurant project. In Proceedings of the International Conference on Speech and Language Processing, Yokohama, Japan. To appear. Jurafsky, D., Wooters, C., Tajchman, G., Segal, J., Stolcke, A., and Morgan, N. (1994b). Integrating advanced models of syntax, phonology, and accent/dialect with a speech recognizer. In AAAI Workshop on Integrating Speech and Natural Language Processing, Seattle. To appear. Kanal, L. N., Kumar, V., Kitano, H., and Suttner, C. B., editors (1994). Parallel Processing for Arti cial Intelligence 1. North-Holland, New York, NY. Kanerva, P. (1988). Sparse Distributed Memory. MIT Press, Cambridge, MA. Kitano, H. (1993). Challenges of massive parallelism. In Proceedings of the Thirteenth International Joint Conference on Arti cial Intelligence, pages 813{834. Kitano, H., Hendler, H., Higuchi, T., Moldovan, D., and Waltz, D. (1991). Massively parallel arti cial intelligence. In Proceedings of the Thirteenth International Joint Conference on Arti cial Intelligence, pages 557{562. Kitano, H., Kumar, V., and Suttner, C. B., editors (1994). Parallel Processing for Arti cial Intelligence 2. North-Holland, New York, NY. Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78(9):1464{1480. 160

Koikkalainen, P. (1991). MIND: A speci cation formalism for neural networks. In International Conference on Arti cial Neural Networks, volume 1, pages 579{584, Espoo, Finland. Koikkalainen, P. and Oja, E. (1988). Speci cation and implementation environment for neural networks using communication sequential processes. In International Conference on Neural Networks, volume 1, pages 533{540, San Deigo, CA. Kolodner, J. L. and Thau, R. (1988). Design and implementation of a case memory. Technical Report RL88-1, Thinking Machines Corporation. Koutso os, E. and North, S. C. (1993). Drawing graphs with dot. dot User's Manual. Kowalski, R. A. (1979). Logic for Problem Solving. Elsevier Science, New York. Kung, H. T. (1982). Why systolic architectures? IEEE Computer, pages 37{46. Kung, S. Y. and Hwang, J. N. (1988). Parallel architectures for arti cial neural nets. In International Conference on Neural Networks, volume 2, pages 165{172, San Diego, CA. Kung, S. Y. and Hwang, J. N. (1989). A uni ed systolic architecture for arti cial neural networks. Journal of Parallel and Distributed Computing, 6:358{387. Kurfe, F. (1991). Parallelism in Logic. Verlag Vieweg, Germany. Lako , G. and Johnson, M. (1980). Metaphors We Live By. University of Chicago Press, Chicago. Lange, T. E. (1990). Simulation of heterogeneous neural networks on serial and parallel machines. Parallel Computing, 14(3):287{303. Lange, T. E. and Dyer, M. G. (1989). High-level inferencing in a connectionist network. Connection Science, 1(2):181{217. Lee, W. S. and Sheu, P. C.-Y. (1994). An object-oriented query evaluation scheme for logical databases in massively parallel environment. IEEE Transactions on Knowledge and Data Engineering, 6(1):181{187. Leighton, F. T. (1992). Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hupercubes. Morgan Kaufmann, San Mateo, CA. Lenat, D. B., Guha, R. V., et al. (1990). CYC: Towards programs with common sense. Communications of the ACM, 33(8):30{49. Lin, W.-M., Prasanna, V. K., and Przytula, K. W. (1991). Algorithmic mapping of neural network models onto parallel SIMD machines. IEEE Transactions on Computers, 40(12):1390{1401. Lloyd, J. W., editor (1987). Foundations of Logic Programming. Springer-Verlag, Berlin. Mani, D. R. (1995). Shruti-cm5|a user and programmer's manual. Forthcoming Report. Mani, D. R. and Shastri, L. (1991a). Combining a connectionist type hierarchy with a connectionist rule-based reasoner. In Proceedings of the Thirteenth Annual Conference of the Cognitive Science Society. Cognitive Science Society, Lawrence Erlbaum. Condensed version of (Mani and Shastri, 1991b). Mani, D. R. and Shastri, L. (1991b). Combining a connectionist type hierarchy with a connectionist rule-based reasoner. Technical Report MS-CIS-91-33, University of Pennsylvania. 161

Mani, D. R. and Shastri, L. (1992a). A connectionist solution to the multiple instantiation problem using temporal synchrony. In Proceedings of the Fourteenth Annual Conference of the Cognitive Science Society. Cognitive Science Society, Lawrence Erlbaum. Condensed version of (Mani and Shastri, 1992b). Mani, D. R. and Shastri, L. (1992b). Multiple instantiation of predicates in a connectionist rulebased reasoner. Technical Report MS-CIS-92-05, University of Pennsylvania. Mani, D. R. and Shastri, L. (1993). Re exive reasoning with multiple instantiation in a connectionist reasoning system with a type hierarchy. Connection Science, 5(3 & 4):205{242. Mani, D. R. and Shastri, L. (1994). Massively parallel real-time reasoning with very large knowledge bases: An interim report. Technical Report TR-94-031, International Computer Science Institute. Manner, R., Horner, R., Hanser, R., and Genthner, A. (1989). Multiprocessor simulation of neural networks with NERV. In Proceedings of the Supercomputing Conference, pages 457{465, Reno, NV. McAllester, D. A. (1993). Automatic recognition of tractability in inference relations. Journal of the Association for Computing Machinery, 40(2):284{303. McCartor, H. (1990). Back-propagation implementation on the adaptive solutions cnaps neurocomputer chip. In Touretzky, D. S., editor, Advances in Neural Information Processing Systems 2, pages 1028{1031. Morgan Kaufmann, San Mateo, CA. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K., and Tengi, R. (1990). Five papers on WordNet. Technical Report CSL-43, Princeton University. Revised March 1993. Mohan, C., Pirahesh, H., Tang, W. G., and Wang, Y. (1994). Parallelism in relational database management systems. IBM Systems Journal, 33(2):349{371. Moldovan, D. I. (1989). RUBIC: A multiprocessor for rule-based systems. IEEE Transactions on Systems, Man, and Cybernetics, 19(4):699{706. Moldovan, D. I. (1993). Parallel Processing: From Applications to Systems. Morgan Kaufmann, San Mateo, CA. Moldovan, D. I., Lee, W., Lin, C., and Chung, M. (1992). SNAP: Parallel processing applied to AI. Computer, 25(5):39{50. Mueller, S. and Gomes, B. (1994). Ecient mapping of randomly sparse neural networks on parallel vector supercomputers. In Sixth IEEE Symposium on Parallel and Distributed Processing. Murali, P., Wechsler, H., and Manohar, M. (1990). Fault-tolerance and learning performance of the back-propagation algorithm using massively parallel implementation. In Proceedings of Frontiers '90, The Third Symposium on the Frontiers of Massively Parallel Processing, pages 364{367, College Park, MD. Naish, L. (1984). MU-Prolog 3.1db reference manual. Melbourne University, Australia. Nelson, M. E., Furmanski, W., and Bower, J. M. (1989). Simulating neurons and networks on parallel computers. In Koch, C. and Segev, I., editors, Methods in Neuronal Modeling: From Synapses to Networks, pages 397{437. MIT Press, Cambridge, MA. Newell, A. (1992). Uni ed theories of cognition and the role of Soar. In Michon, J. A. and Akyurek, A., editors, Soar: A Cognitive Architecture in Perspective, pages 25{79. Kluwer Academic, Netherlands. 162

Nordstrom, T. and Svensson, B. (1992). Using and designing massively parallel computers for arti cial neural networks. Journal of Parallel and Distributed Computing, 14(3):260{285. Nun~ez, F. J. and Fortes, J. A. (1990). Performance of connectionist learning algorithms on 2-D SIMD processor arrays. In Touretzky, D. S., editor, Advances in Neural Information Processing Systems 2, pages 810{817. Morgan Kaufmann, San Mateo, CA. Obermeyer, K., Ritter, H., and Schulten, K. (1990). Large-scale simulations of self-organizing neural networks on parallel computers: Application to biological modeling. Parallel Computation, 14(3):381{404. Olson, T. J. (1989). An architectural model for visual motion understanding. Technical Report TR 305, University of Rochester. Ousterhout, J. K. (1994). Tcl and the Tk Toolkit. Addison-Wesley, Reading, MA. Pearl, J., editor (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, CA. Pomerleau, D. A., Gusciora, G. L., Touretzky, D. S., and Kung, H. T. (1988). Neural network simulation at warp speed: How we got 17 million connections per second. In Proceedings of the International Conference on Neural Networks, San Deigo, CA. Porter, B., Lester, J., Murray, K., Pittman, K., Souther, A., Acker, L., and Jones, T. (1988). AI research in the context of a multifunctional knowledge base: The botany knowledge base project. Technical Report AI88-88, University of Texas. Przytula, K. W. and Prasanna, V. K., editors (1993). Parallel Digital Implementations of Neural Networks. Prentice Hall, Englewood Cli s, NJ. Quillian, M. R. (1968). Semantic memory. In Minsky, M., editor, Semantic Information Processing, pages 227{270. MIT Press, Cambridge, MA. Regier, T. (1992). The acquisition of lexical semantics for spatial terms: A connectionist model of perceptual categorization. Technical Report TR-92-062, International Computer Science Institute. Rogers, D. (1988). Kanerva's sparse distributed memory: An associative memory algorithm wellsuited to the connection machine. In Proceedings of the Conference on Scienti c Applications of the Connection Machine, volume 1, pages 282{298, Mo et Field, CA. Rumelhart, D. E., Hinton, G., and Williams, R. (1986). Learning internal representations by error propagation. In Rumelhart, D. E. and McClelland, J. L., editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press, Cambridge, MA. Rumelhart, D. E. and McClelland, J. L. (1986). Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press, Cambridge, MA. Schubert, L., Papalaskaris, M., and Taugher, J. (1987). Accelerating deductive inference: Special methods for taxonomies, colors and times. In Cercone, N. and McCalla, G., editors, The Knowledge Frontier, pages 187{220. Springer-Verlag, New York, NY. Shams, S. and Przytula, K. W. (1990). Mapping of neural networks onto programmable parallel machines. In Proceedings of the IEEE International Symposium on Circuits and Systems, volume 4, pages 2613{2617, New Orleans, LA. Shapiro, E., editor (1987). Concurrent Prolog: Collected Papers. MIT Press, Cambridge, MA. 163

Shapiro, E. (1989). The family of concurrent logic programming languages. ACM Computing Surveys, 21(3). Shasha, D. and Wang, T. L. (1991). Optimizing equijoin queries in distributed databases where relations are hash partitioned. ACM Transactions on Database Systems, 16(2). Shastri, L. (1988). Semantic Networks: An Evidential Formulation and its Connectionist Realization. Morgan Kaufmann, San Mateo, CA. Shastri, L. (1989). Default reasoning in semantic networks: A formalization of recognition and inheritance. Arti cial Intelligence, 39(3):283{355. Shastri, L. (1990). Connectionism and the computational e ectiveness of reasoning. Theoretical Linguistics, 16(1):65{87. Shastri, L. (1991a). The relevance of connectionism to AI: A representation and reasoning perspective. In Barnden, J. A. and Pollack, J. B., editors, Advances in Connectionist and Neural Computation Theory, Volume 1. Ablex Publishing Corporation, Norwood, NJ. Shastri, L. (1991b). Why semantic networks? In Sowa, J. F., editor, Principles of Semantic Networks: Explorations in the Representation of Knowledge. Morgan Kaufmann, San Mateo, CA. Shastri, L. (1993). A computational model of tractable reasoning|taking inspiration from cognition. In Proceedings of the Thirteenth International Joint Conference on Arti cial Intelligence. Shastri, L. (1994). Negation in shruti. Personal Communication. Shastri, L. and Ajjanagadde, V. (1993). From simple associations to systematic reasoning: A connectionist representation of rules, variables and dynamic bindings using temporal synchrony. Behavioral and Brain Sciences, 16(3):417{494. Siemon, H. P. and Ultsch, A. (1990). Kohonen networks on transputers: Implementation and animation. In International Neural Network Conference, volume 2, pages 643{646, Paris, France. Singer, A. (1990). Implementations of arti cial neural networks on the connection machine. Parallel Computing, 14(3):305{315. Smotro , I. G. (1990). Data ow architectures: Flexible platforms for neural network simulation. In Touretzky, D. S., editor, Advances in Neural Information Processing Systems 2, pages 818{825. Morgan Kaufmann, San Mateo, CA. Sterling, L. and Shapiro, E., editors (1986). The Art of Prolog. MIT Press, Cambridge, MA. Sun, R. (1991). Connectionist models of rule-based reasoning. In Proceedings of the Thirteenth Cognitive Science Conference, Chicago, IL. Tick, E. (1991). Parallel Logic programming. MIT Press, Cambridge, MA. TMC (1991a). Connection machine CM-200 technical summary. Technical Report CMD-TS200, Thinking Machines Corporation, Cambridge, MA. TMC (1991b). Connection machine CM-5 technical summary. Technical Report CMD-TS5, Thinking Machines Corporation, Cambridge, MA. TMC (1993). CMMD Reference Manual. Version 3.0. Thinking Machines Corporation, Cambridge, MA. 164

TMC (1994). CM-5 User's Guide. CMost Version 7.3. Thinking Machines Corporation, Cambridge, MA. Treleaven, P. C. (1991). PYGMALION neural network programming environment. In International Conference on Arti cial Neural Networks, volume 1, pages 569{578, Espoo, Finland. Ueda, K. (1987). Guarded horn clauses. In Shapiro, E., editor, Concurrent Prolog: Collected Papers, volume 1, pages 140{156. MIT Press, Cambridge, MA. Valduriez, P. (1993a). Parallel database systems: Open problems and new issues. Distributed and Parallel Databases, 1(2):137{165. Valduriez, P. (1993b). Parallel database systems: The case for shared-something. In Proceedings of the IEEE International Conference on Data Engineering, pages 460{465. Valduriez, P. and Gardarin, G. (1984). Join and semi-join algorithms for a multiprocessor database machine. ACM Transactions on Database Systems, 9(1). von Eicken, T., Culler, D. E., Goldstein, S. C., and Schauser, K. E. (1992). Active messages: A mechanism for integrated communication and computation. In Proceedings of the Nineteenth International Symposium on Computer Architecture. ACM Press. Waltz, D. L. (1990a). Massively parallel AI. In Proceedings of the National Conference on Arti cial Intelligence. Waltz, D. L. (1990b). Memory-based reasoning. In Arbib, M. A. and Robinson, J. A., editors, Natural and Arti cial Parallel Computation, pages 251{276. MIT Press, Cambridge, MA. Waltz, D. L. and Pollack, J. B. (1985). Massively parallel parsing: A strongly interactive model of natural language interpretation. Cognitive Science, 9(1):51{74. Wang, C.-J., Wu, C.-H., and Sivasundaram, S. (1989). Neural network simulation on sharedmemory vector multiprocessors. In Proceedings of the Supercomputing Conference, pages 197{ 204, Reno, NV. Watanabe, T., Sugiyama, Y., Kondo, T., and Kitamura, Y. (1989). Neural network simulation on a massively parallel cellular array processor: AAP-2. In International Joint Conference on Neural Networks, volume 2, pages 155{161, Washington, DC. Webber, S. H. (1989). Figurative adjective-noun interpretation in a structured connectionist network. In Proceedings of the Eleventh Annual Conference of the Cognitive Science Society. Cognitive Science Society, Lawrence Erlbaum. Weigert, T. J. and Tsai, J. J. P. (1994). A computationally tractable nonmonotonic logic. IEEE Transactions on Knowledge and Data Engineering, 6(1):57{63. Wilensky, R. (1983). Planning and Understanding: A Computational Approach to Human Reasoning. Addison-Wesley. Wilson, S. S. (1989). Neural computing in a one-dimensional SIMD array. In Proceedings of the International Joint Conference on Arti cial Intelligence, pages 206{211. Wolfram, S. (1991). Mathematica: A System for Doing Mathematics by Computer. Addison-Wesley, Redwood City, CA. Wolfson, O. and Ozeri, A. (1993). Parallel and distributed processing of rules by data-reduction. IEEE Transactions on Knowledge and Data Engineering, 5(3):523{530. 165

Yoon, H., Nang, H. H., and Maeng, S. R. (1990). A distributed backpropagation algorithm of neural networks on distributed-memory multiprocessors. In Proceedings of Frontiers '90, The Third Symposium on the Frontiers of Massively Parallel Processing, pages 358{363, College Park, MD. Zhang, W., Wang, K., and Chau, S.-C. (1995). Data partition and parallel evaluation of datalog programs. IEEE Transactions on Knowledge and Data Engineering, 7(1):163{176. Zhang, X., McKenna, M., Mesirov, J. P., and Waltz, D. L. (1990). An ecient implementation of the back-propagation algorithm on the connection machine CM-2. In Touretzky, D. S., editor, Advances in Neural Information Processing Systems 2, pages 801{809. Morgan Kaufmann, San Mateo, CA.

166