Lessons Learned in Data Reverse Engineering - Semantic Scholar

Lessons Learned in Data Reverse Engineering Kathi Hogshead Davis Department of Computer Science Northern Illinois University DeKalb, IL 60115 [email protected]

Abstract Reverse engineering of data has been performed in one form or another for over twenty-five years. In this paper we describe the lessons learned in data reverse engineering (DRE) as contributed in a survey of data reverse engineers. Interesting is the fact that some of the lessons learned tell us how we are doing in the process of initial database design as well as how difficult the DRE process really is. It is hoped that from these lessons learned, we can assist in the suggestion of the next steps that are needed in the DRE area and promote discussion among the DRE community.

1. Introduction In 2000 at the Working Conference on Reverse Engineering, we presented a historical survey of the papers that had been published in the area of data reverse engineering [DAVI00b]. The one major question that arose from the discussion of the historical survey was “What have we learned in the last twenty-five years?” From this question evolves the question: “And where do we go from here using the lessons learned?” In order to start to answer the first question, we conducted an informal survey of about a dozen people whom we knew to have performed some type of data reverse engineering in the past. We asked the data reverse engineers to send us via email the top three lessons that they learned in their DRE experience. Eight people responded with their top three lessons, which we consider an acceptable response rate for such an informal survey. This paper is not intended to be the definitive description of all the lessons that have been learned in DRE over the last twenty-five years. Rather we hope that this paper generates thought and discussion in the area of what have we learned. From here it is hoped that we can more accurately continue the

research into data reverse engineering by building upon these lessons. To that end we surveyed researchers, real world database designers, and toolmakers. We hope the reader finds the lessons described here as interesting, enlightening, and thought provoking as we did. 2.

What is Data Reverse Engineering?

For the readers who do not yet know what data reverse engineering is – we offer this section as a quick definition. Data reverse engineering (DRE) is the process of recovering information about an application from the data and its schema within an existing software system. DRE involves analyzing a legacy data environment to extract the existing data structures from which a logical schema is derived. An abstraction of the logical schema then can be performed to produce a conceptual schema. The results of DRE, be it the logical schema, conceptual schema, or any other documentation of the legacy system has many diverse from just understanding the current system to assisting in the migration to a new system. The term “data reverse engineering” evolved from the more generic term reverse engineering. Data reverse engineering techniques consist of a more restrictive subset than those used in reverse engineering. As Elliot Chikofsky stated in his preface to Peter H. Aiken’s book Data Reverse Engineering: Slaying the Legacy Dragon [CHIK96]. “Reverse engineering is a process to achieve understanding of the structure and interrelationships of a subject system. It is the goal of reverse engineering to create representations that document the subject and facilitate our understanding it – what it is, how it works, and how it does not work. As a process, reverse engineering can be applied to each of the three principal aspects of a system: data, process, and control. Data reverse engineering concentrates on the data aspect of the system that

Proceedings of the Eighth Working Conference On Reverse Engineering (WCRE’01) 0-7695-1303-4/02 $17.00 © 2002 IEEE 1

is the organization. It is a collection of methods and tools to help an organization determine the structure, function, and meaning of its data.” It is the restriction of data reverse engineering to the data portion of a software system that makes it both a complex and most interesting activity. With new technology being introduced faster and faster, data reverse engineering is becoming more common and increasingly more of a necessity. In order to work on solutions to the Y2K date problem, data engineers performed data reverse engineering. They used techniques to analyze their data and understand the extent to which millenium date change would effect their software systems – some without even realizing that they were actually performing data reverse engineering. Data reverse engineering is regularly performed on one’s own software systems for a variety of reasons ranging from just obtaining a basic understanding of the current data for system maintenance, to allowing system integration, to converting the existing data system into a new database management system. The process of DRE is even being extended further into the area of software engineering by being used to extract the (actual and possibly hidden) business rules of an organization [FU 00]. Acquiring the knowledge necessary in the development of a software system is another use of DRE. In system analysis and design, we are taught to study the current system prior to designing the new system. DRE assists in the knowledge acquisition process, from which a new system can be designed, or new functions can be added to the existing system, or the current system can be migrated to use new technologies. In addition, DRE is also used to assess the quality of products of outside vendors. Michael Blaha, an independent consultant and trainer in the area of modeling, database design, and reverse engineering, in “How to Recognize Database Winners and Losers” [BLAH99a] argues that “software practices are all over the map, which means you must look closely at whatever you plan to buy. Reverse engineering will unmask the errors that can cripple an application – before you shell out money to the vendor.” Jean-Luc Hainaut in “The Nature of Data Reverse Engineering” [HAIN00] summarized the process, “Data reverse engineering is not the most exciting engineering activity… Basically, DRE seldom is goal per se, but most often is the first step in a broader engineering project. It is generally intended to redocument, convert, restructure,

maintain, or extend legacy applications.” DRE is used to recover the complete technical and functional specifications of the data within an organization’s applications. The principal objective of data reverse engineering is to create an understanding of the current data environment whether that consists of flat files or a database management system. The data reverse engineering process consists of techniques that produce documentation of the current data environment. The output documentation usually consists of some form of diagram (such as the entityrelationship diagram or an extension thereof) that represents the current objects and relationships within the data. Included in the output of the DRE process may be a data dictionary describing the current data. The resulting model is intended to assist the user in understanding the current environment no matter what the user wishes to do with the information.

3. The Lessons The lessons learned by the respondents to our survey can be divided into a several categories. • • • •

Some of the respondents learned similar lessons, others learned different lessons, but there were absolutely no contradictory lessons as described by the respondents. 3.1 Lessons about the data and its design One thing that is surprising to Michael Blaha is that data base development practice is still very poor. He says, “the state-of-the-practice is much less than the state-of-the-art” [BLAH99b]. The lesson here discovered during DRE is that we are not doing good initial database design even though we know what to do. The questions arise: Why not? Why aren’t the practitioners doing better quality database designs? Proceeding along these same lines, Blaha has discovered that the practice is very highly variable. He is amazed by the variety that he sees and cannot imagine all the odd ways of designing databases that he sees in practice. Why are standards not being established?


Lessons about the data and its design Lessons about the process of data reverse engineering Lessons about data reverse engineering research versus real world applications Lessons about the perception of data reverse engineering

Addressing the question of what the existing data looks like and the standards in place, Howard Duncan says that one of the lessons he learned is that just because an organization has strong standards in place in their database design process is no guarantee that the people doing the design actually follow them. He is a professor at Dublin City University who has worked on porting data from one database to another [DUNC00]. Okay, if we are establishing standards why are we not using them? Roger Chiang, who teaches at the College of Business Administration at the University of Cincinnati, discovered that the instances of the data could be analyzed to provide additional information for DRE [CHIA94]. However, the quality of the data is an important factor that must be analyzed to determine the usefulness of the data itself. How do we know the quality of the data? Is asking the user the only way to determine the quality of data? One lesson that has been learned during DRE pertains to the data model underlying the database. Network or “navigational databases contain more semantic richness than relational databases,” says Isabelle Comyn-Wattiau a professor at the University of Cergy-Pontoise [AKOK00]. “However we cannot be sure that the semantic richness of the navigational models was correctly used by designers.” Again we see the recurring problem of whether or not database designers actually do the design accurately. We seem to see a theme running through all the lessons learned about the data and the database design. That theme is that the data is only as good as the design and standards used and in practice good design techniques are not being used as they should be used. We all knew that most legacy systems have poor data, but the DRE process is teaching us that even the databases designed more recently are not of good quality. Why is this? Who is to blame? How do we fix it? Can we use DRE to assist in fixing the problem? Can we use DRE to assist in the education of the users as to why good quality design is essential? How do we convince the real world database designers to do quality work? 3.2 Lessons about the process of data reverse engineering One of the most interesting things discovered by the author and others about the DRE process is that we can gain information about the application while performing DRE. The information can be domain knowledge of the application as the author learned and described in [DAVI00a]. Or the information can also be about the application itself. Comyn-Wattiau who discovered a similar lesson says that application

semantics that are lost during the initial data design process can be partially recovered inside data. Blaha says, “one can readily learn so much about an application from reverse engineering… I have learned that reverse engineering lets me get inside the developers’ heads and even understand their way of thinking.” If there is domain knowledge available, Ghannouchi, et.al. in their paper [GHAN98] showed that the use of it in DRE minimized the user’s intervention in the process. The question then arises: How much domain knowledge is enough to proceed with a DRE? Is there a point where a person does not have enough domain knowledge to even begin a DRE process on an application? The DRE process can enhance the analysis of programs is another lesson learned. Suzanne Embury of The University of Manchester notes “The additional semantics provided by the data model, schema, and well-defined interface to the DBMS gives us a huge helping hand in trying to analyze programs that work with them. In particular, it provides us with more possibilities for extracting knowledge in a declarative form from source code.” How useful? In another lesson, Jacky Akoka of Conservatoire National des Arts et Métiers learned that an efficient reverse engineering cannot be conducted by only examining schema information, scanning the data is also necessary. How much more information can be gathered by looking at the data? What are the cost/benefit differences between using just the schema in the DRE process or also looking at the data? One common lesson we learned about the DRE process can be summed up in one statement that was suggested by almost all of the respondents: DRE is difficult. This lesson might seem to be an obvious one to the reader. However, it is a lesson that we keep learning over and over again. One person even suggested that it might not be feasible to perform DRE on legacy relational databases. Then the question becomes, at what point in the quality of the data is the DRE process too costly when compared to the benefits? To aid with the difficulty of DRE, many have learned that human input is a must. To assist in the human input, Jean Henrard of Institut d’Informatique - University of Namur states that “tools are essential for real size projects.” He also says, “tools need [to be] customized to each project.” Each project is different in the environment of programming language, DBMS, programming style, and the quality of the data and documentation. Tools, where do we stand today? Exactly what is available that is worth using?


Duncan says that the human input in the DRE process needs to be limited to the process itself not modifying the output. If an automated DRE process is used, we must not “touch-up” the output. Doing this invalidates the automated DRE process so that a manual process, which is prone to inaccuracies, might as well been used. Also we must be sure that the DRE process is more complete than just a mapping from one database to another. The mapping of one database management system to another does not a data reverse engineering make. So exactly what constitutes a DRE? Translation of data models? Reverse engineering of the system? All of the above? 3.3 Lessons about data reverse engineering research versus real world applications The main lesson learned about the data reverse engineering research versus real world applications has to do with size. Embury discovered that “Lots of papers on data reverse engineering deal with tiny examples, and only a small number of cases. But the real problems being tackled by people in industry are on a whole different planet, both in terms of size and complexity.” Akoka who shares this opinion says, “proposing an efficient reverse engineering method requires a real life example.” There is a large gap from theory to reality that is not being addressed by the research community. Why not? How can we address the theory to reality needs of industry? 3.4 Lessons about the perception of data reverse engineering The lesson learned about the perception of data reverse engineering is summed up in the statement by Akoka, “DRE is a subject addressed by a very small research community.” This poses problems for the organizations that need DRE in terms of understanding exactly what DRE is and why they should do it. Henrard has found that “it is very important to explain to the customer what is DRE and why they need DRE and the expected results.” DRE can cost a lot of money for what the customer sees as nothing but just an understanding of the existing system. It is very difficult for the customer to evaluate the result. It is easier for them to evaluate the process. Another reason the customer needs to be involved in the DRE project. Do we need to have “evangelism” about DRE to educate users as Henrard points out in his response to the survey?

4. Where Do We Go from Here? This is the question all DRE researchers are asking themselves. What is needed next? From the informal survey, we can summarize how the respondents answered this question in several points. First we need to “get the word out” that DRE is important to anyone with data. As Blaha asked, how do we inform the broader software community about the extensive knowledge that an organization can gain from DRE? Henrard goes so far as to say we need to be evangelistic about DRE. What are the ways we can get the word to more than just the few researchers? One of the things we can suggest is to have a special issue of an industrial magazine dedicated to DRE. We feel that researchers generally know the value of DRE. It is industry that needs to be reminded of the benefits that DRE can provide. Secondly, we think that researchers now need to look closer at performing their DRE research on industrial sized data. (Maybe this can assist with point one.) But, how do we go about convincing industry that we are capable enough in doing DRE to handle their data? How do we go about convincing industry that we need to use data of their volume in our research so that we can further assist them in their DRE? We feel that these are tough questions to answer. We think that we must publicize the successes that exist (maybe in the industrial magazine issue mentioned previously). We also need to be realistic when it comes to the failures that have occurred. Showing organizations that we are learning from our mistakes and improving the DRE process will go a long way to gain the respect and confidence needed to allow us to continue to work with real industrial sized data. Thirdly, many more tools are needed. DRE is a repetitive process that can greatly be assisted with the use of tools. Creating a good useable tool is a long process. There are only a few currently available, but none that is really used regularly in the industrial setting. Exactly where are we in terms of the tools available? Where do we go from here? In order to meet this need one of the things that needs to be completed is a survey of the currently existing tools. A description of each tool including exactly what it can do along with an evaluation of the tools in action is needed. The survey would not be an easy task, but it would be extremely beneficial to both industrial and research data reverse engineers.


5. Conclusion In conclusion, this paper has presented a summary of where DRE is today – at least from the standpoint of a few knowledgeable people. We have raised some questions about where and how DRE can move on from here. We also hope that the information contained here is the basis for discussion in the DRE community.

6. References The majority of papers written on data reverse engineering have been published in either of the following two conferences: the Working Conference on Reverse Engineering or the International EntityRelationship Conference (now called the International Conference on Conceptual Modeling). [AKOK00] – Akoka, Jacky and Isabelle ComynWattiau, “Migration Schema Mapping vs. Reverse Engineering of Network Databases”, Data Reverse Engineering Workshop, EuroRef, Seventh Reengineering Forum, Reengineering Week 2000, Zurich, Switzerland, March 2000. [BLAH97} – Blaha, Michael, “Dimensions of Database Reverse Engineering”, Proceedings of the Fourth Working Conference on Reverse Engineering, October, 1997, paged 176-183. [BLAH99a] – Blaha, Michael, “The Case for Reverse Engineering”, IT Pro, March/April 1999, pages 35 – 41. [BLAH99b] – Blaha, Michael, “How to Recognize Database Winners and Losers”, IT Pro, May/June 1999, pages 20 – 25. [BLAH99c] – Blaha, Michael, “Reverse Engineering for Software Quality”, Software Technology and Engineering Practice (STEP 99), Pittsburgh, August 1999. [CHIA94] – Chiang, R.H.L, T.M. Barron, and V.C. Storey, “Reverse Engineering of Relational Databases: Extraction of an EER Model from a Relational Database”, Data & Knowledge Engineering, 12, 1994. [CHIK96] – Chikofsky, Elliot, “The Necessity of Data Reverse Engineering”, Preface to Data Reverse Engineering: Slaying the Legacy Dragon, McGrawHill, 1996, pages xiii – xvi.

[DAVI00a] – Davis, Kathi Hogshead, ”Gaining Domain Knowledge through Data Reverse Engineering: An Experience Report”, Data Reverse Engineering Workshop, EuroRef, Seventh Reengineering Forum, Reengineering Week 2000, Zurich, Switzerland, March 2000. [DAVI00b] – Davis, Kathi Hogshead,” Data Reverse Engineering: A Historical Survey”, Proceedings of the Seventh Working Conference on Reverse Engineering, Brisbane, Australia, November 2000, IEEE Computer Society, pages 70-78. [DUNC00] – Duncan, Howard and Alan Cooke, “Some Experience in Porting a Relational Database from Concurrent OS32 to IBM DB2”, Data Reverse Engineering Workshop, EuroRef, Seventh Reengineering Forum, Reengineering Week 2000, Zurich, Switzerland, March 2000. [FU 00] – Fu, G., X. Liu, J. Shao, S.M. Embury, and W.A. Gray, “Extracting Business Rules from Legacy Information Systems”, Data Reverse Engineering Workshop, EuroRef, Seventh Reengineering Forum, Reengineering Week 2000, Zurich, Switzerland, March 2000. [GHAN98] – Ghannouchi, Sonia, H. Hadjami, B. Ghezela, and F. Kamoun, “A Generic Approach for Data Reverse Engineering Taking into Account Application Domain Knowledge”, Proceedings of the 2nd Euromicro Conference on Software Maintenance and ReEngineering (CSMR’98) 1998. [HAIN95] – Hainaut, J-L., “DB-MAIN: A Programmable CASE Tool for Database Applications Engineering”, Tutorial on Database Reverse Engineering, presented at both IFORSID’95 and CAiSE95 Conferences, 1995. [HAIN00] – Hainaut, J-L., “The Nature of Data Reverse Engineering”, Data Reverse Engineering Workshop, EuroRef, Seventh Reengineering Forum, Reengineering Week 2000, Zurich, Switzerland, March 2000. [HENR00] – Henrard, J., J-L. Hainaut, J-M. Hick, D. Roland and V. Englebert, “From micro-analytical Techniques to Mass Processing in Data Reverse Engineering –The Economic Challenge”, Data Reverse Engineering Workshop, EuroRef, Seventh Reengineering Forum, Reengineering Week 2000, Zurich, Switzerland, March 2000.