Master Thesis New generation of document ...

7 downloads 9428 Views 1MB Size Report
Aug 4, 2011 - re-usability of the configuration files, html and python code across ..... mailutils for sending emails. send email() is the main API function.
Master Thesis

New generation of document acquisition in a digital repository: the Invenio example

by

08/04/2011

CERN-THESIS-2011-045

Raja Sripada

Supervisors:

Mr. Samuele Kaplun

Dr. Martin Rajman

IT-UDS-CDS, CERN

I&C-LIA, EPFL

March 18, 2011

Abstract The Invenio Software package, originally developed at CERN, is a free and open source integrated digital-library software to run a document repository on the web. The document acquisition module of Invenio, ’WebSubmit’ provides a framework for the repository manager to build interactive forms to gather documents and all the related meta-data from the users and to set up workflows to subsequently process them. The current implementation of WebSubmit, due to its legacy design, has several drawbacks including requiring the repository manager to always type HTML and often Python code directly into the web forms and having to redundantly repeat several steps. The current architecture is also heavily based on unnecessary disk I/O and database operations. The configuration process is moreover error prone. In this master project, a new design of the WebSubmit is proposed and a prototype implemented. The new WebSubmit tackles the above issues by requiring a single and simpler text-based configuration file per submission, allowing re-usability of the configuration files, html and python code across multiple submissions and minimizing I/O and database operations. It makes the most common and frequent configurations easier while at the same time, retaining and enhancing the flexibility and power of the ’workflow’ paradigm of the old WebSubmit in a variant form.

iii

Acknowledgments I would like to express my gratitude to Samuele Kaplun who supervised my Masters project work at CERN for his patient guidance and for his valuable time spent on me. I thank my supervisor at EPFL, Dr. Martin Rajman for directing this project work. I am also thankful to Jean-Yves Le Meur for giving me the wonderful opportunity to be a part of his team. I convey my appreciation to all the CDS team members at CERN for their technical support and for the many enriching experiences during the 20 weeks that I worked with them. Finally, I thank Marisa Marciano Wynn, Internships Coordinator, I&CEPFL, for her support during the application process to CERN and Antonella Martin-Veltro, Secretary, SIN-EPFL, for assisting me with the administrative issues.

v

Contents 1 Introduction 1.1 CERN . . . . . . . . . . . . . . . . . . . . . . . 1.2 Invenio software . . . . . . . . . . . . . . . . . 1.3 WebSubmit module . . . . . . . . . . . . . . . . 1.4 A Case Study - Council Document Submission

. . . .

1 1 1 2 2

Project Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7 7 8 8

3 Resources 3.1 INI Configuration files . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Workflow Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Plugins from Invenio software . . . . . . . . . . . . . . . . . . . .

9 9 9 10

4 New WebSubmit 4.1 Configuration files . . . . . . . . 4.2 Workflow Engine changes . . . . 4.3 Class design and Implementation 4.3.1 Submission . . . . . . . . 4.3.2 Interface . . . . . . . . . . 4.3.3 Interface Element . . . . . 4.3.4 Session . . . . . . . . . . 4.3.5 WebInterface . . . . . . . 4.4 Example configuration . . . . . .

11 11 12 13 13 13 13 13 15 15

2 The 2.1 2.2 2.3

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . . . . . . .

5 Conclusion

19

6 Future Work

21

Appendices A Template file

23

Bibliography

25

vii

CONTENTS

viii

Chapter 1

Introduction 1.1

CERN

CERN, the European Organization for Nuclear Research[1] located in Geneva, Switzerland is the world’s largest particle physics laboratory. It was established in 1954 constituting twelve European member states. It currently has twenty European member states. Since its inception, in addition to an unprecedented insight attained into the universe of particles by building and operating powerful particle accelerators and colliders, the organization has also contributed to the advancement of science and technology in general. Of particular interest to Computer Scientists and Engineers is the fact that it is the birth place of the World Wide Web, invented by Sir Tim Berners-Lee in 1989.[2] Other notable contributions from CERN include those in the area of Grid Computing given that computing is an important challenge considering the enormous amounts of data generated during the course of the experiments.[3] The Swiss Federal Institute of Technology at Lausanne(EPFL)[4] is a collaborating university with some of the teams at CERN, one of which is the team responsible for the development of the document repository software, Invenio[5].

1.2

Invenio software

Invenio Software is the digital library software conceived and developed in the CDS(CERN Document Server) 1 team at CERN. It is a free software which can be downloaded, installed and configured2 to run ones own digital library on the web. This software has been developed at CERN to cater to the requirement of handling potentially millions of documents including multimedia. It is a complete package comprising multiple modules covering all aspects of a document repository from document acquisition to search[7]. 1A

specific instance of the Invenio Software at CERN[6] the software is downloaded, the INSTALL guide can be found in the source tree

2 Once

1

CHAPTER 1. INTRODUCTION

1.3

WebSubmit module

The ’WebSubmit’ module[8] of Invenio is the document acquisition module. It is responsible for rendering the submission interfaces, form data validation, storing the submitted meta-data and the full text(s) on the disc etc. The stored meta-data will be used by the ’bibconvert’ module[9] to fill in the admin(repository manager) provided template files and generate the formatted bibliographic records. These records are then uploaded into the database by the ’bibupload’ utility[10] All these tasks are done by running the pre-configured workflows by the admin. The details of the current WebSubmit is elaborated in the following section where a specific submission for ’Council Documents’ is discussed.

1.4

A Case Study - Council Document Submission

The CERN Council constituting representatives from member states meet regularly to address various activities of the organization. The document(s) presented at these meetings have to be uploaded into the CDS(CERN Document Server). A submission interface has been designed and a submission has been configured for submitting these Council documents with two available ’actions’ namely ’Submit’ a document and ’Modify’ the bibliographic data of an existing document as shown in Figure 1.1.

Figure 1.1: Council Documents Submission Interface showing available actions

The Submit action’s interface has been shown in Figure 1.2. The mandatory fields are marked with ’*’. In this submission, the constituent form fields are simple html elements and an html table. 2

1.4. A CASE STUDY - COUNCIL DOCUMENT SUBMISSION

Figure 1.2: Submission Interface

3

CHAPTER 1. INTRODUCTION On the submission of this form, the data is validated, firstly, by JavaScript checks and then by server side checks provided by the admin as python functions. After this, the form data are stored on the disc in multiple files, one file for every field. This form data stored as files on the disc is used to create bibliographic ’records’ by the ’bibconvert’ module of Invenio. A bibliographic record constitutes ’MARC’ tags [14] and their corresponding values. These records will be uploaded into the database by the ’bibupload’ module in order to be later searched, formatted to be presented in the search results etc. The record creation is done using the template files provided by the admin at the time of configuration of a submission. The template file used for the creation of the record after the ’submit’ action is shown in Appendix A. One can notice the complexity involved in creating a template file. One template file has to be provided for every action. Therefore, another template file should be provided for the ’modify’ action even though both are quite similar. This is one redundancy in the current WebSubmit. Of course, there is no re-usability of template files across submissions. If a form field value has to be broken down into multiple values to be associated with multiple Marcs, then the corresponding file has to be read from the disc and more files have to be created on the disc with appropriate names and values. By appropriate names, we mean that the file names have to match the names in the templates provided so that the templates can be filled with the values read from the files to create the record. On choosing the ’Modify’ action in figure 1.1, the interface rendered for entering the record number that should be modified is shown in Figure 1.3.

Figure 1.3: Modification Interface

If the entered record number is found, the user is taken again through the interface shown in figure 1.2 in order to be able to modify the existing data and resubmit it. Even though, the interfaces look relatively simple on the user side, in order 4

1.4. A CASE STUDY - COUNCIL DOCUMENT SUBMISSION to be able to render these forms, in the current WebSubmit, the admin has to write not only the html code but also python code(for the creation of the html table with its constituent fields shown in Figure 1.2). The ’submit’ action’s configuration page for the admin is shown in Figure 1.4. This entire information is stored in a database table. Also, the created form fields can’t be reused in any other form within this submission or in other submissions.

Figure 1.4: Configuration page for ’Submit’ action

For every action(Submit, Modify etc), there is an associated ’workflow’. A workflow does everything from rendering an interface to creating and uploading the record to sending notification emails. Workflows constitute python functions. The workflow that is run for the submit action is shown in Figure 1.5. There is, similarly, a different workflow for the modify action. A configured Workflow can’t be reused in another submission. This ’workflow’ paradigm is a very powerful feature of the current WebSubmit and will be used in a variant form in the new WebSubmit too.

5

CHAPTER 1. INTRODUCTION

Figure 1.5: Workflow for the ’Submit’ action

6

Chapter 2

The Project 2.1

Motivation

The following observations have been made with regards to the existing WebSubmit. 1. There are multiple stages to be faithfully followed in the configuration of any submission by the admin. 2. There is redundant work involved in the creation of multiple submissions with no re-usability of created form elements, record creation mechanisms and workflows across submissions and even within a particular submission. Reusability is an important issue since in any installation, there can be multiple submissions and each submission can have many associated actions. For eg., in CDS, there are over 100 submissions and each submission has on an average, 2 to 3 actions. 3. Template files have to be provided by the admin for every submission in order for the creation of records. Writing the template files is a complex and error prone task. 4. Creating a submission form, however simple, involves writing html code. 5. Creating relatively complex html elements like tables involves writing python code with file I/O operations. 6. These file read/write operations involved are multiple and hence, not only expensive but also difficult to handle. 7. The created forms’ data including the html is stored in database tables after configuration. 8. Simple configuration scenarios are encountered quite often. But, the work of the admin is not simplified enough. This applies to the various fields in a form and hence the data checks required and finally, the record created. For 7

CHAPTER 2. THE PROJECT eg., Title and Author(s) appear in most of the submissions. And the data check corresponding to Author(s) is almost the same for every submission. Hence, this fact should be used to make configuration easier for the admin. 9. The workflow specification mechanism for various actions using python functions is powerful. Hence, this should be retained in some variant form in the future versions of the WebSubmit. 10. On the user side, the rendered submission forms are not adequately interactive with the submitter. These drawbacks mandate the requirement of a new WebSubmit.

2.2

Objectives

The objective of the new WebSubmit is to address each of the issues mentioned in the above section. It should be designed to tackle them by requiring a single and simple configuration file per submission, allowing re-usability of the configuration files, html and python code and minimizing I/O and database operations. The new WebSubmit should make configuration as simple and as convenient as possible to the admin while at the same time not restricting what the admin can potentially do. Thus, the most common scenarios should be accounted for while the tools used should be powerful enough.

2.3

Technologies

The following are the main technologies involved in Invenio Software. The development environment is GNU/Linux. Python is the programming language used for the Invenio Software. Python is notable for its simplicity and dynamism. It has many redefinition capabilities and is suitable for test-driven development[11]. A coding style[12] is followed following the recommendations from PEP 8[13]. The Internal record format is the library standard MARC in its MARCXML form[14]. It is flexible enough to cope with the meta-data representation requirements that may arise in the future. MySQL is used for the database and GIT concurrent versioning system is used for the development of the framework.[15]

8

Chapter 3

Resources The following sections describe the available resources which will be used in the new WebSubmit.

3.1

INI Configuration files

’ini’ files which stand for initialization files are being used for the configuration of submissions. They have a simple structure consisting of name-value pairs that can be divided into sections. Thus, the structure is, ;comments [section1] name1=value1 [section2] name1=value1 ; comments name2=value2 ... ’ConfigObj’[16] is an available python module for reading and writing ini files as python dictionaries. In addition, it also supports additional features which are an extension of the .ini paradigm like nested sections, list values and multi-line values among others.

3.2

Workflow Engine

As mentioned previously in section 1.4, we retain the powerful notion of workflows in the new WebSubmit. However, the tool is different. The workflow engine that we are using to run the pre-configured workflows is more intuitive, simpler to configure and much more powerful. It is the one developed internally at CERN by Chyla, Roman.[17]. Thus, in the new WebSubmit too, for every action(Submit, Modify etc), there is an associated pre-configured workflow. However, with the new workflow ’engine’, a workflow can be specified as a python list of lists with the elements being python functions that return callables. A typical workflow can be seen in figure 4.2. 9

CHAPTER 3. RESOURCES It is possible to jump back and forth among the callables of the workflow with simple commands like ’jumpCallForward’, ’jumpCallBack’ etc which can be given in the callables. Thus, the callables are executed by the engine according to the control flow determined by the callables themselves. This was not possible with the workflow concept of the existing WebSubmit. The control flow is thus, in the control of the admin who configures it.

3.3

Plugins from Invenio software

Some of the available utilities from Invenio Software which are/will be used in the new WebSubmit are as follows. webinterface handler and urlutils for apache request handling. webpage, a generic utility for displaying Invenio pages. webuser for working with different users’ login and authentication credentials and dealing with sessions in Invenio. htmlutils for generating html code corresponding to various html form elements by providing the relevant attributes. Eg. invenio.htmlutils.H.input(name=”xyz”, value=”abc”, type =”textarea”) returns ”” mailutils for sending emails. send email() is the main API function. pluginutils, a module which has a generic plugin container class with a dictionary interface.

10

Chapter 4

New WebSubmit The new WebSubmit has a single ini configuration file per submission. This ini file is parsed by an enhanced version of the pythonic module configobj described in section 3.1. Also, for the specification of the workflows, it uses an enhanced version of the workflow engine described in section 3.2. A hierarchical class structure is followed for the classes corresponding to the elements/fields constituting the html forms in the Interfaces. A session is created at the start of a workflow and is valid till its completion. Sections 4.1 and 4.2 describe the changes made to configobj and workflow engine respectively while in section 4.3, the class diagram of the new WebSubmit is presented and explained. Section 4.4 gives an example configuration.

4.1

Configuration files

The configuration files used are ’ini’ files parsed by ’configobj’ mentioned in Section 3.1. Additions have been done to configobj for the following extra functionalities. Include other ini files in an ini file: The configuration files are parsed as python dictionaries by configobj. Hence, inclusion is done by merging dictionaries corresponding to the ini files in the right order. Substitute a section/subsection with the contents of another: If a (sub)section’s name is preceded by $, then the dictionary is searched from the beginning recursively for the (sub)section with that name. If found, the first (sub)section’s contents are substituted with the found (sub)section’s contents. If no (sub)section is found, an error message is thrown. Override a section/subsection with the contents of another: If a (sub)section’s name is preceded by ˆ and the dictionary has a (sub)section with the same name at the same level, then the contents of the first (sub)section are overridden with those of the second. If no (sub)section is found, nothing is done. The following conventions must be followed while naming the sections. 11

CHAPTER 4. NEW WEBSUBMIT • All interface section names must start with the word ’interface’ and the workflow section names with ’workflow’. • Within the ’interface*’ section, similar naming convention holds true for the subsections corresponding to the interface’s page description, interface’s form fields and it’s associated checks. They must start with ’description’, ’fields’ and ’checks’ respectively.

4.2

Workflow Engine changes

Some addition(s)/modification(s) were done to the workflow engine in collaboration with the original author. The workflow engine didn’t have a functionality to be halted and then to be resumed at a later time. This was required for our WebSubmit since any workflow would have to wait for user input or for some other process before resuming. With the changes, any callable returned by the functions comprising the workflow could raise an exception ’HaltProcessing’ to halt the engine. The exception would then be handled by dumping the workflow engine with its state. It can be loaded and resumed later. The callables handle everything including rendering interfaces for the submission to the procedures that are triggered after the submission. Every callable has two objects as parameters - session and engine. The session object is used for accessing the session data which includes the submitted data by the user(s) via the interfaces. The communication medium between the callables as well as between the callables and the webinterface handler is the engine object. We use the ’store’(A dictionary) attribute of the ’engine’ object for this communication. The methods setVar(variable name, variable value), getVar(variable name), hasVar(variable name) and delVar(variable name) can be called on this object for accessing and modifying this attribute. There are two standard variables in ’store’: ’interface id’: The interface(as specified in the configuration file) that should be displayed upon halting the engine. Set this variable before halting the engine. Delete it as soon as the workflow resumes. ’next user’: The list of user ids corresponding to the users who are allowed to access some part of the workflow. It can be set in any callable to restrict access to the workflow from that point on. Deleting the variable implies universal access. Default is universal access. Another noteworthy variable is ’status’, which is the status of the workflow and is displayed when the session is accessed via its id at any point of time. Here is a function that returns a callable which sets the variable ’interface id’ signaling to the webinterface to render an interface. This function is called in 12

4.3. CLASS DESIGN AND IMPLEMENTATION the workflow shown in figure 4.2 def buildInterface(interface): def buildInterface(session, engine): engine.setVar(’interface id’, interface) engine.setVar(’status’, ’Halted’) raise HaltProcessing return buildInterface

4.3

Class design and Implementation

The Class diagram for the new WebSubmit is shown in Figure 4.1. A brief description of each of the main classes is given in the following sub sections.

4.3.1

Submission

A Submission object is the result of a parsed configuration file. It uses the modified configobj module which parses the configuration file and returns a dictionary. This dictionary is searched for workflows and interfaces which are then stored in this object.

4.3.2

Interface

The Interface object, extended from python ’List’, contains a list of html elements which make up an html form. It contains ’aggregating’ methods like get html which would collect html code returned by each of its constituent elements, and returns the html corresponding to the interface.

4.3.3

Interface Element

This class corresponds to a generic html element. It contains attributes and methods which are generic to any html element. Those that are more specific to a particular html element are added to the elements class derived from this(like TextBox, CheckBox etc.). Such a class may be further extended into classes corresponding to fields of a form(like Title, Author(s) etc).

4.3.4

Session

Session object is a unique object for a single run of any workflow corresponding to some action. It contains the session id and the path on disc where the session information is stored. It contains the methods load and dump. The load method loads the workflow engine and updates the session object while the dump method just dumps the session. 13

Figure 4.1: Class diagram of the new WebSubmit

CHAPTER 4. NEW WEBSUBMIT

14

4.4. EXAMPLE CONFIGURATION

4.3.5

WebInterface

It is the webinterface handler class. It is responsible for parsing the urls and rendering the appropriate pages in accordance with the access rights of the requesting user. As mentioned earlier, the interaction between the webinterface handler and the callables constituting a workflow is through the ’store’ attribute of the ’engine’ object. The session id, the submission type and the action are contained in the url. In accordance to the url, the submission object is created. Also, the session and engine objects are read from the disc in case the engine corresponding to the session id in the url has been halted. Or else, they are created too.

4.4

Example configuration

A sample configuration is explained in this section. Consider the configuration file below named ’Journals.ini’ and its included file in figure 4.2 and figure 4.3 respectively. The contents of the subsection ’fields-approval’ in the ’ini’ file in figure 4.2 are substituted by those of the same section name in figure 4.3. Also, since no section with the name ’workflow1’ is found at the same level(Remember, these ini files are parsed as dictionaries) neither in this file nor in the included file, the override symbol in the section name ’ˆworkflow1’ is simply ignored.

Figure 4.2: An example configuration file, Journals.ini

The list of available submissions(doctypes) are displayed at /submitng. When any doctype is chosen, the available actions(workflows) for that doctype are displayed at /submitng/. When an action is chosen, a session is created and the workflow is triggered. The workflow in figure 4.2 is triggered upon reaching the url corresponding to the doctype(Journals) of the configuration and the action(submit), thus, starting with the execution of the callable returned by the ’buildInterface’ function that displays the submission interface. The rendered submission interface is shown in figure 4.4. Notice that the url structure is /submitng////

15

CHAPTER 4. NEW WEBSUBMIT

Figure 4.3: Included configuration file, included-file.ini

Upon rendering the form, the engine’s contents are dumped and it is halted. Again, on submission of the form, the engine’s contents are loaded and it is restarted. The values are checked by the callable returned by ’checkValues’ and if the checks fail, the engine jumps a step back and executes the first callable again, effectively re-rendering the interface. This behavior of stepping back is the result of the ’jumpCallBack’ command executed by the callable that is returned by the function ’checkvalues’. Else, we proceed, saving the form data on disc and sending emails regarding the submission to the email addresses provided. At this point again, the engine is halted. Only the specified users can access the link to approval Interface. When such a user accesses the approval Interface, it is rendered as shown in figure 4.5. After a choice is made(approve/reject), the relevant callable is executed as evident from the if-else block. At this point, the workflow execution is completed.

16

4.4. EXAMPLE CONFIGURATION

Figure 4.4: Submission Interface

Figure 4.5: Approval Interface

17

CHAPTER 4. NEW WEBSUBMIT

18

Chapter 5

Conclusion A new version of the WebSubmit module has been designed and its prototype has been implemented as part of this Masters project. It addresses the configuration issues and the limitations of the current WebSubmit and meets the set objectives discussed in Chapter 2. The resources chosen and modified for the new WebSubmit viz. the ini configuration files and the workflow engine are simple to use, yet, powerful. The chosen workflow engine related code is used both in other external projects and in other internal Invenio modules and will become part of the core Invenio toolset. It is thus an actively maintained code. Also, the new WebSubmit’s class hierarchy has been designed for extensibility and is hence ready for future enhancements. The implemented prototype allows for interactive testing of additional features and is thus convenient for further development. Its design would sustain and with further additions to the implementation, this new WebSubmit can replace its existing counterpart in Invenio.

19

CHAPTER 5. CONCLUSION

20

Chapter 6

Future Work The work that has been carried out ended up in creating a first prototype of the new WebSubmit, providing an example of submission workflow. The current code-base needs to be re-factored into a plugin based framework(for form elements and workflow callables). Advanced support for file management should be added and the framework should be integrated with the existing Invenio authentication and authorization framework. Real-time Ajaxbased server-side form validation is a planned feature. Powerful plug-ins for smart form elements should be added (e.g. for entering authors, auto-suggested keywords etc.). Additional callables to perform the most important tasks should also be added (e.g. to automatically extract meta-data from a full-text etc.). Currently, the admin is supposed to know the existing plug-ins and their parameters. A set of tools to aid the admin in discovering available plug-ins and their documentation and in checking the correctness of their usage should also be introduced. The framework needs to be polished through real-case usage. In CDS, existing submission workflows based on the current WebSubmit will have to be re-engineered to exploit all the features of the new WebSubmit.

21

CHAPTER 6. FUTURE WORK

22

Appendix A

Template file A typical template file:

23

APPENDIX A. TEMPLATE FILE

24

Bibliography [1] CERN, http://cern.ch, March 2011. [2] Sir Tim Berners-Lee’s proposal for an information management system, http://info.cern.ch/Proposal.html, March 1989. [3] Worldwide Computing Grid @ CERN, http://lcg.web.cern.ch [4] EPFL, Lausanne, http://epfl.ch [5] *Invenio Software, http://invenio-software.org [6] CERN Document Server, http://cdsweb.cern.ch [7] *Invenio Modules’ Overview, hacking/modules-overview

http://invenio-demo.cern.ch/help/

[8] Admin guide for Invenio’s ’WebSubmit’ Module, http://invenio-demo. cern.ch/help/admin/WebSubmit-admin-guide [9] Admin guide for Invenio’s ’bibconvert’ Module, http://invenio-demo. cern.ch/help/admin/bibconvert-admin-guide [10] Admin guide for Invenio’s ’bibupload’ Module, http://invenio-demo. cern.ch/help/admin/bibupload-admin-guide [11] Python Advocacy, PythonAdvocacy

https://twiki.cern.ch/twiki/bin/view/CDS/

[12] *Invenio Coding Style, http://invenio-demo.cern.ch/help/hacking/ coding-style [13] *Style Guide for Python Code, http://www.python.org/dev/peps/ pep-0008 [14] *MARC Meta-data Standard, admin/howto-marc [15] Git Twiki, GitGettingStarted

http://invenio-demo.cern.ch/help/

https://twiki.cern.ch/twiki/bin/view/CDS/

[16] *ConfigObj module Homepage, http://www.voidspace.org.uk/python/ configobj.html [17] *Workflow Engine documentation, https://svnweb.cern.ch/trac/ rcarepo/wiki/InspireWorkflowEngine#Details

25