distributed document searching in public administration domain

A Distributed Peer System for Public Administration Content Retrieval Flavio Corradini, Andrea Lazzari, Fausto Marcantoni, Alberto Polzonetti, Marco Trivelli University of Camerino, Dipartimento di Matematica ed Informatica, Via Madonna delle Carceri 9, 62032 Camerino (MC), ITALY {flavio.corradini, andrea.lazzari, fausto.marcantoni, alberto.polzonetti, marco.trivelli}@unicam.it Abstract. In this article we propose a peer to peer system able to permit a collaborative information exchange between Public Administrations. The system acts as a peer to peer network for document retrieval but in every single peer of network there is a search engine capable to index documents and able to share index information with other members.

Keywords. P2P, PSE, search engine, collaborative tool, share content system, eGovernment

1. Introduction “Good Governance” concept is getting in domains like, Politics, Public Administrations and more development management environments. In the last decade, this term was associated with a global renovation of public government sector to empower the collaboration between government institutions and to enable closer interaction between them. Good Governance focus its tasks on creation of new inter- or intra- governmental interfaces, approaching a new system in sharing trusted information. However, some problems are rising in order to performs it in a collaborative approach, as followed: Problems of capturing information only once and sharing it; Problems of equitable information access; and Problems of information sharing between and within governments. The main goal, that needs to be focus on, is how to solve them all, and how to create an “as new” approach to information. By our point of view information is a service, needed to improve task involved in information sharing. Some recent articles strongly endorse the view that sharing information between law-based organizations will enhance their ability to reach

prefixed goals [1]. As pointed out in [1], much of the information that government organizations share is at least somewhat unstructured. The primary contribution of this work is to propose an architecture, called PSE - Peer Search Engine, that permits simple information sharing between Government Agencies. It creates also a structure based on the information gathered on documents, so as to enable information sharing between heterogeneous government organizations. Two aspects seem to be more important then others, first of all the ability to provide different information degrees that need to be shared, in a XML document forms, and the inclusion of various groups determining what kind of information should be shared. Once the structure of the information has been determined, our architecture uses them to reach all the agencies in our network to realize “accessible information”.

2. PSE - Peer search engine The main goal of our system is to enable Public Administrations (PAs) to create a distributed documents sharing system, able to simplify every bureaucratic procedures of documents and information exchange that every day PAs employees must perform. Before beginning to introduce and illustrate application’s functional modules is important to rapidly expose the three formal classification sets of P2P architecture focusing in node's functionalities.

2.1. Centralized networks This kind of architecture, as usually, is composed of a central unit that makes flow controls between every peers connected each others. The centralized server maintains directories of distributed contents. These directories are composed of information provided by every single peer connected [2].

Nevertheless the registration procedures are often composed only of sharing data information [3]. After the registration procedures, the server provides to share these information between peers; it allows connection between those peers that need to share data in a direct manner. Server does not maintains shared data but only a few set of information necessary for peers communication. However, there are some applications that store other essential information for data computation [4] or grid file system sharing [5]. Some other subclasses of this architecture permit to create a partially centralized peer network. In this sub classification a peer may become more responsible then others and it acts like a local server although it allows some server behaviours such as: super node interconnection or peer to peer connections.

2.2. Decentralized networks Another point of view to classify peer networks is to consider a totally decentralized structure. All nodes have the same responsibility without consider their capabilities, positions and resources [6]. Each node on this network is called servent. In this type of architecture no kind of management is applied on shared files, therefore every servent maintains its autonomy and it independently choices which and how many resources share [7]. The totally absence of failure points permits to increase the networks availability and information persistence because there is no kind of dependencies between servent and everyone is able to perform similar behaviours.

subordinate peers communications arising the amount of simultaneous connections / communications. Like exposed above we can classify our system as an Hybrid peer to peer network, because functionalities are spread within the single units of the system.

3. PA Local Architecture 3.1. Distributed data sharing A PAs employees needs to search and examine a lot of documents every day to carry out his work. Lets take into account administrative acts. Generally, before writing them , there are a lot of searching activities aimed to find correlations between previous issued acts and the newest ones. These tasks are performed likely in a manual manner, or using full-text search provided by the back-office software.

2.3. Hybrid approach All previously mentioned features of the two architectures were combined on hybrid points of view that merge centralized and decentralized network models. In hybrid structure, there are some nodes that are not properly general, some of these have double functionalities. The first is the possibility to be a centralized server for a few set of controlled peers, capable to satisfy, for example, search request of subordinate peers and able to satisfy every routing requests: intra layer and extra layer [8]. In a centralized system, super nodes are chosen in a dynamically manner and they are identified as a peer. In fact, peers cooperation is a part-time activity. Hybrid network performance is improved because super nodes only perform a

Figure 1. PALs PA Local System

A tool that performs a contextualized search in a multitude of file format, legacy or not, or in a wide spread of location and not only on the physical hard drive but in each PAs systems, is likely needed. In order to function, this tool must be easy to use and near to the normal modus operandi of the employees, web based interface instead of legacy terminal applications and much more.

3.2. Document factory In our functional schema the document factory virtually represents every document made by the PA. We use “Factory” to ironically

meaning industrial production of documents issued by all Italian PAs. All documents are grouped in this logical unit because by our point of view, PAs do not need an internal division between its organization charts not even among documents. Probably internal procedures share same tasks, these facets can be improved making it unique, for instance using information retrieval technique. The act of grouping information must be totally straight and transparent for the final user, so that not to make heavier everyday work activity. The final user can go on his activity without thinking to the new layer between information, avoiding every problem related to new work procedures.

3.3. AhI - Ad Hoc interface The Document Factory contains a very large amount of documents; likely in a large numbers of formats. To index and search within these varieties of information types we need a specialized layer able to read information of any type. Like above mentioned, Ad Hoc Interface are able to do that, those creating one instance of a reading daemon for each file type document in PAs Document Factory. By the way, in PA domain there are universe of different legacy systems supporting the back office. The AhI definitively represents a data structure format interface for knowledge, a Meta data information and different kind of information built-in file type able to manage and put it in a format useful for indexing step. Software house may be capable to write our AhI using a set of API to retrieve information from file system or database or other storage systems.

activities, usually done in this kind of application, the document will re-indexed by the crawler using the same criteria. Therefore it’ll be certainly searchable with the same keywords, and the uniquePA identifier specifies the current document owner/s. The possibility to track the owner in the indexing add some interesting features to our system. Like a digital sign, the uniquePA identifier grant the recognition of the owner in each document download. Another facet of indexer unit is to bring the xml crawled documents and parse them, creating a document summary, usually in text format, a list of keywords related to contents. Furthermore indexer cleans from the document all the unnecessary information for searching process, like images, lexical conjunction, and common use words, and so on. The newly created document is a summary and it will store into Digest Repository.

4. Central architecture Public Administrations need to be identified on our system in order to create a univocal trusted network. To do that we need an authentication system that uses some pre-existent system of identification. The authentication procedure is composed of two phases: - PAs authority registrant address checking - PA registration on system

3.4. Crawling and indexing The crawling layer probes the Document Factory searching differences and new file ready to be indexed. All this work is done using a structured data schema [9] filled by using AhI. In particular, the indexer is able to read only a single kind of data representation. In general, it depends on indexer implementation but crawler, working with AhI, is capable to create an xml document [10] validated by using the indexer data schema. After that the indexing step can be done. Every PAs do not need to force to change normal work procedures, imposing their employers to work with software solution. The crawler, after the creation of indexer xml document, adds a uniquePA identifier, capable to recognize the owner of the searchable document. Concerning the downloading and re-sharing

Figure 2. PACs PA Central system

4.1. PAuth After the crawling and indexing steps, those PA wishing to share information, need to send a

request for joining the network. In this first step the PA sends the URL location of its servant [6]. PSE checks the URL domain with the CNR [11] registration authority, testing association [12] domain / IP and validating the internet address of the requestor. Otherwise a single sign on system may be used in order to grant authorization and authentication to the system.

4.2. Peers URI table (PUT)

Document Digests, i.e. the text format of document and the set of keywords associated with this one. Nevertheless, when user does not know the PA that shares documents all the PUT is queried and flooded. Anyway, this optimization has the ability to reduce band weight usage, number of flooded system, and server side resource allocations, granting faster response time [15].

4.4. Information’s flow

After the recognition of requestor, PSE will write the PA domain name into a URI [13] table called PUT that contains the URI and servent path of the PA requestor. The structure of this table is dependent on URI address, preferably composed of: - nation domain i.e. it, eu, uk; - region name i.e. Marche, Toscana, Umbria, or province i.e. Macerata - public organization name i.e. communeCamerino, regione and many more

Now we focus on the ability, provided by the system, to re-share the downloaded documents like in other content distribution networks. Every single document, if located into the Document Factory, becomes part of the knowledge shared by a PA, when a user requests that particular document, results provide not only the location of its original owner but also the new locations where that document was downloaded.

to create URL like: - www.regione.marche.it - www.comune-camerino.macerata.it Nation

Region / Province name

PAs type

Servent Path

It It It

marche marche macerata

regione regione comune

Servent 1 Servent 2 Servent

Table 1. PUT Uri structure of peers registration system

This solution permit to create a centralized structure where every PAs would join PSE obtaining a registration. Indeed only the authenticated peers will able to search and download file from network ensuring control on data distribution. Moreover PUT allows creating partial query on system.

4.3. Query engine When a user performs a query using the network on the centralized system, Query Engine is activated. The Query Engine is able to compile user query, the term compiling means the ability to make smaller and faster network research. Using PUT and according to the keywords used for the research, Query Engine can probe only specified PAs, creating a subset of servers (Query Filtering) involved on flooding research [14]. Results of searching procedure are the

Figure 3. System architecture schema

In a situation where the document’s owner is not connected to the network, is possible to have a shadow copy of all the documents previously downloaded by the other peer, does creating a distributed PA Document Factory backup. This implies the ability for a PA employee to do his work without direct document availability.

5. Conclusion e Future works In this work we have proposed a content distribution system able to share document among PAs or other government agencies in a peer to peer point of view. Information into our domain tends to be unstructured and data managed are not ever permitted. In fact inter- or intragovernment legacy back-office environment are not always connected to procedures used by employees. According to explained problems it can be said they we need a system to permit this structure to maintain order into information. We have presented all steps we

needed to create a system for information definition, storage, access and search. At last but not the least we have identified a lot of aspects that need more investigation work to be solved and that nowadays represents our first targets. This system may be representing a new kind of collaboration among PAs but we must point out the following: - re-sharing capability today do not permit to retrieve the original owner because a PA indexer mark the document like the original one; - naming different document in different ways but indexed exactly with the same keywords, may cause some problem at Query Engine that tend to group Document Digest by keywords, causing some problem to the final user recognize documents; - registering sequences require more efficient way and trustful manner; - And, at last, creating a prototype that permit us to test ad verify the search architecture.

7. References [1] Dizard, W. P. 2002, White house Promotes Data Sharing, Government Computer News [2] Bungale P., Goodell G., Roussopoulos M., Conservation vs. Consensus in peer-to peer preservation systems, Proceedings of the 4th International Workshop on Peer-ToPeer Systems (IPTPS ’05), 2005. [3] Chad Yoshikawa, Brent Chun, and Amin Vahdat, Distributed Hash Queues:Architecture & Design, Proceedings of the 3rd International Workshop on Agents and Peer-to-Peer Computing, July 2004 [4] Clark I., Sandberg O., Wiley B., Hong T.W. , Freenet: A distributed anonymous information storage and retrieval system, International Workshop on Design Issues in Anonymity and Unobservability, LNCS 2009, Springer: New York (2001). [5] Clements A.T. , Ports D.R.K., Karger D.R. , Arpeggio: Metadata Searching and Content Sharing with Chord, MIT Computer Science and Arti_cial Intelligence Laboratory 32 Vassar St., Cambridge MA 02139 [6] Cooper B.F. , Quickly routing searches without having to move Content, Proceedings of the 4th International

Workshop on Peer-To-Peer Systems (IPTPS ’05), 2005. [7] Coulouris G., Dollimore J., KindbergT. , Distributed Systems: Concepts and Design, Addison-Wesley, Harlow, England, 1989 [8] Fletcher G.H.L., Hardik A. Sheth, and Katy B¨orner, Unstructured Peer-to-Peer Networks topological properties and search performance, Agents and Peer-toPeer Computing: Third International Workshop, AP2PC 2004, New York, NY, USA, July 19, 2004 [9] Foster I., Internet Computing and the Emerging grid., Nature Web Matters, 2000 [10]J. Lee, An end-user perspective on file sharing systems, Communications of the ACM, 2003 [11]Kubiatowicz J., Extracting guarantees from chaos., Communications of the ACM, 2003 [12]Lu J., Callan J., Content Based Retrieval in Hybrid P2P network, Proceedings of the twelfth international conference on Information and knowledge management. S. 199–206. ACM Press. 2003. [13]Lv Q., Cao P., Cohen E., Li K., Shenker S., Search and replication in unstructured peer-to-peer networks, Proceedings of the 16th ACM International Conference on Supercomputing (ICS’02). New York, NY. [14]Ratnasamy S., Handley P.F.M., Karp R., Shenker S., A Scalable ContentAddressable Network, Proceedings of the ACM SIGCOMM 2001 Technical Conference [15]Shamir A., How to share a secret, Communications of the ACM, vol. 22, n.11, pp. 612--613, Nov. 1979.