Relational Databases

2 downloads 0 Views 3MB Size Report
embedded in the relational database structure (schema). In addition, linking .... (Named Entity and numeric data), whereas the former is based on Named-Entity ...
IEEE International Conference on Smart Computing and Electronic Enterprise (JCSCEE2018) ((92018

Managing TextualData Semantically In Relational Databases SK Ahammad Fahad Faculty of Computer Information Technology Al-Madinah International University Shah Alam, Malaysia

Wael M.S. Yafooz Faculty of Computer Information Technology Al-MadinahIntemational University Shah Alam, Malaysia Faculty of Computer and Mathematical Sciences UiTM Shah Alam, Malaysia [email protected]

Abstract— the massive volume of data in databases, web pages, and document files usually causes information to be disorganized and unclear for the user. Therefore, information in such an environment can be classified into three forms: structured, semistructured,or unstructured. Structured informationis the best form of information because it facilitates the acquisition and comprehension of knowledge. Relational Database Management System (RDBMS) has a robust structure that manages, organizes and retrieves data. There are many attempts have been made in order to deal with such data. These attempts can be categorized into three groups: within a database schema, by a developed data model within the database, or by query-based techniques in database. Nonetheless, RDBMS contain massive amount of unstructured data such as textual data.. This paper proposed Textual Virtual Schema Model (TVSM). TVSM is conducted to perform semantic textual data linking and clustering and is embedded in the relational database structure (schema). In addition, linking and converting the unstructured information to structured data. Quality improvement of textual data clusters.

Achievementof high query processingefficiencyin retrieving data clusters. TVSM initially developed to assist researchers, developers,and database administrators who are concerned on unstructured information management, information extraction,

multi-document clustering, information retrieval, query processing efficiency, personal information management, question answering, information integration, news tracking, and newssummarization. Keywords— database, textual documents, clustering , unstructured data

I.

INTRODUCTION

The massive volume of data in databases,web pages, and document files usually causes information to be disorganized and unclear for the user. Therefore, information in such an environment can be classified into three forms: structured,

semi-structured, or unstructured. Structured information is the best form of information because it facilitates the acquisition and comprehension of knowledge. Many forms of unstructured or semi-structuredinformation,such as news portals, articles, and discussion forums, can be found in database systems or on the World Wide Web. Relational Database Management System (RDBMS), which is known as a data repository [l]. Due to this, it has a robust structure that manages, organizes and retrieves data. Nonetheless, RDBMS contain massive amount of unstructured Therefore, there are many data such as textual data

b13()[email protected]

database, or by query-based techniques in database applications. Since the unstructureddata comes in imprecise structure from natural language or computer, a solution must be

provided to overcome the difficulty of having such data to be in normal form of data structure [3].

There are three ways of dealing with unstructuredtextual data that include the management of a database schema, data model within the database, and query-based techniques in database applications. First, the database schema contains a description of entities within a database. Entities which are referred to as database tables and contain actual data.

Second, the data model within the database, deals with data

by developing new data models, which requires a new data structure to represent the relations between entities in a database. However, these techniques focus more on the unstructured multimedia files such as medical images. Third, query-based techniques in database applications, which is also known as a front-end technique, facilitates user access to useful information. Query-based techniques can be in

the form of extract-then-query [4, 5] query-then-extract [6, 7] or keyword searches [8]. However, these methods are regarded as temporary methods of extracting or managing data and executing a specific task. Consequently, the problem of unstructured data in relational databases remains unaddressed. Most of the systems for managing unstructured information use information extraction methods as its front-end applications.

This study shows the results of Semantic Textual Data Linking and Organization in relational databases using a created a model called a Textual Virtual Schema Model (TVSM) [9-12]. TVSM is used for automatic textual data semantic linking and clustering in relational database. In this way, data clustering carried out as the embedded process within the database level as opposed to the current approach. Ina addition, the TVSM can do, Management repository of automatically semantic textual data. , Linking and converting the unstructured information to structured data, Quality improvement of textual data clusters and Achievement of high query processing efficiency in retrieving data clusters.

The rest of this paper is organized as follows: Section Il, presents TVSM while Section Ill discuss result and discussion before the concluding remarks in Section VII.

attempts have been made in order to deal with such data. a These attempts can be categorized into three groups: within the within model database schema, by a developed data

191

IEEE International Conference on Smart Computing and Electronic Enterprise (ICSCEE2018) 02018

extraction methods are used to compare the results of structuring textual unstructured data with the TVSM model, with the most common way of managing unstructured textual

Il. TEXTUAL VIRTUAL SCHEMAMODEL

This study shows the results of Semantic Textual Data Linking and Organization in relational databases using a created a model called a Textual Virtual Schema Model

data.

(TVSM). TVSM is conducted to perform semantic textual data linking and clustering and is embedded in the relational database structure (schema). With the massive amount of unstructured information in the relational databases, this information will be meaningless and useless, if not managed or reorganized in an understandable form. Thus, TVSM developed to assist researchers, developers, and administrators in handling unstructured information. This method is also helpful for users who are tasked in managing unstructured information, textual document clustering, news tracking, and news summarization. The proposed model can be performed on textual data, but has more advantages in areas that handle news articles because such areas have a large amount of textual data. In addition, textual data are a source of knowledge for many domains. Therefore, the model can sufficiently manage, integrate, and obtain useful information from the textual data.

The first outcome of TVSM is the discovery of a semantic link between textual data. Thus, TVSM can automatically discover the semantic relation between textual documents in relational databases based on the extracted terms (frequent terms). This relation is based not only on the mapping of individual words onto other individual words but also on the semantics of words. The TVSM uses the WordNet database to obtain the semantics of words that frequently appear in textual data. The linking process is based on the most frequent terms and named entities in the textual documents. Thus, unstructured information is converted into structured data by extracting the terms and storing them in another column via table mapping. These terms represent the textual documents; thus, a user who examines these words can understand the topic of the textual document. Such terms help users or applications manage such data for further processing or obtain comprehensive knowledge. Ill.

RESULTS AND DISCUSSION

Textual data structuring, linking and organizing processes link structured and unstructured textual data in relational databases using TVSM. A user can find relationships between unstructured textual data by using such linking processes. The process of linking is achieved after structuring and organizing the unstructured textual data. There are two types of linking, namely: non-semantic and semantic linking. The way of structuring and organizing the unstructured textual data and the types of linking are discussed in this section. A. Non-semantic linking The first experiments were conducted to evaluate the textual document structuringand linking inside the relational database attribute. Thus, the manner of structuring such data with the TVSM model was evaluated by comparing the structuring approach of the information extraction technique. This evaluation was performed based on Non-semantic linking. Non-semantic linking is used to determine the relationship between textual documents. In this experiment, the information

192

In previous research, which is concerned with managing unstructured textual data in relational databases such as: [13,14], unlike with the TVSM model, the researchers employed only information extraction techniques. [14], aimed to extract named entities such as author names or addresses. Their methods focused only on the meta data of articles. While [13], used information extraction techniques to extract the structured from unstructured textual information. They then used such structured information for further processing. However, all of the proposals focused only on the information extraction techniques. Thus, such methods did not represent the textual document based on actual content. Furthermore, they did not focus on the Frequent-Term, which can provide accurate results for further processing such as information retrieval and integrating.

The experiments performed in this research are based on the three datasets, namely: Classic, WAP and Reuters. The Oracle database and Stanford NER are employed to execute the experiment. In IE techniques, structured information is extracted from unstructured textual data (datasets). Thus, a textual document is represented by these sets of structured information. In the TVSM, the proposed similarity measure extracts the Named-Entity and mines the Frequent-Term. These terms represent textual documents. In both methods, the number of selected terms is determined as 10. These 10 terms represent a single textual document. As a result of the extracted

words from the dataset, it shows that the maximum extracted words has a frequency of 10 words. Thus, if more than 10 extracted words, its frequency is counted by 1. In addition, these terms, are known as the general description of the textual document. By using such terms a textual document can be retrieved or they can be used for further processing. The information retrieval results are measured during the clustering process (cluster assignment). The clustering process (cluster assignment) is executed in both IE and TVSM based on the minimum support of words, which can be one, two, three, or four words. The data cluster created by both methods is evaluated using the F-measure. Recall and precision are calculated to determine the F-measure. All the F-measures are calculated for the IE methods and TVSM (all data cluster are presented in appendix D). The maximum F-measure for each cluster and class is calculated as shown in Table 1 .

A difference of the maximum F-measures for IE methods for the four words as minimum support is shown in I, while Figure 2 shows the maximum F-measures for TVSM methods for the four words as minimum support.

IEEE International Conference on Smart Computing and Electronic Enterprise (ICSCEE2018) 02018

two or three words. When the minimum support increases to more than four words, the number data clusters is increased.

4

MED

• CRAN

E

c CACM

20

30

50

Maximum F-measure

Figure l: Comparison IE(non-semantic linking) and classis 4

3

• MED O CRAN

Thus, when a textual document does not belong to a closed data cluster, the information retrieval is unsatisfactory.

The average improvement between the IE methods and TVSM

is approximately17% based on the Classic dataset. With this improvement ratio of F-measure in TV SM compared to the IE methods, TVSM noticebaly outperforms the IE methods. Thus, the textual documents in relational databases are more representative and closed to each other. Therefore, the integrating and linking between the textual data is good in TVSM compared to the IE methods. In the Reuters dataset, textual document structuring and organizationare performed using the same settings as those of previous experiments conducted on the classic dataset. For six classes of this dataset, the maximum F-measure is calculated for data clusters, which were created using both the IE and TVSM methods. All maximum F-measures are shown in Table 1.

• CACM 1

COMPARISON OF F-MEASURS BETTWEEN IE METHOD AND TVSM(REUTRES)

TABLE VI. o

10

so

70

Maximum F-measure

TVSM

class

Minimumsu ort

Minimumsu ort Figure 2: Comparison TVSM(non-semantic linking) and classic

Figure 1 and Figure 2 shows the maximum F-measure of data clusters created using IE and TVSM, respectively. The best maximum F-measure can be realized when a minimum support is two and three words for both methods. The Fmeasure for a data cluster, created using TVSM, is better than that created using IE. The improvement of F-measures of TVSM over IE methods are shown in Figure 3.

CCAT MCAT ECAT POL SPO GRIM

1

2

3

0.660

0.660 0.430 0.230 0.070 0.050 0.030

0.510 0.400 0.270 0.070

0.430 0.230 0.071 0.050 0.037

0.080 0.410

4 0.420 0.290

0.260 0.110

0.060 0.060

1

2

0.660 0.410 0.260 0.160

0.620 0.460 0.240

0.060

0.420 0.310

0.260

0.44()

3 0.530 0.410 0.290 0.200 0.220 0.270

4 0.340 0.340 O. 150 0.290

0.320

The improvementis evident when the minimum support is three words. However, no improvement was noted with one word. For the two based on IE methods, a slightly better maximum F-measure than that of the TVSM model was achieved at 6%. Meanwhile, when the minimum support is three or four words, TVSM achieves a slightly better result than that of IE by 3% and 4%, as shown in Figure 4. The average improvement is approximately 0.7%.

• MED

•CNN • ctS1 e CACM

2

4

3

-20m

com

:acm • POI 2

Figure 3: Improvement of F-measure of TVSM

Figure 3 shows the improvement in maximum F-measure on the data cluster created using the TVSM model compared with IE because the latter is based on structured information (Named Entity and numeric data), whereas the former is based on Named-Entity and Frequent-Terms. When the number of minimumsupports increases to more than four, the F-measure is low. is generally not as good as when the minimum support is support minimum when achieved The best improvement is

.s.co

0.0)

to 00 Improvement

Figure 4: Improvement of using TVSM on Reuters dataset.

For the MCAT class, the improvement in the maximum Fmeasure can be realized better in this class than in the previous class. The improvement is 13%, 04%, 07% and 31% for one,

two, three and four words as minimum support, respectively. The average improvement is 13 times.

193

IEEE International Conference on Smart Computing and Electronic Enterprise (ICSCEE2018) 02018

In the POL class, the maximum F-measure ratios for IE methods are 7%, 7%.7% and 11%, whereas the values for the TVSM model are 16%, 42%, and 15%. The improvements of 97%, 500%, 185% and 36% for one, two, three and four words as minimum support can be attribute to the fact that this dataset concerns political news, which contain frequent terms and named entities, In addition, this dataset does not have a large amount of numerical data. The average improvement is 204%.

between unstructured textual documents becomes noticeable. Accurate information is retrieved, which ensures that linking and integrating can be obtained, In the semantic link experiments, the Classic and Reuters

datasets are selected because these datasets contain actual news data. The maximum F-measures are calculated to assess the accuracy of relatedness amongst textual data during the information retrieval process. The results of the structuring and linking experiments enable the best minimum support words (i.e., two and three words) to be selected. The maximum F-measure for TVSM with/without semantic linking is based on the classic dataset.

In SPO, the maximum F-measure ratios for IE methods are 5%, 5%, 8% and 6%, whereas the values for TVSM are 6%, 42%, 22% and 29%. The improvement in the maximum Fmeasure can be realized better in this class than in the previous class. The improvement is 2%, -2%, 175% and 380% for one, two, three and four words as minimum support, respectively. The average improvement is 1.3 times.

Table 2: F-measures in with/out semantic linkin in TVSM classic

Class

In the GRIM dataset, the maximum F-measure for the IE methods are 3.7%, 3%, 41% and 6\%, whereas the values for TVSM are 26%, 31%, 27% and 32%. The improvementsare 602%, 933%, 588% and 433% for one, two, three and four words as minimum support, respectively. The average improvement is 630%.

TVSM No- TVSM No- Improvement for 2 Semantic Semantic

CACM 0.48 CISI 0.24 CRAN 0.18 MED 0.50

The maximum F-measure ratios obtained by using the

0.27 0.50 0.31 0.32 0.36 0.36 0.37 0.39 0.41 0.21 0.52 0.24

Improvement for 3

0.05 0.34 1.10 0.02

0.13 0.13 0.11 0.13

Table 2 shows a comparison between TVSM with and without semantic linking based on the classic dataset for the four classes. The ratio of the maximum F-measure increases in all classes for both minimum support words, which are two and three words. The noticeable improvement for two and three words as minimum support with and without semantic link are shown in Figure 5.9 and Figure 5.10, respectively. For the CACM class, the maximum F-measure is 48% without semantic linking. The value increases when using semantic linking, reaching 50%, with a 2% improvement. For the CISI class, 24% is achieved without semantic linking. This value increasedwith semantic linking to 33% with an improvement of 34%. For the CRAN class, the maximum F-measure is 18% without semantic linking, whereas the value increased to 39%, an improvementof 110%, with semantic linking. For the Med class, the maximum F-measure is 50%. This value increased with semantic linking to 52% with an improvement of 2%. The average improvement for the case of two words as minimum support is approximately 16% as shown in figure 6.

TVSM model generally outperform those obtained using the IE

methods. The average improvement achieved when using TVSM is approximately 3.5 times, which indicates that textual documents can be better structured by using Named-Entity and Frequent-Terms than by using structured information only. In the WAP dataset, the F-measure of TVSM is more noticeable because WAP has many Frequent-Terms.

Figure 5: Improvement of using TVSM on Reuters dataset.

All the above mentioned evaluations between TVSM and IE methods are based on the non-semantic linking process. Another linking method is that of semantic linking, which uses the synonym of the Frequent-Term. TVSM

B. Semantic Linking Semantic linking is a means of semantically finding links between textual documents. To achieve semantic linking, the WordNet database is employed. Synonyms of Frequent-Terms are obtained from the WordNet database. Such synonyms are used for the matching process to find the relationship between unstructured documents. The matching process, which is performed with the cluster textual documents relates to the

C.ACM

oa)

0.10

0 20

0 30

040

MaximumT-me•sur.

o so

Figure 6: F-measures in with/out semantic linking in TVSM- Two words.

synonyms of Frequent-Terms. In this way, the relationship

194

With three words as a minimum support for the same dataset, the maximum F-measure also increased in all classes, reaching up to 31%, 36%, 41% and 24% with semantic linking

IEEE International Conference on Smart Computing from 27%, 32%, 37% and 21% without semantic linking, with improvements of 13%, and for CACM, CISI, CRAN and MED classes, respectively, as shown in Figure 7.

and Electronic Enterprise (ICSCEE2018) 02018

information, has also been highlighted as one of the problematic difficulties. In addition, unstructured data will be meaningless and not useful to the user if they are not organized properly. Therefore, a model for automatic textual data linking and clustering in relational database structure has been developed. Three different experiments have been conducted on different datasets to show the links between structured and unstructured information.

TVSM •sem Mt •c

CSI

TVSM No-Semmt•c

C.ACM

REFERENCES 0.10

020

030

0.40

0.50

Coronel, C., & Morris, S. (2016). Database systems: design,

Maximum f-measure

Figure 7: F-measures in wtth/out semantic linking in TVSM- *Ihree words.

[2]

In the Reuters dataset, semantic linking experiments were performed to evaluate the TVSM model with semantic linking.

Based on the same settings as those for the semantic experiments conducted on the classic datasets, the Reuters datasetwas used for all six classes with minimum supports of two and three words. Figure 8 and 9 demonstrates the maximumF-measure for the structuring of the original textual document and the retrieving accuracy of the relatedness of textual documents.

implementation, & management. Cengage Learning. Jung, M. G., Youn, S. A., Bae, J., & Choi, Y. L. (2015, November). A study on data Input and output performance comparison of MongoDB and PostgreSQL in the big data environment. In Database Theory and Application (DTA), 2015 8th International Conference on (pp. 14-17). IEEE. [Agrawal, R., Ailamaki, A., Bernstein, P. A., Brewer, E. A., Carey, M.

J., Chaudhuri, S., ... & Gehrke, J. (2008). The Claremont report on

database research. ACM Sigmod Record, 37(3), 9-19. Agichtein, E., & Gravano, L. (2003, March). Querying text databases for efficient information extraction. In Data Engineering, 2003. Proceedings. 19th International Conference on (pp. 113-124). IEEE. [5] Cafarella, M. J. (2009, January). Extracting and Querying a Comprehensive Web Database. In CIDR. [6] Jain, A., Doan, A., & Gravano, L. (2008). Optimizing SQL queries over

[4]

text databases.

GR*M

[7]

Jain, A., Ipeirotis, P., & Gravano, L. (2009). Building query optimizers

for information extraction: the sqout project. ACM SIGMOD Record, 37(4), 28-34.

e POL

[8]

[9]

Luo, Yu,Wang, W., & Lin, X. (2008, April). SPARK: A keyword search ICDE 2008. engine on relational databases. In Data Engineering, IEEE 24th Intemational Conference on (pp. 1552-1555). IEEE.

Wael M.S. Yafooz,Abidin, S. Z., Omar, N., & Halim, R. A. (2013, August). Dynamic semantic textual document clustering using frequent

o

0.1

0-2

0.3

os

0.7

terms and named entity. In System Engineering and Technology (ICSET), 2013 IEEE 3rd International Conference on (pp. 336-340).

0.8

IEEE.

Maximum F-measure

Wael M.S. Yafooz , Abidin, S. Z., & Omar, N. (2011, November) Towards automatic column-based data object clustering for multilingual databases. In Control System, Computing and Engineering (ICCSCE), 2011 IEEE International Conference on (pp. 415-420). IEEE.

Figure 8: F-measures in with/out semantic linking in TVSM- Two words.

Generally, the TVSM model, with semantic linking, outperforms TVSM without semantic linking. The average improvement on the classic dataset is approximately 14%, whereas that on the Reuters dataset is approximately 19%. Such linking will improve the quality of data clusters. The evaluation process of quality of data clusters, which were createdusing the TVSM.

[11]

Wael M.S. Yafooz, Abidin, S. Z., Omar, N., & Idrus, Z. (2013, December). Managing unstructured data in relational databases. In Systems, Process & Control (ICSPC), 2013 IEEE Conference on (pp. 198-203). IEEE.

[12] Wael M.S.Yafooz,Abidin, S. Z., & Omar, N. (2011, November).

Towards automatic column-based data object clustering for multilingual databases. In Control System, Computing and Engineering (ICCSCE), 2011 IEEE International Conference on (pp. 415-420). IEEE. [13] Chu, E., Baid, A., Chen, T., Doan, A., & Naughton, J. (2007,

September). A relational approach to incrementally extractmg and

querying structure in unstructureddata. In Proceedings of the 33rd Internationalconference on Very large data bases (pp. 1045-1056).

CONCLUSION

The massive amount of unstructured data makes the manipulationof such data difficult and time consuming. Such difficulties for example, include further processing, which is encountered in the performance of processes. Similarly, searching, finding useful information and integrating

VLDB Endowment. Mansuri, I. R., & Sarawagi, S. (2006, April). Integmting unstructured data into relational databases. In Data Engineering, ICDE'06. Proceedings of the 22nd International Conference on (pp. 29-29). IEEE.

195