How to give a great research talk

1 downloads 0 Views 2MB Size Report
Method: text categorization to automatic book classification. To distinguish books from .... Page 24 ... Apache Spark is a fast and general engine for large-scale ...
Master ECD Report

AUTOMATICALLY ASSIGNING LABELS TO BOOKS Student’s name: Doan

Mau Hien Supervisor : Assoc. Prof. Dr. Do Thanh Nghi Photo © JamesMillar/TEDxExeter

Research

report content

1. Introduction  Rationale of the Study  Related concept 2. Related works 3. Methodology  Support vector machines  Latent - local SVM 4. Experiments - results 5. Conclusion – Future work

introduction

 Rationale of the Study  In Vietnam, there is no research on automatic classification for books (by Machine learning methods)  Have high practical application for many disciplines and interdisciplinary fields, such as information technology, information management, library, and record and archive management.

introduction

 Related concept: Text

classification (TC)

 Natural language processing (NLP)

 Text categorization is the assignment of natural

language texts to one or more predefined categories based on their content

 Used to automatically assign labels to books

and news; classify and filter emails, etc.

introduction

 Text classification (cont.)

Process of text classification

introduction

 Text classification (cont.)

A graphical view of text classification

Related works

Related works

 Chen et al (2009): Automatic book classification method combined with support vector machine and metadata. Method: text categorization to automatic book classification. To distinguish books from other general documents, data about books are divided into: description data (book title, introductions to book and author); classification accuracy: 95%  Han, Hui, et al. "Automatic document metadata extraction using support vector machines." Digital Libraries, 2003. Proceedings. 2003 Joint Conference on. IEEE, 2003. Method: SVM; Text classification; dataset metadata struck; classification accuracy: 92.9%  Susan Dumais, John Platt, David Heckerman, and Mehran Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the seventh international conference on Information and knowledge management, pages 148–155. ACM, 1998 Method SVM; Text classification; dataset Reuters; classification accuracy: 87%.

Related works

(cont.)

 Thanh Nghi Do and Francois Poulet. Classifying very highdimensional and large-scale multi-class image. In IEEE Cloud and Big Data Computing, Toulouse, France, 2016 Method: Propose a new learning algorithm that uses support vector machines (SVM) to classify the very-high-dimensional and large-scale multiclass datasets, but this approach is applied for images.

Methodology

Methodology

Preprocessing text and Bag-of-words model Machine learning: Support vector Machine

Latent local Support vector Machine

Methodology (cont.)

Preprocessing text and Bag-of-words (BoW) model Data preparation, text preprocessing and feature engineering

• • • •

Analyzing vocabularies Separating words Extracting word features Performing document data in the table form so that algorithms can learn for classification

Ref. BoW (Salton et al., 1975) ; JVTexPro (Nguyen et al., 2006)

Methodology (cont.) No.

1

Preprocessing text and Bag-of-words model Example of dataset of documents

Titles Cửa sổ âm nhạc:

Abstracts Tìm hiểu về các nhạc sĩ và hoàn cảnh ra đời của những ca khúc đã đồng hành với nhiều thế hệ người yêu nhạc

2 …

Keywords Thư viện , Lịch sử âm nhạc Việt Nam

Bách khoa toàn thư về

Sách là quyển Bách khoa giới thiệu cho độc giả

Thế giới:

những điều chưa được biết trên Thế giới.

thư







Bách khoa toàn

Subjects

Âm nhạc

Bách khoa toàn thư …

This book describes the government of Japan, m

The government of Japan with emphasis on the period of readjustment since the peace treaty of 1952.

Khoa học chính trị Nhật Bản

Khoa học chính trị

Methodology (cont.)

Preprocessing text and Bag-of-words model

Example of BoW model 1 (nhạc)

2 (điều)



n (khoa)

Subjects

1

1

0



0

Âm nhạc

2

0

1



1

Bách khoa toàn thư













m

0

0



1

Khoa học chính trị

No.

Ref. BoW (Salton et al., 1975)

Methodology (cont.)

Preprocessing text and bag-of-words model Bag-of-words model of books taken from the library

:< value1> :< value2>...

Methodology (cont.)

Machine learning: Support vector Machine

Support vector machines for binary classification problems

Methodology (cont.)

Machine learning: Support vector Machine Support vector machines for multi-class problems

Multi-class SVM (One-Versus-All)

Multi-Class SVM (One – Versus – One)

Latent local Support vector Machine

 We propose to use a new learning algorithm of SVM, called Latent-lSVM  Classify very high dimensional input spaces and large scale multi-class book datasets.  The Latent-lSVM produces a partition of the full dataset into joint clusters and then it is easier to learn a non-linear SVM in each cluster to classify the data locally.

Global SVM model

Local SVM models

Training algorithm of latent local SVM models

Prediction of x with latent local SVM models

Experiments

Software

Experiments • • • • •

JVTexPro LibSVM Liblinear Liblinear 0ne-vs-one Latent-lSVM

(Java) (C/C++) (C/C++) (C/C++) (C/C++))

All experiments are run on PC with linux Ubuntu 16, Intel(R) core i5-4590, 3.3Ghz (4CPUs), 8GB main memory

Experiments

Book dataset Books: include English and Vietnamese

• No. of books : 114.998 • Input attributes: 03 • Output class: 661

Experiments

Separation result of book dataset (Preprocessing) • 64.073 words • 661 classes

Experiments

Tuning parameters • Latent-ISVM requires to tune the Dirichlet hyperparameters of LDA, the good model quality has been reported for β= 0.01 and α = 50/T with the number of topics (clusters) T • LibSVM: parameters cost C=4, kernel t =0, epsilon e=0.1, n-fold with V=10.

• Liblinear: parameters cost C=4, epsilon e=0.1, cross validation mode with V=10 • Liblinear one-vs-one: cost C=4, epsilon e=0.1, V=10

Results

Classification results

Results

Training time of algorithms

Conclusion and future work

Conclusion • Latent-lSVM achieves an accuracy of 70.14% in the classification of book dataset having 64.073 dimensions and 661 classes • Latent-lSVM is a better more than another algorithms

Future work • Necessary to develop a distributed implementation for large scale processing on an in-memory cluster-computing platform, namely Apache Spark • Apache Spark is a fast and general engine for large-scale data processing. • Spark is utilized at a wide range of organizations to process large datasets

Link: http://spark.apache.org/

Demo videos for research results: • Liblinear:

https://youtu.be/hkyl7j5faF0

• Liblinear 1-vs-1:

https://youtu.be/bGlNzWNKvvE

• LibSVM:

https://youtu.be/lnSHgYT7XEw

• Laten-local SVM: https://youtu.be/diWVY7rdq6g

Thank you! and and

References

[7] J Huang. A study of book title feature extraction based on the automatic classification. Unpublished Master Thesis, Department of Library and Information Science of Fu-Jen Catholic University, Taipei. [8] Hyunsoo Kim, Peg Howland, and Haesun Park. Dimension reduction in text classification with support vector machines. Journal of Machine Learning Research, 6(Jan):37–53, 2005. [9] Walid Magdy and Kareem Darwish. Book search: indexing the valuable parts. In Proceedings of the 2008 ACM workshop on Research advances in large digital book repositories, pages 53–56. ACM, 2008. [10] Cam-Tu Nguyen, Xuan-Hieu Phan, and Thu-Trang Nguyen. Jvntextpro: A java-based Vietnamese text processing tool. http://jvntextpro.sourceforge.net/, 2010. [11] Fran¸cois Poulet and Thanh-Nghi Do. Mining very large datasets with support vector machine algorithms. In Enterprise Information Systems V, pages 177–184. Springer, 2004. [12] Asim Roy. A classification algorithm for high-dimensional data. Procedia Computer Science, 53:345–355, 2015. [13] Gerard Salton, Anita Wong, and Chung-Shu Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975.

[1] Tom Betts, Maria Milosavljevic, and Jon Oberlander. The utility of information extraction in the classification of books. In European Conference on Information Retrieval, pages 295–306. Springer, 2007. [2] S.Y Chen, J.Y Yeh, M.J Hwang, X.J Lin, H.R Ke, and W.P Yang. Automatic book classification method combined with support vector machine and metadata. International Journal of Advanced Information Technologies (IJAIT) 3(1), 2–21, 2009. [3] Thanh Nghi Do and Fran¸cois Poulet. Classifying very high-dimensional and large-scale multi-class image datasets with Latent-lSVM. In IEEE Cloud and Big Data Computing, Toulouse, France, July 2016. [4] Susan Dumais, John Platt, David Heckerman, and Mehran Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the seventh international conference on Information and knowledge management, pages 148–155. ACM, 1998. [5] Sharon Givon and Maria Milosavljevic. Extracting useful information from the full text of fiction. In Large Scale Semantic Access to Content (Text, Image, Video, and Sound), pages 633–638. LE CENTRE DE HAUTES ETUDES INTERNATIONALES D’INFORMATIQUE DOCUMENTAIRE, 2007