HASKER: An efficient algorithm for string kernels ... - Science Direct

20 downloads 0 Views 316KB Size Report
1877-0509 c 2017 The Authors. Published by Elsevier ... bVISCOM, Fraunhofer FOKUS, Kaiserin-Augusta-Allee 31, 10589 Berlin, Germany. Abstract ... classification. String kernels are a common form of using information at the character level.
Available online at www.sciencedirect.com

Available online at www.sciencedirect.com Available online at www.sciencedirect.com

ScienceDirect Procedia Computer Science 00 (2017) 000–000 Procedia Computer Science 112 (2017) 1755–1763 Procedia Computer Science 00 (2017) 000–000

www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia

21st International Conference on Knowledge Based and Intelligent Information and Engineering 21st International Conference on KnowledgeSystems Based and Intelligent Information and Engineering Systems

HASKER: HASKER: An An efficient efficient algorithm algorithm for for string string kernels. kernels. Application Application to to polarity classification in various languages polarity classification in various languages a b a,∗ Marius Marius Popescu Popescua ,, Cristian Cristian Grozea Grozeab ,, Radu Radu Tudor Tudor Ionescu Ionescua,∗

a Faculty a Faculty

of Mathematics and Computer Science, University of Bucharest, 14 Academiei, Bucharest, Romania of Mathematics and Computer Science, University of Bucharest, 14 Academiei, Bucharest, Romania b VISCOM, Fraunhofer FOKUS, Kaiserin-Augusta-Allee 31, 10589 Berlin, Germany b VISCOM, Fraunhofer FOKUS, Kaiserin-Augusta-Allee 31, 10589 Berlin, Germany

Abstract Abstract String kernels have successfully been used for various NLP tasks, ranging from text categorization by topic to native language String kernels have successfully been used for various NLP tasks, ranging from text categorization by topic to native language identification. In this paper, we present a simple and efficient algorithm for computing various spectrum string kernels. When identification. In this paper, we present a simple and efficient algorithm for computing various spectrum string kernels. When comparing two strings, we store the p-grams in the first string into a hash table, and then we apply a hash table lookup for the comparing two strings, we store the p-grams in the first string into a hash table, and then we apply a hash table lookup for the p-grams that occur in the second string. In terms of time, we show that our algorithm can outperform a state-of-the-art tool for p-grams that occur in the second string. In terms of time, we show that our algorithm can outperform a state-of-the-art tool for computing string similarity. In terms of accuracy, we show that our approach can reach state-of-the-art performance for polarity computing string similarity. In terms of accuracy, we show that our approach can reach state-of-the-art performance for polarity classification in various languages. Our efficient implementation is provided online for free at http://string-kernels.herokuapp.com. classification in various languages. Our efficient implementation is provided online for free at http://string-kernels.herokuapp.com. c 2017 The Authors.  Published by Elsevier B.V. c 2017  2017The TheAuthors. Authors.Published Published byElsevier ElsevierB.V. B.V. © Peer-review under responsibilityby of KES International. Peer-review under under responsibility responsibility of KES International International. Peer-review Keywords: string kernels, blended specturm kernel, intersection kernel, kernel methods, similarity-based learning, polarity classification, Keywords: string kernels,analysis, blendedstring specturm kernel, kernel, opining mining, sentiment kernels tool,intersection open-source code. kernel methods, similarity-based learning, polarity classification, opining mining, sentiment analysis, string kernels tool, open-source code.

1. Introduction 1. Introduction In many NLP tasks, the most common approach is to rely on features like words, part-of-speech tags, stems, or In many NLP tasks, the most common approach is to rely on features like words, part-of-speech tags, stems, or some other high-level linguistic features. However, methods that work at the character level have also demonstrated some other high-level linguistic features. However, methods that work at the character level have also demonstrated impressive results in various text analysis tasks such as text categorization by topic 11 , semantic parsing 22 , authorship impressive results by topic , semantic parsing , authorship 3,4,5,6 in various text analysis 7 tasks such as text categorization 8 and dialect identification 3,4,5,6 , plagiarism detection 7 , translationess detection 8 , native language identification 9,10,11 identification 12,13 , plagiarism detection , translationess detection , native language identification 9,10,11 and dialect identification 12,13 . . identification By using character p-grams as features, the feature space quickly explodes to a high dimension when p is greater By using character p-grams as features, the feature space quickly explodes to a high dimension when p is greater than 5. A common way of using information at the character level while avoiding this overhead is to use string than 5. 14A common way of using information at the character level while avoiding this overhead is to use string kernels 14 . String kernels implicitly embed the texts in a very large feature space by specifying the pairwise simkernels . String kernels implicitly embed the texts in a very large feature space by specifying the pairwise similarity between each pair of strings instead of building their representation explicitly. This is known as the kernel ilarity between each pair of strings instead of building their representation explicitly. This is known as the kernel . While the kernel trick can significantly reduce the memory footprint, an efficient algorithm for computing trick 14 trick 14 . While the kernel trick can significantly reduce the memory footprint, an efficient algorithm for computing the similarity between strings is still required to solve the problem of time. Researchers have proposed efficient the similarity between strings is still required to solve the problem of time. Researchers have proposed efficient ∗ ∗

Corresponding author. Corresponding E-mail address:author. [email protected] E-mail address: [email protected]

c 2017 The Authors. Published by Elsevier B.V. 1877-0509  c 2017 The Authors. Published by Elsevier B.V. 1877-0509 Peer-reviewunder responsibility of KES International. 1877-0509 2017responsibility The Authors. by Elsevier B.V. Peer-review©under of Published KES International. Peer-review under responsibility of KES International 10.1016/j.procs.2017.08.207

1756

Marius Popescu et al. / Procedia Computer Science 112 (2017) 1755–1763 Marius Popescu, Cristian Grozea, Radu Tudor Ionescu / Procedia Computer Science 00 (2017) 000–000

algorithms 15,16 and tools 17 for computing string kernels. Our first contribution is a simple and more efficient algorithm (HASKER) for string kernels based on hash maps: given two strings, we store the p-grams in the first string into a hash table, and then we apply a hash table lookup for the p-grams that occur in the second string. We report about four times better running times than the state-of-the-art tool of Rieck et al. 17 . Our tool is freely available at http://string-kernels.herokuapp.com. Our second contribution is to apply the proposed algorithm for polarity classification in various languages. We propose a system based on various string kernels and Kernel Ridge Regression 14 suitable for any target language. All string kernels used here are based on the common character p-grams between two strings. Using character-based string kernels makes the corresponding learning method completely language independent, because the texts will be treated as sequences of symbols (strings). Methods working at the word level or above very often restrict their feature space according to theoretical or empirical principles. For instance, they select only features that reflect various types of appraisal groups or only some type of words, such as opinion oriented words. These features prove to be very effective for specific tasks, but other good features that are not considered can also exist. String kernels simply embed the texts into a very large feature space, given by all the substrings of length p, and leave it to the learning algorithm to select important features for the specific task, by assigning different weights to these features. Since it does not restrict the feature space according to any linguistic theory, the string kernel approach is linguistic theory neutral. Furthermore, the method does not explicitly consider any features of natural language such as words, phrases, or meaning, contrary to the usual NLP approach. However, an interesting remark is that such features can implicitly be discovered within the extracted p-grams. On the other hand, explicitly extracting such features could get very difficult for less-studied or low-resource languages, and the methods that rely on linguistic features become inherently language dependent. Even a method that considers words as features cannot be completely language independent, since the definition of a word is necessarily language specific. A method that uses only opinion oriented words as features is also not language independent because it needs a list of opinion oriented words which is specific to each language. When features such as part-of-speech tags are used, the method relies on a part-of-speech tagger which might not be available for some languages. By contrast, the only requirement for our method is a set of positive and negative examples, which can be easily acquired automatically from product or movie reviews, likely available online for any studied language. To demonstrate that the proposed framework is language independent, we test it on several very different languages. Indeed, the corpora used in the experiments contain documents written in English, Arabic and Chinese. The empirical results indicate that string kernels are comparable and often better than the state-of-the-art approaches, yet they do not require any linguistic knowledge. The paper is organized as follows. Related work is presented in Section 2. The HASKER algorithm is described in Section 3. The polarity classification experiments are detailed in Section 4. We also present results for sentence polarity classification in Section 5. Finally, we draw our conclusions in Section 6. 2. Related work Two strands of works are related to the approach described in this work: the works in which string kernels are used for various text classification tasks and the works in which language independent methods are applied for polarity classification. String kernels are a common form of using information at the character level. They are a particular case of the more general convolution kernels 18 . Interestingly, the first application of string kernel ideas came in the field of text classification with the paper of Lodhi et al. 1 that deals with a classification task of semantic nature, text categorization (by topic). Lodhi et al. 1 used string kernels for document categorization with very good results. String kernels were later successfully used in other text classification tasks, not especially semantically coupled, like authorship identification 3,4,6 , native language identification 9,10,11 or dialect identification 12,13 . Some of these methods achieved state-of-the-art results 6,11,12,13 . Since the early 2000, sentiment analysis has grown to be one of the most active research areas in natural language processing 19 . Polarity classification is a basic task in sentiment analysis that aims to automatically establish the polarity of a text, whether the expressed opinion in the respective text is positive or negative. Most of the research and evaluation in this domain was concentrated on English. Nevertheless, there are some research studies targeting other



Marius Popescu et al. / Procedia Computer Science 112 (2017) 1755–1763 Marius Popescu, Cristian Grozea, Radu Tudor Ionescu / Procedia Computer Science 00 (2017) 000–000

1757

Algorithm 1: Algorithm for Spectrum Kernel 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Input: x, y – the input strings; p – the length of the p-grams; Notations: h – the hash table; Computation: for i ∈ {1, ..., |x| − p + 1} do if h(x[i : i + p])  ∅ then h(x[i : i + p]) ← h(x[i : i + p]) + 1; else

h(x[i : i + p]) ← 1;

k ← 0; for i ∈ {1, ..., |y| − p + 1} do if h(y[i : i + p])  ∅ then k ← k + h(y[i : i + p]); Output: k - the p-spectrum kernel between x and y.

languages 20,21,22,23 , as well as language independent approaches 24,25 . Language independent polarity classification methods were developed to respond to the need of building opinion mining systems in languages other than English. Many of these methods are reviewed by Korayem 25 . Unfortunately, most of these methods are not completely language independent since they work at the word level (see the discussion in Section 1). These methods try to use the information conveyed by words in a language independent manner. Abbasi et al. 24 use a language independent feature selection method in order to choose the words that are good predictors for polarity classification. Raychev et al. 26 propose to weight the words by taking into account positional information and estimated subjectivity. Even systems that explicitly use some linguistic knowledge such as seed words (few common words such as “very”,“bad” and “good” in English) are considered language independent by their authors 27 . String kernels have been used by Zhang et al. 28 for Chinese polarity classification. Zhang et al. 28 applied the same variant of string kernels as Lodhi et al. 1 , the subsequence kernel, to classify AmazonCN reviews and report promising results. The method described by Zhai et al. 22,29 is the closest one to the approach proposed in this article. It works at the character level (uses suffix tree substring-group features) and was evaluated on different languages: Chinese, English and Spanish. In Section 4, the results of our approach (for English and Chinese) are compared with those reported by Zhai et al. 22,29 . 3. Efficient algorithm for string kernels There are many kernel functions for strings with various applications in computational biology and computational linguistics 14 . Perhaps one of the most natural ways to measure the similarity of two strings is to count how many substrings of length p the two strings have in common. This gives rise to the p-spectrum kernel 15 . Formally, for two strings over an alphabet Σ, x, y ∈ Σ∗ , the p-spectrum kernel is defined as:  k p (x, y) = |x|v · |y|v , v∈Σ p

where |x|v is the number of occurrences of string v as a substring in x. We present a simple and efficient algorithm for computing the p-spectrum kernel. HASKER (HAsh map algorithm for String KERnels) is formally described in Algorithm 1. We use the following notations. For a string x over an alphabet Σ, the length of x is denoted by |x|. Strings are considered to be indexed starting from position 1, that is

1758

Marius Popescu et al. / Procedia Computer Science 112 (2017) 1755–1763 Marius Popescu, Cristian Grozea, Radu Tudor Ionescu / Procedia Computer Science 00 (2017) 000–000

Method Harry HASKER (ours) Harry HASKER (ours) Harry HASKER (ours) Harry HASKER (ours)

#strings 1000 1000 1000 1000 5000 5000 5000 5000

p-gram 3 3 5 5 3 3 5 5

Time (s) 132.6 35.5 141.8 38.8 3109.4 867.1 3426.9 969.0

Table 1: Running times (seconds) of the Harry implementation 17 of the specturm string kernel versus our efficient algorithm. The times are compared for building kernel matrices for 1000 and 5000 strings, using 3-grams and 5grams, respectively. The reported times are measured on a computer with Intel Core i7 2.3 GHz processor and 8 GB of RAM using a single Core. x = x[1]x[2] · · · x[|x|]. Moreover, x[i : j] denotes its substring x[i]x[i + 1] · · · x[ j − 1]. Our algorithm is based on two major phases. Given two strings x and y as input, we first build a hash map h that retains the occurrence counts of each p-gram in x (steps 7-11). Then, for each p-gram in y that appears in h we add the occurrence counts stored in h to the similarity value k (steps 12-15). If we consider that the hash table lookup is O(1), our algorithm works in linear time with respect to the length of the strings. Our algorithm can easily be adapted for the presence bits string kernel or the intersection string kernel 10 . Indeed, our tool provides implementions for these related kernels as well. The reader is refered to Ionescu et al. 30 for a detailed presentation of all these kernels. 4. Experiments The purpose of the experiments is to demonstrate that HASKER yields state-of-the-art performance in terms of accuracy and time. In the polarity classification experiments, the kernels are normalized and Kernel Ridge Regression (KRR) 14 is employed for the learning task. 4.1. Time Evaluation For the time evaluation, we use the first 1000 and the first 5000 documents from the training set of the IMDB Review data set 31 . Our efficient algorithm is compared with the suffix tree implementation of the spectrum kernel 15 provided in Harry 17 , in various realistic settings. Harry is a recent state-of-the-art tool that enables the efficient computation of a wide range of similarity measures on string data. Table 1 shows the running times of the two approaches (Harry versus HASKER) when using 3-grams and 5-grams, respectively. In all settings, our algorithm is about four times faster than the algorithm implemented in the Harry tool. It is important to mention that both algorithms were evaluated on the same computer, each using a single Core. The computer is equipped with Intel Core i7 2.3 GHz processor and 8 GB of RAM. Parallel processing can improve the speed of the two algorithms, but we consider that an evaluation in a parallel environment is beyond the scope of this work. 4.2. English language polarity classification Corpus. The string kernels are first evaluated on the IMDB Review data set 31 . It contains 50, 000 reviews from IMDB with an even number of positive and negative reviews. The data set is evenly divided into 25, 000 labeled reviews for training and 25, 000 labeled samples for test. The IMDB Review corpus is one of the biggest corpora for sentiment analysis available to the public. Results. The accuracy rates of the presence bits kernel, the p-spectrum kernel and the intersection kernel obtained on the IMDB Review test set are illustrated in Figure 1. Results are reported for p-grams in the range 5-10. On the



Marius Popescu al. / Procedia Science 112 (2017)001755–1763 Marius Popescu, Cristian Grozea, RaduetTudor Ionescu / Computer Procedia Computer Science (2017) 000–000

1759

91.5 91

Accuracy

90.5 90 89.5

presence bits intersection p−spectrum

89 88.5 88 87.5 5

6

7 8 Length of p−gram

9

10

Fig. 1: Accuracy rates obtained by three types of string kernels on the IMDB Review official test set. Results are reported for p-grams in the range 5-10. IMDB Review corpus, the best performing kernel for every p-gram length is the intersection kernel, followed by the presence bits kernel. All kernels reach their peak accuracy for p-grams of length 9. Furthermore, Figure 1 shows that the accuracy profile generated by each kernel over the range of 5-10 p-grams has roughly the same shape for all three kernels. The best accuracy rate (90.6%) is attained by the intersection kernel based on 9-grams, which is almost 0.1% better than the accuracy rate of presence bits kernel based on 9-grams. By combining kernels as a sum, we usually gain small performance improvements 11 . Hence, we tried out various kernel combinations and we obtained a slightly better performance (90.7%) with the blended intersection kernel based on p-grams in the range 8-10. 4.3. Arabic language polarity classification Corpus. We evaluate string kernels on the Large-scale Arabic Book Review (LABR) data set introduced by Nabil et al. 23 . We consider the balanced polarity classification task, for which there are 13, 160 reviews in the training set and another 3, 288 in the test set. We choose the balanced polarity classification task over the unbalanced task, since we do not aim to design or use classifiers suitable specifically for unbalanced data, yet this could be the subject of future work beyond the scope of this paper. Results. The presence bits kernel, the p-spectrum kernel and the intersection kernel are evaluated on the LABR test set and the results are shown in Figure 2. The string kernels are based on p-grams in the range 2-6. Longer p-grams have not been tried out, since the accuracy rates of all three kernels already drop from 5-grams to 6-grams. Moreover, the three kernels reach their peak accuracy rate when 3-grams are being used. This is a different outcome compared to what we obtained on the English corpus with respect to the p-gram length. Indeed, on the IMDB Review corpus, the three kernels reach their peak accuracy when p-grams are around 9 characters long. Given that the Arabic writing system is very different from the English writing system (for example, some vowels are omitted in the Arabic writing), the different outcomes are not very surprising. On the contrary, it seems natural to have a different p-grams length suitable for each language and, in fact, this is the only parameter of a string kernel that needs to be tuned on each corpus and language. Among the evaluated kernels, the intersection kernel attains the best performance. It reaches an accuracy of 86.4% for p-grams of length 3 on the LABR test set. Various kernel combinations using the best kernels and p-gram choices have also been evaluated. The best performance (86.5%) is obtained by the intersection kernel based on p-grams in the range 3-5.

1760

Marius Popescu, Cristian Grozea, Radu Tudor Ionescu / Procedia Computer Science 00 (2017) 000–000 Marius Popescu et al. / Procedia Computer Science 112 (2017) 1755–1763

87

Accuracy

86.5 86 85.5 85 84.5

presence bits intersection p−spectrum

84 2

3

4 Length of p−gram

5

6

Fig. 2: Accuracy rates obtained by three types of string kernels on the LABR test set. The kernels are evaluated on the on the balanced polarity classification task. Results are reported for p-grams in the range 2-6.

Accuracy

90 85 80

presence bits intersection p−spectrum

75 70 1

2 3 Length of p−gram

4

Fig. 3: Accuracy rates obtained by three types of string kernels on the PKU data set using a 10-fold cross-validation procedure. Results are reported for p-grams in the range 1-4.

4.4. Chinese language polarity classification Corpus. We use the PKU data set 20 for the Chinese polarity classification experiments. It contains 886 product reviews collected from a popular Chinese IT product web site. There are 451 positive reviews and 435 negative reviews in this data set. Results. The same three kernels are evaluated on the PKU data set using a 10-fold cross-validation procedure. As shown in Figure 3, the best accuracy rates are obtained when unigrams are being used. Thus, results are reported only for p-grams in the range 1-4 (longer p-grams have not been considered). Except for the presence bits kernel, which gives a slightly lower accuracy rate for unigrams, the accuracy profiles of all three kernels are nearly identical. The best accuracy rate (93.8%) is obtained by the p-spectrum kernel based on unigrams, followed closely by the intersection kernel based on unigrams (93.7%). As in the other experiments presented so far, several kernel combinations based



Marius Popescu et al. / Procedia Computer Science 112 (2017) 1755–1763 Marius Popescu, Cristian Grozea, Radu Tudor Ionescu / Procedia Computer Science 00 (2017) 000–000

Corpus Name IMDB Review LABR PKU

Language English Arabic Chinese

Release best Method Maas et al. 31 Nabil et al. 23 Wan 20

Accuracy 88.9% 82.7% 86.1%

State-of-the-art Method Le et al. 32 Nabil et al. 23 Zhai et al. 22

Accuracy 92.6% 82.7% 94.1%

1761

String kernels Kernel Accuracy ∩ kˆ 8+9+10 90.7% ˆk∩ 86.5% 3+4+5 ∩ 94.2% kˆ 1+2

Table 2: Results summary on all three corpora. On one hand, the best string kernel approach on each corpus is compared with the best performing approach used when the respective corpus was introduced (release best). On the other hand, the best string kernel approach is also compared with the state-of-the-art approach on each corpus. References are given for completion. The best accuracy rate on each corpus is highlighted in bold. on sum of kernels have been evaluated. The best performing kernel combination is the intersection kernel based on p-grams in the range 1-2. It achieves an accuracy rate of 94.2% on the PKU data set. It is interesting to mention that words in Chinese are usually represented by one or two characters. Thus, using character unigrams or bigrams is almost similar to using words as features, although the string kernel approach does not have to split the text into tokens. This fact could provide an explanation for why the peak accuracy rates are obtained for shorter p-grams on the Chinese corpus compared to English or Arabic. 4.5. Results summary The results of the best performing string kernel approach on each of the three corpora are presented in Table 2. The string kernels are compared with both the best performing method used in the release paper of each corpus, and the state-of-the-art approach on the respective corpus. On the English corpus, the string kernels are less than 2% below the state-of-the-art approach of Le et al. 32 . This is not a bad result, as the string kernel approach is a lot less sophisticated. On the Arabic corpus, the string kernels yield an improvement of almost 4% compared to the state-of-the-art approach of Nabil et al. 23 . Finally, the string kernels attain only slightly better performance than the state-of-the-art method on the Chinese corpus. Given the simplicity of the string kernel approach, the overall results on corpora comprised of documents written in three different languages are noteworthy. Since it does not require any linguistic tools or features, the string kernel approach is suitable for newly-studied languages as a highly accurate and readily available intelligent baseline. 5. Beyond documents: sentence polarity classification In the previous section, string kernels have been evaluated in the case of document polarity classification. Another polarity classification task is sentence polarity classification. This task is considered harder since the information contained in a typical sentence is much less than that contained in a typical document because of their length difference 19 . This section describes some experiments on sentence polarity classification. Corpus. The string kernels are evaluated on the Stanford Sentiment Treebank 33 . There are two sub-tasks associated with this data set: binary classification of sentences, and fine-grained classification over five classes: very negative, negative, neutral, positive, and very positive. For the binary classification sub-task, the data set consists of: 6920 sentences for training, 872 sentences for validation and 1821 sentences for test. The data set comes with detailed labels for sentences and sub-phrases, constituency parse trees are provided for each sentence in the data set, and each node in these trees is annotated with a sentiment label. Results. The presence bits kernel, the p-spectrum kernel and the intersection kernel are evaluated on the binary (polarity) classification sub-task. The same procedure as in the previous section is followed here and the validation set is used to select the best kernel type and the best range of p-grams. The best performance (on the validation set) is obtained by the intersection kernel based on p-grams in the range 4-6. On test set, the accuracy of this kernel is 81.8%. It is worth comparing this result with the result reported in the supplementary material of Socher et al. 33 where

1762

Marius Popescu et al. / Procedia Computer Science 112 (2017) 1755–1763 Marius Popescu, Cristian Grozea, Radu Tudor Ionescu / Procedia Computer Science 00 (2017) 000–000

an 80.0% accuracy is obtained when only sentence-level annotations (label data) were used for training. Much better results (85.4%) were reported by Socher et al. 33 when the whole set of labeled sub-phrases was used for training. To make use of the available labeled sub-phrases, in the case of string kernels, each sub-phrase is treated as an independent sentence, thus increasing the number of training examples. In this setting, our best accuracy (84.0%) is again obtained by the intersection kernel based on p-grams in the range 4-6. Overall, it seems that the accuracy level reached by the string kernels approach is on par with the state-of-the-art results 33 . 6. Conclusion and future work We have presented an efficient algorithm for the spectrum kernel which is nearly four times faster than the suffix tree algorithm implemented in the Harry tool 17 . We also report good polarity classification results for three different languages. Thanks to the modeling power of the p-grams based string kernels and to the robustness of the machine learning method employed with respect to the danger of overfitting, we have been able to almost match and frequently outperform the best known results in polarity classification on corpora with texts in English, Chinese and Arabic. The usage of string kernels allowed automatic and implicit extraction of the linguistic knowledge needed to solve this semantic task, reducing the requirements of the method to the bare minimum: a set of labeled text documents (for example, positive and negative reviews). In future work, we aim to extend our tool by including more similarity functions for strings, e.g. the subsequence kernel, and to evaluate it on even more NLP tasks. References 1. Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.J.C.H.. Text classification using string kernels. Journal of Machine Learning Research 2002, 2, 419–444. 2. Kate, R.J., Mooney, R.J.. Using String-kernels for Learning Semantic Parsers. Proceedings of ACL 2006, 913–920. 3. Sanderson, C., Guenter, S.. Short text authorship attribution via sequence kernels, markov chains and author unmasking: An investigation. Proceedings of EMNLP 2006, 482–491. 4. Popescu, M., Dinu, L.P.. Kernel methods and string kernels for authorship identification: The federalist papers case. Proceedings of RANLP 2007. 5. Escalante, H.J., Solorio, T., Montes-y-G´omez, M.. Local histograms of character n-grams for authorship attribution. Proceedings of ACL: HLT 2011, 1, 288–298. 6. Popescu, M., Grozea, C.. Kernel methods and string kernels for authorship analysis. CLEF (Online Working Notes/Labs/Workshop) 2012. 7. Grozea, C., Gehl, C., Popescu, M.. ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection. In: 3rd PAN Workshop. Uncovering Plagiarism, Authorship, and Social Software Misuse. 2009, p. 10. 8. Popescu, M.. Studying translationese at the character level. Proceedings of RANLP 2011, 634–639. 9. Popescu, M., Ionescu, R.T.. The Story of the Characters, the DNA and the Native Language. Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications 2013, 270–278. 10. Ionescu, R.T., Popescu, M., Cahill, A.. Can characters reveal your native language? a language-independent approach to native language identification. Proceedings of EMNLP 2014, 1363–1373. 11. Ionescu, R.T., Popescu, M., Cahill, A.. String kernels for native language identification: Insights from behind the curtains. Computational Linguistics 2016, 42(3), 491–525. 12. Ionescu, R.T., Popescu, M.. UnibucKernel: An Approach for Arabic Dialect Identification based on Multiple String Kernels. Proceedings of VarDial Workshop of COLING 2016, 135–144. 13. Ionescu, R.T., Butnaru, A.. Learning to Identify Arabic and German Dialects using Multiple Kernels. Proceedings of VarDial Workshop of EACL 2017, 200–209. 14. Shawe-Taylor, J., Cristianini, N.. Kernel Methods for Pattern Analysis. Cambridge University Press; 2004. ISBN 978-0-521-81397-6. 15. Leslie, C.S., Eskin, E., Noble, W.S.. The spectrum kernel: A string kernel for svm protein classification. Proceedings of Pacific Symposium on Biocomputing 2002, 566–575. 16. Vishwanathan, S.V.N., Smola, A.J.. Fast Kernels for String and Tree Matching. Proceedings of NIPS 2002 2002, 569–576. 17. Rieck, K., Wressnegger, C.. Harry: A tool for measuring string similarity. Journal of Machine Learning Research 2016, 17(9), 1–5. 18. Haussler, D.. Convolution Kernels on Discrete Structures. Technical Report UCS-CRL-99-10; University of California at Santa Cruz; Santa Cruz, CA, USA; 1999. 19. Liu, B.. Sentiment analysis: Mining opinions, sentiments, and emotions. Cambridge University Press; 2015. 20. Wan, X.. Using bilingual knowledge and ensemble techniques for unsupervised chinese sentiment analysis. Proceedings of EMNLP 2008, 553–561. 21. Zagibalov, T., Carroll, J.. Unsupervised classification of sentiment and objectivity in Chinese text. Proceedings of IJCNLP 2008, 304–311. 22. Zhai, Z., Xu, H., Kang, B., Jia, P.. Exploiting effective features for chinese sentiment classification. Expert Systems with Applications 2011, 38(8), 9139–9146. 23. Nabil, M., Aly, M.A., Atiya, A.F.. LABR: A Large Scale Arabic Book Reviews Dataset. Proceedings of ACL 2013, 494–498.



Marius Popescu et al. / Procedia Computer Science 112 (2017) 1755–1763 Marius Popescu, Cristian Grozea, Radu Tudor Ionescu / Procedia Computer Science 00 (2017) 000–000

1763

24. Abbasi, A., Chen, H., Salem, A.. Sentiment Analysis in Multiple Languages: Feature Selection for Opinion Classification in Web Forums. ACM Transactions on Information Systems 2008, 26(3), 12:1–12:34. 25. Korayem, M.. Sentiment/Subjectivity Analysis Survey for Languages other than English. CoRR 2016, abs/1601.00087. 26. Raychev, V., Nakov, P.. Language-Independent Sentiment Analysis Using Subjectivity and Positional Information. Proceedings of RANLP 2009, 360–364. 27. Lin, Z., Tan, S., Cheng, X.. Language-independent Sentiment Classification Using Three Common Words. Proceedings of CIKM 2011, 1041–1046. 28. Zhang, C., Zuo, W., Peng, T., He, F.. Sentiment Classification for Chinese Reviews Using Machine Learning Methods Based on String Kernel. Proceedings of ICCIT Third International Conference on 2008, 2, 909–914. 29. Zhai, Z., Xu, H., Li, J., Jia, P.. Feature Subsumption for Sentiment Classification in Multiple Languages. Proceedings of PAKDD 2010, 261–271. 30. Ionescu, R.T., Popescu, M.. Knowledge Transfer between Computer Vision and Text Mining. Advances in Computer Vision and Pattern Recognition. Springer International Publishing; 2016. 31. Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.. Learning Word Vectors for Sentiment Analysis. Proceedings of ACL 2011, 142–150. 32. Le, Q.V., Mikolov, T.. Distributed Representations of Sentences and Documents. Proceedings of ICML 2014, 32, 1188–1196. 33. Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, D.C., Ng, A., et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. Proceedings of EMNLP 2013, 1631–1642.