Detection of the Offensive Language in Multilingual

Detection of the Offensive Language in Multilingual Communication Azalden Alakrot, Nikola S. Nikolov Department of Computer Science and Information Systems, University of Limerick E-mail: [email protected], [email protected] Introduction

Data Collection and Labelling

Conclusion and Future Work

Despite the high number of studies that deal with abuse on social media platforms, the vast majority of these studies focus on the English language. With the increasing number of Arabs actively using social media, there is a paucity of studies on addressing the problems specific for the Arabic language.

We collected and statistically analysed a large dataset of close to 16,000 YouTube comments in Arabic.

We observed diversity of Arabic dialects appearing in YouTube comments, therefore the annotators were chosen to be from three different countries.

Research Goals The goal of this study is to develop techniques based on Natural Language Processing (NLP) for detection of offensive language on a social media platform. This work aims at identifying the characterristics of the language generated on social media by Arabs from multiple countries in the Arab region, and finding proper solutions based on machine learning techniques. A secondary goal is to build a dictionary of offensive words that can be used in both this and future studies of offensive language in online communication.

Subsequently, our dataset was labelled by three annotators which is the standard methodology used in similar studies with English text corpora, see Warner et al. (2012), Huang et al. (2014) and Al-garadi et al. (2016). The three annotators were asked to label each YouTube comment in our dataset as either offensive or nonoffensive. Scenario 1: Label as offensive comments on which all annotators agree. Scenario 2: Label as offensive comments on which at least two annotators agree. Table 1. Labelling Results Comments on which all annotators agree

Percentage of matches

10715 comments

71 %

Comments unlabelled by at lest one annotator

At least one annotator disagree

848 comments

4335 comments

Scenario 1 Comments labelled offensive by three annotators

Percentage of offensive comments

3797 comments

25 %

Methodology • Data collection and labelling. • Data pre-processing: data cleaning, feature selection and extraction, transformation of text data into forms ready for text miming, data normalisation, tokenization, filtering and stemming.

Scenario 2 Comments labelled offensive by at least two annotators

Percentage of offensive comments

5817 comments

39%

Predictive Model

Figure 1 illustrates the supervised learning process we follow and plan to complete.

References • Al-garadi, M.A., Varathan, K.D. and Ravana, S.D., 2016. Cybercrime detection in online communications: The experimental case of cyberbullying detection in the Twitter network. Computers in Human Behavior, 63, pp.433443. • Chen, Y., Zhou, Y., Zhu, S. and Xu, H., 2012, September. Detecting offensive language in social media to protect adolescent online safety. In Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom) (pp. 71-80). IEEE. • Huang, Q., Singh, V.K. and Atrey, P.K., 2014, November. Cyber bullying detection using social and textual analysis. In Proceedings of the 3rd International Workshop on Socially-Aware Multimedia (pp. 3-6). ACM. • Nobata, C., Tetreault, J., Thomas, A., Mehdad, Y. and Chang, Y., 2016, April. Abusive language detection in online user content. In Proceedings of the 25th International Conference on World Wide Web (pp. 145-153). International World Wide Web Conferences Steering Committee. • Warner, W. and Hirschberg, J., 2012, June. Detecting hate speech on the world wide web. In Proceedings of the Second Workshop on Language in Social Media (pp. 19-26). Association for Computational Linguistics.

Dataset Labelling

Machine Learning Algorithms

Vector space model

Training and Test Dataset Collecting

Figure 1. Supervised

• Statistical analysis of the accuracy of the models.

As a next step we plan to analyse this dataset with a range of machine learning algorithms and a variety of pre-processing methods.

Predicted label

• Feature selections and vector space modelling. • Employment of supervised machine learning techniques for building predictive models.

In our dataset we found that non-Arabic comments are 2%, and about 60% of them are in English. Interestingly, we found also 1% Arabic comments written in non-Arabic alphabet.

Learning Process

Vector space model

Unknown text data