Effect of Different Feature Types on Age Based ...

4 downloads 184 Views 271KB Size Report
techniques were used like POS tagging, Spell Checker, psycholinguistic database .... We collected short written texts from different social media sources like ...
Effect of Different Feature Types on Age Based Classification of Short Texts Avar Pentel Institute of Informatics Tallinn University Tallinn, Estonia [email protected]

Abstract— The aim of the current study is to compare the effect of three different feature types for age-based categorization of short texts as average 85 words per author. Besides widely used word and character n-grams, text readability features are proposed as an alternative. By readability features we mean different relative ratios of text elements as characters per word, words per sentence, etc. Support Vector Machines, Logistic Regression, and Bayesian algorithms were used to build models. Most effective features were readability features and character ngrams. Model generated by Support Vector Machine and combined feature set yield to f-score 0.968. Age prediction application was built using a model with readability features. Keywords— age detection; readability features; n-grams; logistic regression; support vector machines; naïve bayes

I. INTRODUCTION Automatic user age detection is a task of growing importance of cyber-safety and criminal investigations. In order to prevent sexual abuse of the minors, automatic sexual predator detection models [1] are trained on previously identified predator’s texts. However, this approach does not help in situations, where under aged are pretending to be older and visiting portals that are not appropriate for their age. Automatic age detection has potential to solve both of these problems. It helps to detect adult who pretends to be a youngster in order to get his/her victims’ trust, and it also helps to detect child who pretends being adult in order to gain access to some adult portal, or community. Many previous works on age detection [2]-[9] tried to group several different age groups. While such profiling certainly has many applications, these models might not be the most effective if the goal is only binary classification between minor group and adult group. In the current study we investigate the features that are appropriate for binary classification of short texts written by children or teenagers versus those written by adults. Besides of model accuracy, we also consider the applicability of feature types in real life, online environments, i.e. how realistic is to extract those features from short texts instantly in online systems. Using most appropriate feature types, we build a prototype age detector application.

II. RELATED WORK Most frequently used features in authorship profiling are word n-grams and character n-grams (sometimes called ngraphs). Additionally, some databases are used to categorize the words, so for instance part of speech tagging is used to extract the roles of the words in texts. More or less all these features are related to the content and also their occurrence is related to context, not to mention that they all are heavily language dependent. The best-known age based classification results are reported by Jenny Tam and Craig H. Martell [2]. They used age groups 13-19, 20-29, 30-39, 40-49 and 50-59. All age groups were in different sizes. As features word and character n-grams were used. Additionally, they used emoticons, the number of capital letters and number of tokens per post as features. SVM model trained on youngest age group against all others yield to f-score 0.996. Moreover, this result seems remarkable, while no age gap between the two classes was used. However, we have to address to some limitations of their work that might explain high f-scores. Namely, they used unbalanced data set (465 versus 1263 in the training data set and 116 versus 316 in the test set). Unfortunately, their report gave only one f-score value, but no confusion matrices, ROC or Kappa statistics. We argue, that with unbalanced data sets, single f-score value is not sufficient to characterize the model’s accuracy. In such test set - 116 teenagers versus 316 adults the f-score 0.85 (or 0.42 depending on what is considered positive result) will simply be achieved by a model that always classifies all cases as adults. Also, it is not clear if reported fscore was averaged. Armagon and Koppel [3] tested three age groups: 10s (1317), 20s (23-27) and 30s (33-42). As features they used stylebased and content-based features. Style based features were part of speech (POS), function words, hyperlinks, and neologisms. Content-related features were simple content words, as well as special classes of words taken from the handcrafted linguistic inquiry word count (LIWC) categories. They used Bayesian Multinomial Regression algorithm to build model, which gave 77.7% of classification accuracy.

As we can see, most of previous studies are using similar features, word and character n-grams. Additionally, special techniques were used like POS tagging, Spell Checker, psycholinguistic database, sentiment strength tool, and LIWC tool to categorize words. While text features extracted by this equipment are important, they are costly to implement in real life online systems. Similarly the large feature sets up to 50,000 features, most of which are word n-grams, means megabytes of data for pattern matching. Ideally, this kind of age detector could work using client browser resources (JavaScript), and all feature extraction routines and models have to be as small as possible. Therefore ideal features would be self-sufficient, computable on-site. In some previously mentioned works [6], [9] one class of used features posses such characteristics, namely readability features. Readability features are simple statistical measures about text, as texts length, word length, number of words in a sentence, number of syllables in word, etc. These attributes might look like worthless at first glance, however, they had some long history in text categorization already before computerized text processing. So for example Gunning Fog index [10] takes into account complex (or

TABLE I.

SUMMARY OF PREVIOUS WORK

Feature types

f-score

Tam (2009)

x

x

x

343

0

0.996

N/a

Argamon (2009)

x

7250

5

N/a

77.7%

Nguyen (2011)

x

11101

0

N/a

N/aa

Peersman (2011)

x

N/a

9

0.917

N/a

Authors

Weren (2013)

x

Santosh (2014)

accuracy

x

335

5

0.77

77%

x

335

5

N/a

66%

Marquardt (2014)

x

x

This Study

x

x

a.

x

words per author

separation gap

Marquart [9] tested five age groups 18-24, 25-34, 35-49, 50-64, and 65+. Used dataset was unbalanced and not stratified. He also used some of the text readability features as we did in the current study. Besides of readability features, he used word n-grams, HTML tags, and emoticons. Additionally, he used different tools for feature extraction like psycholinguistic database, sentiment strength tool, LIWC tool, and spelling and grammatical error checker. Combining all these features, his model yield only to modest accuracy of 48,3%.

Summarizing previous work in the following Table I, we don’t list all used features. So for example, features that are generated using POS tagging or features generated some word databases are all listed here as word n-grams. Most of the papers reported many different results, and we list in this summary table only the best result.

emoticons

Santosh, et al. [7], [8] used word n-grams as content-based features and POS n-grams as style based features. They tested three age groups 13-17, 23-27, and 33-47. Using SVM and kNN models, best classifiers achieved 66% accuracy.

(1)

char n-grams

Weren, et al. [6] used word n-gram features and different similarity metrics, but they also used some readability features. Using C45 algorithm, they trained a model, which had 77% accuracy on Spanish texts. Unfortunately, they did not report about the impact of different feature types.

⎡⎛ words ⎞ ⎛ complexwords ⎞⎤ GFI = 0.4 × ⎢⎜ ⎟ + 100 × ⎜ ⎟⎥ sentences words ⎝ ⎠ ⎝ ⎠⎦ ⎣

word n-grams

Peersman et al [5] used large sample, 10,000 per class and extracted up to 50,000 features based on word and character ngrams. They tested three datasets with different age groups – 11-15 versus 16+, 11-15 versus 18+ and 11-15 versus 25+. Also experimentations carried out with a number of features, and training set sizes. Best SVM model and with largest age gap, largest dataset and largest number of features yield to fscore 0.88.

difficult) words, those containing 3 or more syllables and average number of words per sentence. If a sentence is too long and there are many difficult words, the text is considered not easy to read and more education is needed to understand this kind of text. Gunning Fog index is calculated with a formula (1) below:

readability

Dong Nguyen and Carolyn P. Rose [4] used linear regression to predict author age as continues value. They used a large dataset with 17947 authors with average text length of 11101 words. They used as features word unigrams and POS unigrams and bigrams. The text was tagged using the Stanford POS tagger. Additionally, they used LIWC tool to extract features. The best regression model had r2 0.551 and mean absolute error 6.7.

x x

N/a

0

N/a

47.3%

85

4

0.968

96.5%

Nguyen used linear regression model, which yieald to r2=0.55 and MAE 6.7

As we can see readability features are not popular in age detection. We suppose that authors reading skills and writing skills are correlated and by analyzing author’s text readability, we can conclude about his/her education level, which at least to the particular age is correlated with the actual age of an author. As readability indexes work reliably on texts with about 100 words, these are good candidates for our task. Moreover, these kinds of features are very favorable in real life applications because they are self-sufficient. Features can be calculated from the text itself without any outside databases. We do not use an actual Gunning Fog Index or any other readability index, but we use the same variables as features in machine learning. As a baseline another n-gram based feature sets are tested and results are compared. Prototype age detector application is built using the best fitting model. III. METHODOLOGY A. Corpus and training datasets We collected short written texts from different social media sources like Facebook, Blog comments, and Internet forums. Average number of words per author was 85. All authors were identified and their age fell between 7 and 48 years. All texts in the collections were written in the Estonian language. We separated age groups by setting 2-year or 4-year gap between

groups starting from 13 years and ending with 19 years (Table II). As a result we got 14 different training data sets. Each dataset was balanced, and there were 200 documents per class. TABLE II.

TESTED AGE GROUPS

2-year separation gap

4-year separation gap

younger

older

younger

older

7-12

15-48

7-12

17-48

7-13

16-48

7-13

18-48

7-14

17-48

7-14

19-48

7-15

18-48

7-15

20-48

7-16

19-48

7-16

21-48

7-17

20-48

7-17

22-48

7-18

21-48

7-18

23-48

B. Features In the current study we used 2 types of features in our training datasets: readability features, and n-grams. We used two types of n-grams - character and word ngrams, which are number n consecutive characters in text, and number n consecutive words in texts, accordingly. As character n-grams, we used bigrams and trigrams, and as word n-grams, we used unigrams, bigrams and trigrams. We extracted all different n-grams from the corpus and used for feature selection Weka [11] Java implementations of Chi-Squared Attribute evaluation, SVM attribute evaluation and Infogain attribute evaluation. All features were numeric, patterns were matched for each text and all occurrences counted. Counting results were normalized to the text length. Both types of ngram features are presented in Table III. TABLE III.

N-GRAM FEATURES

Number of unique n-grams in corpus Young group

Adult group

Number of selected features

Character bigrams

851

1173

119

Character trigrams

4683

6235

576

n-gram feature type

Word unigrams

4481

6833

100

Word bigrams

10828

13434

30

Word trigrams

11984

14718

6

We don’t present selected feature rankings here; it is done in section IV C on the basis of standardized models. Other set of features we used is readability features. By readability features we mean quantitative data about texts, as for instance an average number of characters in word, syllables in word, words in sentences, commas in sentence and relative frequency of words with n syllable. All together 14 different features were extracted from each text (Table IV). We used only numeric values and normalized the values using other quantitative characteristics of the text.

TABLE IV.

READABILITY FEATURES

feature

calculation

1.

CPW

Characters Words

2.

WPS

3.

CPS

4.

SPW

5.

S1TW

6.

S2TW

7.

S3TW

8.

S4TW

9.

S5TW

10.

S6TW

11.

S7TW

7 sylWords Words

12.

S8TW

8 + sylWords Words

Words with 8 or more syllables to all words ratio

13.

CWPS

ComplexWords Sentences

Average number of complex words in Sentence

14.

CWTW

ComplexWords Words

Complex words to all words ratio

Words Sentences Commas Sentences SyllablesInText Words 1sylWords Words 2sylWords Words 3sylWords Words 4sylWords Words 5sylWords Words 6sylWords Words

explanation Average number of characters per word Average number of words per sentence Average number of commas per sentence Average number of Syllables per Word Words with 1 syllable to all words ratio Words with 2 syllable to all words ratio Words with 3 syllables to all words ratio Words with 4 syllables to all words ratio Words with 5 syllables to all words ratio Words with 6 syllables to all words ratio Words with 7 syllables to all words ratio

Notion of complex word in our feature set is the loan from Gunning Fog Index [10], where it means words with 3 or more syllables. As the Gunning Fog Index is developed for English language, and in the Estonian language average number of syllables per word is higher, we raised the number of syllables accordingly to 5. Additionally we count the word complex if it has 13 or more characters. C. Machine Learning Algorithms For classification we tested three popular machine-learning algorithms: Support Vector Machines, Logistic Regression and Naïve Bayes. Motivation of choosing these three algorithms is based on related works [2]-[9], but also on more general literature about classification algorithms [12], [13]. In our task we used Java implementations of listed algorithms that are available in freeware data analysis package Weka [11]. D. Evaluation For evaluation, we used 10 fold cross validation on all models. It means that we partitioned our data into 10 even sized and random parts, and then using one part for validation and another 9 as training dataset. We did so 10 times and then averaged validation results. We present f-scores as weighted average f-scores for both classes. F-score is a harmonic mean between precision and recall. Here we give an example how it is calculated. Let suppose we

have a dataset presenting class A and B. And our model classifies the results as in following Table V: TABLE V.

CONFUSION MATRIX

Classed as →

A TP FP

A B

B FN TN

If we interpret these results from the point of view class A, then TP means here true positive, i.e. item belonging to class A is correctly classed. Items in class B that are falsely classed as belonging to group A, are called false positives (FP). False negatives (FN) are all items from class A that are classed as B. Positive predictive value or precision for class A is calculated by formula 2. precision =

TP TP + FP

(2)

In following examples we focus on most effective group with 4-year separation gap. This age gap is in many ways justifiable. It is reported that most victims of online sexual abuse are at age 13-15 [14]. Secondly 4-year gap separates those minors who are 15 or less from adults 20+, and is related to the age of consent. While by Estonian law the age of consent is set to 14 years, it is subject of political discussion, and most probable change will be upward 2 years. Thirdly, in cases like child sexual abuse or pedophilia, it is better to have more false positive results than false negatives. B. Classification of age groups 7-15 and 20-48 With 4-year gap between age groups 7-15 and 20-48, readability features yield to slightly better results with SVM and Logistic regression classifiers (Table VI). The Naïve Bayes classifier was more effective with n-gram features. TABLE VI.

7-15 VERSUS 20-48 YEAR OLD AUTHORS F-Scores

Recall or sensitivity is the rate of correctly classified instances (true positives) to all actual instances in predicted class. Calculation of recall is given by formula 3. TP recall = TP + FN

(3)

F-score is harmonic mean between precision and recall and it is calculated by formula 4. f − score = 2 ×

precision × recall 2TP = precision + recall 2TP + FP + FN

(4)

The f-score for class A and class B could be different. Presenting our results, we use a single f-score value, which is an average of both classes’ f-score values. IV. RESULTS A. Effect of position and size of separation gap The best results were achieved with a 4-year separation gap between age groups 7-15 and 20-48. With a 2-year separation gap, the results were more evenly distributed, not depending that much on the position of the gap as shown on Fig. 1.

Readability Features

char

word

All combined

0.953

0.952

0.85

0.95

Logistic Regression

0.95

0.929

0.775

0.92

Naïve Bayes

0.811

0.946

0.901

0.882

SVM standardized

0,94

C. Most distinctive features We evaluated the impact of features on the basis of the SVM model with standardized data. We generated three models separately: for character n-grams, for word n-grams, and for readability features. The effect of most distinctive features are presented in Tables VII-IX, where two columns in the middle listing features for adult and under-aged group, and first an last column SVM attribute weights. Model trained with standardized character n-gram features (character bigrams and trigrams) shows that character trigrams had a higher impact on classification (Table VII).

Weight

0,92 0,9 0,88 0,86 0,84 0,82 0,8 13

14

15

16

17

18

starting year of separation gap 4 year gap

2 year gap

Fig. 1. Effect of separation gap’s size and position.

19

n-grams

Combined feature set with readability features and with all n-grams did not improve the results. However, when we let out word n-grams and combined only readability features with character n-grams, our SVM model with standardized data yield to best f-score 0.968.

TABLE VII.

0,96

f-score

Classifier

MOST DISTINCTIVE CHARACTER N-GRAMS Chraracter n-gram features adults

under aged

Weight

0.0868

raa

ta

0.0834

gem

adn

-0.084 -0.083

0.0811

hen

mbr

-0.079

0.0739

epa

ngu

-0.069

0.0731

ila

uk

-0.058

0,068

hoo

nte

-0.056

0.0661

eja

is

-0.055

0.0636

mo

idu

-0.055

0.062

ikk

poi

-0.055

0.0604

lej

ern

-0.052

TABLE IX.

While this model yield to high classification accuracy, the impact of individual features is here lower than in other models. It can be explained by a number of features. At this stage of our study, it is hard to give any interpretation to particular char n-grams, probably further study is needed on this topic.

TABLE VIII.

adults

under aged

te (you)

arvutis (in computer)

0.660

oleks (could be)

pilte (pictures)

-0.589

0.592

aasta (year)

oli (was)

-0.523

-0.793

0.562

natuke (a little)

ju (after all)

-0.516

0.561

jälgida (to observe)

ta oli (he/she was)

-0.504

0.491

võinud (could)

ja ta (and he/she)

-0.486

0.448

linna (to the city)

seal (there)

-0.463

0.415

kahjuks (unfortunately)

eks (is’nt it/right)

-0.440

0.414

ka (also)

aeg (time)

-0.418

0.366

ümber (around)

ta ei (he (was) not)

-0.398

0.355

on nii (it is so)

ühel (one)

-0.393

0.349

kui (if)

konto (account)

-0.333

0.328

läheb (goes) samas (in the same time)

tüdruk (girl)

-0.326

sel päeval (this day)

-0.283

0.302

kindel (sure)

endale (for myself or for him/herself) talle (him/her)

0.296

et seda (that this)

ta (hi/she)

0.294

talle (him/her)

poiss (boy)

-0.244

0.269

siis (then)

-0.216

0.267

varem (sooner)

ühe (one) Internetis (on the Internet)

0.308

öelda, et (to say that)

under aged

Weight

1.364

Average number of words in sentence

Ratio of words with 4 syllables

-0.762

0.839

Average number of characters in word

Average number of commas per sentence

-0.745

0.258

Complex words in sentence

Ratio of words with 2 syllables

-0.271

Ratio of words with 1 syllable

-0.389

Weight

1.229

0.323

adults

MOST DISTINCTIVE WORD N-GRAMS Word n-gram Features

Weight

Readability Features

Weight

Another SVM model was generated by using standardized word n-gram features: unigrams, bigrams, and trigrams. The effect of the 20 most distinctive features is given in table VIII.

MOST DISTINCTIVE READABILITY FEATURES

D. Prototype Application We suppose that readability features are less context dependent if dealing with a short texts. Readability features are also simple to calculate on site, therefore we decided to implement in our prototype application Logistic Regression model that was trained only on readability features.1 Our prototype application uses written natural language text as an input, extracts all features in exactly the same way, as we extracted features for our training dataset, and predicts author’s age class (Fig. 2.). Test text in

Tokenizer

Feature extractor

-0.269 -0.268

Model

- 0.251

Cleans all excess whitespaces, divides a text to array of words and to sentences

Calculates 14 numeric values and returns as an array

Uses model generated outside and an array of features as an input

Age prediction out

-0.2

As we can see, the features with highest impact here, are word unigrams and some bigrams. It is noticeable that at the top of high impact features there is a high frequency of function words, which can be explained by the overall high frequency of function words. We suppose that impact of content words and also word bigrams and trigrams will rise, when using longer texts per author and bigger corpus. Among readability features the strongest indicator of an age was the average number of words in a sentence. Older people tend to write longer sentences. They also are using longer words. Average number of characters per word is in the second place in feature ranking. The best predictors of the younger age group are use of short words with one or two syllables. In following Table (IX), coefficients of the standardized SVM model are presented.

Fig. 2. General design of age detector application

We implemented tokenization, feature extraction routines and classification function in client-side JavaScript. Our feature extraction procedure consists of three stages: 1. Some simple features are extracted as a byproduct of tokenization. A number of characters, number of words, number of commas, and number of sentences, are calculated at this stage. 2. In second stage syllables for each word are counted. At this stage is estimated how many words are in text with n syllable (n = 1 – 8+). 3. All calculated characteristics are normalized using other characteristics of the same text. For example number of 1

http://www.tlu.ee/~pentel/age/

characters in text divided by a number of words in text, or the number of words in text divided by number of sentences in the text. We created a new and simpler algorithm (5) for syllable counting. Other analogues algorithms for Estonian language are intended to exact division of the word to syllables, but in our case we are only interested on the exact number of syllables. As it turns out, syllable count is possible without knowing exactly where one syllable begins or ends.

Implemented syllable counting algorithm as well as other automatic feature extraction procedures can be seen in the source code of the prototype application.2 Finally, we created a simple web interface, where everybody can test predictions by his/her free input or by copypaste. As our classifier was trained on Estonian language, sample Estonian texts are provided on the website for both age groups, as shown in Fig. 4.

In order to illustrate our new syllable counting algorithm, we give some examples about syllables and related rules in Estonian language. For instance the word rebane (fox) has 3 syllables: re – ba – ne. In cases like this we can apply one general rule – when the single consonant is between vowels, then new syllable begins with that consonant.

Sample random texts in Estonian will appear by clicking on those buttons

When in the middle of the word two or more consecutive consonants occur, then usually the next syllable begins with the last of those consonants. For instance the word kärbes (fly) – is split as kär-bes, and kärbsed (flies) is split as kärb-sed. The problem is that this and previous rule does not apply to compound words. So for example, the word demokraatia (democracy) is split before two consecutive consonants as demo-kraa-tia. Our syllable counting algorithm deals with this problem by ignoring all consecutive consonants. We set syllable counter at zero and start comparing two consecutive characters in the word, first and second character, then second and third and so on. General rule is, that we count a new syllable, when the tested pair of characters is a vowel fallowed by a consonant. The exception to this rule is the last character. When the last character is a vowel, then one more syllable is counted. Described algorithm, as shown in Fig. 3, is not 100% accurate; there are exceptions to this rule, but most of them are easy to program. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

Function: count number of syllables in word Set counter to 0 Input word w[] as an array of characters For i=0 to w.length - 2 If w[i] is vowel and w[i+1] is not vowel Add 1 to counter End If End For If w[w.length –1] is vowel Add 1 to counter End If Return counter End

Fig. 3. Pseudocode of syllable counter

Free input form

Fig. 4. Interface of prototype application at http://www.tlu.ee/~pentel/age/

V. DISCUSSION AND CONCLUSION In this paper, we compared the effect of three different feature types on age-based categorization of short texts. We tested character n-grams, word n-grams and readability features. While there are some previous works where readability features were used, the results were rather modest. In the current study, readability features proved to be a competitive alternative for other feature types. In many cases, readability features yield to same or even better classification results. SVM model trained with a combined feature set including readability and character n-gram features was most successful, yielding to classification accuracy 96%, and to f-score 0.968. Text readability features are favorable in many ways, they are self-sufficient, i.e., can be computed on site, which makes them easy to apply in web-based applications. Using readability features, there is no need to include large sets of text fragments for pattern matching. Another reason to favor readability features in short texts, is related to the context dependency of n-gram based features, especially if dealing with short texts. Therefore, we used in our prototype application only readability features. While this study has generated encouraging results, it has some limitations. As readability indexes measure how many years of education is needed to understand the text, we cannot assume that peoples reading, or in our case writing, skills will 1

http://www.tlu.ee/~pentel/age/source_code.txt

continuously improve during the whole life. For most people, the writing skill level developed in high school will not improve further and therefore it might be impossible to discriminate between adult age groups by using readability features only. However, these features might still be useful in discriminating between younger age groups.

[4] D. Nguyen, C.P. Rose, “Age prediction from text using linear

Another limiting factor of our study is the language. Estonian language is very different if comparing, for instance, to Indo-European languages. Therefore, we cannot make any generalizations concerning other languages on the basis of current results. For different languages the effect of readability features may be different. But it is clear, that other n-gram based features are at least as well language dependent.

[6] K. Santosh, A. Joshi, M. Gupta, V. Varma, “Exploiting Wikipedia

In order to increase the reliability of results, future studies should also compare different languages and to include a larger sample of texts. The value of our work is to present suitability of a simple, yet effective feature set for age based categorization of short text. And we anticipate a more systematic and in-depth study in the near future. REFERENCES [1] G. Inches, F. Crestani. “Overview of the international sexual predator identification competition at PAN-2012,” P. Forner, J. Karlgren, C. Womser-Hacker (Eds.), CLEF 2012 Evaluation Labs and Workshop, Online Working Notes, Rome, Italy, September 17-20, 2012.

[2] J. Tam, C.H. Martell. “Age detection in chat,” International Conference on Semantic Computing, 2009.

[3] S. Argamon, M. Koppel, J.W. Pennebaker, J. Schler. “Automatically profiling the author of an anonymous text,” Communicatin of the ACM 52(2), Feb. 2009, pp. 119–123.

regression,” LaTeCH '11 Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, 2011, pp 115-123.

[5] K. Santosh, R. Bansal, M. Shekhar, V. Varma, “Author profiling: predicting age and gender from blogs,” CEUR Workshop Proceedings, 2013, vol 1179. categorization for predicting age and gender of blog authors,” UMAP Workshops 2014.

[7] E.R.D. Weren, V.P. Moreira, J.P.M. Oliveira, “Using simple sontent features for the author profiling task,” Notebook for PAN at CLEF, 2013.

[8] C. Peersman, W. Daelemans, L. Vaerenbergh, “Predicting age and gender in online social networks,” SMUC '11 Proceedings of the 3rd international workshop on Search and mining user-generated contents, 2011, pp 37-44.

[9] J. Marquart, et al., “Age and gender identification in social media,” CEUR Workshop Proceedings, 2014, vol 1180.

[10] R. Gunning, The technique of clear writing. New York: McGraw-Hill 1952.

[11] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten, “The WEKA data mining software: an update”; SIGKDD Explorations, 2009, vol 11, 1.

[12] X. Wu, et al., “Top 10 algorithms in data mining,” Knowledge and Information Systems. Springer, 2008, vol 14, 1–37.

[13] M.C. Mihaescu, Applied intelligent data analysis: algorithms for information retrieval and educational data mining, Columbus, Ohio, Zip publishing. 2013, pp. 64-111.

[14] J. Wolak, D. Finkelhor, K. Mitchell, “Internet-initiated sex crimes against minors; implications for prevention based on findings from a national study,” Journal of Adolescent Health, Elsevier, 2004, vol 35, 5.