Which Words Do You Remember? Temporal Properties of Language ...

Which Words Do You Remember? Temporal Properties of Language Use in Digital Archives? Nina Tahmasebi, Gerhard Gossen, and Thomas Risse L3S Research Center, Appelstr. 9a, Hannover, Germany {tahmasebi,gossen,risse}@L3S.de

Abstract. Knowing the behavior of terms in written texts can help us tailor fit models, algorithms and resources to improve access to digital libraries and help us answer information needs in longer spanning archives. In this paper we investigate the behavior of English written text in blogs in comparison to traditional texts from the New York Times, The Times Archive, and the British National Corpus. We show that user generated content, similar to spoken content, differs in characteristics from ‘professionally’ written text and experiences a more dynamic behavior.

1

Introduction

The rise of the Web has allowed more people to publish texts by removing barriers that are technical but also social such as the editorial controls that exist in traditional media. The resulting language tends to be more like spoken language because people adapt their use to the medium [10]. We can see evidence of this adaption in the vocabulary: Authors on the Web use more new or non-traditional terms. This results in a very dynamic language. However, the terms used to refer to a concept can change between the contexts of the document author and the user of an archive. This makes finding relevant documents in an archive harder. Knowing about the change rate of modern language we can target algorithms towards capturing these changes in long term archives. In this paper we complement our earlier work on term evolution by analyzing the dynamics of languages itself. We compare traditionally written texts (news corpora, written part of British National Corpus (BNC)) and spoken language and the language the Web (spoken part of BNC, two TREC blog crawls). Using the BNC as a ground truth, we investigate if language in user generated text behaves like spoken language and is thus more dynamic than language used in traditional text.

2

Related Work

Language evolution has been a research topic for a long time and gained increasing interest [9]. A good overview of the field can be found in [4]. This research focuses ?

This work is partly funded by the European Commission under ARCOMEM (ICT 270239)

on the origins and development of languages and involves a wide range of scientific areas. Some work has been done in the areas of information retrieval for the access of content. Abecker et al. [1] showed how medical vocabulary evolved. Kanhabua et al. [6] proposed an algorithm for detecting time-based synonyms and showed that using these synonyms for query expansion greatly improved retrieval effectiveness in a digital archive. A special case of evolution, outdated spellings of the same term, has been addressed in [5] where a rule based method is used for deriving spelling variations that are later used for information retrieval. Bamman et al. [2] propose a method for automatically identifying word sense variation in a collection of historical books. They automatically classify the word senses in the corpus and track the rise and fall of those senses over a span of two thousand years. The work in this paper contributes to the state of the art by providing more statistical insights into the dynamics of Web language.

3

Hypothesis

Our hypothesis is that the language used in user generated text from the Web (called Web language) behaves like spoken language and is thus much more dynamic than the language used in traditionally written texts (called edited text). To support our hypothesis we use three main measures described below. The first important property is the deviation from standard vocabulary which we measure using the dictionary recognition rate (DRR). We define the DRR as the proportion of terms recognized by current dictionaries (WordNet1 [8] and Aspell2 , see also [11]). Unrecognized terms are for example proper names, domain specific terms, spelling errors and new terms. We use the DRR of the edited datasets as an estimate for the number of domain specific and new terms. The second property is the change rate in the vocabulary from one year to the next. We measure this using the vocabulary overlap between the dataset dictionaries for two years. While the DRR captures general language properties, this measure describes the language dynamics in a specific dataset. As we expect different behavior for high frequency compared to infrequent terms, we split the dataset dictionaries by frequency into three parts using a 25%/50%/25% split3 . For each adjacent year we measure the part overlaps as well as the total overlap between all terms. We expect a decrease in overlap from the terms of highest to those of lowest frequency because the frequent ones form the stable core of the language whereas the most infrequent ones are typically misspellings. The third and final property is the change in frequency of individual terms in adjacent years. We define Kdiff terms as the set of terms that have a change in frequency greater than 50%, measured as (for term t and year i): f ri (t) − f ri+1 (t) Kdiff (t) = abs max(f ri (t), f ri+1 (t)) 1

2 3

We use the stemmer from the MIT Java WordNet interface JWI (http://projects. csail.mit.edu/jwi/) as well as the WordNet “exception entries” for irregular words. http://aspell.net/ The 25%/50%/25% split was empirically determined.

Table 1. Used datasets. type

timespan

Table 2. Dictionary recognition rates for WordNet and Aspell. docs

NYTimes newspaper 1987–2007 1.8 mio Times newspaper 1960–1985 2.0 mio BNCwr misc. written 1986–1993 2628 BNCsp Blogs

misc. spoken 1991–1994 613 blogs 2005–2008 14.9 mio

WN DRR Aspell DRR NYTimes Times BNCwr

94.6% 92.0% 95.6%

96.5% 94.0% 97.1%

BNCsp Blogs

92.6% 87.7%

98.4% 89.5%

where f ri (t) is the normalized frequency for term t in the dataset dictionary corresponding to year i. A high amount of Kdiff terms shows a highly dynamic language behavior. By considering the dataset dictionary overlap together with Kdiff terms we can measure how many terms stay between two adjacent years and among those, how many change substantially in frequency.

4

Experiments

We use four datasets (see Table 1) which we divide into two different classes: traditional written text (New York Times, Times, written part of BNC (BNCwr) [3]) and Web language and spoken language (TREC-BLOG and spoken part of BNC (BNCsp), from here on described as user generated content). All datasets are pre-processed to filter out noise. We process the texts using TreeTagger4 to lemmatize terms and extract nouns and Lingua::EN::Tagger5 to extract noun phrases. We process each dataset in yearly collections and create a dataset dictionary with the identified nouns and noun phrases. The Blogs dataset is a combination of the two TREC-Blog datasets Blogs06 [7] and a subset of Blogs08 [12]. We filter out non-English documents using a simple heuristic. 4.1

Analysis Results

We split the datasets into yearly collections and show the yearly average for each experiment. However, in the experiments regarding DRR we process BNCwr and BNCsp as a whole because of the limited size of the datasets. Dictionary Recognition Rate: Table 2 shows the DRR of all datasets. For NYTimes and BNCwr the DRR is 94.6%–97.1%. The unrecognized terms are most likely proper names because the Aspell DRR is consistently higher than of WordNet (Aspell contains many person names, WordNet almost none). Among the user generated datasets the DRR is consistently lower than for the written datasets. An exception is the Aspell DRR for the BNCsp (98.4%) which is high because of the person names used in conversations that are recognized by Aspell. The DRR of Blogs is lower than the DRR of NYTimes. To some extend this is due to non-English blog entries being included in the dataset. However, a 4 5

http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ http://search.cpan.org/dist/Lingua-EN-Tagger/

Table 3. Dictionary Overlaps: Dictionaries divided into top 25%, middle 50%, and bottom 25% and compared year by year. Average overlap for all years is shown here. Top Middle Bottom Total

Table 4. Kdiff measure for all datasets. A value of 0.5 means an increase or decrease of frequency ≥ 50% between two consecutive years. The values shown in this table are an average for all years in each dataset. 0.5

0.55

0.6

0.65

0.7

0.75

NYTimes 76.8% 41.3% Times 75.1% 42.3% BNCwr 70.4% 30.4%

9.3% 9.6% 7.25%

60.8% 60.5% 55.2%

NYT 13.0% 9.8% 7.3% 5.1% 3.3% 2.0% Times 14.0% 10.8% 8.1% 5.8% 3.9% 2.5% BNCwr 11.2% 10.1% 8.2% 6.7% 5.4% 3.6%

50.0% 16.6% 54.8% 14.9%

4.2% 6.0%

43.6% 38.9%

BNCsp 19.6% 18.8% 16.7% 14.3% 13.1% 10.9% Blogs 17.7% 15.9% 13.8% 11.7% 9.5% 7.4%

BNCsp Blogs

subsequent test on a sample of the dataset showed that approximately 96.2% of all documents were English. If we correct for this error by adding 3.8% to the DRR of blogs and compare it to NYTimes we find that the first is significantly lower at 99% significance. As NYTimes contains edited texts and has no errors we can use it as an upper bound on the possible DRR of a dataset with a large number of person names. The significantly lower DRR of Blogs shows that Web language deviates more from a standard dictionary than the language used in the NYTimes. Dataset Dictionary Overlap: In Table 3 we can see the overlaps of nouns and noun phrases in yearly collections (dataset dictionaries) of consecutive years for all datasets. We use this measure to indicate the temporal dynamics of each dataset. For written datasets the average top overlap is 70%–77% which is significantly higher than all other values. Intuitively this makes sense as the most common nouns and noun phrases are unlikely to disappear from one year to another. For example, health, service and government will remain frequent topics of discussion so we consider them temporally stable. Other terms are locally frequent or temporally unstable: They are in the top parts only for a single year and disappear after that. These terms are highly affected by events or technical innovations, e.g., iranian movie, olympic gold, groovebox, payroll disparity. For the user generated datasets we see a much lower top overlap (50%–55%). The rate with which terms are exchanged from the vocabulary is much higher in user generated datasets. This is in particular true for the Blogs dataset as terms have be used by many different authors to be considered as frequent. For all datasets, the bottom overlap is very low (less than 10%). This also makes intuitively sense as this sets contains many person names or misspellings that are less stable. The total overlap also differs between the two types of datasets. Among the written datasets around 55%-60% of all terms will also appear the next year. However, among the user generated datasets this overlap is between 39%-44%. In our news datasets the total overlap is around 60% even though the very aim of a newspaper is to cover many different events. Thus 60% overlap indicates a stable language but varying events. For Blogs the total overlap is 39%. The big difference is most likely a product of topic variations, temporally unstable terms as well as misspellings. However, misspellings of common terms are likely to follow

a pattern across all datasets and thus play a smaller role in the lower overlap. Hence the low overlap of user generated datasets is a sign of high dynamics. Kdiff Terms: As a final measure we consider the Kdiff of terms as the relative frequency of terms between consecutive years. Thus this is also a measure of the popularity of a term. For our experiments we only consider differences in frequency larger than 50% (Kdiff ≥ 0.5). In Table 4 we see the average amount of terms in each dataset that have a Kdiff of at least 0.5. A high Kdiff indicates a more dynamic language as more terms increase or decrease in popularity. In Table 4 we see that that there is an average Kdiff of 11%–14% for the traditionally written datasets compared to 18%–20% in user generated datasets. The difference between the groups is statistically significance at 95% level. We also see that as the Kdiff threshold increases we see a larger difference between the two different types of datasets. There are 2%–4% terms with Kdiff ≥ 0.75 for traditionally written in comparison to 7%–11% Kdiff terms in the user generated datasets.

5

Discussion

Overall our experiments show that language from user generated text (speech or blogs) is more dynamic than traditionally written language like that in newspapers. Due to the relatively small size of the BNC, we choose to compare the written and spoken parts of the BNC to each other and the remaining datasets separately. Because of the high quality of the BNC, we use the relationship between its parts as a ground truth to compare against the other datasets. The spoken part of BNC is more dynamic than the written parts: The DRR is lower for BNCsp, indicating the use of more non-standard terms and spellings and the dictionary overlap for nouns and noun phrases (Table 3) is consistently lower for BNCsp, just as the Kdiff terms (Table 4) are higher by 7%–9%. The relation between spoken and written language from the BNC also holds for Web language and written language. In general, the news datasets behave similarly while the Blogs dataset is markedly different. This holds for the DRR where Blogs has consistenly lower values for both dictionaries, the dictionary overlap (20% lower for Blogs), as well as the Kdiff terms (5%–6% larger for Blogs). The difference between the news datasets is quite small and very likely the consequence of the presence of OCR errors in Times but not NYTimes. In general, we find that the relationship between spoken and written language is similar to that between Web language and written language. In conclusion, we find that Web language behaves more dynamically with a higher change rate.

6

Conclusions and Future Work

Our experiments showed that language from user generated content like conversations or blog content behaves more dynamically than language used in traditionally written texts. To measure dynamics we used the number of terms that appear in or disappear from the vocabulary as well as the number of terms that have radical changes in their frequency. As ground truth we used the written

and spoken parts of the British National Corpus, as representatives of real life datasets we used datasets such as the New York Times and the Times Archive (traditionally written texts) as well as the TREC-BLOG dataset (user generated texts). The relationship between written and spoken texts in the BNC is mimicked by the relationship between traditionally written text and user generated text. However, in our experiments we have seen results that we partly attribute to a high amount of person names and proper nouns. It remains future work to investigate to what extent these play a role and determine their role properly. It would also be interesting to investigate how variations spread and to see which types of variations make it to a wider audience and become established. The TREC-BLOG dataset we used in our experiments has some issues: The topic distribution is unknown and the number of entries per year varies heavily. We propose to create a dataset that follows a set of topics over a longer time to encourage and simplify research on temporal language evolution. Acknowledgments We would like to thank Times Newspapers Limited for providing the archive of The Times and our colleague Nattiya Kanhabua for constructive discussions.

References 1. Abecker, A., Stojanovic, L.: Ontology evolution: Medline case study. In: Wirtschaftsinformatik: eEconomy, eGovernment, eSociety. pp. 1291–1308 (2005) 2. Bamman, D., Crane, G.: Measuring historical word sense variation. In: JCDL. pp. 1–10 (2011) 3. The British National Corpus, version 3 (2007), BNC Consortium 4. Christiansen, M., Kirby, S.: Language evolution. Studies in the evolution of language, Oxford University Press (2003) 5. Ernst-Gerlach, A., Fuhr, N.: Retrieval in text collections with historic spelling using linguistic and spelling variants. In: JCDL. pp. 333–341 (2007) 6. Kanhabua, N., Nørvåg, K.: Exploiting time-based synonyms in searching document archives. In: JCDL. pp. 79–88 (2010) 7. Macdonald, C., Ounis, I.: The TREC Blogs06 Collection: Creating and Analysing a Blog Test Collection. DCS Technical Report Series (2006) 8. Miller, G.A.: WordNet: A Lexical Database for English. Communications of the ACM 38, 39–41 (1995) 9. Pinker, S., Bloom, P.: Natural selection and natural language. Behavioral and Brain Sciences 13(4), 707–784 (1990) 10. Segerstad, Y.: Use and adaptation of written language to the conditions of computermediated communication. Ph.D. thesis, Göteborg University (2002) 11. Tahmasebi, N., Niklas, K., Theuerkauf, T., Risse, T.: Using Word Sense Discrimination on Historic Document Collections. In: JCDL. pp. 89–98 (2010) 12. TREC-BLOG. http://ir.dcs.gla.ac.uk/wiki/TREC-BLOG (2012)