Exploring the Incorporation of Acoustic Information into Term Weights ...

Exploring the Incorporation of Acoustic Information into Term Weights for Spoken Document Retrieval Gareth J. F. Jones Department of Computer Science, University of Exeter Exeter EX4 4PT, U. K. email: [email protected]

Abstract Standard term weighting methods derived from experience with text collections have been used successfully in various spoken document retrieval evaluations. However, the speech recognition techniques used to index the contents of spoken documents are errorful, and these mistakes are propagated into the document index file resulting in degradation of retrieval performance. It has been suggested that, because of the uncertainty of correct recognition, term weights in spoken document retrieval might be improved by incorporating the acoustic likelihood information associated with each term by the speech recogniser. This paper examines possible techniques for incorporating acoustic likelihood starting with a theoretical analysis and an initial experimental investigation using the VMR1b collection.

1 Introduction The increasing availability of archives of spoken data stored in digital form is creating new opportunities for very rapid access to large amounts of original source material suitable for many applications. For this information to be easily available suitable automated retrieval tools are needed. The research area of spoken document retrieval (SDR) was initially developed within a small number of early studies, for example [1], [2], [3] and [4]. However, interest in SDR has grown very rapidly in recently years, as illustrated by the wide participation in the SDR tracks at the recent TREC workshops [5] [6] [7]. From the retrieval perspective the key difference between standard text retrieval and SDR is the need to index the contents of spoken documents using some form of speech recognition prior to retrieval. Various methods of speech recognition have been explored for SDR, including keyword spotting [1], subword feature indexing [2] [8] [9], phone lattice spotting [3] [10] and, most commonly for the TREC evaluations document transcriptions generated using large vocabulary speech recognition systems. A common feature of all these approaches is that they will make errors in recognition. Thus all terms in a retrieval index file are only hypotheses of term occurrences which may be correct or incorrect. At the output stage of the speech recognition system each word hypothesis has a statistical score associated with it that is related to the likelihood that it is correct. This information is usually discarded in SDR systems and for retrieval purposes all output hypotheses are generally treated as correct1. However, bearing in mind the uncertainty of the recognition process, it is intuitively attractive to suggest that this acoustic information should be preserved and might be usefully incorporated into the retrieval term weights. Although often suggested as a “future work” item by SDR researchers little work has appeared focusing on this issue. The small amount of work that has been published in this area date has generally been inconclusive [11] [12]. This paper presents a theoretical analysis of methods for integrating acoustic scores into term weights. This analysis suggests reasons why such techniques are likely to have very small impact on retrieval performance, and presents a simple extension to the standard binary independence model which enables acoustic likelihoods to be incorporated naturally. This theoretical analysis is then explored experimentally using the VMR1b collection [4]. 1 In word spotting and phone lattice spotting a threshold is usually applied to the recognition score hypotheses which removes less likely hypotheses; only hypotheses exceeding this threshold are then entered into the index file.

1

Exploring the Incorporation of Acoustic Information into Term Weights for Spoken Document Retrieval

The remainder of this paper is organised as follows. Section 2 reviews term weighting techniques originally derived for use in text retrieval, Section 3 explores methods by which acoustic information might be incorporated into these weights for SDR and examines their likely impact on retrieval behaviour, Section 4 gives an overview of the VMR1b collection, and Section 5 gives a brief description of the speech recognition technique used for indexing the data. Section 6 describes a preliminary experimental analysis of the preceding arguments using VMR1b, and finally Section 7 summarises conclusions from the study.

2 Term Weighting It is generally acknowledged that effective weighting of search terms can significantly improve performance for text retrieval systems. Standard term weighting methods typically combine three components: collection frequency weighting (cfw), document term frequency (tf ), and document length normalisation. There are several established variations on term weighting using these components, this paper concentrates on two of the best known: the vector space model [13] [14] and the Okapi BM25 probabilistic model [15] [16] [17]. Although derived using different approaches both models essentially utilise the same basic form of combined weight,

where is the combined weight of term in document , is the collection frequency weight (often referred to as or inverse document frequency weight) of term , is the term frequency of term in document , and is a function of that depends on the retrieval model being used. The is in practice another way of stating the standard "!# weight [18]. Length normalisation can be provided in various ways into the term weight either as a component of the $ [19] or via the use of an additional factor [14], or alternatively as in the original vector space model [13]. A matching score %& ' between query and an individual document is given by the sum, %&

' (

*+

) ,.-

the sum of over all . The documents are then arranged in decreasing order of matching score and returned to the user for investigation. The following sections describe some popular functions used to implement the components of a combined term weight.

2.1 Collection Frequency Weight The primary motivation for the collection frequency weight is the observation that search terms which occur in less documents within a retrieval collection are likely to be more valuable for discriminating relevant from non-relevant documents in retrieval. The function utilised almost exclusively in term weighted retrieval systems is, ( 0/214365 7(

where 5 is the total number of documents in the collection and 7( is the number of documents containing the term . First introduced in [20], this weight has been theoretically justified in various studies including [21] and [22].

2.2 Term Frequency Weight The incorporation of a term frequency component into a combined weight is motivated by the hypothesis that the number of occurrences of a term within document is in some way indicative of the importance of the term to the 2

Exploring the Incorporation of Acoustic Information into Term Weights for Spoken Document Retrieval document. Thus $ $ should increase as the number of observations of the term within document increases. A number of different approaches have been taken to the development of suitable functions relating to term frequency . A simple linear function, ( 0

has been found to be unsuitable and all functions used in practice increase more slowly. Some well known functions are: Maximum Function A popular introduced in [23] is,

( 8:9;=8?

%A@BDC

(1)

where E?F 8 F ;4G E . 8 is a collection dependent scalar tuning factor and %A@B C is the maximum term frequency observed in document . Although somewhat damping the effect of term frequency this function is still linear in . The best value of 8 found in various studies has been in the range 0.3 - 0.5. Note that the introduction of the %A@HB C factor also introduces a document dependent component of document length normalisation where a single stray high frequency term can dramatically affect the values for an individual document. By restricting the values, this technique compensates for higher term frequencies associated with longer documents [14]. Logarithmic Functions A logarithmic function can be used to dampen the increase in $ with increasing . One logarithmic function introduced in [14] was of the form, ( I;#9J/2K. $G

(2)

However, it was found that a simple logarithmic function does not adequately reduce the effect of large values of . This observation led to the proposal of a double logarithmic $ [24] as follows, ( I;#9J/2K.;#9J/2KL $G

(3)

Length normalisation when using these functions is provided by a pivoted length normalisation factor [14]. This is not discussed further here since it is not of interest in this current study. 2-Poisson Approximate Model Based on the 2-Poisson model distributional assumption a simple within-document frequency function was introduced in [16]. The basic form of this $ is,

( M

- 9J

(4)

M

where - is anM empirically tuned scalar constant. This function again increases more slowly than linearly with . M The value of controls the rate at which the function approaches its asymptotic value of 1.0. Values of between 1.0 and 2.0 have been found to be useful in various experimental studies. This simple form of $ has been further elaborated in the BM25 combined weight (cw) [25] to provide document length compensation and to allow for verbosity as follows, ( M

N> M - 9;O - P$;=