Crowdsourced Nonparametric Density Estimation ... - Antti Ukkonen

Crowdsourced Nonparametric Density Estimation Using Relative Distances Antti Ukkonen

Behrouz Derakhshan

Hannes Heikinheimo

Finnish Institute of Occupational Health Helsinki, Finland [email protected]

Rovio Entertainment Espoo, Finland [email protected]

Reaktor Helsinki, Finland [email protected]

Abstract In this paper we address the following density estimation problem: given a number of relative similarity judgements over a set of items D, assign a density value p(x) to each item x ∈ D. Our work is motivated by human computing applications where density can be interpreted e.g. as a measure of the rarity of an item. While humans are excellent at solving a range of different visual tasks, assessing absolute similarity (or distance) of two items (e.g. photographs) is difficult. Relative judgements of similarity, such as A is more similar to B than to C, on the other hand, are substantially easier to elicit from people. We provide two novel methods for density estimation that only use relative expressions of similarity. We give both theoretical justifications, as well as empirical evidence that the proposed methods produce good estimates.

Introduction A common application of crowdsourcing is to collect training labels for machine learning algorithms. However, human computation can also be employed to solve computational problems directly (Amsterdamer et al. 2013; Chilton et al. 2013; Trushkowsky et al. 2013; Parameswaran et al. 2011; Bragg, Mausam, and Weld 2013). Some of this work is concerned with solving machine learning problems with the crowd (Gomes et al. 2011; Tamuz et al. 2011; van der Maaten and Weinberger 2012; Heikinheimo and Ukkonen 2013). In this paper we focus on the problem of nonparametric density estimation given a finite sample from an underlying distribution. This is a fundamental problem in statistics and machine learning that has many applications, such as classification, regression, clustering, and outlier detection. Density can be understood simply as “the number of data points that are close to a given data point”. Any item in a high density region should thus be very similar to a fairly large number of other items. We argue that in the context of crowdsourcing, density can be viewed for instance as a measure of “commonality” of the items being studied. That is, all items in a high density region can be thought of as c 2015, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved.

having a large number of common aspects or features. Likewise, items from low density regions are outliers or in some sense unusual. To give an example, consider a collection of photographs of galaxies (see e.g. (Lintott et al. 2008)). It is reasonable to assume that some types of galaxies are fairly common, while others are relatively rare. Suppose our task is to find the rare galaxies. However, in the absence of prior knowledge this can be tricky. One approach is to let workers label each galaxy as either common or rare according to the workers’ expertise. We argue that this has two drawbacks. First, evaluating commonality in absolute terms in a consistent manner may be difficult. Second, the background knowledge of the workers may be inconsistent with the data distribution. Perhaps a galaxy that would be considered as rare under general circumstances is extremely common in the given data. We argue that in such circumstances density estimation may result in a more reliable method for identifying the rare galaxies. A galaxy should be considered as rare if there are very few (or none) other galaxies that are similar to it in the studied data. The textbook approach for nonparametric density estimation are kernel density estimators (Hastie, Tibshirani, and Friedman 2009, p. 208ff). These methods usually consider the absolute distances between data points. That is, the distance between items A and B is given on some (possibly arbitrary) scale. Absolute distances between data points are used also in other elementary machine learning techniques, such as hierarchical clustering or dimensionality reduction. However, in the context of human computation, it is considerably easier to obtain information about relative distances. For example, statements of the form “the distance between items A and B is shorter than the distance between items C and D” are substantially easier to elicit from people than absolute distances. Moreover, such statements can be collected in an efficient manner via crowdsourcing using appropriately formulated HITs (Wilber, Kwak, and Belongie 2014). It is thus interesting to study what is the expressive power of relative distances, and are absolute distances even needed to solve some problems? A common application of relative distances are algorithms for computing low-dimensional embeddings (representations of the data in Rm ) either directly (van der Maaten and Weinberger 2012) or via semi-supervised

TDE1

TDE2

High-level overview of the proposed methods

methods are based on the observation that the density at a given point x is proportional to the number of close neighbours of x that belong to D. If there are many other points in D that are at a close proximity to x, we say that x is from a high-density region. The question is thus how to identify and count the points that are close to x without measuring absolute distances. The basic idea of the first method, called TDE1, is illustrated in the left panel of Figure 1. Suppose we have placed a circle with a short radius at every observed data point ui ∈ D. Given x, we can count how many of these circles cover x, i.e., the number of ui that are close to x. Observe that to do this we only need to know whether the distance between x and some ui is shorter or longer than the radius of the circle around ui . This is a problem of comparing two distances, and does thus not require to know the absolute distances. We can thus pair every ui ∈ D with a reference point u0i ∈ D so that the distance between ui and u0i equals the radius of the circle. However, as the circles at every ui must have at least roughly the same radius, finding an appropriate reference point for every ui can be difficult. We also need a means to find the reference points without access to absolute distances. This problem can be solved to some degree using a low-dimensional a embedding, but it adds to the complexity of the method. The second method we propose, called TDE2, is simpler in this regard. Rather than having to find a reference point for every ui ∈ D, it only uses a random set of pairs of points from D. The only constraint is that the pairs should be “short”. We say a pair is short if it is short in comparison to all possible pairs of items from D. The intuition is that short pairs occur more frequently in high density regions of the density function p. The second method thus counts the number of short pairs that are close to x. Here the notion of “closeness” also depends on the length of the pair. This is illustrated in the right panel of Figure 1. The pairs are shown in red, with gray circles around their endpoints. The radius of the circles is equal to the length of the respective pair. The pair is considered to be close to x if x is covered by at least one of the circles that are adjacent to the pair. Very short pairs must be very close to x in order to cover it, while longer pairs also cover points that are further away. It is easy to see that TDE1 is essentially a standard kernel density estimator that uses the uniform kernel. The TDE2 estimator is designed for relative distances from the ground up and has thus some advantages over TDE1. In particular, TDE2 is simpler to construct.

We propose two methods for density estimation using relative distances. The first is a simple variant of a standard kernel density estimator, while the other is a novel method that has some practical advantages over the first one. We call our estimators “triplet density estimators”, because the relative distance judgements they use are defined on sets of three items. Our input is a dataset D, assumed to have been sampled iid. from some unknown density p, and we want to estimate the density p(x) at an arbitrary point x. Both of the proposed

Our first method can be viewed as a standard kernel density estimator that does not use absolute distances. First, we recall the basic definition of a kernel density estimator as originally proposed (Rosenblatt 1956; Parzen 1962), and as usually given in textbooks, e.g. (Hastie, Tibshirani, and Friedman 2009, p. 208ff). Then, we describe how a simple instance of this can be implemented only using relative distances.

6

6

5

5

4

4

3

3

2

2

1

1

0

-1

0

1

2

3

4

5

0

-1

0

1

2

3

4

5

Figure 1: Left: Example of the TDE1 estimator we propose. Right: Example of the TDE2 estimator that we propose. metric-learning (Schultz and Joachims 2003; Davis et al. 2007; Liu et al. 2012) if features are available. While embeddings can be used for various purposes, including also density estimation, we are more interested in understanding what can be done using relative distances directly, without first computing a representation of the items in some (arbitrary) feature space. This is because despite embeddings being powerful, they can be tricky to use in practical situations, e.g. if the true dimensionality of the data is unknown. Moreover, an embedding may have to be re-computed when new data is added to the model. We hope to avoid this by operating directly with relative distance judgements.

Our contributions We consider two approaches for density estimation using relative distances. In particular, we provide the following contributions: 1. We describe how a standard kernel density estimator can be expressed using relative distance judgements, and describe how embeddings to Rm can be used to construct this estimator. 2. We propose a second estimator that is explicitly designed for human computation. The main upside of this estimator is that it is easier to learn and use. While the estimator is biased, the amount of the bias can be reduced at the cost of using more work to build the estimator. 3. We conduct experiments on simulated data to compare the two methods, and show that the second estimator is in general easier to learn than the first one, and that it produces better estimates.

TDE1: A simple triplet density estimator

Basics of kernel density estimation Let D = {u1 , . . . , un } denote a data set of size n drawn from a distribution having the density function p. For the moment, assume that D consists of points in Rm . Given D, the kernel density estimate (KDE) of p at any point x is given by n 1 X 1 KDEn (x) = K( (x − ui )), (1) m nh i=1 h

where K is the kernel function and h is the bandwidth parameter. Fairly simple conditions on the kernel function K and the bandwidth h are sufficient to guarantee that the kernel density estimate is an unbiased and consistent estimate of p. That is, as the size of D increases, we have lim ED∼p [KDEn (x)] = p(x),

n→∞

(2)

where the expectation is taken over random samples from p of size n. Most importantly, the kernel function K must satisfy Z K(y) ≥ 0 ∀y and K(y)dy = 1. (3) Rm

In this paper we consider radial-symmetric kernels. In addition to Equation 3 these satisfy K(y) = K(y 0 ) ⇔ kyk = ky 0 k. That is, the kernel is a function of only the length of its input vector. Also, we want K(y) to be non-increasing in kyk. Now, assume that there exists a distance function d between the items in D. For the moment, let d(x, y) = kx−yk. We can rewrite Equation 1 as follows: KDEn (x) =

n 1 X d(x, ui ) K( ). nhm i=1 h

(4)

The bandwidth h is a parameter that must be specified by the user. There are a number of methods for choosing h in an appropriate manner. In practice the bandwidth h should decrease as n increases. In fact, we must have limn→∞ h(n) = 0 for Eq. 2 to hold. In the experiments we use cross validation to select an appropriate value of h when needed.

Normalised densities and relative distances Before continuing, we point out a property of density functions and relative distances. Recall that the integral of a density function over its entire domain must by definition be equal to 1. This means that the value p(x) is not scale invariant: the precise value of p(x) depends on the “unit” we use to measure distance. If we measure distance in different units1 , the absolute value of p(x) must change accordingly as well. This is reflected in Equation 1 through the bandwidth parameter h that must have the same unit as the distance d(u, v). However, when dealing with relative distances there is no distance unit at all. We only know that some distance is shorter (or longer) than some other distance, but the unit in which the distances are measured is in fact irrelevant. 1

Concretely, suppose we use feet instead of meters.

An important consequence of this is that it is not possible to devise a normalised density estimator based on relative distances. In other words, Equation 2 can not hold with equality. The best we can hope for is to find an estimator that is proportional to p(x). Later we show how to derive a normalised estimator for one-dimensional data, but this makes use of absolute distances in the normalisation constant, much in the same way that h appears in the normalisation constant of Equation 1. In practice we must thus consider non-normalised estimates of p(x) when using relative distances.

Uniform kernel with relative distances The simplest kernel function K that satisfies the requirements given above is the uniform kernel. It takes some constant positive value if x is at most at distance h from the data point ui ∈ D at which the kernel is located, and is otherwise equal to zero. Formally the m-dimensional uniform kernel is defined as V (m)−1 : y ≤ 1, unif K (y) = (5) 0 : y > 1, where V (m) is the volume of the m-dimensional unitm sphere. This is defined as V (m) = Γ(πm2+1) , where Γ is the 2 Gamma function. In this case the density estimate at x is simply proportional to the number data points that are at most at a distance of h from x. We say the point u covers x with bandwidth h, whenever K unif ( d(x,u) h ) ≤ 1, This can be seen as a straightforward generalisation of a histogram that has unspecified bin boundaries. In Figure 1 (left) we can say that the density at x is proportional to the number of “discs” that cover x. It is easy to see that as n increases and h goes to zero (i.e., there are more points covered by smaller and smaller discs), this quantity will indeed approach the underlying density p. We continue by designing a triplet density estimator that uses the uniform kernel K unif . To compute K unif , we must only determine if a data point ui near enough to x. However, we cannot evaluate d(u, v) explicitly for any u, v ∈ D. Instead, we only have access to an “oracle” that can give statements about the relative distance between items. That is, we can issue tasks such as “Is the distance d(u, x) shorter than the distance d(u, v)?” Or, in more natural terms: “Of the items x and v, which is more similar to u?” We can use this task to compute the uniform kernel as follows. Suppose that for every ui ∈ D, we have found some other u0i ∈ D, so that d(ui , u0i ) is equal to h. Denote this pairing by P(h). Note that several u ∈ D can be paired with the same u0 , P(h) is not required to be a proper matching. Given P(h), by the definition of the uniform kernel given in Equation 5, we have d(x, ui ) unif = V (m)−1 I{d(ui , x) ≤ d(ui , u0i )}. K h (6)

This, together with Equation 4 gives the triplet density estimator X 1 I{d(u, x) < d(u, u0 )}. TDE1n (x) = m V (m)nh 0 (u,u )∈P(h)

(7) This simply counts the number of u ∈ D that cover x with the specified bandwidth h. As we only need an oracle that can provide relative distance judgements to evaluate the indicator function above, Equation 7 provides a kernel density estimator that is suitable for human computation. Moreover, all known results that apply to Equation 1 (e.g. the one of Equation 2) directly carry over to Equation 7. There are some issues with this result, however. As discussed above, in practice we cannot evaluate the normalisation constant: both m and the exact value of h are unknown if our input only consists of relative distance comparisons. More importantly, it is unlikely that for a given D we would find a pairing P(h) where the distances between all pairs are exactly equal to h. It is more realistic to assume ˜ that we can find a pairing P(h) where this holds approxi˜ mately. That is, for (almost) every (u, u0 ) ∈ P(h), we have d(u, u0 ) ≈ h. This implies that results about unbiasedness and convergence that hold for Equation 1 are no longer obviously true. While variable bandwidth density estimators have been proposed in literature, our problem differs from those in the sense that we do not try to select an optimal local bandwidth around each u ∈ D, but the bandwidth at u randomly varies around h. In general it is nontrivial to even quantify this variance, or show how it affects the resulting estimator. This raises one important question, however. If we are not able to evaluate the absolute value of d(u, v), how can we find pairs that are (at least approximately) of length h?

˜ Using an embedding to find P(h) As mentioned in the Introduction, by embedding the items to a low-dimensional Euclidean space, we can trivially estimate density using existing techniques. Such embeddings can be computed from relative distances e.g. using the algorithm described in (van der Maaten and Weinberger 2012). This has the downside that to estimate density at a new point x, we in general must recompute the embedding to see where x is located. Using the technique described above, however, it is possible to estimate density directly, provided we have the set of pairs P(h). Therefore, we use the embedding only for constructing ˜ the set P(h) as follows: 1. Embed all items in D to Rm using some suitable algorithm. We use the method proposed in (van der Maaten and Weinberger 2012). This results in a set of points X. Note that the dimensionality m of the embedding is a parameter that must be specified by the user. 2. Use any existing technique for bandwidth selection to find a good h given the points X ∈ Rm . We use cross validation in the experiments later. 3. For every item u ∈ X, find the item v ∈ X so that the distance between u and v is as close to h as possible.

˜ This yields the set P(h). ˜ The pairs (u, v) in P(h) have all approximately length h in the embedding. Of course this length is not the same as the unknown distance d(u, v), but we can assume that d(u, v) is ˜ approximately the same for every (u, v) ∈ P(h).

TDE2: A second triplet density estimator The density estimator proposed above (Equation 7) must construct a low dimensional embedding of the items to find ˜ the pairs P(h). While there are a number of techniques for finding such embeddings, they are not necessarily optimal for density estimation. Their objective may be to create visually pleasing scatterplots, which requires to introduce additional constraints that will favor embeddings with certain properties. For instance, the methods may aim to focus on short distances so that the local neighbourhood of each data point is preserved better. This emphasises clusters the data may have. As a consequence the resulting distances may become disproportionally skewed towards short distances. However, for the bandwidth selection process to produce good results, we would prefer the resulting distances to be as truthful as possible, possibly at the cost of the embedding resulting in a less clear visualisation. Moreover, the embedding algorithms tend to be sensitive to initialisation. Motivated by this, we propose a second triplet density estimator that does not need an embedding. In particular, we show that it is possible to produce good density estimates, provided we can find a sufficient number of short pairs. Unlike in the method proposed above, they do not all have to be approximately of the same length h. Instead, it is sufficient to find pairs that are all at most of a certain length that we will denote by δmax . We first describe the estimator and prove some of its properties, and continue by discussing the problem of finding short pairs without an embedding.

Basic idea This variant of the density estimator uses a slightly different human intelligence task: “Which of the three items x, u, and v is an outlier?” The worker must identify one of the three items as the one being the most dissimilar to the other two. In (Heikinheimo and Ukkonen 2013) this task was used to compute the centroid of a set of items. We model this HIT by a function Ω that takes as input the item x and the pair (u, v), and returns 1 if the distance between x and u or v is shorter than the distance between u and v, and 0 otherwise. That is, we have Ω(x, (u, v)) = 0 if x is chosen as the outlier and Ω(x, (u, v)) = 1 if either u or v is chosen as the outlier. More formally, we have Ω(x, (u, v)) = I{min(d(x, u), d(x, v)) ≤ d(u, v)}, and we say that the pair (u, v) covers the item x whenever Ω(x, (u, v)) = 1. Our method is based on the following observation. Let P = {(ui , vi )}li=1 denote a set of pairs, where every ui and vi is drawn from D uniformly at random. Moreover, let P δmax ⊂ P denote the set of “short pairs”, meaning those

for which we know the distance d(u, v) to be below some threshold δmax . Now, fix some x ∈ D, and consider the number of short pairs (u, v) ∈ P δmax that cover x. If x resides in a high density region of p, we would expect there to be many short pairs that cover x, while there are less of such pairs if x is from a low density region. This is illustrated in the right panel of Fig. 1. The estimator that we propose, called TDE2, is defined as described above. We first draw an iid. sample of pairs P from p, and then remove all such pairs from P that are longer than δmax to obtain P δmax . As we discuss in the remainder of this section, we can estimate density p(x) at any given x by computing the number of pairs in P δmax that cover x and taking a simple transformation of the resulting number.

Preliminary definitions and some simple results As discussed above, the estimate that TDE2 gives for p(x) is based on the number of short pairs that cover x when the pairs are drawn independently from p. Suppose x is fixed, and we obtain the pair (u, v) by first drawing u according to p, and then draw v independently of u also from p. Let δmax denote some distance. Consider the events A = B =

“the item x is covered by the pair (u, v)“, and “the distance d(u, v) is at most δmax “.

Let q(x, δmax ) denote the probability Pr[A ∩ B]. The TDE2 estimator actually estimates q(x, δmax ). Observe that we have q(x, δmax ) = Pr[A ∩ B] = Pr[A | B] Pr[B].

(8)

Now, let P be a set of pairs sampled iid. from p, and denote by P δmax the subset of P that only contains pairs up to length δmax , i.e., P δmax = {(u, v) ∈ P | d(u, v) ≤ δmax }. Observe δmax that |P |P| | is an unbiased and consistent estimate of Pr[B], i.e., we have |P δmax | = Pr[B]. |P| |P|→∞ lim

(9)

To estimate Pr[A | B], we define the triplet score of the item x given set of pairs P as X Ω(x, (u, v)), (10) S(x, P δmax ) = (u,v)∈P δmax

δmax

and find that S(x,P |P δmax | Pr[A | B], i.e.,

)

is an estimate of the probability

S(x, P δmax ) = Pr[A | B]. |P δmax | |P|→∞ lim

(11)

By combining equations 8, 9 and 11, we obtain the following: Proposition 1. Given the set P of pairs sampled independently from p, we have S(x, P δmax ) = q(x, δmax ). |P| |P|→∞ lim

This shows that TDE2 is an unbiased and consistent estimate of q(x, δmax ).

Results for 1-dimensional data Next, we discuss the connection between q(x, δmax ) and p(x). Our objective is to show that by taking a suitable transformation of our estimate for q(x, δmax ) we can obtain a reasonable estimate for p(x). For one-dimensional data it is fairly easy to see when events A and B happen simultaneously and derive an expression for q(x, δmax ). Consider all pairs of length δ starting from δ = 0 up to δ = δmax . For every such pair (u, v) of length δ, item x is covered by (u, v) exactly when u is contained in the interval [x − 2δ, x + δ], and v is at u + δ. (Thus v must be in the interval [x − δ, x + 2δ].) If this happens we have min{d(x, u), d(x, v)} ≤ d(u, v) and Ω(x, (u, v)) = 1. Next, suppose the midpoint of the pair (u, v) is at j. Clearly we must have u = j − δ/2 and v = j + δ/2. Using these we can write q(x, δmax ) as Z δmax Z x+ 23 δ δ δ q(x, δmax ) = 2 p(j − )p(j + ) dj di. 2 2 δ=0 j=x− 32 δ (12) The constant 2 is needed because we can obtain (u, v) by first drawing u and then v, or vice versa. The following proposition shows how q(x, δmax ) is connected to p(x). Proposition 2. Let the function q(x, δmax ) be defined as in Equation 12. We have s q(x, δmax ) = p(x). (13) lim δmax →0 3δmax 2 Proof. See Appendix. This shows that by choosing a sufficiently small δmax , we can obtain approximate estimates for p(x) by taking a simple transformation of the estimate for q(x, δmax ). We conclude this section by combining this observation with propositions 1 and 2 to the following result. Corollary 1. Let p be a univariate density function, and let P be a set of pairs sampled independently from p. Let P δmax ⊆ P contain those pairs in P having length at most δmax . For a sufficiently small δmax we have s S(x, P δmax ) pˆ(x, δmax ) = ≈ p(x). 3δmax 2 |P| Observe that pˆ(x, δmax ) as defined above yields estimates that are appropriately normalised. The estimates are guaranteed to converge to q(x, δmax ), and have a bias wrt. p(x) that depends on δmax . By “sufficiently small” we mean that δmax should be as small as possible without making the set P δmax too small. We continue with a simple study of the effect of δmax on the bias and variance of pˆ(x, δmax ).

Bias and variance of the 1-dimensional estimator, an empirical study The estimator pˆ(x, δmax ) (Corollary 1) is biased, with the bias increasing as δmax increases. We studied the effect of theq bias empirically. The top panel in Fig. 2 shows the shape q(x,δmax ) of for different values of δmax when p is the 3δmax 2

10−4

δmax = 0.5 δmax = 1 δmax = 2

total

bias

var

10−5

10−6

10−7 -4

-2

0

2

0.0

4 true estim 500 pairs 5000 pairs

0.5

1.0

1.5 δmax

2.0

2.5

3.0

1.0 0.9 0.8

short pairs

358

0.7 36 250

0.6 -6

-4

-2

0

2

4

6

Figure 2: Top: Examples of the estimator pˆ(x, δmax ) for different values of δmax when the underlying density is a standard normal distribution. Bottom: Examples of estimating a bimodal distribution from a sample of 500 data points with 500 and 5000 pairs (dashed lines) when δmax = 1.0. standard normal distribution. These are what pˆ(x, δmax ) will converge to as the size of P increases. On the other hand, for “small” δmax the estimate for q(x, δmax ) will have a high variance as there are only few pairs of length at most δmax . The top panel of Fig. 3 shows how the errors behave due to bias and variance as a function of δmax . The estimate is computed from 200 points sampled from a standard normal distribution. For small δmax the total error consists mainly of variance due to the small size of P δmax . As δmax increases the variance component diminishes, and most of the remaining error is due to bias. In practice it is possible to set δmax so that the estimator will produce useful results. The bottom panel of Fig. 2 shows an example of a mixture of two normal distributions (black curve), the limit to which pˆ(x, 1.0) would converge (blue curve), as well as two estimates computed with 500 and 5000 pairs in P, respectively (the red dashed curves). We can observe that even 500 pairs are enough to obtain a reasonable estimate. Finally, the bottom panel of Fig. 3 shows how the estimation accuracy (correlation between the estimate and the true density) behaves as the number of pairs in P is increased from 250 to 2500.

Generalisation to higher dimensions Above we showed that for one-dimensional data the square root of q(x, δmax ) is proportional to p(x) as δmax approaches zero. We can show a similar result also for arbitrary

250

750

1250

1750

2500 2250

Figure 3: Top: Estimation error (MSE, wrt. the true density) as a function of δmax when the estimate is computed using 1000 pairs drawn from a data of 200 data points sampled from a standard normal distribution. Bottom: Correlation of the estimate (δmax = 0.5) with true density (the bimodal distribution shown on the left) as a function of the number of pairs in P. The inset shows the number of short pairs that remain in P δmax after removing those longer than δmax . dimensional inputs. Proposition 3. Let q(x, δmax ) be defined as in Equation 8, and let C(δmax ) be some function of δmax . We have s q(x, δmax ) = p(x). (14) lim δmax →0 C(δmax ) Proof. See Appendix. This result is important because in general we cannot know the true, underlying dimensionality of the data. Hence, our estimator should be independent of the dimensionality, and Proposition 3 shows this to be the case. Finally, since q(x, δmax ) is in practice proportional to the triplet score δmax S(x, ) (Eq. 10), our estimator can be defined simply pP as S(x, P δmax ), because both C(δmax ) as well as |P| are constants that do not depend on x.

Finding short pairs without embeddings Above we showed how the length of the pairs that are used to compute pˆ(x, δmax ) affects the quality of the estimate. In particular, we should only use pairs that are “short” in comparison to the average length of the pairs. In practice we do not have the absolute lengths of the pairs, however.

We can only obtain relative distance information, and cannot thus simply filter the pairs based on their absolute length. Instead, we must find short pairs by some other means. Next we discuss a simple method for doing this. Let P δmax denote the set of pairs that are at most of length δmax . Since all pairs in P are drawn uniformly at random δmax from p, we know that |P |P| | is an unbiased estimate of the probability Pr[d(u, v) ≤ δmax ] as was discussed above. Inversely, by fixing some value α ∈ [0, 100], we can consider all pairs in P for which the length falls below the α:th percentile of the length distribution of all pairs in P. By setting α to some small value, e.g. α = 10, we obtain pairs that correspond to some unknown “small” value of δmax , so that α α 100 = Pr[d(u, v) ≤ δmax ]. Let P denote the set of pairs selected for a given α. This is simply an alternative formulation for “short” pairs without considering absolute distances. Finding the α:th percentile of a set of values can be done efficiently using a selection algorithm. In general, these find the i:th largest value in a given list of values without sorting the values. They commonly run in linear time in the size of their input, which is |P| in this case. Note that e.g. the Quickselect algorithm (Hoare 1961) (and its variants, e.g. the Floyd-Rivest algorithm (Floyd and Rivest 1975)) also find all pairs that are below the percentile as a simple sideeffect. No separate filtering step is thus needed. Most selection algorithms are based on pairwise comparisons, and can thus be implemented without explicitly evaluating d(u, v) for any (u, v) ∈ P. We only need an oracle that can evaluate the indicator function I{d(u, v) < d(u0 , v 0 )}. This is easily expressed as the following human intelligence task: “Which of the two pairs, (u, v) or (u0 , v 0 ), are more similar?” Moreover, by definition of the percentile, we will select pairs. This can be used to control the size of the resulting estimator (the set P α ). The TDE1 estimator proposed above uses by definition n = |D| pairs. To make the TDE2 estimator of comparable in complexity, we should use 100n pairs in P to have P α of the desired size. Note that α both P and P α are in this case linear in n. This is a desirable property as the number of pairs will directly translate to the number of HITs needed to construct as well as to evaluate the estimator. For this it is crucial that we do not need all pairs that are shorter than δmax , but a uniform sample of these is sufficient. Using a standard selection algorithm for finding P α has the downside that the necessary HITs cannot be solved independently. E.g. Quickselect operates in the same way as Quicksort by placing items to either side of a pivot item. This creates dependencies among the tasks, where one task must be solved before others can even be submitted to the workers. Developing a selection algorithm without such dependencies is an interesting open question. α 100 |P|

Experiments In this Section we compare the proposed density estimators, TDE1 and TDE2, using simulations.

Experimental setup We measure cost as the number of HITs that are needed to construct the estimator. The methods are implemented as follows: • TDE1: The estimator is constructed by first embedding a set of items into R2 with the method of (van der Maaten and Weinberger 2012). (In practice we have to make some guess about the input data dimensionality when using this method. In these experiments we assume a fixed dimensionality of 2 independently of the true data dimensionality.) Given the embedding, we find an optimal bandwidth h using cross validation, and select the set P(h) by pairing every u ∈ D with the v ∈ D so that the distance between u and v in the embedding is as close to h as possible (see Section ). The estimates are computed with Equation 7 (by omitting the normalisation). • TDE2: We use the selection algorithm based method of Section to compute the set P α . To make the cost of computing an estimate for a single x equal to the cost of α TDE1, we let |P| = 100 α |D|. This will result in P = |D|. We used α ∈ {0.05, 0.10, 0.15, p 0.20, 0.25, 0.30, 0.35}. The estimates are computed as S(x, α) where S(x, α) is defined analogously to S(x, δmax ). We use a separate training data Dtrain to learn the estimator, and evaluate it on the test data Dtest that is drawn from the same distribution as Dtrain . In all cases we let |Dtrain | = |Dtest | = 200 items sampled iid. from the density in question. We use the following synthetic datasets that are mixtures of multivariate normal distributions: Gaussian-5D : A single standard multivariate normal distribution in a 1-dimensional space. Gaussian-5D : A single standard multivariate normal distribution in a 5-dimensional space. 2Gaussians-3D : Two standard multivariate normal distributions in a 3-dimensional space, located at (0, 0, 0) and (3, 3, 3). 2Gaussians-5D : As 2Gaussians-3D but in a 5-dimensional space. 4Gaussians-2D : Four standard multivariate normal distributions in a 2-dimensional space, located at (0, 0), (0, 5), (5, 0) and (5, 5). 4Gaussians-3D : Four standard multivariate normal distributions in a 3-dimensional space, located at (0, 0, 0), (5, 0, 0), (0, 5, 0) and (0, 0, 5). We only consider fairly low-dimensional data with k ≤ 5. This is because density estimation becomes less meaningful for high-dimensional data due to the curse of dimensionality. We think this is a sufficient study of the problem also from a practical point of view. In crowdsourcing applications we do not expect the workers to use more than a few dimensions when giving similarity judgements. Also, we assume that computation is noise-free. That is, we obtain consistent and correct answers to each HIT. In practice some quality control mechanism such as (Dawid and Skene 1979; Raykar and Yu 2012) can be used to reduce the effect of erroneous solutions to tasks. We assume

Gaussian-1D

Gaussian-5D

0.20

2Gaussians-3D 0.18

0.12

0.17 0.16

0.11

0.15

0.15

0.10 0.10

0.14 0.13

0.09

0.12 9

15

19

26 #HITs/|D|

50

9

13

18

26 #HITs/|D|

2Gaussians-5D

49

9

15

19

4Gaussians-2D

49

4Gaussians-3D 0.20

0.24 0.11

27 #HITs/|D|

0.22

0.18

0.10

0.20

0.16

0.09

0.18

0.14

0.16 0.08

0.12

0.14 9

13

17

26 #HITs/|D|

48

9

13

21

28 #HITs/|D|

48

9 12

18

27 #HITs/|D|

45

Figure 4: Experimental results. Estimation error (MAE) as a function of the number of HITs used to construct the estimator. TDE1 is shown in red, TDE2 is shown in black. this can be taken care of by the crowdsourcing platform, and may increase the costs by a small constant factor. Both methods use |D| HITs to compute a single estimate. We first run TDE2 100 times for every value of α. Then, we compute the average number of HITs (pairwise comparisons of two item pairs) that were needed to find P α . Then, we allow TDE1 to the same number of HITs to compute the embedding. This way we can compare the methods given the same amount of work. We report the amount of work as the number of HITs per data point in Dtrain . We measure the quality of the resulting estimates using mean absolute error, defined as P test | estimate(x) − p(x)| , MAE = x∈D test |D | maxx {p(x)} where p(x) is the true densityPat x. The estimate and p x∈D test estimate(x) = Pare both normalised so that p(x) = 1. We also normalise the error with x∈D test maxx {p(x)} so that it can be understood a fraction of the largest density value in Dtest .

Results Results are shown in Figure 4. We show how the error behaves as a function of the number of HITs required per data item to build the estimator. (Here the largest amount of work corresponds to using α = 0.05 with TDE2.) In this plot we show TDE1 in red, and TDE2 in black. We observe that both methods perform better when they are allowed to use more work. Increasing the amount of work starts to have diminishing effects after some point. TDE1 performs slightly better than TDE2 when only a very small amount of work is used. However, as the number of HITs is increased, TDE2 tends to perform better than TDE1. This is mainly because TDE2 makes no assumptions about the dimensionality of

the input unlike TDE1. Notice that with 4Gaussians-2D the methods perform almost equally well. Of course in practice the dimensionality of the input data is unknown, and making an educated guess about this can be difficult.

Conclusions In this work we presented methods, TDE1 and TDE2, for density estimation using judgements of relative similarity. TDE1 is an implementation of a standard kernel density estimator with relative distances. TDE2 is a novel method that has some practical advantages over TDE1. In particular, our experiments suggest that TDE2 is more robust when the true dimensionality of the input is unknown. There are multiple avenues for further research. Both TDE1 and TDE2 currently use |D| HITs to estimate density at a single point x. It seems likely that this could be improved by optimising the pairs against which x is compared. Another interesting question is whether we could learn estimators like TDE1 without embeddings? That is, can we find pairs of approximately the same length directly with e.g. the kind of pairwise comparisons as used for TDE2. One possible solution to this are partial orders with ties (Ukkonen et al. 2009). Also the theoretical results presented in this paper could be improved. In particular, proving bounds on the estimation error as a function of α would be interesting.

References Amsterdamer, Y.; Grossman, Y.; Milo, T.; and Senellart, P. 2013. Crowd mining. In SIGMOD Conference, 241–252. Bragg, J.; Mausam; and Weld, D. S. 2013. Crowdsourcing multi-label classification for taxonomy creation. In HCOMP 2013. Chilton, L. B.; Little, G.; Edge, D.; Weld, D. S.; and Landay,

J. A. 2013. Cascade: crowdsourcing taxonomy creation. In CHI 2013, 1999–2008. Davis, J. V.; Kulis, B.; Jain, P.; Sra, S.; and Dhillon, I. S. 2007. Information-theoretic metric learning. In ICML 2007. Dawid, A. P., and Skene, A. M. 1979. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied statistics 20–28. Floyd, R. W., and Rivest, R. L. 1975. Algorithm 489: the algorithm selectfor finding the i:th smallest of n elements. Communications of the ACM 18(3):173. Gomes, R. G.; Welinder, P.; Krause, A.; and Perona, P. 2011. Crowdclustering. In Advances in Neural Information Processing Systems, 558–566. Hastie, T.; Tibshirani, R.; and Friedman, J. 2009. The elements of statistical learning, volume 2. Springer. Heikinheimo, H., and Ukkonen, A. 2013. The crowdmedian algorithm. In First AAAI Conference on Human Computation and Crowdsourcing. Hoare, C. A. R. 1961. Algorithm 65: Find. Communications of the ACM 4(7):321–322. Lintott, C. J.; Schawinski, K.; Slosar, A.; Land, K.; Bamford, S.; Thomas, D.; Raddick, M. J.; Nichol, R. C.; Szalay, A.; Andreescu, D.; et al. 2008. Galaxy zoo: morphologies derived from visual inspection of galaxies from the sloan digital sky survey. Monthly Notices of the Royal Astronomical Society 389(3):1179–1189. Liu, E. Y.; Guo, Z.; Zhang, X.; Jojic, V.; and Wang, W. 2012. Metric learning from relative comparisons by minimizing squared residual. In ICDM 2012. Parameswaran, A. G.; Sarma, A. D.; Garcia-Molina, H.; Polyzotis, N.; and Widom, J. 2011. Human-assisted graph search: it’s okay to ask questions. PVLDB 4(5):267–278. Parzen, E. 1962. On estimation of a probability density function and mode. The Annals of Mathematical Statistics 33(3):1065–1076. Raykar, V. C., and Yu, S. 2012. Eliminating spammers and ranking annotators for crowdsourced labeling tasks. Journal of Machine Learning Research 13:491–518. Rosenblatt, M. 1956. Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics 27(3):832–837. Schultz, M., and Joachims, T. 2003. Learning a distance metric from relative comparisons. In NIPS 2003. Tamuz, O.; Liu, C.; Belongie, S.; Shamir, O.; and Kalai, A. 2011. Adaptively learning the crowd kernel. In ICML 2011, 673–680. Trushkowsky, B.; Kraska, T.; Franklin, M. J.; and Sarkar, P. 2013. Crowdsourced enumeration queries. In ICDE 2013, 673–684. Ukkonen, A.; Puolam¨aki, K.; Gionis, A.; and Mannila, H. 2009. A randomized approximation algorithm for computing bucket orders. Inf. Process. Lett. 109(7):356–359. van der Maaten, L., and Weinberger, K. 2012. Stochastic triplet embedding. In Machine Learning for Signal Process-

ing (MLSP), 2012 IEEE International Workshop on, 1–6. IEEE. Wilber, M. J.; Kwak, I. S.; and Belongie, S. J. 2014. Costeffective hits for relative similarity comparisons. In Second AAAI Conference on Human Computation and Crowdsourcing.

Appendix Proof of Proposition 2 Proof. We first upper bound q(x, δmax ) by letting q(x, δmax ) ≤ q 0 (x, δmax ) Z δmax Z x+ 32 δ = 2 p(u∗ )p(u∗ + δ ∗ ) dj di, δ=0

j=x− 32 δ

where u∗ and δ ∗ are the solution to argmaxu,δ p(u)p(u + δ) st. u ∈ [x − 2δmax , x + δmax ] and δ ≤ δmax . Since u∗ and δ ∗ only depend on δmax and not on δ or j, p(u∗ )p(u∗ + δ ∗ ) can be taken out from the integrals, and we have Z δmax Z x+ 32 δ q 0 (x, δmax ) = 2p(u∗ )p(u∗ + δ ∗ ) 1 dj di. δ=0

j=x− 32 δ

A straightforward calculation reveals that the integrals simplify to 32 δmax 2 , and we obtain q 0 (x, δmax ) = 3δmax 2 p(u∗ )p(u∗ + δ ∗ ). Now, as δmax → 0, the x ± 2δmax interval becomes tighter and tighter around x. It follows that u∗ → x and δ ∗ → 0, and thus p(u∗ )p(u∗ + δ ∗ ) → p(x)2 . By combining these observations we find that q 0 (x, δmax ) δmax →0 3δmax 2 lim

= =

3δmax 2 p(u∗ )p(u∗ + δ ∗ ) δmax →0 3δmax 2 lim p(u∗ )p(u∗ + δ ∗ ) = p(x)2 . lim

δmax →0

An upper bound of q(x, δmax ) thus approaches p(x)2 from above as δmax → 0. We can make a similar reasoning in terms of the pair (u∗ , u∗ +δ∗ ) that has the lowest probability in the interval x ± 2δmax . This will show that a lower bound to q(x, δmax ) approaches p(x)2 from below. By combining these we obtain that q(x, δmax )/3δmax 2 must approach p(x)2 as δmax → 0.

Proof of Proposition 3 Proof. We present the proof for 2-dimensional data, but the argument generalises to k-dimensions in a fairly straightforward manner. Fix some point x with density p(x), and consider a random pair (u, v) drawn from p given that d(u, v) is fixed to some δ ≤ δmax . We sample the pair by first drawing u from p, and then draw v given u so that d(u, v) = δ. The point x is covered by the pair (u, v) when one of the following events happens: Uδ : Vδ :

d(x, u) ≤ δ δ < d(x, u) < 2δ and d(x, v) ≤ δ.

Now we can write q(x, δmax ) in terms of Uδ and Vδ by integrating over all values of δ up to δmax : Z δmax q(x, δmax ) = (Pr[Uδ ] + Pr[Vδ ]) dδ. 0

The proof relies on a similar argument as the proof of Proposition 2. We show that q(x, δmax ) can be both upper and lower bounded so that these upper and lower bounds both go to C(δmax )p(x)2 as δmax → 0 for some C(δmax ). We proceed to derive upper bounds for Pr[Uδ ] and Pr[Vδ ]. Let V (y, δ) denote the area of a disc centered at point y with radius δ. Also, let E(y, δ) denote the 1-dimensional circular subspace centered at point y with radius δ. That is, the set E(y, δ) is the “edge” of the set V (y, δ). Next, let pδmax (x) denote the maximum density inside the set V (x, 2δmax ), that is, max p(y). (15) pδmax (x) =

with V (x, δ). We have thus Z Z Pr[Vδ ] = p(u) R(x,δ)

p(v) dv

E(u,δ)∩V (x,δ)

!

du.

Also see the right side of Figure 5. Like above with Uδ , we upper bound Pr[Vδ ] by replacing p(u) and p(v) both with pδmax (x). This is allowed as d(x, u) ≤ 2δ and d(x, v) ≤ δ. We obtain ! Z Z Pr[Vδ ] ≤ pδmax (x)2 1 dv du. R(x,δ)

E(u,δ)∩V (x,δ)

As above with Uδ , the value of the integrals only depends on δ and not p. We get the upper bound Pr[Vδ ] ≤ pδmax (x)2 C2 (δ).

Next we combine these results to obtain a simple upper 13 bound for q(x, δmax ): Observe that few thepairs probability of most event Uδ. The is the total density 13 aation high variance as there Density are of length Z δmax willonly have a high variance as at there aremax only few pairs of length at 13 most max . The Crowdsourced Estimation inside V (x, δ) (so that the point u is close enough to x) panel of Fig. 4 shows how the errors behave due to bias and variance top rightDensity panel Estimation of Fig. 4 shows how the errors behave duetimes to bias and variance Density Estimation Crowdsourced 13 13 q(x, δ 13 ) = (Pr[Uδ ] + Pr[Vδ ]) dδ as there are only pairs of length at that most The Crowdsourced Density Estimation ion of max . Thefew estimate computed from 200 points sampled from around the probability vmax is. are drawn from the circle u with as is avariance function of . The is of computed points sampled frommax max will have behave a high there onlyestimate few pairs length atfrom most200 shows how the errors due to biasasand variance Crowdsourced Density Estimation 13 max . The 0 dhe normal distribution. For small the total error consists mainly of max a standard normal distribution. For small the total error consists mainly of Crowdsourced Density Estimation 13 max radius δ, i.e., the set E(u, δ) (so that we have d(u, v) = δ as Z ighestimate variance as right therewill are have only few highpairs length asa high there at errors most are only The pairs ofbias length atvariance most The max . The istop computed from 200 points sampled from max panel of aFig. 4 variance shows how the behave due δmax willofhave variance as .few there are to only fewand pairs of lengthmax at .most ue to size of Ptotal .error As increases the variance variance due to the ofand P ofcomponent . As 200 increases the variance component maxpoints bution. For small the consists mainly of max nel of the Fig.small 4will shows top how right the errors panel behave ofthere Fig. due 4small shows to bias how errors behave due bias andtovariance as a have function ofvariance . max The estimate is size computed from sampled from required). That is, a high as are only few pairs lengthhow at most . to The top right panel of Fig.the 4 variance shows the errors behave due bias and variance max max Crowdsourced Density Estimation 13 Crowdsourced Density Estimation 13 < pδmax (x)2 C1 (δ) + pδmax (x)2 C2 (δ) dδ s, and most of the remaining error is due to bias. diminishes, and ofsmall the remaining error is.from due to bias. high variance asof there only few pairs ofoflength at most The ofmax Phave As increases the variance max max nsize ofwill . The estimate as a is function computed ofshows from . function The 200 points estimate sampled is computed from 200 points sampled from right panel Fig. 4are how the errors behave due to bias and variance ascomponent amost . The estimate is computed from 200 points sampled from max a.atop standard normal distribution. For the total error consists mainly of max max !estimator 0 ctice is possible that estimator will produce useful topit right panel ofto Fig. shows how the errors due to bias and variance he remaining error is due to4small max Inso practice it possible to distribution. so that the will produce Zthe Zset max as a function ofbias. .normal The estimate isbehave computed from 200 points sampled from a distribution. standard normal For small theconsists total error consists mainly of max ormal distribution. For aset standard the total error consists For small mainly of the total error ofuseful max max max due to the small size ofaispairs P . length As increases variance Z δmax max will a variance high variance as there are only few of at there mostsampled .the The will have high variance as are few pairs ofcomponent length at mainly most max . The maxonly as have a function of .normal The estimate isproduce computed from 200 points from ble to set so that the estimator will useful max he bottom left panel of Fig. 4 shows an example of a mixture of two max results. The bottom left panel of Fig. 4 shows an example of a mixture of two a standard distribution. For small the total error consists mainly of variance due to the small size of P . As increases the variance component max max to the small size P variance . As due to increases the small the size of Perror component . Asand increases the component max diminishes, most of the remaining error due to bias. top right panel of of Fig. shows how errors behave due bias variance top right panel ofvariance Fig. 4toisshows how the errors behave due to bias and variance Pr[U ]distributions = p(v) dv du.pvariance a standard distribution. For small total mainly ofwhich panel of Fig. 4normal shows an4and example of athe mixture δmax stributions (black curve), thesmall limit to p(black ˆp(u) (x, 1.0) would converge variance due tonormal the size ofwhich Pmaxof the .two As increases the variance component diminishes, and most ofconsists the remaining is due to 1.0) bias. would converge = pδmax (x)2 (C (δ) + C (δ)) dδ curve), the limit toerror ˆ(x, max

ed Density Estimation

y∈V (x,2δmax ) Crowdsourced Density Estimation

max

13

max

max

max

max

max

max

max

1 2 and ofthe the diminishes, error and isas due to bias. of remaining error iscomponent due to will bias.produce as variance amost function of .small The estimate is computed from 200 points sampled from max amost function ofmax The estimate is computed from 200 useful points sampled fromet al. Inremaining practice it setthe so. the that the estimator max max due to the size ofpossible P . remaining As increases variance Vto (x,δ) E(u,δ) ck curve), limit to which pˆis (x, 1.0) would converge 14 Antti Ukkonen max diminishes, and most of the error is is due to pairs bias.tocomputed In practice it possible set P, that 500 the estimator will produce useful e), asis well as two estimates computed with 500 and 5000 in max so 0 (blue curve), as well as two estimates with and 5000 pairs in P, a it standard normal For small total error consists mainly ofestimator a 5000 standard normal small total error consists mainly of ce possible to distribution. set In practice so that it the is estimator possible to will set produce so For useful that the will useful maxis maxa the max results. The bottom left panel ofthe Fig. 4distribution. shows an example of mixture ofproduce two diminishes, and ofmax the remaining error due tobottom bias. wo estimates computed with 500 and pairs in P, In most practice it is possible to set so that the estimator will produce useful Z results. The left panel of Fig. 4 shows an example of a mixture of two max max lybottom (the red dashed curves). We can observe that even 500 pairs are max δmax respectively (the red dashed curves). We observe that even 500ofcomponent pairs are variance duenormal to the small size Pseteven . an As500 increases the variance variance due to the small size Pcomponent . can As increases variance max max left panel of results. Fig. 4of The shows bottom example left panel of aof mixture Fig. 4ofwhich shows of two an example athe mixture two converge In practice it is possible to so the estimator will produce hed curves). We can observe that pairs are distributions (black curve), limit to (x, 1.0) would converge max results. The bottom left panel ofthat Fig. 4the shows anpanel example of auseful mixture ofwith two normal distributions (black curve), limit toof which pˆ(x, 1.0) would Also see left side of Figure 5. Since δlimit ≤ δtopˆmax and d(x, u) ≤ 1-dimensional circular subspace centered at point ybottom radius . That the4 set δmax 2 obtain a (black reasonable estimate. Finally, the bottom right of Fig. 4the diminishes, and most of the remaining error is(black tomost bias. diminishes, and of the remaining error isthe due to1.0) bias. enough to adue reasonable estimate. Finally, right panel of is, Fig. results. The bottom left panel of Fig. 4(blue shows an example of atwo mixture of two nable estimate. Finally, the bottom right panel of Fig. 4curve), = p (x) (C1 (δ) + C2 (δ)) dδ ibutions curve), normal the distributions limit toobtain which pˆ (x, 1.0) would the converge which pconverge ˆ(x, would converge (blue curve), as well as two estimates computed with 500 and 5000 pairs in P, normal distributions (black curve), the limit to which p ˆ (x, 1.0) would curve), as well as estimates computed with 500 and 5000 pairs in P, max ) is the “edge” of the set V (y, ).true Next, let (x) denote thethe maximum In practice itaccuracy isδ,possible toE(y, setestimate so that the estimator will produce useful In practice it is possible to set so that thep estimator will produce w, accuracy the estimation (correlation between the estimate and the max max shows how the estimation accuracy (correlation between the estimate and true we have d(x, v) ≤ 2δ, and we can thus upper bound (black curve), the limit to which pˆ(x, 1.0) would converge (correlation between the and the true asnormal well asdistributions two (blue curve), computed as well with as 500 two and estimates 5000 pairs computed incurves). P, with 500 and 5000 pairs in 500 P,useful 0 (blueestimates curve), as well as two estimates computed with 500 and 5000 pairs in P, respectively (the red dashed curves). We can observe that even 500 pairs are respectively (the red dashed We can observe that even pairs are results. bottom panel ofinFig. shows an example a mixture of density inside the set V (x, 2ofand ), is, results. The bottom left panel ofthat Fig. 4 two shows an example of a mixture of two max δmax ehaves therespectively number of(the pairs P is4 to increased from 250 to 2500. (blue curve), as well as two estimates computed with 500 5000 pairs in umber ofasThe pairs in PPr[U is left increased from 250 2500. density) behaves as the number of pairs in P is increased from 250 to 2500. red dashed curves). can observe even 500 pairs are both p(u) p(v) with pP, (x). This enough to We obtain aand reasonable estimate. Finally, the bottom right panel (the red dashed curves). We can (the observe red dashed even curves). 500 We pairs can are observe that even 500 pairs are of Fig. 4 enough torespectively obtain areplacing reasonable estimate. thethat bottom right panel δ ] by normal distributions (black curve), the limit tothat which pˆFinally, (x, 1.0) would normal distributions (black curve), the limit pof ˆ(x,Fig. 1.0) 4would converge = pδmax (x)2 C(δmax ). respectively (the red dashed curves). We can observe that even 500converge pairs aretoofwhich max enough to obtain aFinally, reasonable estimate. Finally, the bottom right panel Fig. 4 thepanel shows how the estimation accuracy (correlation between the estimate and true p (x) = max p(y). btain a reasonable estimate. enough to obtain the a reasonable bottom right estimate. panel of Finally, Fig. 4 the bottom right of Fig. shows how the estimation accuracy (correlation between the estimate and true (blue curve), as well as two estimates computed with 500asand 5000 pairs in P, 4 ) with 500 and 5000 pairs4in the yields (blue curve), as well two estimates computed P,(13) enough to shows obtain a reasonable estimate. Finally, the bottom right panel of Fig. y2V (x,2 max how the estimation accuracy (correlation between thetrue estimate and true behaves asP the number ofbetween pairs in the Pthe isestimate increased fromthe 250 to 2500. he estimation accuracy shows (correlation how estimation between the accuracy estimate (correlation and the and respectively (the redbehaves dashed curves). Wedensity) can 14 observe that even 500 and pairs are as therespectively number of pairs inthe is increased from to 2500. (the red dashed curves). We can250 observe that even 500 true pairs are her dimensions Antti Ukkonen al. shows how the estimation accuracy (correlation between estimate the true The C(δ ) thatet appears above is the one we use in the ! alisation to density) higher dimensions density) behaves asGeneralisation the number of pairs in is increased from is 250 to 2500. 4.4 tothe higher Za P250 Zestimate. Observe that the probability of2500. event the density V (x,of)Fig. (so 4that max enough to obtain a as reasonable estimate. Finally, bottom right panel of 4 total aves as the number of density) pairs in behaves Pofisenough increased asinthe from ofdimensions pairs to in PU2500. is Fig. increased from 250 inside to 2500. to reasonable Finally, the bottom right panel density) behaves the number pairs P obtain isnumber increased from 250 to claim (Eq. 14) of the proposition. Now, we can replace shows how the estimation accuracy (correlation between the estimate and the true the point u is close enough to x) times the probability that v is drawn from the shows how estimation accuracy (correlation between the estimate and the true 2 n 2 holds forpone dimensional data. However, thethe main max 4.4 Generalisation to higher dimensions Pr[U ≤ pinδp(x). (x) 1 Pdviscentered du. tthis of isProposition 2 holds forp one dimensional data. the main 1-dimensional circular subspace at point ytowith radius is, the setupper bound in Equation 14, and obδresult The Proposition 2However, holds for one dimensional However, the main density) as the of] pairs P isThe increased from to 2500. density) as the number of pairs in increased from 250 2500. circle around ubehaves with ,250 i.e., the set E(u, ) (sodata. that we have d(u, v) =δmax as. That thatbehaves q(x, ) isnumber proportional toof ab-radius p q(x, ) with this max 4.4 higher dimensions 4.4Generalisation Generalisation totohigher dimensions Vis(x,δ) E(u,δ) consequence of this is that q(x, ) to is unknown proportional p(x). The E(y, )this the “edge” of abthe set (y, ). Next,toletp(x). p maxThe (x) abdenote the maximum max That practical consequence istothat q(x, ) isVproportional 4.4appropriately Generalisation to higher dimensions ed to normalise the estimates is required). is, ofof sation to higher dimensions 4.4 Generalisation higher dimensions The result Proposition 2 holds max forpone dimensional data. However, the main serve that ue of some needed towhy appropriately normalise the is set unknown density inside the Vthat (x, 2 max ), that is, ! max arguments needed todimensional appropriately normalise the estimates istounknown e give asolute similarvalue result must hold for estimates max Z Z The result result of Proposition 2 of holds forp one dimensional data. However, the main practical consequence of this is q(x, ) is proportional p(x). The abThe of Proposition 2 holds for one data. However, the main max p Generalisation tothe higher dimensions 4.4 Generalisation togive higher dimensions The result of Proposition 2 holds for one dimensional data. However, the main nfll.4.4 this section we give some arguments why adata. similar result must hold for We do not practical derive normalisation constant as was The value of the integrals no longer depends on the density anyway. In this section we some arguments why a similar result must hold for consequence of this is that q(x, ) is proportional to p(x). The abp solute value of needed to appropriately normalise the estimates is unknown max max Proposition 2 holds The for result one dimensional of Proposition 2 holds However, for one the dimensional main data. However, the main max Pr[Uq(x, ] = max p(u) dv The du.ab- p(y). practical consequence of this is that ) to is p(x). proportional top(v) p(x). C(δmax ) pδmax (x)2(13) p (x) = max p p practical consequence of this is that q(x, ) is proportional The abonal case, but will show the following proposition: maxas onal data of as solute well.is We do not derive the normalisation constant as was k-dimensional data well. We not derive the constant as lim was value of needed to appropriately normalise the estimates is normalisation unknown anyway. In the this section we give some why ato similar must hold for y2V (x,2 )result Vdo (x, )v). E(u, )thus maxconsequence = lim pδmax (x)2 = p(x)2 . max p, only on the length δ of pair (u, We can replace sequence that practical q(x, )one is proportional of this isdata. that to p(x). q(x, The ab)one isarguments proportional p(x). The the abmax max solute value of needed to appropriately normalise the estimates is unknown The result ofthis Proposition 2 holds for dimensional However, the main max The result of Proposition 2 holds for dimensional data. However, main solute value of needed to appropriately normalise the estimates is unknown p p max he one-dimensional case, but will show the following proposition: δ →0 δmax →0 anyway. In this section we give some arguments why a similar result must hold for C(δ ) k-dimensional data as well. We do not derive the normalisation constant as was done for the one-dimensional case, but will show the following proposition: max max ) be defined as in Equation 9, and let C( ) be max ofmax appropriately solute value of normalise needed the estimates to appropriately is unknown normalise the estimates is practical consequence ofthis this is that q(x, )and is to p(x). The abpractical consequence ofobtain this is,probability that q(x, ) is proportional tounknown p(x). The abmax needed max anyway. In section we give some arguments why a similar result must max those with asome function of δproportional and max Observe that the of event the for total density inside Since  d(x, u) we have d(x, v) 2the,ishold and we can thus upper V (x, ) (so that anyway. In thistosection we give arguments why aone-dimensional similar result must hold for max k-dimensional data as well. We dofor not derive the normalisation constant as U was done the case, but will show following proposition: We have solute valuewe ofgive needed to appropriately normalise the estimates is unknown max solute value of needed to appropriately normalise the estimates is unknown max his section some anyway. arguments In this why section a similar we give result some must arguments hold for why a similar result must hold for max Antti Ukkonen et al. k-dimensional data as well. We do not derive the normalisation constant as was k-dimensional data as well. We do not derive the normalisation constant as was ion 4 Let q(x, ) be defined as in Equation 9, and let C( ) be bound Pr[U ] by replacing both p(u) and p(v) with p (x). This yields the point u is close enough to x) times the probability that v is drawn from the max as Proposition case, 4 Let be following defined in Equation 9, and let C( max ) bethe second equality simply follows from the definition max ) the donemax for the one-dimensional butq(x, will show proposition: Here sIn this anyway. section wenot givederive some arguments why aWe similar result must hold for 2some anyway. In thisbut section we give arguments why similar result must hold for Proposition 4δmax Let q(x, )was be as the ina Equation 9, )and C(we ) be max maxhave al data well. We do k-dimensional the data normalisation as well. do constant not derive as Z thedefined normalisation constant as let was done for the one-dimensional but will show the done for the one-dimensional case, will show the tion of as . We have circle around u proposition: with radius , Zproposition: i.e., set! E(u, (so that max Pr[U ] normalisation ≤ pfollowing (x) C (δ). somecase, function of . We have max δ 1following δmax d(u, v) = as q(x, max ) as well.4 We k-dimensional data do not derive the constant asderive was k-dimensional data asinwell. We do not the normalisation constant as Proposition Letthe q(x, ) be defined as Equation 9, and let C( ) be some function of . We have of pwas (x) as given in Equation 15 above. As δmax goes max max max lim = p(x). (12) 2 one-dimensional case, done but for will show one-dimensional the following case, proposition: but will show the following proposition: max is, s required). That subspace centered at point y with radius . That is, the set s Pr[U ]  p (x) 1 dv du. Proposition 4 Let q(x, ) be defined as in Equation 9, and let C( ) be !0 C( ) max max max max function 4 done for theProposition one-dimensional but will show following done for one-dimensional but s will Let )the bethe defined as proposition: in case, Equation 9, show and the let following C( max ) proposition: be some of case, .q(x, We have max max ! zero, the area from which we take pδmax (x) shrinks until to q(x, ) max Vmax (x, )) Z E(u, ) max q(x, f the set V (y, ). Next, let p (x) denote the maximum some function of . We have Z max Next, consider the event V . It happens when u falls in q(x, s max n Proposition 4 Let q(x, ) lim be Proposition defined asdefined 4We Equation Let q(x, 9, and ) be9, defined C( max as) in be Equation δlet max )9, and let C( max ) be (12) = have p(x). (12) max some function of)the .in max lim = p(x). 4 Let q(x, be in Equation and Proposition 4 Let q(x, ) beC( defined as in Equation lim = p(x).9, and let C( only of for 2-dimensional data, but argument maxC( max ) be max let max ) be (12) s )of ofasgeneralises Vn (x, that is, max !0 !0 max max.),We ) δ) x is left. The upper bound of q(x, δmax )/C(δmax ) thus q(x, )max max maxas Pr[U ]= p(v) du. The value the integrals no longer depends the density p, only ondvthe length !0 C(on ) p(u) max max of2 max some function .defined We sofhave max the around x, the setC( R(x, = V (x, function ofhave We p(x). (12)2δ) \ some function We have y some straightforward manner. Fixhave some point xlim with density max . “ring” max . = q(x, ) max V (x, ) a function E(u,of ) C(p(x). ) max max !0 maxcan max of the pair (u, v). We thus replace those with and obtain lim = (12) s s approaches p(x)2 from above as δmax → 0. We can do the q(x, ) s ppair the (x)v)proof =drawn max p(y). (13) s 2-dimensional dom (u, from p given d(u, v) isWe fixed Vfor (x, δ), and v present falls on the segment E(u, δ) thatbut intersects !0that C( )the present 2-dimensional data, but the argument generalises max Proof present the proof for data, the generalises Proof proof for 2-dimensional data, the but argument generalises lim = of p(x). (12)argument y2V (x,2 max ) max We q(x, ) q(x, ) q(x, ) 2 q(x, ) max max max max mple the by first drawing uthe from p, and then draw max Since  and d(x, u)  Cgeneralises , (p(x). we have d(x,  2xsame , and we can thus in upper !0 C( )Pr[U max max reasoning terms of a lower bound of q(x, δmax ), by Proof We present proof for 2-dimensional the to k-dimensions indata, amax fairly manner. Fixpoint somev) with density density pargument (x) ).some nsions in pair a fairly straightforward manner. Fix some x but with = (12) lim = p(x). limpoint (12) = p(x). lim]straightforward = (12) 1Fix tolim k-dimensions in a p(x). fairly straightforward manner. xpoint with(12) density max !0 Proof We present the for 2-dimensional but thethat generalises !0 C((u, )with = . The is is covered by pair v))when one maxthe max !0 !0 C( ) C((u, C( )pair maxreplacing max bility of point event theproof total inside Vdata, (x, )manner. (so max max bound Pr[U by both p(u) and p(v) with p d(u, (x). max max toxU k-dimensions inv) adensity fairly straightforward Fix some point x density p(x), and consider a]argument random v)from drawn from pthat given that v) isThis fixed yields consider a random pair (u, drawn from p given that d(u, v) is fixed replacing max with a min in Equation 15, i.e., p(x), and consider a random pair (u, v) drawn p given d(u, v) is fixed Next, consider the event V . It happens when u falls in the “ring” around x, to k-dimensions in a fairly straightforward manner. Fix some point x with density ppens:to x) Proof We present the proof data, but the argument generalises ough times thethe probability thatdrawing vfor is 2-dimensional drawn from the p(x), and consider aby random pair (u, v) drawn from p then given that d(u, is fixed to some from We sample the pair byv) first u Zfrom p, and ! then draw Zdrawing max .p, Proof .and We sample pair first ubut and draw We present the2-dimensional 2-dimensional data, argument generalises max Proof We the proof for data, but the argument generalises tofor some v)the .present We sample pair by first drawing u from p, generalises and then max p(x), aproof random pair (u, drawn from p)the given that d(u, v) is fixed defined as set R(x, = Vthe (x, 2=2-dimensional )\V (x, ), and vxcovered on the segment ofdraw E(u, one ) esent the proof for Proof We data, the but proof the argument for 2-dimensional generalises data, but the argument toconsider k-dimensions in apresent fairly straightforward manner. Fix some point with density 2falls to some  . We sample the pair by first drawing u from p, and then draw max v given u so that d(u, v) . The point x is by the pair (u, v) when adius , i.e., the set E(u, ) (so that we have d(u, v) = as max Antti Ukkonen et al. δ ] one  covered p (x)by 1 dv du. :soto d(x, u) d(u,  v) k-dimensions in a.. fairly straightforward manner. Fix some point xPr[U withx density k-dimensions apair fairly straightforward manner. Fixthe some point with density = The x intersects covered bydrawing the (u, v) when vpoint given uisto so d(u, v) = The point is pair (u,xv) when one to  sample the pair first u.). p, and then draw p max (x) = min p(y). max that with Vin We have onsthat insome a fairly straightforward to k-dimensions manner. inthat aThe fairly some straightforward point xfrom with density manner. Fix with density v given uWe so that d(u, v) = .by point x(x, is drawn covered by thepthus pair (u, v) some when one ofFix the following events happens: p(x), and consider a (u, random pair (u, from given that d(u,point v) (x,is)xfixed E(u, p(x), andu) ad(u, random pair v) drawn from pv)given that d(u, v) isv)fixed p(x), and consider a random pair (u,when drawn from pV given that d(u,) v) is fixed y∈V (x,2δmax ) owing events happens: :nsider d(x, < 2that and d(x,  v