Distance between Populations

0 downloads 0 Views 239KB Size Report
effect that the population has on the selection operation can easily be seen in the fol- ..... Schaum's Outline of Theory and Problems of General Topology.
Distance between Populations 1

Mark Wineberg and Franz Oppacher

2

1

Computing and Information Science University of Guelph Guelph, Canada [email protected] 2 School of Computer Science Carleton University Ottawa, Canada [email protected]

Abstract. Gene space, as it is currently formulated, cannot provide a solid basis for investigating the behavior of the GA. We instead propose an approach that takes population effects into account. Starting from a discussion of diversity, we develop a distance measure between populations and thereby a population metric space. We finally argue that one specific parameterization of this measure is particularly appropriate for use with GAs.

1 Introduction: The Need for a Population Metric All previous attempts to characterize gene space have focused exclusively on the Hamming distance and the hypercube. However, this 'chromosome space' cannot fully account for the behavior of the GA. An analysis of the GA using chromosome space implicitly assumes that the fitness function alone determines where the GA will search next. This is not correct. The effect that the population has on the selection operation can easily be seen in the following (obvious) examples: In fitness proportional selection (fps) the fitness values associated with a chromosome cannot be derived from the fitness function acting on the chromosome alone, but also takes into account the fitness of all other members in the population. This is because the probability of selection in fps is based on the ratio of the ‘fitness’ of the individual to that of the total population. This dependence on population for the probability of selection is true not just for fitness proportional selection, but also for rank selection as the ranking structure depends on which chromosomes are in the population, and tournament selection since that can be reduced to a subset of all polynomial rank selections. Finally, and most glaringly, the probability of selecting a chromosome that is not in the population is zero; this is true no matter the fitness of the chromosome! Consequently the fitness value associated with the chromosome is meaningless when taken independently of the population. As the above examples demonstrate, any metric that is used to analyze the behavior of the GA must include population effects. These effects are not made evident if only the chromosome space is examined. Therefore the metric used must include more information than just the distance between chromosomes; we must look to the populaE. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1481–1492, 2003. © Springer-Verlag Berlin Heidelberg 2003

1482

M. Wineberg and F. Oppacher

tion as a whole for our unit of measure. In other words, we need a distance between populations. There are four sections in this paper. The first section examines the well-known population measure ‘diversity’ since the definitions and methodologies developed for it will form the basis of the distance measures. In the two sections after, two different approaches are introduced that attempt to determine the distance between populations. The first approach, the all-possible-pairs approach, is a natural extension of the traditional diversity definition. The second approach describes the mutation-change distance between populations. In the final section, a synthesis of these two distance concepts is developed eventually leading to a single definition of the distance between populations

2 Diversity Before attempting to find a relevant distance between populations, it will be instructive to first discuss the related concept of ‘diversity’. There are three reasons for this. First, diversity is a known measure of the population that is independent of the fitness function. Since the distance between populations should likewise be independent of the fitness, useful insights may be derived from a study of diversity. Second, several techniques shall be introduced in this section that will become important later when discussing the distance between populations. Finally, the concept of diversity itself will be used in the analysis of the distance between populations.

3 All-Possible-Pairs Diversity The simplest definition of diversity comes from the answer to the question “how different is everybody from everybody else?” If every chromosome is identical, there is no difference between any two chromosomes and hence there is no diversity in the population. If each chromosome is completely different from any other, then those differences add, and the population should be maximally diverse. So the diversity of a population can be seen as the difference between all possible pairs of chromosomes within that population. While the above definition makes intuitive sense, there is one aspect not covered: what do we mean by different? If a pair of chromosomes is only different by one locus, it only seems reasonable that this pair should not add as much to the diversity of the population as a pair of chromosomes with every locus different. Consequently the difference between chromosomes can be seen as the Hamming distance or chromosome distance, and hence the population diversity becomes the sum of the Hamming distances between all possible pairs of chromosomes [1]. In cluster theory this is called the statistic scatter [2]. Now, since the Hamming distance is symmetric, and is equal to 0 if the strings are the same, only the lower (or, by symmetry, only the upper) triangle in a chromosomepairs-table need be considered when computing the diversity. Consequently the allpossible-pairs diversity can be formalized as

Distance between Populations

Div( P) =

n

i −1

i =1

j =1

Σ Σ hd(c ,c ) i

j

1483

(1)

where P is the population, n is the population size, chromosome c i ∈P , l is the length of a chromosome and hd(ci,cj ) is the Hamming distance between chromosomes. The Reformulation of the All-Possible-Pairs Diversity: A Linear Time Algorithm

A problem with formula (1) is its time complexity. Since the Hamming distance be2 tween any two pairs takes O(l ) time and there are n possible pairs (actually 1 n(n − 1) pairs when symmetry is taken into account), then the time complexity 2

( )

when using (1) is O l n . Since the time complexity of the GA is O(l ⋅ n) calculating the diversity every generation would be expensive. 2

Surprisingly, a reformulation of definition (1) can be converted into an O (l ⋅ n ) algorithm to compute the all-possible-pairs diversity. Gene Counts and the Gene Frequencies We will now introduce two terms that not only will be used to reformulate the definition of the all-possible-pairs diversity, but also will become ubiquitous throughout this paper. They are the gene count across a population, and the gene frequency of a population. The gene count c k (α ) is the count across the population of all genes at locus k that equal the symbol α. This means that n

c k (α ) =

Σδ

i, k

(α )

(2)

i =1

where δ i , k (α ) is a Kronecker δ that becomes 1 when the gene at locus k in chromosome i equals the symbol α, and otherwise is 0. Later in the paper we will frequently write c k (α ) as cα , k , or just as cα if the locus k is understood in the context. The array of the gene counts of each locus will be called the gene count matrix. The gene frequency fk (α ) is the ratio of the gene count to the size of the population. In other words, fk (α ) =

c k (α ) n

(3)

Again, later in the paper we will frequently write fk (α ) as fα |k , or just as fα if the locus k is understood in the context.

1484

M. Wineberg and F. Oppacher

The array of the gene frequencies of each locus will be called the gene frequency matrix. The Reformulation With the notation in place we can present the alternate form of writing the allpossible-pairs diversity:

The all-possible-pairs diversity can be rewritten as

Theorem 1:

Div( P) =

n

2

l

Σ Σ f (α ) (1 − f (α )) 2l k

(4)

k

k = 1 ∀α ∈A

Proof: Let us first examine a chromosome ci that at locus k has gene α. When computing all of the possible comparison pairs, 0 is obtained when compared to all of the other chromosomes that also have gene α at locus k. There are n fk (α ) of those. Consequently there are n − n fk (α ) comparisons that will return the value 1. So the component of the distance attributable to ci is n − n fk (α ) . Since there are n f(α ) chromosomes that have the same distance component, the total distance contributed by chromosomes with gene α at locus k is n fk (α )(n − n fk (α )) , which simplifies to 2

n fk (α )(1 − fk (α )) . Summing over all α will give us double the comparison count (since we are adding to the count both hdk (c i ,c j ) and hdk (c j ,c i ) ). So the true comparison count is

n2 2

Σ f (α )(1 − f (α )) . Averaging over all loci gives us the result we

∀ α ∈A

k

k

want. • Normalizing (4) assuming that the alphabet size a < n and that a divides into n evenly, we get Div(P) =

a

l

l ( a − 1)

ΣΣ

fk (α ) (1 − fk (α )) .

(5)

k =1 ∀ α ∈A

Since in the majority of GA implementations a binary alphabet is used with an even population size (because crossover children fill the new population in pairs), the above equation becomes Div(P) =

2

l

l

Σ Σ f (α ) (1 − f (α )) k

k

.

(6)

k =1 ∀α ∈A

The gene frequencies can be pre-computed for a population in O (l ⋅ n ) time. Consequently, the formula above can be computed in O (a ⋅ l ⋅ n ) , which reduces to O (l ⋅ n ) since a is a constant of the system (usually equal to 2). Thus we show that the all-possible-pairs diversity can be computed in O (l ⋅ n ) time, which is much faster 2

than the O (l ⋅ n ) time that the original naïve all-possible-pairs algorithm would take.

Distance between Populations

4

1485

An All-Possible-Pairs “Distance” between Populations

The obvious extension of the all-possible-pairs diversity of a single population would be an all-possible-pairs distance between populations. Here we would take the Cartesian product between the two populations producing all possible pairs of chromosomes, take the Hamming distance between each of those pairs of chromosomes, and sum the results. Since there are O (n ⋅ m ) such pairs (where n and m are the two popu2

lation sizes) then assuming m ∝ n , there would be O (n ) distances being combined. Consequently the resulting summation, if it turns out to be a distance, would be a squared distance. So formally we have: n1

n2

i =1

j =1

Σ Σ hd(chr1 ,chr2 )

Dist ′ (P1 , P2 ) =

i

j

(7)

where P1 and P2 are populations with population sizes of n1 and n2 respectively, chr1i ∈ P1 and chr2 j ∈ P2 , and i and j are indices into their respective population. The

reason we are using the function name Dis t ′ instead of Dist shall be explained in the next subsection. This ‘distance’ between populations is used in some pattern recognition algorithms and is called the average proximity function1. Following the same argument as with diversity presented when reformulating the diversity to become a linear algorithm, a frequency-based version of the same formula can be produced:

Dist ′ (P1 ,P2 ) =

nm l

l

ΣΣ f

1, k

(α ) (1 − f2 , k (α ))

(8 )

k =1 ∀ α ∈A

where f1, k (α ) is the gene frequency of the gene α at locus k across population P1, and f2, k (α ) is the corresponding gene frequency for population P2. Problems

While initially attractive for its simple intuitiveness, the all-possible-pairs “distance” is unfortunately not a distance. While it is symmetric and non-negative, thus obeying distance properties M2 and M3, it fails on properties M1 and M42. The failure of property M1 is readily seen. M1 states that the distance must be 0 iff the populations are identical; consequently the all-possible-pairs “distance” of a population to itself should be equal to 0. Instead it is actually the all-possible-pairs diversity measure, which is typically greater than 0. In fact, the diversity only equals 0 when all of the chromosomes in the population are identical! Furthermore the all-possible-pairs “distance” also fails to satisfy M4, the triangle inequality. This can be seen from the following example. Let A be a binary alphabet {0, 1} from which the chromosomes in all three populations that form the triangle 1 2

[3] pg. 378. See Appendix A.

1486

M. Wineberg and F. Oppacher

will be drawn. Let populations P1 and P3 both have a population size of 2 and P2 have in it only a single chromosome. To make the situation even simpler, in all populations let each chromosome consist of only 1 locus. Now look at an example where the population make-up is as follows: P1 = { < chr1, 0 ,0 >, < chr1, 1 , 0 >} , P2 = {< chr2, 0 ,0 >} P3 = {< chr3, 0 ,1 >,< chr3, 1 ,1 >} . The corresponding gene frequencies are f1 (0) = 1 , f1 (1) = 0 , f2 (0) = 1 , f2 (1) = 0 , f3 (0) = 0 and f3 (1) = 1 . Using the all-possible-pairs “distance” definition (8) we can calculate

that

Dist(P1 , P2 ) + Dist(P2 , P3 ) =

0+

2=

2,

and

that

Dist(P1 , P3 ) = 4 = 2 . Consequently Dist(P1 , P2 ) + Dist(P2 , P3 ) < Dist(P1 , P3 ) and so the triangle inequality does not hold. Thus the all-possible-pairs “distance” cannot be considered a metric3. It is for this reason that we put the prime after the ‘distance function’ that has been developed so far. Correcting the All-Possible-Pairs Population Distance

We will now modify the formula to turn it into a true distance. We shall first deal with the failure to meet the triangle inequality. Definition (8) was written to be as general as possible. Consequently, it allows for the comparison of two populations of unequal size. In the counter-example showing the inapplicability of the triangle inequality, unequal sized populations were used. When populations of equal size are examined no counter-example presents itself. This holds even when the largest distance between P1 and P3 is constructed and with a P2 specially chosen to produce the smallest distance to both P1 and P3 . Generalizing this, we could redefine the definition (8) such that small populations are inflated in size while still keeping the equivalent population make-up. The same effect can be produced by dividing definition (8) by the population sizes, or in other words through normalization. Now let us address the problem of non-zero self-distances. As noted in the previous subsection, this property fails because the self-distance, when comparing all possible pairs, is the all-possible-pairs diversity, which need not be zero. To rectify the situation we could simply subtract out the self-distances of the two populations from the all-possible-pairs distance equation4. Again we are removing the problems through normalization. To summarize the above, looking first only at a single locus and normalizing the squared distance (which simplifies the calculation) we get: 3

It is not even a measure. See [3] pg. 378 under the properties of the average proximity function.

4

The 2a term in front of the two normalized diversities in the resulting distance equation is a re-normalization factor. It is needed to ensure that the resulting distance cannot go below zero, i.e. the distance stays normalized as required.

a −1

Distance between Populations

(

)

2 Dist k ( P1 , P2 ) = Dist′k (P1 , P2 )

2



a −1

2a

Divk (P1 ) −

a −1

2a

1487

Div k (P2 )

(9)

Now, let us substitute (8), the definition of Dist ′ (P1 , P2 ) , into the above equation. a Also let Divk (P ) = fk (α ) ⋅ (1 − fk (α )) , the normalized diversity from the (a − 1) ∀α ∈A diversity reformulation section modified for a single locus. Then (9) becomes Dist L

2

,k ( P1 , P2 )

=

1

(f 2Σ

(α ) − f2 , k (α ))

2

1, k

∀α ∈A

(10)

(the use of the L2 subscript will become apparent in the next section). Notice that the above distance is properly normalized5. Furthermore, this process has actually produced a distance (or rather a pseudo-distance): Theorem 2:

The function Dist L

2

,k

( P1 , P2 ) is a pseudo-distance at a locus k.

Proof: First notice that f1, k (α ) − f2 , k (α ) forms a set of vector spaces (with k being

the index of the set). Now

Σ (f

(α ) − f2 , k (α )) is the L2-norm on those vector 2

1, k

∀α ∈A

spaces. As noted in Appendix B, we know that the norm of a difference between two vectors v − w obeys all distance properties. Consequently, the equation

Σ (f

1, k

∀α ∈A

(α ) − f2 , k (α )) is a distance. Any distance multiplied by a constant (in this 2

1

case 2 ) remains a distance. However, Dist l ,k ( P1 , P2 ) is a distance between gene frequency matrices, and of course there is a many-to-one relationship between populations and a gene frequency matrix. For example, you can crossover members of a population thus producing a new population with different members in it but with the same gene frequency matrix. Hence you can have two distinct populations with a distance of 0 between them. Consequently, Dist L ,k (P1 , P2 ) is a distance for gene fre2

2

quency matrices, but only a pseudo-distance for populations. ! Using the L2-norm, we can combine the distances for the various loci into a single pseudo-distance: Dist L ( P1 , P2 ) = 2

1 2l

l

Σ Σ (f

1, k

k =1 ∀ α ∈ A

(α ) − f2, k (α ))

2

.

(11)

While it would be nice to have an actual distance instead of a pseudo-distance between populations, most properties of metrics are true of pseudo-metrics as well. Furthermore, since the distances between gene frequency matrices are actual dis-

5

The maximum occurs when f1| k (α 1 ) = f2 | k (α 2 ) = 1 and f1| k (α ≠ α 1 ) = f2| k (α ≠ α 2 ) = 0 .

1488

M. Wineberg and F. Oppacher

tances, their use connotes a positioning in gene space, albeit with some loss of information.

5 The Mutational-Change Distance between Populations While, in the section above, we were able to derive a population distance using an allpossible-pairs approach, it is a bit disappointing that to do so we needed to perform ad-hoc modifications. In this section we will approach the matter from a different perspective. We will define the distance between populations as the minimal number of mutations it would take to transform one population into the other. The above definition of population distance is the generalization of the Hamming distance between chromosomes. With the distance between chromosomes we are looking at the number of mutations it takes to transform one chromosome into the other; with the distance between populations we directly substitute into the entire population each mutational change to create an entirely new population. There are, of course, many different ways to change one population into another. We could change the first chromosome of the first population into the first chromosome of the other population; or we could change it into the other population’s fifth chromosome. However, if we just examine one locus, it must be true that the gene counts of the first population must be transformed into those of the second by the end of the process. The number of mutations that must have occurred is just the absolute difference in the gene counts (divided by 2 to remove double counting). There is one slight problem with the above definition. It only makes sense if the two populations are the same size. If they are of different size, no amount of mutations will transform one into the other. To correct for that, we transform the size of one population to equal that of the other. To give the intuition behind the process that will be used, imagine two populations, one double the size of the other. If we want to enlarge the second population to the size of the first population, the most obvious approach is to duplicate each chromosome. The effect that this has is the matching of the size of the second population to the first while still maintaining all of its original gene frequencies. Since a population will not always be a multiple of the other, we duplicate each population n times, where n is the other population's size. Now both populations will have the same population size. So the duplication factor in front of the first population is n2 , the duplication factor in front of the second population is n1 , and the common population size is n1 n2 . So we can now define the mutational-change distance between two populations at a locus as Dist L , k ( P1 , P2 ) = 1

Σ

n 2 c 1, k (α ) − n1 c 2, k (α )

∀ α ,α ∈A

= n1 n 2

Σ

∀α , α ∈ A

= n1 n 2

Σ

∀α , α ∈ A

c1 , k (α ) n1



c 2, k (α ) n2

f1 , k (α ) − f2, k (α )

Distance between Populations

1489

which, when normalized, becomes Dist L ,k ( P1 , P2 ) = 1

1

Σ

2

f1 , k (α ) − f2, k (α )

(12)

∀α , α ∈ A

Notice the similarity between the above and the all-possible-pairs distance at a locus (10). We basically have the same structure except that the L2-norm is replaced by the L1-norm (hence the use of the L1 and L2 subscripts). Therefore, the argument that was used to prove Theorem 2 applies here as well. Consequently the mutational-change distance between populations at a locus is also a pseudo-distance. Finally, averaging across the loci produces the mutational-change pseudo-distance between populations: Dist L (P1 , P2 ) = 1

1

l

ΣΣ 2l

f1, k (α ) − f2 , k (α )

(13)

k = 1 ∀ α ,α ∈A

6 The Lk-Norms and the Distance between Populations In the previous two sections we have seen two different distances (actually pseudodistances) between populations derived through two very different approaches. Yet there seems to be the same underlying structure in each: the norm of the differences between gene frequencies. In one case the norm was the L1-norm, in the other the L2norm, otherwise the two results were identical. Generalizing this, we can define an Lkdistance on the population: Dist L (Pa , Pb ) = k k

1

l

ΣΣ 2l

fa , i (α ) − fb , i (α )

k

(14)

i = 1 ∀α , α ∈ A

and Dist L (Pa , Pb ) = max ∞

∀α , α ∈A ∀i , i ∈[1, l ]

(f

a, i

(α ) − fb, i (α )

).

(15)

Interestingly, the L ∞ - distance restricted to a single locus can be recognized as the Kolmogorov-Smirnov test. The K-S test is the standard non-parametric test to determine whether there is a difference between two probability distributions. Realizing that there are an infinite number of possible distance measures between populations, the question naturally arises: is one of the distance measures preferable or will any one do? Of course, to a great degree the choice of distance measure depends on matching its properties to the purpose behind creating that distance measure in the first place; i.e. different distances may or may not be applicable in different situations. That being said, there is a property possessed by the distance based on the L1-norm which none of the others possess, making it the preferable distance. This property becomes evident in the following example. Let us examine 4 populations; the chro-

1490

M. Wineberg and F. Oppacher

mosomes in each population are composed of a single gene drawn from the quaternary alphabet {a, t, c, g}. The 4 populations are: P1a = {< chr1 , a >, < chr2 ,a >, < chr3 ,a >, < chr4 ,a >} P1b = {< chr1 , c >, < chr2 ,c >, < chr3 ,c >, < chr4 ,c >} P2a = { < chr1 ,a >, < chr2 , a >, < chr3 , t >, < chr4 , t >} P2b = { < chr1 ,c >, < chr2 , c >, < chr3 , g >, < chr4 ,g >} and so f1a (a) = 1,

f1a (t ) = 0,

f1a (c) = 0,

f1a (g) = 0,

f1b (a) = 0,

f1b (t ) = 0,

f1b (c) = 1,

f1b (g) = 0,

1

f2a (a) = 2 ,

1

f2a (t) = 0, 1

f2b (a) = 0,

f2b (t) = 2 ,

f2a (c) = 2 ,

f2a (g) = 0,

f2b (c) = 0,

f2b (g) = 12 .

Now, lets look at the two distances DistL (P1a , P1b ) and Dist L (P2a , P2b ) . In both cases the populations have no genes in common. We should therefore expect the distance between both pairs of populations to be the maximum distance that can be produced. It is true that the diversity within each of the first two populations is 0, while the diversity within each of the second two is greater than 0; however that should have nothing to do with the distances between the populations. One expects both distances to be equally maximal. Working out the distances from the equation k

DistL (Pa , Pb ) =

1

k

Σ

2 ∀α , α ∈A

k

k

fa (α ) − fb (α )

k

we get

Dist L ( P1a , P1b ) = K

1 2

k

⋅ 2 ⋅ (1) +

1 2

1

⋅ 2 ⋅ (0)

k

k

=1

and

1

Dist L (P2 a , P2b ) = K

1 2

⋅4 ⋅

1 2

k

k

k −1

=2

k

The only value of k for which the two distances will be equal (and since 1 is the maximum, they will be both maximal) is when k = 1 . For the L ∞ - norm , 1 Dist L (P1 a ,P1 b ) = 1 and DistL (P2 a , P2 b ) = 2 , so it is only under the L1-norm that the two distances are equal and maximal. The above property of the L1-norm holds for any alphabet and population sizes. ∞



Distance between Populations

7

1491

Conclusion

The purpose of this paper is to develop a distance measure between populations. To do so we first investigated population diversity. Using our analysis of diversity as a template, we defined two notions of population distance, which we then generalized into the Lk- distance set. We picked the L1-distance as the most appropriate measure for GAs because it is the only measure that consistently gives maximum distance for populations without shared chromosomes. This distance forms a metric space on populations that supersedes the chromosome-based gene space. We feel that this enhancement to the formulation of gene space is important for the further understanding of the GA.

References 1. Louis, S. J. and Rawlins, G. J. E.: Syntactic Analysis of Convergence in Genetic Algorithms. In: Whitley, L. D (ed.): Foundations of Genetic Algorithms 2. Morgan Kaufmann, San Mateo, California, (1993) 141–151 2. Duran, B. S. and Odell, P. L.: Cluster Analysis: A Survey. Springer-Verlag, Berlin (1974) 3. Theodoridis, S. and Koutroumbas, K.: Pattern Recognition, Academic Press, San Diego (1999) 4. Lipschutz, S. (1965). Schaum's Outline of Theory and Problems of General Topology. McGraw-Hill, Inc., New York (1965)

Appendix A: Distances and Metrics While the concept of ‘distance’ and ‘metric space’ is very well known, there are many equivalent but differing definitions found in textbooks. A metric space is a set of points with an associated “distance function” or “metric” on the set. A distance function d acting on a set of points X is such that d: X × X → R , and that for any pair of points x, y ∈ X , the following properties hold:

d(x, y ) = 0 iff x = y M2 d(x, y ) = d (y, x ) (Symmetry) M3 d(x, y ) ≥ 0 as well as a fourth property: for any 3 points x, y,z ∈ X , M4 d(x, y ) + d(y,z) ≥ d(x,z) (Triangle Inequality) If for x ≠ y , d(x, y ) = 0 , which is a violation of M1, then d is called a pseudoM1

distance or pseudo-metric. If M2 does not hold, i.e. the ‘distance’ is not symmetric, than d is called a quasi-metric. If M4 (the triangle inequality) does not hold, d is called a semi-metric. Finally note that if d is a proper metric then M3 is redundant, since it can be derived from the three other properties when z is set equal to x in M4.

1492

M. Wineberg and F. Oppacher

Appendix B: Norms Norms are also a commonly known set of functions. Since we make use of norms so extensively, we felt that a brief summary of the various properties of norms would be helpful. A norm is a function applied to a vector in a vector space that has specific properties. From the Schaum’s Outline on Topology6 the following definition is given: “Let V be a real linear vector space …[then a] function which assigns to each vector v ∈ V the real number v is a norm on V iff it satisfies, for all v,w ∈ V and k ∈ R , the following axioms: [N1] v ≥ 0 and v = 0 iff v = 0. [N2]

v +w ≤ v + w .

[N3] kv = k v . The norm properties hold for each of the following well-known (indexed) functions: k

Lk − norm = < a1,-,am > = k Σ ai . Taking the limit as k → ∞ of the Lk-norm produces the L ∞ - norm :

(

)

L∞ - norm = max a1 , a 2 ,..., a m . The norm combines the values from the various dimensions of the vector into a single number, which can be thought of as the magnitude of the vector. This value is closely related to the distance measure. In fact, it is well known that the norm of the difference between any two vectors is a metric.

6

[4] pg. 118.