A fuzzy distance metric for measuring the ... - Semantic Scholar

6 downloads 0 Views 609KB Size Report
Apr 27, 2007 - aCentre for Chemometrics, School of Chemistry, University of Bristol,. Cantocks Close, Bristol .... Ridgefield Park, NJ). The scanning mode was .... are sometimes quite diverse in composition and a high baseline signal is very ...
PAPER

www.rsc.org/analyst | The Analyst

A fuzzy distance metric for measuring the dissimilarity of planar chromatographic profiles with application to denaturing gradient gel electrophoresis data from human skin microbes: demonstration of an individual and gender-based fingerprint Yun Xu,a Richard G. Brereton,*a Karlheinz Trebesius,b Ingrid Bergmaier,b Elisabeth Oberzaucher,c Karl Grammerc and Dustin J. Pennd Received 15th February 2007, Accepted 27th April 2007 First published as an Advance Article on the web 21st May 2007 DOI: 10.1039/b702410j A newly devised fuzzy metric for measuring the dissimilarity between two planar chromatographic profiles is proposed in this paper. It does not require an accurately assigned sample-feature matrix and can cope with slight imprecision of the positional information. This makes it very suitable for 1-D techniques which do not have a second spectroscopic dimension to aid variable assignment. The usefulness of this metric has been demonstrated on a large data set consisting of nearly 400 samples from Denaturing Gradient Gel Electrophoresis (DGGE) analysis of microbes on human skin. The pattern revealed by this dissimilarity metric was compared with the one represented by a sample-feature matrix and highly consistent results were obtained. Several pattern recognition techniques have been applied on the dissimilarity matrix based on this dissimilarity metric. According to rank analysis, within-individual variation is significantly less than between-individual variation, suggesting a unique individual microbial fingerprint. Principal Coordinates Analysis (PCO) suggests that there is a considerable separation between genders. These results suggest that there are specific microbial colonies characteristic of individuals.

1 Introduction Denaturing Gradient Gel Electrophoresis (DGGE) is a commonly used technique for DNA fingerprinting of complex microbial communities.1–5 16S rRNA-gene fragments of all bacteria present in a particular sample are amplified by the polymerase chain reaction (PCR). The denaturing gradient gel electrophoresis (DGGE) method is applied to separate this mixture of DNA fragments according to their particular melting point. Thereby an ecosystem fingerprint is produced consisting of different bands, each band representing a particular bacterial species. In this paper, a large-scale study on human skin microbial profiles was conducted.6–8 Nearly 200 subjects participated in this study and nearly 400 samples were collected. The samples were analysed on 28 different gels. The main difficulty of this study is band alignment between different profiles (or samples/lanes) and gels. Since DGGE is a 1-D technique, the only information to identify each band is its position. However, owing to the nature of the gradient gels, the separation behavior is not very reproducible from gel to gel. Although there were standard lanes on each gel with a series of known standard bands, they can only cover a small a Centre for Chemometrics, School of Chemistry, University of Bristol, Cantocks Close, Bristol, UK BS8 1TS. E-mail: [email protected]; Fax: +44-117-9251295; Tel: +44-117-9287658 b Vermicon AG, Emmy-Noether-Str. 2, 80992 Munich, Germany c Ludwig Boltzmann Institute for Urban Ethology, Department for Anthropology, Althanstraße 14, A-1090 Vienna, Austria d Konrad Lorenz Institute for Ethology, Austrian Academy of Sciences, Savoyenstraße 1a, A-1160 Vienna, Austria

638 | Analyst, 2007, 132, 638–646

proportion of the position space. There can still be many ambiguities of the assignment of the bands, i.e. when two bands with slight positional difference from two different lanes are compared, it is sometimes difficult to decide whether they originate from the same microbe or from a different one. It is possible to compare each gel and each profile visually and make a decision by experts with adequate background knowledge as to whether two bands from different profiles come from similar origins. However, if there are many gels and samples, as in this study, visual inspection would be very time consuming – for 390 profiles there are 75 855 possible pairwise comparisons which, at a rate of one comparison every 10 min, would involve over 12 000 hours’ work or 5.62 years at 50 h a week and 45 weeks a year. In addition, there is no guarantee that every alignment would be correct: this is especially critical in cases where there are a large number of bands, many of which are unique to a few individuals, rather than a small number of bands common to most samples. Also, the more samples we have, the higher the chance that there will be ambiguities. However, pattern recognition techniques based on sample-feature matrices, such as Principal Component Analysis (PCA),9,10 require an unambiguous assignment of each variable, i.e. each column must accurately correspond to a single variable. In this paper, we tackle this problem in another way. The first step is still to detect the bands on each DGGE profile and measure the intensity and the position of each band, which is similar to most conventional methods for processing planar chromatographic data and may require some manual This journal is ß The Royal Society of Chemistry 2007

intervention and decisions as to which features are considered as bands. The second step of matching the profiles is performed automatically. In the method reported in this paper, we measure the pair-wise (dis)similarity between every pair of profiles; so, because only two profiles are being compared at one time, the number of bands that need to be matched is much smaller than would be required for global alignment of all the profiles, thus the influence of a mismatch or ambiguity will be much smaller as any error or uncertainty is restricted to a pair of samples. Therefore, we devised a dissimilarity measure weighted by a fuzzy membership function which models the uncertainty of the positional information to measure the similarity between two different samples (i.e. two different lanes on the gel). By using it, global assignment of bands can be avoided and a pair-wise dissimilarity matrix can be constructed to represent the pattern of the data instead of a sample-feature matrix. We can then apply various pattern recognition techniques on this dissimilarity matrix to reveal the pattern of the microbial profiles.

2 Experimental 2.1 Fixation of samples Microbial samples were taken in Greifenburg (Carinthia, Austria) from the armpit of different subjects. Sampling of the axillary microflora of the arm pit was performed by the washing-scrub method of Williamson and Kligman.11 A plastic cylinder open at both ends was placed on the armpit. A 1.5 ml sample of detergent solution was filled into the cylinder. A glass stirrer was moved with constant pressure over the skin to detach the micro-organisms. The solution was removed and transferred to a sterilised reaction tube and the procedure was repeated. The obtained solutions were fixed with ethanol at a ratio 1 : 1 (ethanol–detergent solution). The total sample volume of 6 ml consisted of 3 ml detergent solution containing the microbes (sample) and 3 ml of 96% ethanol. 2.2 DNA extraction A 1.3 ml aliquot of the sample was centrifuged (10 min, 14 000 rpm) and the supernatant was discarded. This step was repeated once, by adding 1.3 ml of sample to the pellet and an additional centrifugation step (10 min, 14 000 rpm). The pellet obtained was washed in 200 ml of 1 6 PBS (Phosphate Buffered Saline). After centrifugation (10 min, 14 000 rpm) the pellet was resuspended in 100 ml of 6% Chelex1 100 solution (BioRad,Munich, Germany) according to Rodrı`guez-La´zaro et al.,12 and incubated at 56 uC for 20 min. The sample was then thoroughly mixed and incubated further at 100 uC for 8 min. Subsequently, the sample was mixed and cooled for 5 min on ice. Following a centrifugation step (10 min, 14 000 rpm) the supernatant containing the DNA was removed. 2.3 PCR amplification of target DNA for DGGE The extracted genomic DNA was amplified using the forward primer 341F-GC 13 with a GC clamp 59-CGC CCG CCG CGC GCG GCG GGC GGG GCG GGG GCA CGG GGG GCC This journal is ß The Royal Society of Chemistry 2007

TAC GGG AGG CAG CAG-39 and the reverse primer 518R 59-ATT ACC GCG GCT GCT GG-39. The final 50 ml reaction mixture contained the following: 2 ml template DNA, 25 pmol primers each, 2.5 U of Taq DNA polymerase (Promega, Mannheim, Germany), 1-fold PCR buffer (Promega), 75 mM MgCl2 (Promega), 10 mM dNTPs, (Promega). The PCR protocol included a 5 min initial denaturation at 94 uC, 30 cycles of 94 uC for 0.5 min, 44 uC for 1 min, 72 uC for 1.5 min followed by 10 min at 72 uC for final extension in a Primus 96 thermocycler (MWG, Ebersberg, Germany). PCR products were stored at 220 uC until further use. 2.4 DGGE analysis PCR products were examined by standard agarose gel electrophoresis (1.7% agarose, 1 6 TAE) followed by ethidium bromide staining (4 mg l21) for 45 min. The result was visualised on a UV transilluminator and photographed. DGGE analysis was performed on a DcodeTM-System (BioRad). Samples were loaded onto an 8% (w/v) acrylamide gel (37.5 : 1 acrylamide–bisacrylamide) in 1 6 TAE buffer with a denaturant gradient ranging from 20 to 60% prepared in accordance with Muyzer et al.14 (100% denaturant contains 7 M urea and a volume ratio of 40% formamide). To standardise the DGGE gels, reference standards were applied to each gel. The reference standard consisted of a mixture of PCR products of 11 different bacterial species which are commonly found in human skin samples.15 The banding pattern resulted from PCR products obtained by the same primer pair as described above. Electrophoresis was performed at 60 uC, initially at 25 V for 15 min following at 130 V for 4 h. The gel was silver-stained based on the method of Sanguinetti et al.16 by the following procedure. A 150 ml portion of fixing solution (10% ethanol, 0.5% acetic acid) was applied to the gel before being shaken gently for 3 min. Subsequently, the gel was incubated for 10 min at room temperature in a silver nitrate solution (0.2% AgNO3, 10% ethanol, 0.5% acetic acid). After discharging the silver nitrate solution a washing step in distilled H2O for 2 min followed. Hereafter, 150 ml of ‘developer solution’ was applied to the gel (3% NaOH containing 300 ml of 37% formaldehyde) for 5 min, while shaking gently. The staining procedure was stopped by incubating the gel for 5 min in the 10% ethanol, 0.5% acetic acid solution. The stained gel was transferred on an overhead transparency sheet and documented on a SnapScan 1236 scanner (Agfa, Ridgefield Park, NJ). The scanning mode was transparent, 300 dpi and 24-bit color. The resultant pictures were converted to TIFF files for processing. Results in this paper come from these files. 2.5 Data set and software A total number of 196 subjects participated in the study during July and August, 2005. The majority of the subjects have been sampled twice, once per fortnight. These subjects were grouped into 17 different families, denoted as A, B, C, D, E, G, H, J, L, M, N, O, P, Q, R, S and U. We have 390 samples in total (five subjects failed to provide the second adequate samples and three subjects were sampled three times). F = 191 individuals Analyst, 2007, 132, 638–646 | 639

were analysed in this study. The samples were analysed on 28 different gels, each gel aimed to consist of 18 samples (i.e. 18 lanes). However, owing to some experimental reasons (e.g. no band could be detected), some of the lanes were excluded from the analysis. The number of valid lanes on each gel varied from 6 to 18. Each gel contains three standard lanes, each of which consists of 11 reference bands as described above, located at lanes number 5, 10 and 15. An example of a standard lane is presented Fig. 1. All the calculations were performed using MATLAB version 7 (Mathworks, Natick, MA).

3 Methods 3.1 Band detection The first step of our method is to detect the bands on each profile or lane, requiring some manual intervention. The raw data from the gels were processed as a TIFF format image. Semi-automated, in-house, software was written for band detection as follows:

(1) The color picture is transformed to greyscale pictures using the following NTSC-Y equation.17 For each pixel in the color picture, I = 0.2989R + 0.5870G + 0.1140B where I is the greyscale intensity of the pixel, R, G and B are the intensities of the red, green and blue channels (8 bits per channel giving a value between 0 and 255) of the pixel. The intensities are then rescaled into the range from 0 to 65 535 to produce a 16-bit greyscale picture. (2) Each lane is defined by clicking a few points on both sides of the lane. (3) For each lane, the 2-D image was mapped into a 1-D profile by summing the intensities of each row of the lane. (4) The bands were detected by using Saviztky–Golay First Derivative10,18 on the 1-D profile. The window size and the order of polynomial are tuned interactively. In most cases a five-point window and quadratic polynomial filter can yield a satisfactory result. The method is based on approaches described elsewhere.7,19,20

Fig. 1 Band detection.

640 | Analyst, 2007, 132, 638–646

This journal is ß The Royal Society of Chemistry 2007

(5) The band detection results were finally checked and corrected by visual inspection, e.g. deleting false peaks and inserting the peaks missed by the peak detection algorithm. (6) The intensity of each band is computed by integrating the area under each peak. An example of band detection is illustrated in Fig. 1. 3.2 Band position correction Although all protocols for the production of denaturing gradient gels have been standardised thoroughly, the composition of each gel shows slight variation. Furthermore, some inhomogeneities even remain between different lanes on one gel. As a consequence, the separation behavior could differ from gel to gel; therefore, the absolute position of the bands is not very useful. In order to make bands on different gels comparable, the position of each band was corrected to the position of the standard bands as follows. The positions of the 11 bands in the lanes containing standards were set to values from of 1 to 11 numbered by an index. For each band in the lanes from the human samples a corrected position is calculated by using linear interpolation21 using the positions of the standard bands as described by the following equation: Pnew ~indexprevious z

Praw {Pprevious std previous Pafter std {Pstd

where Pnew is the corrected position, Praw is the absolute position is the absolute position of the of the band in the picture, Pprevious std nearest standard band running faster than the band under consideration and Pafter std is the absolute position of the nearest standard band running slower than the band under consideration (all in units of pixels). For the bands that run faster than the first or Pafter standard or slower than the last standard, Pprevious std is set to std 0 or the largest pixel number accordingly. For each lane, a set of bands with corrected positions was computed. 3.3 Fuzzy membership function and pair-wise similarities Because only one piece of information is available (the position of each band relative to the nearest standards) for identifying each band, it is sometimes difficult to unambiguously align lanes especially between gels. In addition, because the separation behavior across a gel may vary non-linearly, interpolation of band positions can introduce errors due to uncertainties of differing separation behavior on each gel. Furthermore, it can be hard to align bands across different gels, as there can often be ambiguous choices. Because of this, methods such as PCA on global matrices can be unreliable as they depend on the accurate assignment of the variables across lanes in individual gels and between different gels. In hyphenated-chromatography such as GC–MS (gas chromatography–mass spectrometry) or LC–DAD (liquid chromatography–diode array detection) this problem is usually overcome by looking at the similarity between spectra in a second dimension.7 Finally, because significantly different separation from gel to gel can occur, and because the samples are sometimes quite diverse in composition and a high baseline This journal is ß The Royal Society of Chemistry 2007

signal is very common, profile alignment methods such as COW (correlation optimised warping) and DTW (dynamic time warping)22–24 cannot be directly applied to these DGGE raw profiles either. An alternative approach is to represent the profiles by the bands associated with their corrected position and use pairwise similarity measures, which look at how similar the profiles are for each possible pair of samples, but which also tolerate slightly imprecise positional information. In this paper we use a dissimilarity measure to determine the dissimilarity in the band profiles between two lanes i and j according to the following: (1) Suppose that lane i has Ni bands and lane j has Nj bands and Ni ( Nj. An Ni 6 Nj position difference matrix was constructed: each row represents the difference in the corrected position of one band ni in lane i to all the bands in lane j, i.e. |ni 2 nj|. The value of nj that is a minimum for each row will correspond to the peak in lane j that is closest to band ni. However, sometimes there could be two or more bands in lane i sharing the same nearest neighbour in lane j. In such a case, we only consider the closest pair and the others will be discarded (i.e. consider them as unique bands in lane i). We denote the number of pairs of bands between the two lanes that have been matched as p (p ( Ni). It is important to recognise at this point that a match simply looks for the nearest band in lane j, but does not necessarily indicate that these are from the same source, which depends on the distance as described below. (2) Two dissimilarity metrics are proposed: (i) The first takes into account band intensities as well as uncertainties in position as follows. The dissimilarity metric was calculated by using the following equation: p  P

  xik :xjk :wk   d ði,j Þ~1{ k~1 kxi k:xj  where p is the number of pairs being considered (see step 1 above), each matched pair being denoted by k, where xik and xjk are the integrated intensities of the band pair k in lanes i and j, and xi and xj are the corresponding vectors of intensities of all bands detected in the two lanes, each ||.|| term denoting the second norm of the vector. The key to understanding this is the fuzzy weight function wk = H(Ddk), as defined below, determined by the absolute difference in the corrected position (Ddk) which is of the band pair defined by 1 wk ~H ðDdk Þ~ erfcðA:Ddk {S Þ 2 where erfc(x) is the complementary error function,25 defined by ð 2 ? {t2 e dt erfcðxÞ~ p x where t is a dummy integration factor and S and A are two tuneable parameters as described below. This fuzzy weight function is a weighted cosine metric and works on the raw bands detected, tolerating slight imprecision of the positional information. Sensitivity to the position difference is controlled by the two Analyst, 2007, 132, 638–646 | 641

tuneable parameters. With a suitable setting of S and A, the contribution of each variable to the distance metric is as follows. (a) When the difference is small enough, the weight is 1 or very close to 1, reflecting two bands whose positions are very close after correction and so likely to originate from the same source. (b) When the difference is moderate, the weight is between 0 and 1, decaying exponentially with Dd increasing, reflecting increasing uncertainty that two peaks are from the same source. (c) When the difference is large, the weight is 0 since the two bands are so far apart they cannot originate from the same source. In this paper, S is set to 5 and A to 30. The behavior of the weight function is given in Fig. 2. If two bands are closer than about 0.1 standardised units (see Section 3.2) they are assumed to be a perfect match, and if they differ by more than 0.2 units they are assumed not to match at all. These parameters need to be tuned, but work well in this particular context. In practice this weight function takes into account the uncertainty of matching peaks due to their the relative positions. (ii) A second dissimilarity metric is proposed for qualitative comparison, i.e. presence or absence of bands, adjusted for their positional uncertainty, but without taking into account their intensities. It is derived from the Jaccard distance,26 a commonly used qualitative dissimilarity metric defined as: d ði,j Þ~1{

a azbzc

where a is the number of peaks that are common to both samples, and b and c are the number of peaks unique to one of the two samples. By taking into account the uncertainty of each matching which represented by the fuzzy weight function, we can refine

a~

p X

wk

(which would equal p if all peaks were perfectly matched in both lanes), and bzc~Ni,unique zNj,unique {pz

p X

ð1{wk Þ

k~1 p X   wk ~ðNi {pÞz Nj {p z p{

!

k~1

~Ni zNj {p{

p X

wk

k~1

where Ni,unique is the number of bands being considered as unique in sample i (see above). Hence, the fuzzy weighted Jaccard distance metric can be written as: p P

d ði,j Þ~1{

wk

k~1 p P

wk zNi zNj {p{

k~1 p P

~1{

p P

wk

k~1

wk

k~1

Ni zNj {p

A pair-wise dissimilarity matrix D can then be computed between each pair of samples, for either of the measures above, which is employed in further analysis. 3.4 Ranking An interesting question is whether the microbial profile is significantly different from one individual to another? To address this problem, we used a rank-based analysis procedure on the average distance matrix as described below. An average distance matrix between individuals was computed as follows. (1) For individual a for which there are two repeats, a1 and a2, the distance between the profiles is defined simply as daa = d(a1,a2), and if individual a has three repeats, the distance between the profiles is defined as:

k~1 3 l{1 P P

daa ~ l~2

d ðal ,am Þ

m~1

3

(2) For similarities between different individuals a and b, the distances of all possible pairs of repeats are averaged. When there were four repeats (the most common cases), the four comparisons are averaged as follows: 2 2 P P

dab ~ l~1

Fig. 2 Fuzzy weight function, Dd is the correct difference in position between two matched peaks.

642 | Analyst, 2007, 132, 638–646

d ðal ,bm Þ

m{1

4

(3) Subjects with only one sample are excluded from ranking analysis. (4) Given F individuals, there are F 6 (F 2 1)/2 possible distances both between different individuals and between samples originating from the same individual. The distances can be ranked from r = 1 (the most similar with the minimum This journal is ß The Royal Society of Chemistry 2007

distance) to r = F 6 (F 2 1)/2 = 18 145 (the least similar with the maximum distance). (5) Two rank lists can be constructed: an AA rank list that consists of the rank of the average dissimilarity between microbial lanes originating from repeat samples for the same individual; and an AB rank list that consists of the rank of the average dissimilarity between samples originating from different individuals. These ranks are taken from the overall rank list obtained in step (4). (6) The hypothesis is that if there is an individual and constant microbial signature, repeat samples from the same individual should be more similar than repeat samples from different individuals, i.e. the within-individual variation is less than the between-individual variation. A two-sample Kolmogorov–Smirnov (K–S) goodness-of-fit test27 was used for comparing the rank lists. The value of K = max(|PAA(r) 2 PAB(r)|) where PAA(r) is the proportion of the sequence values (in rank list AA) less than or equal to r; PAB(r) is the proportion of the second sequence values (rank list AB) less than or equal to r, and P varies from 0 to 1. For comparing AA and AB lists, a one-tail test was used for K to test the hypothesis that the ranks in the AA list are significantly lower than those in AB. The null hypothesis is that the two lists are coming from the same underlying distribution, which means there is no difference in the withinand between-individual variability. 3.5 Principal coordinates analysis (PCO) It is also very useful to present data in low-dimensional space for visualisation. Principal Coordinates Analysis (also known as Classical Multidimensional Scaling) is a commonly used technique to visualise a pair-wise distance matrix.28 The idea behind PCO is that, being given a pattern defined by a distance matrix, a new set of points can be constructed in low-dimensional space (PCO scores), and the pair-wise Euclidean distance matrix of the PCO scores is as close to the original distance matrix as possible. The original dissimilarity or distance matrix can use any type of distance measure (in this paper we use the fuzzy distance, measured based on the vector cosine and the Jaccard distance). A short description of the algorithm is given below: (1) The matrix D(2) is calculated by squaring the elements of the original dissimilarity matrix. (2) A matrix C is computed, whose elements are given by cij =

ð2Þ 21/2dij

A scores matrix T = V?S1/2 can be computed in analogy to PCA and the first l components are retained. Visualisation can be done by plotting one column of T against another.

4 Results 4.1 Comparison with bands tables In order to validate our method with a conventional samplefeature matrix method, we compared the pattern revealed by our method and one revealed by a bands table (i.e. each row represents a sample and each column represents a unique band). As stated above, it is usually difficult to produce an accurately assigned bands table on a large data set. However, to produce a reliable bands table on a relatively small data set is much more feasible. We used family D which contains 11 members and 22 samples for this purpose. With this data set, a bands table was constructed with 22 samples and 36 unique detected bands, by visual inspection. It is important to recognise that there are only 231 pair-wise comparisons when using 22 samples, which is feasible by manual methods: for 390 samples there will be 75 855 pairwise comparisons, making detailed manual inspections of bands tables impracticable, and so potentially prone to serious errors. We compare both the qualitative (presence/absence) and quantitative dissimilarities between the profiles. For the qualitative measure on the bands table we employ the Jaccard distance26 as discussed above, and for the quantitative dissimilarities we use the unweighted cosine xi :x0j   d ði,j Þ~1{ kxi k:xj  where xi and xj are vectors of length 36 consisting of the intensities of the 36 detected bands if present in a sample (equal to 0 if not). These are compared with the fuzzy weighted methods described in Section 3.3. To compare the patterns represented by these two distance matrices, we performed PCO on the dissimilarity matrices for both the quantitative and qualitative distance measures: a PCO matrix based on the bands table is compared with one based on the fuzzy dissimilarity measure; all non-zero components were retained. The similarity between these two scores matrices A and B was evaluated by using the RVcoefficient30,31 as follows: RVðA,BÞ~

traceðB0 AA0 BÞ n h i h io1= 2 trace ðA0 AÞ2 trace ðB0 BÞ2

(3) Finally, the matrix G is computed so that gij = cij 2 c¯i 2 c¯j + c¯ where c¯i and c¯j represent the row and column means of matrix C and c¯ the overall mean. This procedure simultaneously columnand mean-centers the data. (4) Eigen decomposition29 is performed so that G = V?S?V9 This journal is ß The Royal Society of Chemistry 2007

By analogy to the correlation coefficient32 which measures the correlation between two vectors, the RV coefficient varies from 0 to 1: 1 indicates that A = B with appropriate scaling and rotation, and 0 indicates that there is nothing in common between these two matrices. For the qualitative measure of similarity, the RV coefficient of the two scores matrices is 0.9702 which indicates that the patterns represented by these two dissimilarity matrices are indeed very similar. Analyst, 2007, 132, 638–646 | 643

For the quantitative comparison, the same approach as described above is used and the RV coefficient is 0.9901, which also indicates a very good match between these two methods. Another way of comparison is to compare the two distance matrices directly by unfolding the lower (or upper) triangle parts of two dissimilarity matrices into two vectors, i.e. given two distance matrices A and B with dimensions of Q 6 Q (where Q = 22 in this application), the two vectors are written as {a11, a21,…aQ1, a22,…, aQ2,…, aQQ} and {b11, b21,…bQ1, b22,…, bQ2,…, bQQ}where aij and bij are the ith row and jth column elements in the matrices A and B respectively. By plotting one unfolded dissimilarity matrix against another, a strong correlation should be the evidence if these two dissimilarity matrices are very similar. The results are shown in Fig. 3(a) and 3(b). The correlation coefficient was also calculated based on the two pairs of unfolded dissimilarity matrices for each type of metric. The qualitative distance metric gives a correlation coefficient at 0.9774 and the quantitative distance metric a correlation coefficient at 0.9954. Both indicate that these two methods have very similar patterns. The significance level of the correlation of these two distance matrices was tested by using a permuted Mantel test.33 We permute the order of the samples in one distance matrix (the fuzzy weighted distance matrix) while keeping the

other unchanged, then we calculate the correlation coefficient between these two unfolded matrices as described above: the procedure is repeated 10 000 times and 10 000 correlation coefficients are obtained and form an empirical null distribution. The results suggest that given 10 000 permutations, not a single case can obtain such a high correlation [see Fig. 3(c)]. These comparison studies suggest that on this validation data set, these two methods gave highly consistent results, which means our method is a reliable alternative to the commonly used bands-table method. A major advantage of our method is that the pair-wise comparisons are automated and fast and allow a feasible and robust comparison of large data sets. 4.2 Rank analysis The empirical Cumulative Distribution Functions (CDF) of the AA rank list (rAA) and AB list (rAB) for the entire population were superimposed and are presented in Fig. 4. When the quantitative distance metric was used, the CDF of the AA list is almost always above that of the AB list, except for a small number of samples. The first 10% of the most similar samples judged by their ranks include more than 65% of the AA list while only around 10% AB samples have been

Fig. 3 Comparison of bands table and fuzzy distance metric: (a) qualitative distances comparison; (b) quantitative distances comparison; (c) permuted Mantel test.

644 | Analyst, 2007, 132, 638–646

This journal is ß The Royal Society of Chemistry 2007

included. The one-tail K–S statistic gives K = 0.62 and the null hypothesis can be rejected at a confidence level of nearly 100% and suggests that, using a quantitative measure of similarity, there is a very significant difference between the withinindividual repeat samples and the between-individual repeat samples. In other words, the differences of microbial profiles between different individuals are generally greater than those of the repeats of the same individual, and hence there is significant evidence for individual microbial signatures. In contrast, only 5% of the AA samples are more dissimilar than the corresponding population of AB samples in the CDF curves: we interpreted these as samples with poor reproducibility or cases where there is an analytical error in one of the analyses as only two repeats were analysed. A similar trend can be observed when the qualitative distance metric was applied, except that the CDF of the AA rank list is always above that of the AB rank list. 4.3 Principal coordinates analysis (PCO)

Fig. 4 Superimposed Cumulative Distribution Functions for the fuzzy weighted method: (a) quantitative distance metric, (b) qualitative distance metric.

The most obvious separation within the data set was due to the gender of the subjects sampled. This can be seen from the plot of the first three PCO components: the average score over the repeat samples is computed for each individual. There is a considerable separation between males and females with a certain amount of overlap [see Fig. 5(a) and (b)]. In addition, the PCO plot based on the quantitative distance metric gives better separation than the one based on the qualitative distance metric, which suggests that the simple presence/ absence criterion on microbial profiles may not be sufficient to discriminate the gender difference. It does, however, suggest that there is a difference between male and female microbial profiles, which is not unexpected. It is important to realise that there are likely to be several factors influencing the microbial profile, including age, individuality, genetics and so on, and as such, gender will not uniquely influence the microbial fingerprint. Hence, it is unlikely that there will be perfect separation between the two clusters. In a related project

Fig. 5 PCO plots: (a) average PCO plot on the full data set using qualitative fuzzy distance metric; (b) average PCO plot on the full data set using quantitative fuzzy distance metric.

This journal is ß The Royal Society of Chemistry 2007

Analyst, 2007, 132, 638–646 | 645

looking at the GC–MS fingerprint34 we find that there is no perfect prediction for females and males from the chemical signal in sweat, and we also find that different individuals have different levels of masculinity and femininity. There is reason to expect a similar situation for the microbial signal.

5 Conclusion and discussion In this paper we demonstrate the usefulness of a newly devised dissimilarity metric, which does not require an accurately assigned sample-feature data matrix, and is particularly useful for large-scale 1-D planar chromatographic applications. In such cases, an accurately assigned sample-feature matrix is usually not easy to obtain except by manual expertise, which can be impracticable with very large data sets and prone to errors. Also, given a sample-feature matrix on a small subset of the data, we show that both approaches yield very similar results. Although our application is to DGGE profiles, it can be easily applied to other types of planar chromatography such as Thin Layer Chromatography (TLC) or other types of gel electrophoresis. The drawback of the method described in this paper is that since no sample-feature matrix is available, all of the data analysis techniques have to be based on a pair-wise distance matrix, and detailed insight of the data, such as which variables have a significant influence on certain separations or which variables do not contain much information of interest, has been lost. Based on the dissimilarity metric we devised, we found that there is significant evidence for an individual biometric fingerprint. Also, there is some separation between genders which indicates that there might be some systematic differences in microbial profiles of males and females. The methods developed in this paper show great promise for the analysis of one-dimensional chromatographic data in large surveys.

Acknowledgements Alexandra Katzer is thanked for her superb organisational skills. This work was sponsored by ARO Contract DAAD1903-1-0215. Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government.

References 1 R. Gasser, P. Nansen and P. Guldberg, Mol. Cell. Probes, 1996, 10, 99–105. 2 G. Muyzer, Curr. Opin. Microbiol., 1999, 2, 317–322. 3 B. Dı´ez, C. Pedro´s-Alio´, T. L. Marsh and R. Massana, Appl. Environ. Microbiol., 2001, 67, 2942–2951. 4 E. Danilo, J. Microbiol. Methods, 2004, 56, 297–314.

646 | Analyst, 2007, 132, 638–646

5 J. Dubois, S. Hill, L. S. England, T. Edge, L. Masson, J. T. Trevors and R. Brousseau, J. Microbiol. Methods, 2004, 58, 251–262. 6 D. J. Penn, E. Oberzaucher, K. Grammer, G. Fischer, H. A. Soini, M. V. Novotny, S. J. Dixon, Y. Xu and R. G. Brereton, J. R. Soc. Interface, 2007, 4, 331–340. 7 H. A. Soini, K. E. Bruce, I. Klouckova, R. G. Brereton, D. J. Penn and M. V. Novotny, Anal. Chem., 2006, 78, 7161–7168. 8 S. J. Dixon, R. G. Brereton, H. A. Soini, M. V. Novotny and D. J. Penn, J. Chemom., 2006, 20, 325–340. 9 S. Wold, K. Esbensen and P. Geladi, Chemom. Intell. Lab. Syst., 1987, 2, 37–52. 10 R. G. Brereton, Chemometrics: Data Analysis for the Laboratory and Chemical Plant, Wiley, Chichester, 2003. 11 P. Williamson and A. M. Kligman, J. Invest. Dermatol., 1956, 45, 498–503. 12 D. A. Rodrı`guez-La´zaro, T. Jofre´, M. Aymerich, M. Hugas and M. Pla, Appl. Environ. Microbiol., 2004, 70, 6299–6301. 13 G. Muyzer, E. C. de Waal and A. G. Uitterlinden, Appl. Environ. Microbiol., 1993, 59, 695–700. 14 G. Muyzer, S. Hottentra¨ger, A. Teske and C. Wawer, in Molecular Microbial Ecology Manual, ed. A. D. Akkermans, J. D. van Elsas and F. J. de Bruijn, Kluwer Academic Publishers, Dordrecht, 1995, pp. 1–23. 15 K. Trebesius, B. Banowski, C. Beimfohr, I. Bergmaier, C. Jassoy, A. Sa¨ttler, R. Scholtyssek, R. Simmering and D. Bockmu¨hl, submitted. 16 C. J. Sanguinetti, N. E. Dias and A. J. G. Simpson, Biotechniques, 1994, 17, 915–919. 17 W. K. Pratt, Digital Image Processing, John Wiley & Sons, New York, 1991. 18 A. Savitzky and M. J. E. Golay, Anal. Chem., 1964, 36, 1627–1639. 19 G. Vivo´-Truyols, J. R. Torres-Lapasio´, A. M. van Nederkassel, Y. Vander Heyden and D. L. Massart, J. Chromatogr., A, 2005, 1096, 133–145. 20 G. Vivo´-Truyols, J. R. Torres-Lapasio´, A. M. van Nederkassel, Y. Vander Heyden and D. L. Massart, J. Chromatogr., A, 2005, 1096, 146–155. 21 E. Meijering, Proc. IEEE, 2002, 9, 319–342. 22 N. P. Vest Nielsen, J. M. Carstensen and J. Smedsgaard, J. Chromatogr., A, 1998, 805, 17–35. 23 C. P. Wang and T. L. Isenhour, Anal. Chem., 1987, 59, 649–654. 24 G. Tomasi, F. van den Berg and C. Andersson, J. Chemom., 2004, 18, 231–241. 25 D. Zwillinger, Handbook of Differential Equations, Academic Press, Boston, 3rd edn, 1997. 26 P. Jaccard, Bull. Soc. Vaud. Sci. Nat., 1908, 44, 223–270. 27 R. B. D’Agostino and M. A. Stephens, Goodness-of-Fit Techniques, Marcel Dekker, Inc., New York, 1986. 28 J. C. Gower, Biometrika, 1966, 53, 325–338. 29 E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney and D. Sorensen, LAPACK User’s Guide, SIAM, Philadelphia, 3rd edn, 1999. 30 P. Robert and Y. Escoufier, Appl. Statist., 1976, 25, 257–265. 31 J. O. Ramsay, J. M. F. Ten Berge and G. P. H. Styan, Psychometrika, 1984, 49, 403–423. 32 A. G. Asuero, A. Sayazo and A. G. Gonza´lez, Crit. Rev. Anal. Chem., 2006, 36, 41–59. 33 N. A. Mantel, Can. Res., 1967, 27, 209–220. 34 S. J. Dixon, R. G. Brereton, E. Oberzaucher, K. Grammer, H. A. Soini, M. V. Novotny and D. J. Penn, Chemom. Intell. Lab. Syst., 2007, 87, 161–172.

This journal is ß The Royal Society of Chemistry 2007