The microbial signature of drinking waters - Water Science & Technology

J. Harmand*, L. Paulou*, J. Desmoutiers*, L Garrelly** , P. Dabert* and J.J. Godon* *Laboratoire de Biotechnologie de l’Environnement, INRA-LBE, Avenue des e´tangs, 11100 Narbonne, France (E-mail: [email protected]) ** Bouisson Bertrand Laboratoires, Parc Eurome´decine, 778 rue de la croix verte 34196 Montpellier cedex 5, France Abstract This paper presents a new software developed for analyzing single strand conformation polymorphism (SSCP) electrophoresis patterns delivered by the genetic analyzer ABI310 (Applied Biosystems). SSCP is a molecular typing technique based on the PCR amplification of microbial 16S rDNA and used for the monitoring of complex microbial ecosystems dynamics. The software – a home-made MATLAB toolbox called MODIMECO – developed for the analysis of SSCP patterns is presented. MODIMECO includes a number of basic signal processing abilities as well as largely used statistical tools such as the well known principal component analysis. The use of the SSCP for assessing the hypothesis of the existence of a microbial signature of drinking waters illustrates the typical advantages of using such software tools. Results are discussed and conclusions drawn. Keywords MATLAB toolbox; software; 16S rDNA single strand conformation polymorphism; pattern analysis; microbial signature; classification; PCA

Introduction

In a recent prospective paper, Yuan and Blackall (2002), stressed that optimizing the microbial community structure and property should be an explicit aim for the design and operation of a treatment plant. In order to do so, it is proposed to combine the available modern molecular monitoring techniques of microorganisms – such as for instance fluorescence in situ hybridisation (FISH) – with conventional tools. Although we are far from these objectives, a number of molecular microbiology techniques allow us to take pictures of a complex ecosystem and to monitor – at least the presence – and in some circumstances the relative abundance of majority species within this ecosystem (such techniques are abbreviated as MMTs for molecular monitoring techniques in the sequel). Among them, the single strand conformation polymorphism (SSCP) is a technique that allows us to monitor the presence of majority species (when targeting 16S rDNA) and/or their activity (when based on 16S rRNA detection) within a complex ecosystem (Delbe`s et al., 2000). The result of the analysis is given under the form of a pattern of DNA fragments where the x-axis is related to a given species while the y-axis is representative of its relative abundance within the analyzed sample. However, these patterns suffer from a number of drawbacks. First, they can be quite noisy. Second, ideally, they should consist in a succession of rays, the size of each being related to the relative quantity of the corresponding species. However, the detector is not perfect and its response can rather be assimilated to a Gaussian curve. Thus, several basic Gaussian curves can recover others leading to a global “Gaussian-like” shape (cf. Figure 1). From this picture, it is quite clear that only majority species abundance can be estimated. Third, the distribution of the rDNA fragment along the x-axis is assumed to follow a normal distribution. Thus, it is expected that the contribution of minority species is captured in removing a curve which the shape can be approximated doi: 10.2166/wst.2006.028

Water Science & Technology Vol 53 No 1 pp 259–266 Q IWA Publishing 2006

The microbial signature of drinking waters: myth or reality?

259

J. Harmand et al. Figure 1 General shape of a SSCP pattern

in joining together all local minima of the raw curve (cf. Figure 2, until now, MODIMECO has only been developed in French). Until recently, the SSCP curves were analyzed by hand. The majority species were identified and their relative abundance within the sample estimated manually (usually in assuming that the relative abundance of a species is given by the height of the peak). Due to the increase of the number of patterns to be analyzed and the need for more accurate analyses, it was decided to develop a specific tool able to systematically and automatically analyze a set of patterns. This software has been developed as a MATLABq toolbox and is called MODIMECO. It includes a number of basic signal processing tools and allows the user to load and analyze SSCP patterns. In particular, it includes statistical tools such as the principal component analysis (PCA), one of the most widely used methods within the framework of microbial community dynamics studies. Knowledge on the microbiological signature of tap water could have several advantages: (i) to detect modifications of input and link it to possible contamination, (ii) to identify the direction of preferential flow, (iii) to know the aberrant intake points which

260

Figure 2 Identification of present majority species within a complex ecosystem

would be excluded for microbiological analyses no longer to the source but in the consumer. This paper is organized as follows. First, the SSCP analysis toolbox, MODIMECO is presented. Then, the analysis of patterns from a water network are analyzed and the hypothesis of a microbial signature of waters is assessed. Finally, some conclusions are drawn.

In order to be as open as possible, MODIMECO leaves the possibility open for the user to perform the analysis in removing or not from the raw pattern the contribution of the minority species. In the case where it is decided not to remove it (the analysis is said to be local), the analysis is made in assuming that the relative abundance of a given species corresponds to the area under a reconstructed theoretical Gaussian curve. In the case where it is decided to remove it (in this case, the analysis is said to be global), the relative abundance of a species is either computed in calculating the area of a reconstructed theoretical Gaussian (as in a local analysis) or just in considering the height of the theoretical reconstructed Gaussian. An example is shown in Figure 2: the original raw pattern is plotted as well as both the “minority curve” and the curve where the contribution of the minority species has been subtracted. Recall that the aim of the analysis is to transform the result of the SSCP given under the form of a pattern into a 2 £ N vector of data where N is the number of peaks (and by extension of species) that have been identified. The first line contains the abscissa of the peak and the second line contains the relative abundance of the corresponding species (cf. Table 1). Once the algorithm to be used (global or local analysis) has been chosen by the user, MODIMECO automatically identifies the peaks, and, for each of them, computes the most probable associated standard deviation. From this information (height of a peak and its associated standard deviation), a standard statistical procedure is called in order to reconstruct a theoretical Gaussian. A theoretical pattern (TS) can then be computed and plotted together with the original raw pattern (RS) and the difference curve (DC) between them (cf. Figure 3). Based on the heights of the peaks and their associated closest left and right local minima in the RS, the software classifies all identified peaks into two distinct classes: “sure” (when the differences between the height of a peak and both its closest left and right local minima is significative, cf. circles in Figure 2) and “ambiguous” (if at least one of the differences between the height of a peak and its closest left or right minima is not significative, cf. crosses in Figure 2).

J. Harmand et al.

A toolbox for analyzing a single SSCP pattern

Table 1 Results under the form of a matrix (12 peaks identified and the results presented either with respect to the area or with respect to the height of the peak) Abscissa

% of the peak with respect to the total area

% of the peak with respect to the highest one

641.00 626.00 602.00 581.00 557.00 538.00 511.00 475.00 440.00 413.00 374.00 351.00

2.87 3.91 11.36 8.07 3.46 5.77 6.18 25.35 4.73 9.79 6.49 12.02

11.32 15.44 44.82 31.83 13.65 22.77 24.38 100.00 18.66 38.61 25.61 47.42

261

J. Harmand et al. Figure 3 The RS, TS and DC curves: 12 majority species have been identified

At this step, the user can manually: add a peak that has not been automatically recognized, remove a “sure” peak, or at the opposite, confirm the presence of an “ambiguous” peak. In any case, the algorithms are run again and the data are updated and finally saved under the form of a 2 £ N vector.

Dynamical analysis of microbial ecosystems

262

A particularly attractive challenge is to use the data automatically generated with the previously presented software in order to systematically analyze the dynamics – in time (several SSCP patterns taken at the same location but at different dates) or in space (several SSCP patterns taken at the same date but at different locations of a system) – of microbial communities or bacterial populations. To do so, statistical tools such as PCA are currently used to interpret data from MMT such as polymerase chain reaction – denaturing gradient gel electrophoresis (PCR-DGGE) or 16S rDNA terminal - restriction fragment length polymorphism (T-RFLP) (cf. for instance Westergaard et al., 2001 and Dollhopf et al., 2001, respectively). The use of statistical methods present a number of advantages. In particular, numerous samples can be analyzed simultaneously, permitting the monitoring of microbial communities or simply bacterial groups for which the occurrence and relative frequency are affected by any environmental parameter (cf. Fromin et al., 2002). Although it is underlined in this last paper that the PCA is probably not the most suitable tool for analyzing DGGE patterns (because its underlying model assumes that biological populations have a linear response curve along the axes of ecological variation), it is to be noticed that it has proved to be very useful in a number of cases (e.g. Mu¨ller et al., 2001). Furthermore, to the best of the present authors’ knowledge, the PCA has not been applied to SSCP data. This is why the above described software has recently been enriched with a PCA module analysis allowing the user to analyze its SSCP data. The PCA generates new variables, called principal components or PC (linear components of the original variables), that explain the highest dispersion of the samples.

8

J. Harmand et al.

To do so, one should define a number of descriptors that will play the role of original variables. To characterize a pattern, the following descriptors have been included in the software: † Absolute values of all peaks and their mean value, † In order to avoid obtaining a hollow matrix (which can be a problem in the computation of the PCs), the user can group a number of peaks by blocks which the size is left to the user as a free parameter. The descriptors proposed are then the values of the peaks by block and their mean value, † The number of peaks of the pattern that are greater than a given percentage of the maximum, † Percentage of minority species removed. A PCA is then applied to the matrix of data obtained in extracting the values of all the desired descriptors chosen by the user to the set of patterns that have been previously selected. Results are listed under the form of a resulting matrix comprising the percentage of the variance of data that can be explained with respect to the number of PC retained. The list of the above described descriptors has been optimized in order to minimize the mean number of PCs necessary to describe the variability of a large number of patterns to be analyzed. Given the cumulative variance, the user can then decide the “optimal” number of PCs to be retained in order to minimize the residues (“quantity” of the data that cannot be explained in the new space). Then, the software projects all data into the new space and a graphical representation is proposed (any PC can be plotted with respect to any other one to study the projection of the points in these specific coordinates). After that, in order to classify the different patterns into distinct sets (or classes) which the number is not known a priori (in this case, the classification is “unsupervised”), the software uses an algorithm called the “distance algorithm” (the user can choose between an “Euclidean” or a “weighted” distance in which the computed distance is weighted by the variance of each PC). The results are then proposed under the form of a tree where the user has access to the intra- and inter-classes variances. In order to choose the optimal number of distinct classes, the user must find the best trade-off between the maximization of inter-classes variances and the minimization of the intraclasses ones. It is also possible to project a new pattern into a “model space” for which the number of classes has already been determined using the procedure described hereabove. In this case, the new point is projected and the class from which it is the closest is determined using the k-closest neighbors technique where k is a free parameter that can be chosen by the user. In the following section, the software is used to assess the hypothesis of a microbial signature of a drinking water distribution network.

3 7

1

2

6

5 4

Figure 4 The drinking water distribution network

263

Application: the microbial signature of a drinking water distribution network

J. Harmand et al.

The water network to be considered is shown in Figure 4. Points 1 and 2 correspond to the sinking of karstic water and the storage tank respectively whereas, points 3 to 8 correspond to different consumer taps. After initial treatment by the toolbox, the 8 individually analyzed SSCP patterns (only the reconstructed patterns (the TS curves) are plotted) corresponding to the different sample points are represented in Figures 5a– h. 18

(a) 150

ASP2 curve recognized peaks

120

27

100

100

24 29

80

13 2628

24

100

200

300

400

500

600

700

0

800

15

6

0

100

200

300

27

400

(d) 150

500

600

700

800

24 28


120

28

20

20

8

(c) 140

13

40

9 15

0

25

60

22 20

7

50

0

18

(b) 140 ASP1 curve recognized peaks


5

100

100

80 17

60 40

100

300

400

(e) 40

500

600

700

(f) ASP5 curve recognized peaks

35 9 14 151921

0

100

200

400

70


14

9

5 200

(g) 70

300

400

27

20

26 29

10

500

600

700

800

0

5 8

0

100

200

60

10 100

200

10 12 8 14 5 7

25

20

400

700

800

500

600

700

800

0

0

100

200

21 25 30 28 22

4

10 300

600


24

30

3

0

500

20

40 23 22

79 16 11 15

400

50

20

13

27 19

60

40 30

300

(h) 70 ASP7 curve recognized peaks

50

21 13

19

5

20

800

30

5

100

700

50

18

15

0

600

2225

8

10

500

11 16

40

20

264

300

26

60

25

0

0

800

20 12

0

19

35

200

30

9

24 28

13

0

15

50

4

2

20 0

20

12

300

400

500

600

700

800

Figure 5 Patterns of the distribution network; (a) Analyzed pattern #1: TS1; (b) Analyzed pattern #2: TS2; (c) Analyzed pattern #3: TS3; (d) Analyzed pattern #4: TS4; (e) Analyzed pattern #5: TS5; (f) Analyzed pattern #6: TS6; (g) Analyzed pattern #7: TS7; (h) Analyzed pattern #8: TS8

J. Harmand et al.

Once each pattern has been analyzed, the problem is to determine the distances between all these patterns. The hypothesis of a microbial signature of drinking waters states that two points that are close into the network should have similarities in their ecosystems in terms of present species unless one sample has been contaminated by an external source. Such an hypothesis is particularly interesting in order to be able to rapidly identify and locate eventual breaks or undesirable contaminant points in drinking water distribution networks. To do so, the eight patterns are analyzed using the PCA algorithms presented above. The descriptors (chosen arbitrarily to get a minimum number of PCs to be retained) used for the study are: a size of blocks of 8; number of peaks greater than 40% of the maximum; percentage of removed minority species with respect to the total area of the patterns; and mean value of the peaks. Retaining four PCs, the cumulative variance is 95.5% and the projection of the 8 reconstructed patterns into the first three PCs of the new PCA space is plotted in Figure 6.

Discussion

With respect to Figure 6, one can expect patterns 1–4 to be classified into one specific class. It is slightly more complicated to determine how the others can be classified. Using the distance algorithm (on the basis of the weighted distance), it is established that for 3 classes, the intra- and inter-classes variances of one class are almost equal. In other terms, the related class consists in elements that are as far from each other than the class they define is far from the other classes. Thus, we finally decided to keep four distinct classes composed as follows: † Class A: patterns 1 –4, † Class B: pattern 8, † Class C: patterns 5 and 6, † Class D: pattern 7.

Figure 6 Projection of the patterns onto the PCA space

265

J. Harmand et al.

An attempt at interpretation of these results can be realized with respect to the hypothesis of a signature of waters in distribution networks. Indeed, except for the pattern 7 which is really far from the closest patterns located around it (2, 3 and 6), quite satisfying results are obtained: the principal pipe is composed of the Class A (patterns 1– 4) and the derivation includes patterns 5 and 6 into a distinct class. Finally, the point 8 is an isolated derivation of the network. However it should be stressed that the results are very sensitive to the choice of the descriptors. Conclusions

In this paper, a new software for analyzing SSCP patterns has been presented. This software includes basic signal processing tools as well as the probably most widely used statistical method for analyzing microbiology data, that is the well known PCA. The assessment of the hypothesis of a microbial signature of waters illustrates the use of the software for analyzing SSCP patterns. It is shown that the PCA results are in accordance with this hypothesis. Until now, only the PCA has been included in the software. In the near future, it is expected that a large library of statistical methods will be available in MODIMECO. In particular, it seems important to be able to tackle the problem of detecting and identifying nonlinear patterns. For instance, Clegg et al. (2003) propose to use Kohonen self organizing maps (SOMs) that have proven to be very useful in a number of biological research areas. Used together with advanced statistical tools such as those proposed in MODIMECO, it is expected that MMTs rapidly become real sensors to be available for modeling (and thus for increasing the understanding of bioprocesses), and why not in the future, for on-line control, diagnosis and supervision. Acknowledgements

This study was supported by a grant from the Languedoc Roussillon (France) PRAT re´gion, LR No 021101.

References Clegg, C.D., Lowell, R.D.L. and Hobbs, P.J. (2003). The impact of grassland management regime on the community structure of selected bacterial groups in soils. FEMS Microbiology Ecology, 43, 263 –270. Delbe`s, C., Moletta, R. and Godon, J.J. (2000). Monitoring of activity dynamics of an anaerobic digester bacterial community using 16S rRNA PCR-Single-Strand Conformation Polymorphism analysis (SSCP). Environmental Microbiology, 5, 506 – 515. Dollhopf, S.L., Hasham, S.A. and Tiedje, J.M. (2001). Inerpreting 16S rDNA T-RFLP data: Application of self-organizing maps and principal component analysis to describe community dynamics and convergence. Microbiological Ecology, 42(4), 495 – 505. Mu¨ller, A.F., Westergaard, K., Christensen, S. and Sorensen, S.J. (2001). The effect of long-term mercury pollution on the soil microbial community. FEMS Microbiological Ecology, 36(1), 11 – 19. Fromin, N., Hamelin, J., Tarnawski, S., Roesti, D., Jourdain-Miserez, K., Forestier, N., Teyssier-Cuvelle, S., Gillet, F., Aragno, M. and Rossi, P. (2002). Statistical analysis of denaturing gel electrophoresis (DGE) fingerprinting patterns. Environmental Microbiology, 4(11), 634 – 643. Westergaard, K., Mu¨ller, A.K., Christensen, S., Bloem, J. and Sorensen, S.J. (2001). Effects of tylosin as a disturbance on the soil microbial community. Soil Biology and Biochemistry, 33, 2061– 2071. Yuan, Z. and Blackall, L.L. (2002). Sludge population optimization: a new dimension for the control of biological wastewater treatment systems. Water Research, 36, 482 – 490.

266