Tandemly repeated pentanucleotides in DNA ... - BioMedSearch

0 downloads 0 Views 1MB Size Report
Jul 14, 1994 - Human satellite Ill is taken as a specific example. It is shown that the first guanine within GG-. AAT pnt exhibits the highest mutability. Sequential.
3412 -3417 Nucleic Acids Research, 1994, Vol. 22, No. 16

%.) 1994 Oxford University Press

Tandemly repeated pentanucleotides in DNA

sequences

of

eucaryotes B.Bor6tnik*, D.Pumpernik, D.Lukman, D.Ugarkovic1 and M.Plohl'

National Institute of Chemistry,Hajdrihova 19, PO Box 30, 61115 Ljubljana, Slovenia and 1Department of Molecular Genetics, Rudjer Bo6kovic Institute, Bijeni6ka 54, PO Box 1016, 41000 Zagreb, Croatia Received March 16, 1994; Revised and Accepted July 14, 1994

ABSTRACT Genetic sequence data banks were scanned in order to retrieve tandemly repeated pentanucleotides (pnts). It was found that among 102 (= (1024 - 4)/2/5) possible distinct pnts roughly each fourth is involved in tandem repeats. It is shown that tandemly repeated pnts are composed of frequently occuring di- and trinucleotides and that those pnts which occur frequently in the form of mono- or di-pnts form also tandem repeats either in the form of satellites or in the form of shorter tandem repeats. Human satellite Ill is taken as a specific example. It is shown that the first guanine within GGAAT pnt exhibits the highest mutability. Sequential distribution of base changes gives evidence that the mutations do not occur at random positions but in a correlated fashion so that long stretches of original pnts remain intact. It is found that pnts related to the satellite Ill are present in introns and flanking regions of some structural genes, but are not preserved between orthologous genes of related species. The results corroborate the most plausible mechanism of their evolution - rapid amplification followed by successive divergence of repeat units by various mutational processes.

Pnts related to satellite IH were found in centromeres of other organisms (3). Such GGAAT repeats have a capability to form multistrand structures even in the absence of the complementary strands and also exhibit high affinity to HeLa-cell nuclear proteins (3).On the other hand stretches of simple tandem repeats can be found interspersed in euchromatic regions in the vicinity of genes or even in introns (9), making nonsatellite (= satellite like) stretches of tandem repeats. A copy number of tandem repeats in the vicinity of some genes can be directly linked with an increase in the severity of serious diseases, like myotonic dystrophy (10), or fragile X syndrome (11). In this work we present the results of the analysis of pentameric repeats which were retrieved from the sequences deposited in the gene data banks. We report on the occurrence of satellite and non-satellite tandem repeats and discuss the list of pnts which are involved in the repeats. As far as the distinction between satellite and non-satellite repeats is concerned we shall adhere to the specifications which are available in the GenBankTm entries. As a central issue we expose the human satellite 111 pentameric repeats and related pentameric repeats found in other species. The analysis of mutations and sequential distribution of repeat variants was performed.

INTRODUCTION

MATERIALS AND METHODS

Highly repeated nucleotide sequences like satellite DNAs represent an important building component of any eukaryotic genome. Satellite DNAs can be classified according to the length and complexity of their basic repeating units. If the repeating unit is below 10 base pairs (bp) the sequences are referred as microsatellites which can be found as highly abundant components of the constitutive heterochromatin of human (1,2,3), mouse (4), Drosophila (5), teleost fishes (6) and other genomes. Their functional role, as the function of satellite DNAs in general, is poorly understood. Some are highly polymorphic and can be used in fingerprint analyses (7). A satellite species which is abundant in human genome is satellite HI which consists of GGA(A/G)T repeats. Its importance is promoted by the finding that GGAAT repeats are positioned adjacent to alphoid repeats (8), which are able to bind nuclear proteins in the kinetochore formation (6).

Our basic source of DNA sequences was GenBankrm genetic sequence data bank (12) releases 79 and 80 and also EMBL database (13). We also used blast server (14) in order to search for the homologs of specific sequences which were of our interest. We scanned the GenBankrm database by means of several independent computer programs of our own. The first program searches for tandem repeats of arbitrary pnts. It was tolerated that within each pnt one nucleotide can differ with regard to neighbouring pnts. All the repeats with more than 5 pnts with the abovementioned tolerance were extracted so that the identities of the source sequences were at our disposition. Another computer program was dedicated to the analysis of the frequency of occurrence of short pnt repeats. For mono- and di-pnts the expected number of occurrences in the entire genetic sequence database is statistically significant while for longer repetitions

*To whom correspondence should be addressed

Nucleic Acids Research, 1994, Vol. 22, No. 16 3413 one can expect less than one representative. The results of the latter above mentioned computer program enables to bridge the information of the occurrence of short versus long repeats. We also used the computer programs which provide us with the frequencies of occurrences of dinucleotides and trinucleotides which can be used in order to test the stochastic character of the composition of sequences in terms of di-, tri-, and pentanucleotides. Theoretical aspects of the probability of occurrence of specific oligomers were studied by several authors (15,16,17). It turns out that for long random sequences of length Nwith equiprobable presence of all four nucleotides, the mean expected number of oligonucleotides of length n and with the specified composition is equal to N14n. The variance of the distribution depends upon the overlap capability of the nucleotides (16,17). Since we shall be interested only in the expected number of oligonucleotides we do not need to consider the features which emerge from the overlap capability. For the limit of long sequences we can express the predicted number of pnts in the following way:

M,5S)Zd (a,b,c,d,e)

=

M3)(a,b,c) *A'(3)(b,c,d) *N(3)(c,d,e)

provided that there is no correlation in the distribution of nucleotides beyond the second neighbour. The quantities of the type A'3)(a,b,c) were obtained by counting the number of a,b,c triplets in N-2 possible trinucleotide fragments within the sequence of length N. Pnts in long enough tandem repeats which do not show ideal repetitions were analysed with regard to the deviation from the average composition as well as with the regard to the sequential distribution of the differences relative to the ideal repeat. For each base in the pnt we calculated the probability for the conversion to any of three possible remaining bases. In order to obtain more complete information about the mutational changes of tandemly repeated pentamers we tried to extract also the information about the sequential ordering of mutations. Our starting point was the supposition that the changes with regard to the ideal pentameric repeats are randomly distributed. This hypothesis can be easily tested by calculating the distribution of runs of mutated or ideal pnts. By the word run we mean the sequence of closely related elements. If a certain sequence with Npnts consists ofp *N ideal and (1 -p) *Nmutated pnts then one can expect N (1-p)2 pn runs with length n of ideal pnts. If there exists a systematic deviation from the above mentioned rule it can give us the information about the nature of the distribution of mutations and about possible evolutionary pathways.

RESULTS The probability of occurrence of pentanucleotide repeats of various lengths There exist 45 = 1024 distinct five letter words of ACGT alphabet. For tandem repeats the number of distinct pnts is halved on the account of the equivalence of two complementary strands and is further reduced for a factor of 5 on the account of the equivalence of cyclic permutants. This means that there exists 102 distinct classes of pnts if one leaves aside A5/T5 and C5/G5 pnts. , The procedure which searches for tandemly repeated pnts retrieved besides the sequences of human satellite m, drosophila satellites and putative satellite sequences from chicken, duck and

GGAAT

GGAGT } Human satellite III GGAAA GGGAA I Avian GGAGA Isatellites Drosophila AGAAG ACAAT ATAAT I satellites

GGGGT (30) GGGAT (25) GGGTT (30)

GGGGA (20) 1 nstr GGGCT (300) vertebrates GGTGT (20) J

AAAAC AAAAT AGAAC GGCCT

AAAAG AAACC AGCTG AATGC

(250) (200) (16) (10) AAATT (170)

(220) nstr (50) vertebrates (240) (20) J nstr AATGC (20) } invertebrates

Figure 1. The list of pnts which are retrieved from GenBankTM in the form of tandem repeats. For each pnt the phylogeny affiliation is given. It is also marked whether particular pnt appears as satellite tandem repeat or non satellite tandem repeat (nstr). For nstr the total number of pnt repeats in stretches longer than 5 are given.

pheasant genomes, also many non-satellite repeats in various noncoding regions of genes. In Fig. 1 the satellite and frequently occurring non-satellite pnts and their total number of repeats are given. We can see that each fourth pnt is involved in tandem repeats. It would be interesting to find out whether there is some rationale behind the distinction between the pnts which appear in tandem repeats and those which do not. To this end we got quite valuable information from the results of the analysis of the frequency of occurrence of dinucleotides, trinucleotides, and pentanucleotides. We will illustrate the results on the case of primate sequences. In the GenBankTm Release 79 where the primate file (GBPRI.SEQ) contains more than 25 million of base pairs the four bases are pretty uniformly represented: A (26.0%), G(24.8%), C(24.7%), and T(24.5%). The frequencies of occurrence of dinucleotides (in percents) are as follows: 2.5CG,

4.3TA, 4.9GT, 5A4AC, 5.6AT, 6.OTC7', 6.0c, 6.7cA, 6.7TT, 7.2GG, 7.3Cw, 7.4cc, 74AA, 7*5AG, 7.5C4, 7.6TG. Such an ordering of

dinucleotide sequences was interpreted by Ohno and Yomo (18) as an universal TG/CA/CT excess, CG/TA deficiency rule. It is interesting to point out that the frequency of occurrence of longer segments is to a large extent controlled by the dinucleotides which are contained in the segment. In Fig. 2a it is shown how faithfully the probability of occurrence of pnts can be predicted by the frequencies of occurrence of dinucleotides and trinucleotides (see the equation in the preceding section). The correlation coefficient for the M2), W3) - M5) relation (see also Fig. la) is r = 0.901. Even higher correlation coefficient (r = 0.91 1) is obtained when comparing the observed and predicted numbers of trinucleotides (ApK3rj(a,b,c) = 142)(a,b) NA2)(b,c)/A%g for primate sequences. The predictability of the occurrence probability of sequence elements goes to the DNA stretches comprising up to 10 nucleotides such as tandem repeats of two pnts. In Fig. 2b the relation between the frequencies of occurrence of (pnt), and (pnt)2 is plotted. The relation predicting the number of (pnt)2 on the basis of (pnt), can be written in the following way: AN52) = (N5))2/N,o where N,,, is the total number of nucleotides which were analysed. This relation is represented by dashed line. It is roughly fulfilled. The corresponding correlation coefficient is 0.884. Due to finite content of databases (for primates Ntot is approx. 25 million bp) the predicted frequency

3414 Nucleic Acids Research, 1994, Vol. 22, No. 16

a T

T

C

G

C

T

T A G

A G

C

C

G A

G

C

A

T

A

Figure 3. This figure shows the mutational pattern of GGAAT pentanucleotide from human satellite Im sequences. The height of the column segments above the GGAAT letters are proportional to the frequency of conversion to the bases marked within the column segments.

b 1 000 . 0+0

500 -

a.

200 100 50 20 10 -

+ *%

dfr,g 00 00,0' 0 0

/6

^0 0

00co

00

0

0oO o0

0

I0

o1

,

.1*00 A

A

10 000

N(5) NA 50 000 100 000

The illustration how the probability of occurrence of sequential elements in terms of the frequency of occurrence of subelements for primate sequences. (a) Predicted numbers of occurrences of pentanucleotides (Npred) versus the observed numbers (Nobs). The prediction was done on the basis of the observed numbers of occurrences of dinucleotides and trinucleotides following the equation which is explained in the text. The correlation coefficient of the distribution of pnts is r = 0.911. Ideal prediction would generate the points along the dashed line. (b) Number of (pnt)2 (= p))) as a function of number of (pnt)1 (= N5 for 102 different pentanucleotides so that the fragments of all repeats are represented. In order not to overcrowd the figure only one cyclic permutant (GGAATGGAAT versus GGAAT, for example) is entered into the figure. It turns out that the figure shows similar appearance for all five permutants. Sequences containing (pnt),, with n > 2 were omitted to avoid the situation where individual sequences rich in pnt tandem repeats give predominant contribution to N5). Four different symbols represent the following four groups: pnts which are found in tandem repeats, either satellite or non satellite ones (0); pnts which contain rare dinucleotides CG and TA and are not involved in tandem repeats ( o ); pnts containing CG and are involved in tandem repeats (+); all the remainingpnts (0). Figure 2.

can be predicted

2

of (pnt)3 is below 1 and the relation connecting the number of (pnt), and (pnt)3 can not be tested. This means that all pnt tandem repeats longer than (pnt)2 either satellite or non satellite ones, are the product of some mutational mechanisms such as

slippage replication or unequal crossing over. Let us also point out a peculiar feature of the distribution of pnts in Fig. 2b. At high A ) and A2) values there is a systematic deviation from AK52) = (AA5))2INtt rule since the distribution is shifted towards higher N2 values. What we

consider most important is the fact that the pnts which are located in this region exhibit tandem repeats (black circles), either in the form of satellite or non-satellite ones. The probability of occurence of mono-pnts in mammalin DNA sequences was studied also by Volinia et.al.(19). They were looking for rare pnts performing a regulatory function. All the pnts which they identified belong, according to our classification presented in Fig. 2, to the class of rare pnts because they all contain TA dinucleotide. In invertebrates and plants the occurrences of dinucleotides is different than in primates which were discussed above. CG is less deficient and TA exhibits normal frequency of occurrence. There are also some changes in the patterns of the appearance of pnts. However, it is interesting to point out that the connection between the frequency of occurrence of (pnt), and (pnt)2 as well as their relation to (pn0)n with n >5 is also valid for these phylogenetic groups. GGGCT pnt for instance, which appears in Fig. 1 on account of primate and rodent sequences, is underrepresented on pnt1 and pnt2 scale in plants and invertebrates and consequently, it is not found in tandem repeats within these two phylogenetic groups. The analysis of the mutations in the satellite tandem repeats In this section we will present the results of the analysis of satellite tandem repeats with regard to the frequency and sequential distributions of point mutations which destroy the ideal pnt repetitions. For this study the human satellite In sequences are the most appropriate. In the nonredundant database selected from GenBankrm and EMBL databases one can find more than 20 entries with well resolved human satellite HI sequences .These sequences comprise altogether 10000 bp what is equal to 2000 pnt repeats. This can serve as a decent basis for the analysis of compositional and sequential characteristics. In the databases one can find four entries which contain more than 1000 bp (acc. no. X54108, M21305, S90110 and X06228). The analysis of the above mentioned sequences show that the tandemly repeated GGAAT pentanucleotide pattern is approximately 80 percent conserved with respect to ideal repetitions (20). The pattern of the changes is not random as can be seen in Fig. 3 where one can see that the guanine at the first place of the GGAAT pnt exhibits the highest mutability and the adenine at the third place the lowest. The two mutabilities differ for more than a factor of three. The results presented in Fig.

Nucleic Acids Research, 1994, Vol. 22, No. 16 3415 a.) GAGAT GAAAT GGGAT GAGAT GAAAT GAGAT AGGAT AAAAT GGAAT GGGAT GGA-T GCGAT GGGAT ACGAT GACAT AGAAT AG-AT GGAGT CGGAT G-AAT GGGAT GGGAT GGGAT GG-AT b.) GGAATI GGAGT GGAGT GGAGT GGAAT GGAAA c.) TGAAT GGAAT GGAAT GGA-T G-AA- G-AAd.) (GGAGT)i GGAAT GGAGT GGAAT e.) GGCTT GAAAT AGAAT -AAAT GGAAT -GAAT GGATT f.) GGAAG G--TT GGAAT GGAAT GAAAT GCACTGGGAAT g.) GGAGGaGGAGA GGAGCIGGAGT GGAGT GGAGT

Figure 4. The difference between the observed and expected number of runs of non mutated GGAAT pnts in human satellite Im sequences as a function of run length. See also the explanation in Materials and Methods section.

2 have the meaning only if one can prove that the differences between the mutational probabilities are not the result of stochastic fluctuations. In the case of the above mentioned results we got rather firm evidence that the mutational pattern depicted in Fig. 3 is a real feature. The analysis of independent human satellite HI sequences led to the results which converged towards the results depicted in Fig. 3. As far as the question of the sequential distribution of the base changes with respect to the ideal pentamer repeat is concerned the results are depicted in Fig. 4 where the difference between the observed and expected probability of appearance of runs of non-mutated GGAAT pnts are plotted. We can see that there is well resolved deficit of short runs and excess of long runs. Quantitatively, among 800 non mutated pnts 660 of them were expected and 470 were observed to participate in the runs shorter than 4. For the runs with n 2 4 the respective numbers are 130 and 320. This is a clear evidence that the mutations do not occur at random positions but in a correlated fashion so that long stretches of original pnts remain intact. It is also interesting to study the pTRS-47 subfamily (16) of satellite m. In this subfamily GGAAT and GGAGT pnts are mixed with slight predominance of GGAAT. The analysis of the sequence X54108 encompassing 1366 bp reveals that the two pnts are mixed in such a way that the distribution of either runs show similar departure from the result which would represent a random mixing as depicted in Fig. 4. Another well represented pentameric satellite stems from drosophila genome where the satellites with pentameric repeats exhibit the form of a mosaic structure with segments of ideal pnt repeats. Four major sequences (acc. no. X05197) can be represented in shorthand notation as a41C63; 072; Ua6132a16d56; 638-Y43 where = AATAC; = AATAG, -y = AAGAG; 6 = AGAGG and e = AATAT and the subscripts mean the number of repeats. The density of point mutations is below 1%. a

Pentameric repeats in flanks and introns Pentameric repeats can be found also in flanks and introns of structural genes. Maroteaux et al. (9), for instance have found pentameric repetitions of GGAAG, GGAAA and GGAGA pnts in avian genomes. The repetitions are rather well conserved, similar as in the above mentioned drosophila satellites. In the intron of the human protein C inhibitor gene (acc. no. M68516)

Figure 5. Human satellite HI fragments found in flanks and introns of four human and three non-human genes. (a) Human pulmonary surfactant associated protein intron 1 (acc. no. M24461). (b) Human mast cell chymase gene, 3' flank (acc. no. M64269). (c) Human loricrin gene, 3' flank, complementary strand (acc. no. M94077). (d) Human fumarylacetoacetate hydrolase gene, intron 1, 5' end (acc. no. L14658). (e) Strongylocentrotus purpuratus fascin mRNA, 3' flank (acc. no. L12047). (f) Melanogaster notch gene, intron E, complementary strand (acc. no. M16151). (g) Avena sativa phytochrome regulatory protein, 5' flank, complementary strand, (acc. no. X03244).

(22) one locates more than 100 tandemly repeated GGA(A/G)T pentamers, which can be identified as a fragment of the pTRS-47 subfamily of satellite HI. In order to see if the segments of satellite III appear also in some other entries of the genetic sequence databases we performed a search and we found four human genes and also several genes belonging to other organisms (drosophila, sea urchin and oats) where the pentamers GGA(A/G)T are tandemly repeated either in introns or in flanking regions. The number of tandem repeats in the above mentioned entries is not particularly high (5 to 9 repeats) but we think that this length is above the threshold at which one becomes certain that the sequence emerged as a result of the mechanism which generates tandem repeats. In Fig. 5 we present the fragments with repetitive sequences, mentioned above. In order to get some notion about a possible role of poly(GGAAT) we compared orthologous genes of which at least one contains tandemly repeated pnt of the above mentioned type. The most informative was the comparison between the genes coding for serine protease inhibitors in human, rat and mouse. In all three cases the gene consists of five exons. In human gene the fourth intron contains a stretch of more than 100 pnts of GGAAT and GGAGT types. Mouse (acc.no. X61597) and rat (acc. no. X16362) genes are between 70 and 80 percent homologous as far as the coding regions are concerned but exhibit no homology in intron structure, which also means that there is no sign of pnt tandem repeats. The only similarity which exists is similar length of the fourth intron. To some extent a similar situation can be found also in the case of retinal rod GMP phosphodiesterase a subunit genes (GPD), where mouse gene (acc. no. X60664) contains in the 3 prime flanking region 17 repetitions of AGAAC. This pnt differs from GGAAT at two places, and can be also found in the fourth intron of human C inhibitor gene, immediately following the GGA(G/A)T repeat. Bovine (acc. no. M27541) and human (acc. no. M26061) GPD genes exhibit high homology with mouse GPD in coding regions, but no homology in flanks, and no pnt repetitions either.

DISCUSSION We consider most important issue of this work the finding that the list ofpnts which appear in tandem repeats is to a great extent determined at the dinucleotide level. The basic facts regarding

3416 Nucleic Acids Research, 1994, Vol. 22, No. 16 the nonuniformity in the appearance of genes with regard to dinucleotide composition were presented by several authors (23,24,25,26). This phenomenon stems from the picture in which genes have a mosaic structure composed of sequential elements (on the scale of tens or hundreds of base pairs) with different CG content. This phenomenon is most clearly exhibited in vertebrate genomes and was thoroughly studied by Bernardi and coworkers (27,28). We have shown earlier (29,30) that the mosaic structure of genes produces an apparent long rangeness in the sequential correlations in gene sequences although other authors claim that the long range correlations are a real phenomenon (31). As far as the implication of the mosaic structure of genes to the problem of prediction of tandemly repeated pnts on the basis of dinucleotide frequencies is concerned we think that our approach is correct. We demonstrated on the case of primate sequences where CG and TA dinucleotides are rare that they are the major determ-iinants of the frequency ofpnts. Among the 102 possible distinct pnt tandem repeats 56 contain CG or TA or both of them. Only three among these 56 pnts occur frequently as (pnt), and (pnt)2 and are also involved in tandem repeats. The exceptions are the pnts with 100% C +G composition. These are CCCCG, CCCGG, CCGCG and their reverse complements plus cyclic permutants (we do not include CCCCC because it does not form distinct pnts). Our analyses have shown that tandem repeats of the abovementionedpnts are located in the CpG rich islands which are present in the majority of human genes (32, 33). All the other pnts which are involved in tandem repeats contain only frequent dinucleotides and appear frequently as (pnt)1 and as (pnt)2. We can propose no explanation for this observation except a loose statement that the above property could be the consequence of the constraints imposed on the DNA level. If an isolatedpnt fulfils this constraint, then according to the selfish gene scenario (34) its chances to succeed in proliferation by some amplification mechanisms such as slippage replication or unequal crossingover are increased. Examination of tandemly arranged pnts shown in Fig. 1 reveals that most of them exhibit purine-pyrimidine asymmetry. It is known that such DNAs may assume anomalous structures like fold-back or multistrand structures due to stable GA base pairing between nucleotides on one or multiple strands (3). Conserved hexanucleotide sequence family from the telomeric regions of human chromosomes can form multistrand folding structures on the basis of GG Hoogsteen base pairs (35). G strings potentially capable of forming similar structures can be found in several pnts, appearing frequently in eukaryotic genomes. Such unusual DNA structures could be important for the maintenance of chromosomal integrity like in human telomeres (2) or affect the expression of genes in their vicinity (9). We propose that overrepresentation of these particular pnts in eukaryotic genomes, as revealed by this analysis, could be due to their unusual multistrand structures playing a role in the above mentioned processes. Analysis of nucleotide changes in satellite III confirms the results obtained by the analysis of 1.8 kb highly repetitive EcoRI satellite II and LI DNA family (20), showing that the pattern of changes is not random. The most variable nucleotides are Gs at the first and second position, respectively, which are predominantly changed to As. Such a pattern is hard to explain by random point mutations, but it could result from methylation of cytosines in opposite DNA strand, followed by deamination to thymidine. It is known that CpN dinucleotides are the sites for methylation in most mammalian genomes (36). However,

other mutational mechanisms like gene conversion and unequal crossingover probably also participate in the evolution of this satellite. Studies of sequential arrangements within satellite II reveals that the mutations do not appear at random positions but in a correlated fashion meaning higher similarity of pentanucleotides appearing in a close proximity. This can be explained as a result of slippage replication amplifying basic repeat and producing ideal repetitions which are furthermore changed by point mutations and spread to other parts of the genome. There are, however, examples where mutations are randomly spread in satellite DNA (37), suggesting that the process of inter chromosomal spreading is faster than the mutation rate and intrachromosomal homogenisation. In the case of D.melanogaster pentameric satellites the point mutations are rather rare as well as in the case of putative avian pentameric satellites. These observations indicates the possibility that the above mentioned satellites were introduced in the genomes rather late in the evolutionary history. The sequences of human satellite Im exhibit rich history of modifications because they are far from ideal pentameric repeats. The percentage of the accepted point mutations with regard to the ideal pentameric repeats is close to 20% which produces approximately 50% of modified pnts. On this basis one can try to estimate the time of amplification of satellite IH pentameric repeat. The result depends of course on the mutation rate which appears in the evolution of primates as a highly uncertain quantity. It is known that some orthologous non repetitive DNA of human and chimpanzee diverged as slow as 0.1% per million years (0.1%/Ma), while, on the other side, certain hyper variable mitochondrial DNA evolved with one hundred times higher rate this is 10%/Ma. (38) In order to be able to make an educated guess regarding the value of the mutational rate of satellite HI sequences one should take into the consideration the putative relatedness of the function of human alphoid repeats and satellite HI repeats (8). However, Jorgensen et al. (39) who studied in details the evolution of a repetitive sequences have found that their evolution is rather slow and that various families of alphoid repeats diverged before the chimpanzee and human separated. Let us also point out that the differences between human satellite HI subfamilies (such as pTRS-63 and pTRS-47) are similar to the differences between human alphoid families this is between 15% and 20%. On this basis we can conclude that human satellite HI sequence have their root rather far down the phylogeny, beyond the major speciation events in primates family. Searching for pnts related to satellite LII revealed their presence in introns and flanking regions of several genes belonging to different organisms. The number of tandemly repeated pnts is usually between 5 to 9. However, comparison between orthologous genes from related species shows that tandemly repeated pnts are not preserved in these genes, despite high similarity between coding regions. It seems that these tandem repeats occur suddenly in evolution of species as a result of a quick amplification process, probably caused by slippage replication. It was shown in few cases that expansion of trinucleotide repeats in 5' or 3' ends of particular genes as well as coding regions can cause severe heritable human diseases (10, 11). The number of repeats is unstable and can vary even between normal individuals, while intermediate numbers of repeats are associated with higher risk of further expansion into disease. Highly expanded repeat regions are found in affected individuals and the number of repeats correlates roughly with severity of

Nucleic Acids Research, 1994, Vol. 22, No. 16 3417 the disease. It was also shown that these trinucleotide expansions, at the molecular level, induce methylation of promoter, effect mRNA processing or cause production of mutant protein (10, 11). It can be also proposed that expansion of pentanucleotides which appear in substantial number of copies in the vicinity of avian ovotransferin genes (9) or in many other different species could induce similar effect on expression of particular gene.

ACKNOWLEDGEMENTS This work was supported by the Ministry of Science and Technology of the Republic of Slovenia and by Research Fund of the Republic of Croatia. The anonymous referees are acknowledged for bringing under the authors attention some relevant published works and for useful suggestions.

REFERENCES 1. Frommer,M., Prosser,J., Tkachuk,D., Reisner,A.H., Vincent,P.C. (1982) Nucleic Acids Res., 10, 547-563. 2. Moyzis,R.K., Buckingham,J.M., Cram,L.S., Dani,M., Deaven,L.L., Jones,M.D., Meyne,J., Ratliff,R.L. and Wu,J.R. (1988) Proc. Natl. Acad. Sci. USA, 85, 6622-6626. 3. Grady,D.L., Ratliff,R.L., Robinson,D.L., McCanlies,E.C., Meyne,J. and Moyzis,R.K. (1992) Proc. Natl. Acad. Sci. USA, 89, 1695-1699. 4. Love,J.M., Knight,A.M., McAleer,M.A., and Todd,J.A. (1990) Nucleic Acids Res., 18, 4123-4130. 5. Lohe,A., and Roberts,P. (1988) In Verma,R.S. (ed.),Heterochromatin. Molecular and structural aspects. Cambridge University press, Cambridge USA, pp 148-186. 6. Franck,J.P.C., Harris,A.S., Bentzen,P., Denovan-Wright,E.M., and Wright,J.M. (1991) In MacLean,N. (ed.) Oxford Surveys on Eucaryotic Genes. Oxford University Press, Oxford, Vol. 7, pp. 51-82. 7. Jeffreys,A.J., Wilson,V. and Thein,S.L. (1985) Nature, 316, 76-79. 8. Jackson,M.S., Mole,S.E. and Ponder,B.A.J. (1992) Nucleic Acids Res., 20, 4781-4787. 9. Maroteaux,L., Heilig,R., Dupret,D. and Mandel,J.L. (1983) Nucleic Acids Res., 11, 1227-1243. 10. Mahadevan,M., Tsilfidis,C., Sabourin,L., Shutler,G., Amemiya,C., Jansen,G., Neville,C., Narang,M., Barcelo,J., O'Hoy,K., Leblond,S., EarleMacDonald,J., De Jong,P.J., Wieringa,B. and Komeluk, R.G. (1992)

Science, 255, 1253-1255. 11. Kremer,E.J., Pritchard,M., Lynch,M., Yu,S., Holman,K., Baker,E., Warren,S.T., Schlessinger,D., Sutherland,G.R. and Richards,R.I. (1991)

Science, 252, 1711-1714. 12. Benson,D., Lipman,D.J. and Ostell,J. (1993) Nucleic Acids Res., 21, 2963-2965. 13. Stoehr,P.J. and Cameron,G.N. (1991) Nucl. Acids Res., 19, suppl.2227-2230. 14. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) J.Mol.Biol., 215, 403-410. 15. Stuickle, E.E., Nielsen, P.J. and Grob, U. (1992) J. Theor. Biol., 159, 299-306. 16. Gentlemen, J.F. and Mullin, R.C. (1989) Biometrics, 45, 35-52. 17. Pevzner, P.A., Borodovsky and Mironov, A.A. (1989) J. Biomol. Struct. Dyn., 6, 1013-1026. 18. Ohno, S. and Yomo, T. (1991) Electrophoresis, 12, 103-108. 19. Volinia, S., Bemardi, F., Gambari, R. and Barrai, I. (1988) J. Mol. Biol., 203, 385-390. 20. Sol,K. and DuBow,M.S. (1993) Genome, 36, 334-342. 21. Choo,K.H.A., Earle,E., Vissel,B. and Kalitsis,P. (1992) Am.J.Hum.Genet., 50, 706-716. 22. Meijers,J.C. and Chung,D.W. (1991) J.Biol.Chem., 266, 15028-15034. 23. Nussinov,R. (1981) J. Mol. Biol., 149, 125-131. 24. Nussinov,R. (1984) J. Mol. Evol., 20, 111-119. 25. Hanai,R. and Wada,A. (1988) J. Mol. Evol., 27, 321-325. 26. Gutierrez,G., Oliver,J.L. and Marin A. (1993) J. Mol. Evol., 37, 131-136. 27. Bernardi,G., Olfsson,B., Filipski,J., Zerial,J., Salivas,J., Cuny,G., Meunier Rotival,M. and Rodier, F. (1985) Science, 228, 953-958. 28. Bernardi,G. (1989) Annu. Rev. Genet., 23, 637-661.

29. Borstnik,B., Pumpemik,D. and Lukman,D. (1993) Europhys. Lett., 23, 389-394. 30. Borstnik,B. (1994) Int.J. Quant. Chem., 50, 000. 31. Peng,C.K., Buldyrev,S.V., Goldberg,A.L., Havlin,S., Sciortino,F., Simmons,M. and Stanley,H.E. (1992) Nature, 356, 168-000. 32. Cross, S.H., Charlton, J.A., Nan, X. and Bird, A.P. (1994) Nature Genetics, 6, 236-244. 33. Larsen, F., Gundersen, G., Lopez, R. and Prydz, H. (1992) Genomics, 13, 1095-1107. 34. Dawkins,R. (1976) The Selfish Gene, Oxford Univ. Press, Oxford. 35. Williamson,J.R., Raghuraman,M.K. and Cech,T.R. (1989) Cell, 59, 871-880. 36. Bird,A.P. (1980) Nucleic Acids Res., 8, 1499-1504. 37. Plohl,M., Borstnik,B., Lucijanic-Justic,V. and Ugarkovic,D.(1992) Genet. Res., 60, 7-13. 38. Hasegawa,M., Di Rienzo,A., Kocher,T.D. and Wilson,A.C. (1993) J.Mol.Evol., 37, 347-354. 39. Jorgensen,A.L., Laursen,H.B., Jones,C. and Bak,A.L. (1992) Proc. Nat. Acad. Sci. USA, 89, 3310-3314.