Stochastic DNA in the presence of the coding bias

Stanislaw Cebrat
Institute of Microbiology, University of Wroclaw
ul. Przybyszewskiego 63/77
e-mail: cebrat@angband.microb.uni.wroc.pl

Miroslaw Dudek
Institute of Theoretical Physics, University of Wroclaw
pl. Maxa Borna 9
e-mail: mdudek@ift.uni.wroc.pl

CONTENTS: ABSTRACT # INTRODUCTION # PROPERTIES OF THE DNA RANDOM WALKS # CONCLUSIONS #

ABSTRACT

We have shown that the long-range power-law correlations in DNA coding function are associated with the appearance of purine rich sequences in the sense strand of DNA molecule. Superimposing the 1/f noise rules of natural chromosome on the random DNA sequence generates long-range correlations in it and introduces into the stochastic sequence some characters of natural chromosomes.

PACS numbers: 87.15.By, 05.40.+j, 87.10.+e


INTRODUCTION

In the accompanying paper [1] we have announced our discovery of periodicities in the appearance of the coding sequences in DNA. We have considered a DNA molecule to be a six phase stream of codons (nucleotide triplets) and found that the correlation function <s(0)s(r)>, where s=1 if trinucleotide is inside the Open Reading Frame (ORF) and s=0 otherwise, shows periodic oscillations with the distance $r$ between the codons. We have discussed in detail the particular case of the Yeast chromosome II (Ych2) [2] but we have also found that periods in the coding sequence distribution are present both in each of chromosome phases and between the phases in all yeast chromosomes we have examined. The largest periods of the order not exceeding 10^5 codons possibly correspond to the presence of a domain-like structure of DNA. Recently, Azbel [3] came to similar conclusion but he was considering much shorter DNA sequences.

To show the periodicities in coding function, the only feature characterizing the triplet we had applied was: "is it coding or not". The sequence of nucleotides inside both, coding and noncoding sequences has not been of any significance in such studies. The only sequences which have been considered were codon start (ATG) and codons stop: TAA, TAG and TGA which are the hallmarks for coding sequences and belong to them. Therefore, there are no direct connections between the results of Stanley et al [4,5], Voss [6,7] or recent publications [3,8] and our finding. Nevertheless Peng et al [4], examining the "purine/pyrimidine" walks on DNA showed the long range correlations in noncoding regions. Our studies of Yeast genome have shown, that there is a correlation between the length of ORFs and their nucleotide composition (manuscript submitted to YEAST) what suggested that the long-range correlation in coding capacity should be associated with the long-range correlations in nucleotide composition.


PROPERTIES OF THE DNA RANDOM WALKS

In Fig. 1 the ratio [A]/[T] in the sense strand versus the minimal length of ORFs was plotted (the brackets [ ... ] denote the number of nucleotides). It is evident, that longer ORFs are richer in Adenine. We have also noticed, that the overplus of Adenine over Thymidine in all ORFs with k>100 in each phase is correlated with the total length of these ORFs (correlation coefficient larger than 0.98). The same observations have been done for the ratio [G]/[C]. Since the length of ORF is a very strong criterion for its qualification as coding sequence we conclude that sense strand of coding sequence tends to be richer in purines (A and G) than pyrimidines (T and C). These results suggest, that long-range correlations in coding capacities should be associated with the long-range correlations in nucleotide distribution. That is why we have analysed in this paper the Yeast chromosomes using a method described by group of Stanley [4].

Nevertheless, there is a very significant difference between our method and that applied by Peng et al [4]. In our studies the coding role of DNA has been taken under consideration. That means, that our question is not only if the sequence is coding but we have also asked in which strand and in which phase is it coding. This is very important, because: the coding sequence in the sense strand is quite different from its complementary sequence in the antisense strand. Even sequences between coding ORFs (we will call them "spacers") in one phase are not of equal value, because they can either code in one of two phases of the same strand, or in the three different phases of the second strand or they can be noncoding sequences. Since the interphase relationships in ORF generation are very strong [9] all these properties have to be considered.

In Fig. 2 we have plotted three different random DNA walks within phases of the Ych2. The first walker (PP RW) is going "up", if the codon in phase (1) is richer in purines and "down" if it is richer in pyrimidines. The second walker is also PP RW but in phase (4). The third walker (XOR RW) steps "up" if the codon is inside the ORF longer then 150 at least in one of the phases (1-3) and there is no ORF longer than 150 in the phases (4-6), whereas it steps "down" if the codon is inside the ORF longer then 150 at least in one of the phases (4-6) and there is no ORF longer than 150 in the phases (1-3). Otherwise the walker does not move. It is evident, that phase (1) tends to be coding when the strand is rich in purines and phase (4) when this strand is rich in pyrimidines.

In the accompanying paper [1] we have showed the results of the root mean square displacement F(r) [4] analysis in terms of local alpha slope for the coding properties of Ych2 . It is obvious, that observed by us 1/f coding noise is totaly resistant to the changes in both ORFs and spacer sequences as long as their length does not change. That is why we have tried to check if the long-range correlations in PP RW on codons (as in Fig. 2) would be preserved if on the random DNA sequences different rules of natural chromosome were superimposed.

In the Figs. 3, 4, 5 we have compared the results for the exponent alpha calculated for the PP RW in phase (1) of Ych2 with the PP RW for the following sequences:

The results of Fig. 3 show that the distribution of coding sequences is responsible for the correlations in DNA even in the case of pure stochastic rules for triplet occurrence in spacers. The observed range of 5000 nucleotide triplets (where alpha > 1/2) is large with respect to the short range correlations shown in the previous paper [1] connected with the size of ORFs. The nucleotide distribution along DNA strands yields power-like behavior within this range once the coding noise of the natural chromosome or periodic rules are imposed on the appearance of the codons start and stop Figs. 4, 5 These codons are richer in purines and even though they constitute less than 10% of codons of the sequence, their biased distribution generates long-range correlations. Besides the generating the correlations, such distribution of start and stop codons biased by 1/f coding noise of Ych2 results in generating very specific size distribution of ORFs.

In Fig. 6., the numbers of ORFs versus their length (in codons) have been plotted for Ych2, chromosome (B) and a purely stochastic one with the nucleotide composition as in Ych2. It is evident that 1/f coding noise superimposed on the distribution of start and stop codons generates long ORFs which are absent from a purely stochastic chromosome.


CONCLUSIONS

There are long-range power-like correlations in coding functions of the Yeast chromosomes. Since these correlations are resistant to nucleotide composition of sequences between and inside coding ORFs - it should be the distribution of coding ORFs which is responsible for them. Because the sense strand of coding sequences is usualy richer in purines than antisense strand - the correlated distribution of coding ORFs generates the correlations in nucleotide distribution along the chromosome sequences. To preserve the correlations shown by purines/pyrimidines codon random walks it is enough to keep the codon composition of ORFs. Futhermore, the distribution of starts and stops with the rules resulting from 1/f coding noise generates correlations in purines distribution and ORF size distribution close to that of natural chromosomes. The properties of sequences generated by stochastic rules biased with 1/f coding noise even though resemble the natural sequences are still far from natural chromosomes. The main cause of the differences results from the fact, that while generating such sequence we generate only one phase of DNA molecule. We have shown elsewhere, that relationships between phases are not random. Superimposing the rules resulting from interphase relationships and from the long-range correlations should generate sequences much more similar to natural chromosomes.

The discovery of the long-range power-law correlations in DNA coding capacity has a very important consequences for understanding many biological phenomena. One of the most important seems to be the explanation of the famous Haldane dylemma (e.g. [10], [11]). In 1937 Haldane showed that organisms have to pay for maintaining active genes with elimination of harmful mutations. Taking together the mutation rate and the size of genome, the cost of this elimination seems to be unbearable and would exterminate the species.

Thus far, the only known mechanism of elimination of such mutations has been the Darwinian selection operating on the phenotypic level. The discovery of the long-range correlations means, that some properties of DNA sequences are determined on the molecular level, i.e., the numbers of possible molecular states is restricted. That means, that forbidden states on the molecular level are not and do not have to be checked by Darwinian, phenotypic selection. Haldane [11] said: The history of science makes it almost certain that facts will be discovered which show that the theory of natural selection is not fully adequate to account for evolution.


  1. S. Cebrat and M.R. Dudek, submitted to PRL, Preprint IFT UWr 893/95 (1995)
  2. H. Feldmann + 96 coauthors, EMBL J. 13, 5795 (1994)
  3. Mark Ya. Azbel, Phys. Rev. Lett. 75, 168 (1995)
  4. C.-K. Peng, S.V. Buldyrev, A.L. Goldberger, S. Havlin, F. Sciortino, M. Simons, H.E. Stanley, Nature 356, 168 (1992)
  5. S.V. Buldyrev, A.L. Goldberger, S. Havlin, C.-K. Peng, M. Simons, F. Sciortino and H.E. Stanley, Phys. Rev. Lett. 71 1776 (1993)
  6. R.F. Voss, Phys. Rev. Lett. 68, 3805 (1992)
  7. R.F. Voss, Phys. Rev. Lett. 71, 1777 (1993)
  8. A. Arneodo, E. Bacry, P.V. Graves and J.F. Muzy, Phys. Rev. Lett. 74 3293 (1995)
  9. S. Cebrat and M.R. Dudek, Trends Gen., 12 12 (1996)
  10. M. Kimura, The Neutral Theory of Molecular Evolution, Cambridge University Press (1983)
  11. J.B.S. Haldane, Natural selection in "Darwin's biological work", ed. P.R. Bell, pp 101-149, Cambridge University Press (1959)
Back to previous pageBack to previous page