Coding rhythm of DNA strands

Stanislaw Cebrat
Institute of Microbiology, University of Wroclaw
ul. Przybyszewskiego 63/77
e-mail: cebrat@angband.microb.uni.wroc.pl

Miroslaw Dudek
Institute of Theoretical Physics, University of Wroclaw
pl. Maxa Borna 9
e-mail: mdudek@ift.uni.wroc.pl

CONTENTS: ABSTRACT # INTRODUCTION # PHASE STRUCTURE OF DNA MOLECULE # CODING PROPERTIES OF DNA PHASES # CONCLUSIONS #

ABSTRACT

We have analysed several chromosomes of the Yeast genome and showed the existence of the long-range power-law correlations in the distribution of coding sequences in six phases of DNA molecule. The observed correlations reflect the hierarchical harmonic stucture of the coding capacity of DNA molecule.

PACS numbers: 87.10, 87.15, 05.40


INTRODUCTION

Genome sequencing programs provide us with thousands of DNA sequences collected in databanks, which now contain even complete sequences of procaryotic genomes or eucaryotic chromosomes. Nevertheless, we do not have any good criteria to determine the coding regions of DNA and to describe the higher level of organisation of the chromosome. Stanley et al [2,3] have discovered the scale-invariant long-range power-law correlations in DNA sequences and concluded that such correlations are characteristic only of the noncoding nucleotide sequences. The importance of the coding regions in generating the correlations in DNA still remains controversial. A series of papers has appeared on the topic (e.g. [4-8]) but none of them explains the nature of the long range correlations and none of them shows the correlations in distribution of coding function along the DNA molecule. To understand the role of the coding regions in DNA structure we have analysed in detail the Yeast chromosome II (Ych2), which is sufficiently long (807188 bp) to avoid accidental correlations characteristic of small systems. Nevertheless, it was obvious for us, that a method of looking for correlations in coding capacity of DNA molecule has to respect the logical structure of DNA - its phases. Usually, DNA molecule is presented as a simple sequence of four different nucleotides: Adenine (A), Guanine (G), Cytosine (C) and Thymine (T). Since in the double DNA helix A in one strand always corresponds to T in the second strand and G corresponds to C - one strand determines strictly the sequence of the second strand (antiparallel, read in the inverse direction). That is why the sequence of only one strand is sometimes analysed from a point of view of occurrence of long-range correlations in it. This can be a correct method only in the case when the sequence of nucleotides is analysed without the consideration of its coding role. We have found a very strong rule which says, that two strands of the coding region of DNA are asymmetrical - its sense strand (which corresponds to mRNA sequence) tends to be richer in purines. That means that the antisense strand tends to be richer in pyrimidines [9]. Therefore, the analysis of coding sequences has to give different results, depending on the direction of analysed sequence. Furthermore, if we look for correlations in the nucleotide sequence of one strand, taking into consideration the sequences coding only in this strand, we can not differentiate between phases. Therefore, there is no possibility to find relationships between phases in such studies.


PHASE STRUCTURE OF DNA MOLECULE

In our considerations, any DNA molecule has been represented by a six phase stream of triplets (Fig. 1.). The sequence of symbols A,T,G and C in line W (stands for Watson) represents the sequence of one DNA strand. The sequence of symbols in line C (stands for Crick) is complementary to the first strand and should be read in opposite direction. The protein sequence is coded by the genetic code composed of codons which are the trinucleotide sequences - each corresponding to one aminoacid in a protein sequence. Codons are read in one direction, they do not overlap, there are no comas between them (Crick et al., 1961 [10]). Therefore, each strand of the sequence of DNA molecule should be read in three different "phases". Arbitrarily we have assumed that the phase (1) starts from the first nucleotide position in the first strand of DNA (symbols in parenthesis represent the codons). The phase (2) and (3) are shifted by one and two nucleotide positions respectively, as in Fig. 1. The phases (4 - 6) on the complementary strand are read in the reverse order. Since Ych2 is composed of 807 188 nucleotides (the number divided by three with the residue two) - each of the six phases consists of the same number of triplets - 269 062. It is easy to see, that we can get six different sequences of codons when we read the DNA sequence in phases. That means that DNA molecule, read in different phases represents different information in each phase, and even more - each phase represents the whole physical object (in case of Ych2 without two nucleotides placed at the ends of the strand). Any protein coding sequence starts with codon ATG and stops with one of three stop codons: TAA, TGA or TAG. Such a sequence, starting with the start and ending with the stop is called Open Reading Frame (ORF). One of the characteristic features of ORF is its length measured in codons. The shortest ORF has the length k=1 and consists of the pair: start codon - stop codon. In practice, during sequence analysis, one considers only the ORFs with k > 100 &\div 150 to be coding, thinking of the shorter ORFs as being purely random, which is not true [11]. In the following, we examine the coding properties of the Ych2 six phases from the point of view of long-range correlations in ORF distribution.


CODING PROPERTIES OF DNA PHASES

For the Ych2 we have calculated the probability Palpha_beta(r) that both of a pair of codons (one in the phase alpha and the other in the phase beta, where alpha,beta = 1, 2, 3, 4, 5, 6) being a distance r apart are coding. The probability is nothing else but the correlation function <s^alpha(0)s^beta(r)>, where s=1 if codon is coding and s=0 otherwise. The term "coding codon" is used for the codons belonging to the ORFs with length k >= kb whereas kb is a varying parameter (the larger value of kb the higher probability that ORF is coding). In particular, in Fig. 2. we show the plots of P11(r) and P12(r) for r<5000 codon units and kb=200.

In each phase alpha=1,2, ...,6 we have observed the exponential decay of Palpha_alpha(r) within the range of r of the order of 1000 codon units (for kb>100). In the same range of r the cross-correlation functions Palpha_beta with alpha not equal beta are increasing (Fig. 2) (the coding function of the sequence in one phase usually excludes the possibility of its simultaneous coding in another phase). The exponential decay is connected with the short-range correlations on the distance related to the length of ORFs. However, we have observed alternations of the coding and noncoding DNA sequences (for various values of kb), in the tails of $P_{\alpha \beta}(r)$, both on large and small scales. One can observe this in Fig. 3 and Fig. 4, where the plots of P11(r) and P12(r) are presented.

One can notice the striking similarity of the graphs in Figs. 3 and 4 with the graphs of the well-known Weierstrass-Mandelbrot fractal function W(t) [12,13] (e.g. see Fig.2 in Ref. [13]) which can exhibit both deterministic or stochastic behavior and can be used as a model for 1/f noise (The relationship between the Weierstrass-Mandelbrot function and the correlation function Palpha_beta(r) for DNA phases is under investigation by us and Dr. M. Wolf.).

One can easily notice in Figs. 3 and 4 that the most slowly varying contributions to Palpha_beta(r) seem to posses the periodic structure. Looking for the periodicity we have used the Lomb periodogram [14] method. This is method of spectral analysis for unevenly sampled data, which is superior to FFT analysis in the case of the data being the sum of a periodic signal and white noise. The method weights the data and yields statistical significance of the measured periodicities.

In Figs. 5 and 6 we have plotted the graphs of the Lomb periodograms for the probabilities P11(r) (Fig. 3) and P12(r) (Fig. 4), respectively. Similar periodograms can be obtained for the remaining phases and other yeast chromosomes. In particular, in Fig. 7 we have presented the Lomb periodogram for P11(r) in phase (1) of the Yeast chromosome XI, which consists of 666448 bp. One can compare these results with the Lomb periodogram in Fig. 8 for the typical stochastic chromosome generated by computer (with the A, C, T, G frequency as in Ych2). The left-most peak in the Fig. 8 has been cut off once it has been associated with the size of the chromosome. The peaks in the Lomb periodograms for Ych2 have disappeared when the ORFs and spacers of Ych2 were distributed randomly (discussed in the associated paper).

To visualize the role of coding and noncoding triplets with respect to the observed long-range correlations we have examined the random walk (XOR RW) of the differences in coding properties between the phases alpha and beta.

The definition of the XOR RW is the following. The walker steps "up" if the phase alpha is coding and the phase beta noncoding, whereas it steps "down" if the phase beta is coding and the phase alpha noncoding. Otherwise the walker does not move.

To show how strong are the long-range correlations in coding capacity of DNA phase we have examined the root mean square fluctuation about the average of the walker displacement in terms of the local slope alpha .

In Fig. 9 we have put together the results for the exponent alpha for XOR RW between phases (1) and (4) of Ych2 both for k > 1 and k > 150. Since alpha > 0.5 for r much longer than the short range correlations seen in Fig. 2, it is evident that the distribution of coding sequences is highly correlated and it is subject to the same strong rules within range r regardless of their position within chromosome. Even the distribution of short ORFs is correlated because a lot of them are generated by the longer, coding ones [11].


CONCLUSIONS

We have analysed the longest available chromosome sequences in the Yeast genome with respect to their coding properties and found the scale-invariant long-range power-law correlations resulting from the very regular harmonic rhythm of coding in a wide range of scales. The angular frequencies at which the DNA phases are coding have hierarchical structure.

We suggest, that these periodicities in coding capacity are connected with the tendency to restore the local symmetry in DNA molecule, what can be illustrated by symmetry of purine-pirimidine dominant codon usage in the whole chromosome (unpublished).

The correlations in distribution of coding sequences have some biological implications. They prove the nonrandom organization of chromosomes. There are two possibilities in explaining the role of the observed correlations. The first one is, that it is forced by conditions external to the DNA molecule and its polymerisation process. In this case the chromosome should be the subject of natural selection, by elimination of all sequences not fulfilling the rules of correlations. The second possibility is that the correlations are the result of intrinsic properties of the molecule and/or the physical properties of environment in which the DNA molecule exists and is synthetized. In this case the coding role is superimposed on the more fundamental property of DNA molecule.That could explain the fractal characters of the function discribing the molecule and would be a nice and consistent attempt to resolve the Haldane dyllema, which was discussed in the accompanying paper.


Acknowledgments

One of us, M.R. D., was supported by KBN grant No 2 P302 057 07.


  1. C.-K. Peng, S.V. Buldyrev, A.L. Goldberger, S. Havlin, F. Sciortino, M. Simons and H.E. Stanley, Nature 356, 168 (1992)
  2. H.E. Stanley, S.V. Buldyrev, A.L. Goldberger, J.M. Hausdorff, S. Havlin, J. Mietus, C.-K. Peng, F. Sciortino and M. Simons, Physica A 191 1 (1992)
  3. S.V. Buldyrev, A.L. Goldberger, S. Havlin, C.-K. Peng, M. Simons, F. Sciortino and H.E. Stanley, Phys. Rev. Lett. 71 1776 (1993)
  4. R.F. Voss, Phys. Rev. Lett. 68, 3805 (1992)
  5. R.F. Voss, Phys. Rev. Lett. 71, 1777 (1993)
  6. C.-K. Peng, S.V. Buldyrev, A.L. Goldberger, S. Havlin, R.N. Mantegna, M. Simons, and H.E. Stanley, Physica A 221 180 (1995)
  7. Mark Ya. Azbel, Phys. Rev. Lett. 75, 168 (1995)
  8. A. Arneodo, E. Bacry, P.V. Graves and J.F. Muzy, Phys. Rev. Lett. 74, 3293 (1995)
  9. H. Feldmann + 96 coauthors, EMBL J. 13, 5795 (1994)
  10. S. Cebrat, M.R. Dudek and A. Rogowska, submitted to Yeast.
  11. F. Crick et al., Nature 192, 1227 (1961)
  12. St. Cebrat and M.R. Dudek, Trends. Gen. 12, 12 (1996)
  13. B. Dujon and 106 coauthors, Nature 369, 371(1994)
  14. B.B. Mandelbrot, Fractals: form, chance and dimension, San Francisko: Freeman (1977)
  15. M.V. Berry and Z.V. Lewis, Proc. R. Soc. Lond. A 370, 459 (1980)
  16. W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, Numerical Recipes in C: the art of scientific computing - 2nd ed., pp. 575-584, Cambridge University Press (1992)
  17. M. Kimura, The Neutral Theory of Molecular Evolution, Cambridge University Press (1983)
  18. J.B.S. Haldane, Natural selection in Darwin's biological work, ed. P.R. Bell, pp 101-149, Cambridge University Press (1959)

Back to previous pageBack to previous page