Information Content in Genomic Sequences
Nucleotide distribution in genomes is known not to be random, showing overepresentation of specific motifs, long and short range correlations, periodicities, large-scale compositional heterogeneities. Little is known about the possible functional effects of nucleotide distributions on the conformational landscape of DNA, nevertheless, it is known that the chemo-physical properties of the stacking of nucleotides are fundamental in determining the double helix dynamics and are critical factors for the transmission of the message encoded in DNA.
A vast literature indicates that there is a relationship between transcription processes and the base composition of the related regulative sequences. It is still an open question whether the GC content may play a role in determining promoter performances as it is related to DNA conformational properties such as flexibility, thermal stability, opening bubbles which permit transcription and so on. Consequently, we are interested in deviations from randomness in nucleotide distribution patterns as well as in the content of information or ‘complexity' encoded in the symbolic representation of sequences upstream the Transcription Start Site (TSS).
Analysis of regulatory regions
We are investigating DNA sequence heterogeneities at a local level, in functional regions such as promoters, in order to ascertain possible evolutive constraints on their structure. We hypothesize the possible correlation between nucleotide distribution patterns and the putative existence of differential selection pressures, deriving from structural and/or functional constraints.
Base Composition Analysis in Vertebrates. Promoter sequences from -1000 bp to -1 bp relative to the TSS in (A) Danio rerio, (B) Xenopus tropicalis, (C) Monodelphis domestica, (D) Gallus gallus, (E) Canis familiaris, (F) Mus musculus, (G) Bos bovis, (H) Pan troglodytes.
Positional Shannon Entropy (hn) for substring 8bp-long in different organisms: comparison between real promoters and surrogated sequences (orange). X-axis: nucleotide position relative to the TSS. Y-axis: Positional Shannon Entropy.