The recent innovations in sequencing technologies have lowered both cost and labour to produce a genomic sequence, especially for bacteria; the main consequence of such an improvement is the availability of many sequences from related species, leading to the concept of Comparative genomics, by which differences between closely related species (or even strains) can be linked to phenotypic differences. These differences can be both at the nucleotide or at the protein level; both approaches can be applied and linked together to give a complete and coherent picture of the sources of the genomic and phenotypic differences in the given species/strains set.
The cellular functions are mainly executed by proteins and comprise enzymatic activities related to the metabolsm, defense and attack mechanisms, structural functions, regulation of the cellular activity and so on; since the experimental validation of protein function cannot be easily applied to the whole genomic level without a big effort, computational approaches can be applied to infer protein functions, thus annotating the target proteome. Such annotation can then be transferred to related species using the orthology paradigm, by which proteins that are derived from the same common ancestor share the same function.
Since the algorithms and databases used for the annotation process are many and with different paradigms, an ortholog-protein centric database can be used to tie all the information together and therefore to allow a quick meta-annotation of each orthologous group in the target genomes.
Orthologs-proteins database structure
Regulatory networks analysis
Not all the proteins are needed for the cellular functions all the time: the expression of non-needed protein products can lead to side effects and can reduce dramatically the metabolic efficiency of a cell, especially if bacterial; to allow the expression of protrein products only when needed, a particular class of proteins, called regulators, can bind a specific nucleotide region around the regulated gene in order to activate or repress its expression. Moreover, the regulated gene can encode another regulator, leading to the construction of a complex regulatory networks with complex dynamic behaviors.
Such networks can be exploited with many different methods, having as the first common step the elicitation of the binding motif nucleotide sequence; then methods such as PWMs (Position weight Matrices) and HMMs (Hidden Markov Models) can be applied to find such motifs accros the entire genome.
Example of a regulatory network in three closely related genomes