Finding New Overlapping Genes and Their Theory (FOG Theory)

The general goal of the project “Finding new overlapping genes and their theory (FOG Theory)” is to find and verify new overlapping protein-coding DNA-sequences in prokaryotes, to understand the underlying coding characteristics, and to study their origin and evolution with the help of models from information and communication theory.

In the first part of the joint project, it could be shown that bacterial genomes contain many non-random long open reading frames, which could be overlapping genes. Indeed, a multitude of overlapping gene candidates have been identified using data analysis techniques in more than 50 bacterial genomes. Experimental work, using EHEC bacteria as model organism, has revealed several transcriptional units which could be overlapping genes.

In the next period, we aim at using our previously acquired knowledge to

  • identify the characteristics and peculiarities of overlapping genes for computational prediction and expert assessment using visualization,
  • experimentally characterize the biological function of selected overlapping gene targets, and
  • establish constraints relevant for the evolutionary origin of overlapping genes.

The project “Finding new overlapping genes and their theory (FOG Theory)” is a joint project of three groups: The Data Analysis and Visualization Group at the University of Konstanz, the Institute of Telecommunications and Applied Information Theory – TAIT at Ulm University and the Department of Microbiology at the Central Institute for Food and Nutrition Research (ZIEL) at Technische Universität München. It is part of the priority programme “Information and Communication Theory in Molecular Biology” (InKoMBio) of the German Research Foundation (DFG).

Subproject: Visual Analysis of Next-Generation-Sequencing Data

Next-generation-sequencing (NGS) technologies allow to sequence large amounts of DNA sequences in a short time period and with low costs. The technique is not only used to sequence whole genomes but also to sequence (indirectly) the mRNA (the transcriptome) of a cell. mRNA is transcribed from genes and can be considered a blueprint of a gene which is used to build the protein this gene encodes. Thus, any transcribed genome region can be suspected to contain a protein-coding gene. If mRNA is (indirectly) sequenced, these sequences can be mapped back to the genome. This does not only allow to identify genes which are active under a specific condition but also to identify new genes and even to estimate the amount of transcribed mRNA. Therefore, NGS is the method of choice to identify new overlapping genes.

Since NGS has length limitations, it is not possible to sequence a mRNA over its whole length. Thus, the mRNA needs to be fragmented to be sequenced. But since there are sampling effects when these fragments are sequenced, the mapping of the sequenced fragments (so called reads) results in a rugged coverage over the transcribed region.

Thus, beside the large amounts of data generated in NGS (this can be more than one million reads per experiment), these rugged coverages are a serious challenge in the analysis of NGS data.

Novel data analysis and visualization methods are therefore required to allow the biological experimenter to understand the results. Especially for overlapping gene candidates, a subsequent inspection of the results by an expert is needed.

To ease the analysis of Next-Generation-Sequencing data, especially with respect to overlapping gene candidates, we developed a new visual analytics (VA) system. This VA system permits to determine interesting regions according to a user-defined interestingness function. Furthermore, a genome overview bar allows to easily detect the most interesting cases and helps the analyst to deal with the large amount of data. Furthermore, we visualize the transcription coverage in the open reading frame (ORF) representations to allow an easier mapping of the transcription coverage to an ORF.

This work has been partly funded by the German Research Society (DFG) under the grant SPP 1395 (Information and Communication Theory in Molecular Biology, InKoMBio), project "Finding new overlapping genes and their theory (FOG-Theory)".