Speaker: Prof. Mark Borodovsky
Title: Gene Identification Algorithms: Local and Global Approaches
Date:March 28, 2003
Time:3:00 pm
Location:GCATT Room 325
Abstract:

Within the first decade of the 21st century the collection of complete genomes will be measured in hundreds. Yet, these "books of life" will carry many enigmatic “words” and “sentences”. The problem of interpreting DNA was encountered as soon as the first DNA sequence was determined in late 1970’s. Through the years many algorithms for DNA sequence interpretation and particularly gene finding algorithms were developed. 

I will talk about ab initio gene finding methods. These methods are capable to predict genes not identifiable by similarity search. The ab initio methods GeneMark and GeneMark.hmm were developed at Georgia Tech. They were used for annotation of several prokaryotic and eukaryotic complete genomes such as Haemophilus influenzae, Methanococcus jannaschii, Helicobacter pylori, Escherichia coli, Arabidopsis thaliana, etc. Currently, the use of GeneMark and GeneMark.hmm is documented in GenBank database by 21941 citations in the DNA section and 29512 citations in the protein section. The GeneMark algorithm identifies a coding potential (an a posteriori probability of carrying protein code) of a DNA sequence within a rather short sliding window based on inhomogeneous three-periodic Markov models of coding region and homogeneous models of non-coding region. Operation with short segments of DNA individually signifies the GeneMark local approach.  The GeneMark.hmm algorithm is analyzing a whole DNA sequence at once. This algorithm introduces possible functional (hidden) states for each nucleotide and, eventually, the coding regions boundaries are predicted as transitions between hidden states at the most likely path through the hidden Markov model for a given DNA sequence which could be several MB long. In the ideal case, when the nucleotide sequence has no errors, the GeneMark.hmm algorithm more accurately than GeneMark finds boundaries between coding and non-coding regions, such as gene 5’ and 3’ ends and exon-intron junctions. However, GeneMark.hmm is more sensitive to sequence errors that may corrupt predictions in a rather long region. To the contrary, the output of the GeneMark program is more stable and helps detect sequence errors producing artifactual frameshifts. Both GeneMark and GeneMark.hmm can be used for high and low eukaryotes, bacteria and archaea, eukaryotic viruses and phages: http://opal.biology.gatech.edu/GeneMark/. The general theory of GeneMark and GeneMark.hmm can be explained in terms of posterior decoding (forward and backward algorithm) and Viterbi algorithm.

Biography:

Prof. Mark Borodovsky received his Ph.D. in Applied Mathematics from The Institute of Physics and Technology in Moscow, Russia in 1976. He joined the School of Biology at Georgia Tech in 1992. Prof. Borodovsky serves as Adjunct Professor at both the College of Computing at Georgia Tech and the School of Medicine at Emory University. Prof. Borodovsky's research is focused on computational analysis of structural and functional aspects of DNA sequences. He is interested in developing new computationally treatable models that combine observable and hidden variables (such as Hidden Markov models). These models are trained and tested on already available data and produce new verifiable predictions. The overall aim is to develop new mathematical methods and computer algorithms that reveal new complex mechanisms of genetic information transfer.

Slides: