Algorithmic problems
in Bioinformatics
There has been a tremendous thrust in Bioinformatics research over the
last decade. Being a relatively new discipline, Bioinformatics offers a plethora
of challenging problems. I list below problems that we have started work
on. Although the work done so far involved Signal Processing techniques,
I am very interested in the application of traditional algorithmic techniques
in this area.
Projects:
- Prediction of Protein
Coding Regions in DNA sequences Using Fourier Spectral Characteristics
with Professor Amir Asif
accepted for publication in ICASSP 2005 (pdf file). Also
invited at Asilomar 2004.
An earlier version (with Haoyuan Wang) was published in the bioinformatics
track at MSE-2004.
ABSTRACT :
The exponential growth in the sequencing of genomes has fueled interest in
fast, automated decoding of inherent genomic information from these sequences.
The identification of protein coding regions in DNA sequences is a fundamental
step in computational recognition of genes. Traditional Discrete Fourier
transform (DFT) based approaches [Anastassiou01b,TiwariRBBR97] exploit the
empirical observation that the spectrum of protein coding DNA of length
$N$ nucleotides typically has a peak at frequency k = N/3
corresponding to a periodicity of 3, which is the length of a DNA codon.
We prove the aforementioned and several other empirical observations attributed
to DNA sequences. In addition, we propose an algorithm that predicts genes
based on short-term Fourier spectra. The recall and accuracy of the algorithm
are measured by running the algorithm over a large dataset. Our algorithm
improves on existing algorithms in several ways. We show that it is computationally
more efficient than existing algorithms based on short-term Fourier Transforms
[Anastassiou01b,TiwariRBBR97]. Being deterministic, our algorithm produces
an unique answer as opposed to the randomized algorithms in the Genomic Signal
Processing literature. We motivate and demonstrate the advantages of the
use of windows other than the rectangular ones used so far in the literature.
REFERENCES:
[Anastassiou01b] D.~Anastassiou. Genomic signal processing. IEEE Signal
Processing Magazine, July 2001.
[TiwariRBBR97] S. Tiwari and S. Ramachandran and A. Bhattacharya and S.
Bhattacharya and R. Ramaswamy.
Prediction of probable genes by Fourier analysis of genomic sequences. Computer
Applications in Biosciences,
volume 13, pages 263-270, 1997.
- New Clustering Algorithms for Microarray Data (work in progress)
ABSTRACT :
Microarrays are a recent technology for studying in a single experiment
how many genes express in a wide variety of conditions. Microarrays can be
used to infer the functions of a gene and to predict the roles of a
gene in causing a disease. A large amount of gene expression data has become
available from microarray experiments in recent years. However, finding patterns
in the data produced in microarray experiments is a challenging task, both
due to its large volume and due to the lack of prior knowledge of the patterns
being searched for. I have started working on an important subproblem --
cluster analysis of microarray data. Our approach differs from the large
body of existing work in that we use signal processing techniques to detect
important changes in the microarray data and use the locations of these changes
to do the clustering. A similar idea was investigated in [Syeda]; our approach
differs from this paper in that we use a more general notion of changes and
our clustering algorithm is hierarchical.
REFERENCES:
[Syeda] Tanveer Syeda-Mahmood. Clustering Time-varying Gene Expresssion Profiles
Using Scale-space Signals. Proceedings of the Computational Systems Bioinformatics
(CSB), 2003.
Note: I am often asked about how much Biology and Chemistry students need
to know to do a project in this area. For 4080 students, I will (most likely)
learn the Biological aspects and allow the student to focus on only the computational
aspects (of course a student is welcome to learn the Biological aspects).
M.Sc. students working in this area do need to know a little bit more, but
that can be picked up by self-study. There are no graduate courses suitable
for such students but we are working towards designing one.
Resources (for York students):
- There is a very brief introduction here
and a more detailed introduction here.
- There are several articles on Genomics on this
page.
- A course
page with lecture notes and links to some nice papers.
- Here is a list of On-line
bioinformatics courses and tutorials.
- For an introduction to Genomic Signal Processing, look at this
page. You should be able to download the article from any computer at
York.
- There are some good lecture notes at this page.
- For microarrays, visit this page
for a nice introduction, and do not worry if you do not understand all the
Biology. The problem I am interested in is to design algorithms for analyzing
the data. This
page has a brief description of some the data mining/analysis problems
involved.
Please send me email if you want more information about these projects.
Back to my homepage