Kernel Methods for Bioinformatics

Splice sites are locations in DNA at the boundaries of exons (which code for proteins) and introns (which do not). The more accurately splice sites can be located, the easier it becomes to identify the coding regions in DNA. Accurate splice site sensors thus form critical components of computational gene finders. Since ever-larger chunks of DNA are to be analyzed by gene finders the problem of accurate splice site recognition has never been more important. We pose splice site recognition as a classification problem with the classifier learned from a labeled data set consisting of only local information around the potential splice site. Finding the correct position of splice sites without using global information is assumed to be a difficult task. We apply Support Vector Machines - a state-of-the-art previously untried method - to the splice site recognition problem. We analyze the genomes of C. elegans and of humans using specifically designed support vector kernels. One of the kernels is adapted from our work on detecting translation initiation sites in vertebrates and another uses a Fisher-kernel like idea. We find that both approaches perform very well on this task.

Principal Investigator

Alexander Smola
Computer Sciences Laboratory



Facilities Used



James Macnicol
Gunnar Raetsch
Computer Sciences Laboratory

Soeren Sonnenburg
Fraunhofer Institute

RFCD Codes


Significant Achievements, Anticipated Outcomes and Future Work

During the first month of the project we developed more optimization code for the APAC National Facility Compaq platform and we are working on tuning the performance to exploit the potential of the machines available.

Despite the short period of time, we have two publications on work done on the APAC National Facility that have been submitted or will be submitted to machine learning conferences. Further details and full paper versions will be available in due course.


Computational Techniques Used

The algorithm to be implemented consists of two stages. The first involves the training of a Hidden Markov Model (HMM), which means that we will have to test approximately 500 different settings in order to accommodate for two genomes, deal with acceptor and donor sites, optimize over various HMM architectures, etc. Training of each of the models (we have approximately 100,000 training samples but better results can be obtained for larger datasets) can take several days on an AMD 1.4GHz platform (the latter has roughly 70% of the performance of a single Alpha 1GHz CPU). Optimization can be parallelized on up to 4 CPUs (with shared memory architecture). We estimate the usage of 500 x 3 x 24 = 36,000 CPU hours.

In the second stage we use an interior point algorithm for the SV optimization problem (the specific version is a so-called u-SVM, as described in [1]) with a Sherman-Morrison-Woodbury decomposition to deal with large dense matrices. Each interior point algorithm requires roughly 40 large matrix multiplications of M x N matrices where M ~ 10^6 and N ~ 10^4 which leads to approximately 4 x 10^(15) = 40 x M x N^2 operations, or in other words, an estimated 850 CPU hours, assuming 1.2GFlops sustained performance per CPU. Including overhead, 1,000 CPU hours per training run is a reasonable estimate. For model selection we will need to train 10-15 models per genome, which leads to another 30,000 CPU hours.

Communication between the nodes will be mainly by using MPI to scatter and gather results. One node will play the role of a master-node summing up the partial terms of an N x M matrix multiplication with an M x N matrix (M > N).

[1] Scholkopf B, Smola AJ, Williamson RC, Bartlett PL, New support vector algorithms, Neural Computation 12(5):1207-1245, 2000.