Information Efficient NMR Assignment for Large Proteins


Principal Investigator

Andrew Torda

Research School of Chemistry


James Proctor

Research School of Chemistry


x08 - PC

Many unpleasant combinatorial problems exist in the various facets of protein chemistry. Almost
all of these are due to the large size and varied composition of biopolymers.

This project initially aimed to examine one such problem: automated assignment of NMR spectra for large proteins. Specifically, there was a niche application based on very large proteins. The aim was to find a minimal set of assignments which would allow useful chemistry to be done. Early implementations caused us to modify the aims to include more complete assignments using more information.

The question follows : how much more information is required to just make the spectral assignment possible? This is both methodological and quantitative in its enquiry, and relates to other problems in protein structure. The abstraction of the problem is that of finding all maximal common subgraphs (MCS), a well known NP hard problem. Under this abstraction, the problem of pairwise protein structure comparison (determining common structural fragments) becomes identical to the NMR assignment problem. This provides interesting overlap with other work being carried out in our group relating to sequence aligment.



What are the results to date and the future of the work?

We now have a basic implementation of a maximal common subgraph algorithm using a search method employing clique detection. This has entailed developing sundry data-preprocessing and visualization functions. We are now looking at the relation between the graph-theoretic MCS problem and well known structural alignment methodologies based around classical dynamic programming. This should lead to a hybridised heuristic approach to the MCS problem.

Testing and experimentation at this stage has been limited to small systems due to memory limitations of our own workstations. Development is reaching the point that testing of the assignment and structural comparison applications for large proteins will be required. At this stage, one should be better able to assess the success of the MCS abstraction.

Appendix A -



What computational techniques are used?

Apart from certain data-processing routines, all code is locally written. It is entirely in standard C, and employs a grab-bag of network algorithms based on tree searches and a variety of sorting and labelling methodologies.

- Appendix A