A Rigorous and Efficient Method for Inferring Phylogenetic Relationships

The central challenge in biology is to understand how information contained in DNA sequences gives rise to the diverse physical properties of living systems. The differences observed between individuals of the same and/or different species stem largely from differences in their DNA. The research we seek to perform using the APAC National Facility is aimed at improving the biological realism of models of protein coding DNA sequences.

Principal Investigator

Alexander Isaev
Centre for Bioinformation Science



Facilities Used



Gavin Huttley
Centre for Bioinformation Science

RFCD Codes


Significant Achievements, Anticipated Outcomes and Future Work

Since November 2001 when our project was first allocated time on the APAC National Facility, we have completed calculations for 7-10 OTUís (2173509 trees) for 4 simulation rounds and obtained a preliminary distribution of the Z-score statistic (termed X in our original application). These calculations have been for project 1 from our original application.

Z is a measure of the 'true-tree' location, in units of standard deviations, relative to the mean weighted sums of squares determined from all possible tree topologies for a sample. Understanding its distribution is important for designing a fast and statistically rigorous method of phylogenetic reconstruction. From our first round of simulation we established that the number of taxa and the number of paired tips on the tree topology are two major factors influencing Z in a linear fashion. The latter effects were verified in simulations 2, 3, and 4 with minor deviations (see the attached figures).

Figure 1 in pdf format
Figure 2 in pdf format
Figure 3 in pdf format
Figure 4 in pdf format

These results have led us to formulate what we call the 'breathing tree' model: tree topologies can be transformed into each other through simultaneous shrinking and growing of branches at different places. We anticipate constructing a numerical description of the 'breathing tree' that represents a parametrisation of tree topology. This parametrisation may allow us to numerically compare different evolutionary models, replacing the current parametric bootstrap procedure that is in many instances computationally prohibitive.

Project 2 will be started during the holiday period.


Computational Techniques Used

The sequence simulation program Seq-Gen combined with a custom C program, bushGen, were used to generate phylogenetic trees under a number of evolutionary models and to simulate replicate sequence data sets using these models. A custom application (wss) was written in C to calculate the weighted sum of squares for every simulated dataset and every tree topology for the corresponding number of OTUs. The mean and variance of the resulting distribution were used to measure the distance of the fit of the true-tree from the mean, as Z.

Parallelisation was achieved by submitting different sub-sets of the simulated data to the normal queue on the APAC National Facility SC and assembling the results after job completion using Python scripts. The number of CPUs available on the SC was critical in obtaining results in such a short time interval.