A Rigorous and Efficient Method for Inferring Phylogenetic Relationships
The central challenge in biology is to understand how information contained in DNA sequences gives rise to the diverse physical properties of living systems. The differences observed between individuals of the same and/or different species stem largely from differences in their DNA. The research we seek to perform using the APAC National Facility is aimed at improving the biological realism of models of protein coding DNA sequences.
Principal Investigator Alexander Isaev 
Project x36 Facilities Used SC 
CoInvestigator Gavin HuttleyCentre for Bioinformation Science JCSMR ANU

RFCD Codes 270208 
Significant Achievements, Anticipated Outcomes and Future Work
Since November 2001 when our project was first allocated time on the APAC National Facility, we have completed calculations for 710 OTU’s (2173509 trees) for 4 simulation rounds and obtained a preliminary distribution of the Zscore statistic (termed X in our original application). These calculations have been for project 1 from our original application.
Z is a measure of the 'truetree' location, in units of standard deviations, relative to the mean weighted sums of squares determined from all possible tree topologies for a sample. Understanding its distribution is important for designing a fast and statistically rigorous method of phylogenetic reconstruction. From our first round of simulation we established that the number of taxa and the number of paired tips on the tree topology are two major factors influencing Z in a linear fashion. The latter effects were verified in simulations 2, 3, and 4 with minor deviations (see the attached figures).
Figure 1 in pdf format
Figure 2 in pdf format
Figure 3 in pdf format
Figure 4 in pdf format
These results have led us to formulate what we call the 'breathing tree' model: tree topologies can be transformed into each other through simultaneous shrinking and growing of branches at different places. We anticipate constructing a numerical description of the 'breathing tree' that represents a parametrisation of tree topology. This parametrisation may allow us to numerically compare different evolutionary models, replacing the current parametric bootstrap procedure that is in many instances computationally prohibitive.
Project 2 will be started during the holiday period.
Computational Techniques Used
The sequence simulation program SeqGen combined with a custom C program, bushGen, were used to generate phylogenetic trees under a number of evolutionary models and to simulate replicate sequence data sets using these models. A custom application (wss) was written in C to calculate the weighted sum of squares for every simulated dataset and every tree topology for the corresponding number of OTUs. The mean and variance of the resulting distribution were used to measure the distance of the fit of the truetree from the mean, as Z.
Parallelisation was achieved by submitting different subsets of the simulated data to the normal queue on the APAC National Facility SC and assembling the results after job completion using Python scripts. The number of CPUs available on the SC was critical in obtaining results in such a short time interval.