Principal Investigator Henry A Nix Project q56

Centre for Resource and Environmental Studies Machine VP

Co-Investigators David B Singleton, Lee Belbin and Janet L Stein

ANU Supercomputer Facility, CSIRO Division of Wildlife and Ecology and Centre for Resource and Environmental Studies

Environmental Domain Analysis With Very Large Datasets : Evaluation of Attributes, Weightings, Classificatory Strategies

Environmental domain analysis focuses on those abiotic attributes of the environment that modulate physical processes and biological responses. The objective is to provide a consistent framework for the inventory, evaluation, planning and management of land and water resources and biota. Spatially referenced datasets can be developed relatively quickly and cost-effectively at a range of scales from continental to local using new methods of data estimation (digital elevation modelling; climate surface estimation; substrate attribute estimation). The use of the non-hierarchical clustering algorithm, ALOB, developed by Lee Belbin, allows classification of relatively large datasets. However, evaluation of alternate classification parameters and attribute weightings has been limited by the size of the datafiles involved and the time required for a single classification run.

What are the basic questions addressed?

To assess a number of alternate strategies to generate environmental domains for Australia from a gridded dataset of environmental attributes at various scales, primarily at 1/40th degree resolution. Firstly, what is the envelope of viable classification parameters? Secondly, what is the effect of varying the set of attributes selected, the weighting of these attributes, and the dissimilarity measure chosen? How do classifications based on individual grid-cells compare with those based on grid-cell aggregations (e.g. to sub-catchments)?

What are the results to date and the future of the work?

Comparison of classifications generated using the original and modified ALOB datasets resulted in parameter settings providing a reasonable compromise between the efficiency of runs employing 100% vector mode reallocations and the original algorithm.

After resolving a finite precision problem appearing only with very large datasets, a series of runs were undertaken on Australia wide datasets using combinations of environmental attributes; climate only, climate and terrain, and climate, terrain and substrate and using both grid-cells and sub-catchments as objects. When included, substrate attributes dominated the resulting classifications even at higher order groupings.

Future work will examine new ways of including substrate characteristics. A new continental database is being developed at a much finer resolution (250m grid spacing) and producing datasets at least 100 times larger. Classification of these datasets will provide a significant test of the techniques developed to date. The ALOB program is also being recoded for the CM-5 to take advantage of parallel processing.

What computational techniques are used and why is a supercomputer required?

The original algorithm incorporates four phases: i) seed generation, sampling domain cells from the dataset based on the association between objects being classified and a user supplied threshold, ii) fixed allocation where each object is sequentially allocated to the nearest seed, iii) centroid re-definition, based on the objects assigned to each group, and iv) iteration where objects are extracted from their groups and relocated to the nearest group.

To make the most effective use of vector processing the last phase has been modified. Objects are left in their groups before testing for reallocation to the nearest group centroid. Optional parameters allow various proportions of the reallocations to be done in the original extract-test-add method.

The classification of very large datasets (eg 1,121,335 grid cells by 35 attributes) with the ALOB algorithm, which requires multiple iterations, each time looping through all of the objects, is not feasible on anything other than a supercomputer.