Australian National Database Of Spoken Language (ANDOSL)


J Bruce Millar

Computer Science Laboratory, Research School of Information Science and Engineering

The Australian National Database Of Spoken Language (ANDOSL) project was funded by the
Australian Research Council as a collaborative venture between several research groups, including CSL, for the benefit of the whole Australian research community. The database comprises both read and spontaneous speech, a rigorous phonetic coverage of Australian English, and speakers representing the full spectrum of Australian-born speakers and some migrant groups. Early in the year a second set of 15 CDROMs of signal and descriptive data were published representing the spontaneous speech component. Support over the internet for users of the data has been continually upgraded providing ready access for researchers to the latest developments. In 1997 access to the phonemic annotation of the data has been made available this way.
The purpose of the database is to provide a research resource to those researching in speech science and technology and its application to the Australian speaker community.

The major effort in 1997 has been the organisation of access to various tables within the descriptive database held at CSL. This has been pursued on two fronts. First, the organisation of phonemic annotation data into CDROM image format has enabled flexible release of these data to researchers who have been licensed to use it. Second, several publicly available WWW interfaces to the descriptive database itself have been provided. These allow spot checking of details that are of particular interest to prospective users of the data corpus. This has resulted in useful interactions with researchers who have accessed the facility. Work has also continued to integrate newly-available data resources into the ANDOSL system.



J P Vonwiller

Electrical Engineering

University of Sydney

J M Harrington

English and Linguistics

Macquarie University






How did the MDSS help in archieving the project's results?

The MDSS has been used as a large data management area in which signal data, descriptive data and annotation data has been stored in both raw and processed form prior to being assembled in multiple ways for the production of CDROM images. The production so far of about 15Gb

of CDROM formatted data has involved the manipulation of many times this volume of raw and semi-processed data. The submission of value-added annotations of the data by researchers who are using the issued data has added another storage load. Comparisons of updated data with original forms and reconfiguration of CDROM data organisation have all been possible using the MDSS facilities.


J. B. Millar, J. M. Harrington, J. P. Vonwiller, Spoken Language Resources for Australian Speech Technology , Journal of Electrical and Electronic Engineers (Australia) In press.

