The Alveo Virtual Laboratory was developed through a NeCTAR-funded project led by Western Sydney University.
Alveo provides on-line infrastructure for accessing human communication data sets (speech, texts, music, video, etc.) and for using specialised tools for searching, analysing and annotating that data.
There are two methods of getting access to the Alveo Web Service; Direct Login or via the AAF (The Australian Access Federation) authorisation system.
Data Discovery Interface
Browse and search collections, view documents and create lists of items for further analysis. The Data Discovery Interface provides the jumping-off point for further analysis using the Galaxy Workflow Engine, the NeCTAR Research Cloud, the R statistical package or any other preferred tool or platform. A fully featured API underpins the Data Discovery Interface, providing opportunities to extend the functionality of the Virtual Laboratory.
Galaxy Workflow Engine
Initially targeted at genomics researchers, Galaxy is a scientific workflow system which is largely domain agnostic. The Galaxy Workflow Engine provides Alveo users with a user-friendly interface to run a range of text, audio and video analysis tools. Workflows defining a sequence of steps in an analysis can be created and then shared with other researchers.
The following datasets are contained within Alveo:
- PARADISEC (the Pacific and Regional Archive for Digital Sources in Endangered Cultures), including Indigenous languages, music, and speech data; (5.1TB);
- AusTalk, audio-visual speech corpus from the Big ASC project; (24TB);
- The Australian National Corpus (AusNC) incorporating the Australian Corpus of English (ACE), Australian Radio Talkback (ART), AustLit, Braided Channels, Corpus of Oz Early English (COOEE), Email Australia, Griffith Corpus of Spoken English (GCSAusE), International Corpus of English (Australia contribution is ICE-AUS), the Mitchell & Delbridge corpus, and the Monash Corpus of Spoken English;
- The Audio-Video OZstralian English Speech (AVOZES) data corpus, a visual speech corpus ; (15GB);
- A collection of music excerpts from films: samples of Pixar movie theme music expressing different emotions;
- A collection of room impulse responses which, through convolution with speech or music, can create the effect of that speech or music in the acoustic environment they represent;
- A battery of emotional prosody: samples of sung sentences using different prosodic patterns.
- Colloquial Jakartan Indonesian corpus, audio and text (recorded in Jakarta in the early 1990’s); (5GB);
- The ClueWeb dataset. (100TB).
The following tools are included in Alveo:
- EOPAS (PARADISEC tool) for text interlinear text and media analysis.
- NLTK (Natural Language Toolkit) for text analytics with linguistic data.
- EMU for search, speech analysis, and interactive labelling of spectrograms and waveforms.
- AusNC Tools: KWIC, Concordance, Word Count, statistical summary and statistical analysis on a user-defined subset of content.
- Johnson-Charniak parser, to generate full parse trees for text sentences.
- ParseEval, tool to evaluate the syllabic parse of consonant clusters.
- HTK – modifications, a patch to HTK (Hidden Markov Model Toolkit, to enable missing data recognition.
- DeMoLib software for video analysis.
- PsySound3 (physical and psycho-acoustical algorithms) of complex visual and auditory scenes.
- ParGram (grammar for Indonesian).
- The INDRI tool for information retrieval with large data sets.