This data set contains audio-visual recording of adult native speakers of Thai, Cantonese, Mandarin and Malaysian-Mandarin producing syllables with all their tones and native Lancaster English speakers producing syllables with different intonations. There are 9,500 tokens from 10 Thai speaker; 7,440 tokens from 8 Cantonese speakers; 6,144 tokens from 8 Mandarin speakers; 9,360 tokens from 10 Malaysian-Mandarin speakers; 6,500 tokens from additional 5 Thai speakers and 4,752 tokens from 8 Lancaster English speakers. Audio files were extracted from every video file using Adobe Premier Pro 2.0 or VirtualDub software then they were segmented and labelled for each individual syllable within each file using the PRAAT program. These segmentation files were then used as segmentation cues to cut video files into individual syllables to be used as stimuli in speech perception experiments. The videos are in .avi format and all the audio files extracted from them are in .wav format which can be viewed via most media players. Labelled segmentation files are in .TextGrid format and can be viewed with the PRAAT program. The size of each video file is approximately 2 to 4 GB depending on the language with 266 files all together. Size of each extracted audio file is approximately 100 to 250 MB depending on the language. Size of each TextGrid file is approximately 150 to 250 KB depending on the language.