Datenbestand vom 10. Dezember 2024
Verlag Dr. Hut GmbH Sternstr. 18 80538 München Tel: 0175 / 9263392 Mo - Fr, 9 - 12 Uhr
aktualisiert am 10. Dezember 2024
978-3-8439-2148-0, Reihe Informationstechnik
Zixing Zhang Semi-Autonomous Data Enrichment and Optimisation for Intelligent Speech Analysis
186 Seiten, Dissertation Technische Universität München (2015), Hardcover, A5
Intelligent Speech Analysis (ISA) plays an essential role in smart conversational agent systems that aim to enable natural, intuitive, and friendly human-computer interaction. It includes not only the long-term developed Automatic Speech Recognition (ASR), but also the young field of Computational Paralinguistics, which has attracted increasing attention in recent years. In real-world applications, however, several challenging issues surrounding data quantity and quality arise. For example, predefined databases for most paralinguistic tasks are normally quite small and few in number, which are insufficient for building a robust model. A distributed structure could be useful for data collection, but original feature sets are always too large to meet the physical transmission requirements, for example, bandwidth limitation. Furthermore, in a hands-free application scenario, reverberation severely distorts speech signals, which results in performance degradation of recognisers.
To address these issues, this thesis proposes and analyses semi-autonomous data enrichment and optimisation approaches. More precisely, for the representative paralinguistic task of speech emotion recognition, both labelled and unlabelled data from heterogeneous resources are exploited by methods of data pooling, data selection, confidence-based semi-supervised learning, active learning, as well as cooperative learning. As a result, the manual work for data annotation is greatly reduced. With the advance of networks and information technologies, this thesis extends the traditional ISA system into a modern distributed paradigm, in which Split Vector Quantisation is employed for feature compression. Moreover, for distant-talk ASR, Long Short-Term Memory (LSTM) recurrent neural networks, which are known to be well-suited to context-sensitive pattern recognition, are evaluated to mitigate reverberation. The experimental results demonstrate that the proposed LSTM-based feature enhancement frameworks prevail over the current state-of-the-art methods.