Affective prosody modeling for a Malay Text-to-speech system (TTS)
Indeed they are always in sympathy with an emotional speaker even when there is nothing in what he says; and that is why many an orator tries to stun the audience with sound and fury – Aristotle in Rhetoric
Involvement with speech studies began when working on my dissertation for Masters of Software Engineering in University Malaya, KL. I worked on a project in collaboration with an R & D company; MIMOS Malaysia to incorporate emotions to the first Malay concatenative TTS, Fasih. It involved the addition of an affective component to the Malays TTS system, in order to produce a system that is more expressive in nature. I introduced a new template- driven method for generating expressive speech by embedding an ‘emotion layer’ called eXpressive Text Reader Automation Layer, abbreviated as eXTRA. The module is an independent component that can serve as an extension to any Malay TTS system that uses Multiband Resynthesis Overlap Add (MBROLA) engine for diphone concatenation. Details can be found in (Syaheerah L. Lutfi et. al., 2006).
But since everything like and akin to oneself is pleasant, and since every man is himself more like and akin to himself than any one else is, it follows that all of us must be more or less fond of ourselves…that is why we are usually fond of our flatterers, [our lovers,] and honour; also of our children, for our children are our own work. -Aristotle in Rhetoric
It is vital to ensure that intelligent interfaces are also equipped to meet the challenge of cultural diversity. Studies show that the expression and perception of emotions may vary from one culture to another (Matsumoto & Juang, 2007) and that applying a Similarity Principle would enhance the believability of the system. For example, a localized synthetic speech of an agent from the same ethnic background as the interactor are perceived to be more socially attractive and trustworthy than those from different backgrounds (Nass & Lee, 2002). Based on these studies and personal experiences, we realized that it is crucial to infuse a more familiarized set of emotions to a TTS system whereby the users are natives. We further worked on establishing a localized TTS by concentrating on the culturally-specific manner of speaking and choices of words when in a certain emotional state. Evaluations show that the risk of evoking confusions or negative emotions such as annoyance, offense or aggravation from the user are minimized, other than establishing a localized TTS (Syaheerah et al., 2008).
Emotions Identification from Voice
This study is concerned with obtaining the best parametric model for emotion recognition, based on a Hidden Markov Models (HMMs) classifier. The optimized parameters for emotion identification task were determined empirically, across two representations of observations, the mel-scale cepstral co-efficient (MFCC) and also Perceptual Linear Prediction (PLP), and were improved using well known normalization techniques. A more important finding showed that certain features were better or more precise at identifying certain type of emotion over the other. This and other findings from the experiments are discussed in (Syaheerah et al., 2009). In the article, it is also proposed that the findings could be applied to speech-based affect identification systems as the next generation biometric identification systems that are aimed at determining a person’s ‘state of mind’, or psycho-physiological state.
Spanish Affective TTS
In GTH we also work on Emotional Spanish TTS and participated in a number of Expressive TTS competitions such the Spanish Albazyn TTS competition (first place) (R. Barra-Chicote et. al., 2008) and INTERSPEECH Emotional Challenge 2009 (R. Barra-Chicote et al., 2009).