Automatic Emotion Recognition from Mandarin Speech
Whether a person is speaking privately with family members or giving a presentation at a conference, emotion is an inevitable element of speech and is presented
in some form. Moreover, in many social interactions, thoughts, wishes, attitudes, and opinions cannot be fully expressed without emotion. One of the most important functions of emotion is to support interpersonal understanding. The appropriate use of emotional expression helps to achieve better communication, enhance friendship and mutual respect, and improve relationships. Due to the significant impact of emotion on humans’ exchange of information, the recognition and understanding of emotions in communication behavior has become a prominent multidisciplinary research topic. The earliest modern scientific studies on emotion trace back to the work by Charles Darwin. In The Expression of the Emotions in Man and Animals, Darwin claimed that (1) the voice works as the main carrier of emotion signals in communication, and (2) clear correlations exist between particular emotional states and the sound produced by the speaker (see Darwin, 1872). Following this seminal text, we see that emotion studies were dominated by behavioral psychologists for more than 100 years. In this field, William James established the research theory of emotion that is still prevalent today(cf. James, 1884). Since that point, the topic has spread to a variety of disciplines (see Tao & Tan, 2005).
In human communication, the speaker generally has two channels for delivering his emotional information to the listener: verbal and non-verbal communication
(cf. Koolagudi & Rao, 2012). First, emotional information can be conveyed verbally, which is of interest to linguists. When expressing an emotion through speech, a person can organize words in specific ways to send an emotional signal to others. For example, it is common to hear words that explicitly suggest a certain emotion, such as “I am so sad today.” However, emotion can sometimes work against the verbal form in which it is cased. The second way to express and receive emotional information is through
non-verbal means. The fundamental non-verbal cues for emotion in human communication fall into three main categories: facial expression, vocalization, and body language (cf. Watzlawick, Bavelas, Jackson, & O’Hanlon, 2011). Of these types, vocalization is one of the most efficient vehicles for information transfer (Postma-Nilsenová, Postma, Tsoumani, & Gu, n.d.). As we speak, our voices convey information about us as individuals. The sound of one’s voice can reveal if he is happy, sad, panicked, or in some other emotional state.
Changing our voice sounds can notify the listener that our emotions are shifting to a new direction. Thus, the voice is a way for a speaker to demonstrate his emotional state. Given the wide range of emotional information that a listener receives from speech, it is not surprising that researchers from a variety of disciplines are interested in studying speech emotion. The following section provides a summary of previous research that forms a basis of this study. We start in 1935 when Skinner attempted to study happy and sad emotional information through analyzing the pitch of speech. The non-verbal conveyance of emotion includes paralinguistic acoustic cues such as pitch and energy. Skinner’s study revealed that a person’s pitch is more likely to change if he is happy or sad than if he is experiencing another type of emotion (cf. Skinner, 1935). In their later work, Ortony et al. (1990) observed that a single sentence can express various emotions as the speaker changes the speaking rate and energy used.
Nygaard and Queen (2008) subsequently demonstrated that a listener was able to repeat happy or sad words, such as comedy or cancer, more quickly when the words were spoken in a tone of voice that matched the emotional content; the repetition proceeded more slowly when the emotional tone of voice contradicted the affective meaning of the words used. Schirmer and Simpson (2007) also found that the emotional tone of speech can influence a listener’s cognitive processing of words. Furthermore, more than 50 years ago, Kramer’s studies established that in cross-cultural communication, a listener who does not know the cultural background or language of the speaker can still understand and recognize the emotional information via non-verbal communication (see Kramer, 1964).
The above studies have collectively agreed that non-verbal aspects of speech can independently demonstrate emotional information. Since the non-verbal aspects of speech can separately contain emotion, as such, understanding emotion can help to overcome the language and cultural barriers often present in cross-cultural and international communication. In this thesis, we aim to create a novel method for a computer to recognize emotion through non-verbal speech cues in the Mandarin language. The intention is to thus enable the computer to detect a Mandarin speaker’s different emotional states. Our goal is to find an alternative to the current methods,
which accurately characterize non-verbal speech emotion in languages other than Mandarin. In this study, we disregard the verbal aspects of speech and focus solely on non-verbal aspects of speech in all experiments.
The remainder of this chapter is organized as follows: We first introduce speech emotion recognition (SER) in Section 1.1. The applications of SER are then described in Section 1.2, while Section 1.3 subsequently discusses SER in the Mandarin language. In Section 1.4, we formulate our problem statement (PS) and research questions (RQs). Afterwards, Section 1.5 describes the research methodology, and Section 1.6 offers an overview of our contributions. Finally, Section 1.7 outlines the structure of this thesis.