A Study on Developing Applications Systems Based on Singing Understanding and Singing Expression (in Japanese)

Tomoyasu Nakano
University of Tsukuba, Tsukuba, Japan (March, 2008)


Since the singing voice is one of the most familiar ways of expressing music, a music information processing system that utilizes the singing voice is a promising research topic with various applications in scope. The singing voice has been the subject of various studies in various fields including physiology, anatomy, acoustics, psychology, and engineering. Recently, the research interests have been directed towards developing systems that support use by non-musician users, such as singing training assistance and music information retrieval by using hummed melody (query-by-humming) or singing voice timbre. The basic studies of the singing voice can be applied to broadening the scope of the music information processing.

The aim of this research is to develop a system that enriches a relationship between humans and music, through the study of singing understanding and singing expression. The specific research themes treated in this thesis are the evaluation of singing skills in the scope of singing understanding, and voice percussion recognition in the scope of singing expression.

This thesis consists of two parts, corresponding to the two major topics of the research work. Part 1 deals with the study of human singing skill evaluation, as part of a more broader research domain of understanding of human singing. Part 2 deals with the study of voice percussion, as part of the study of singing expression.

In both studies, the approach and methodology follows what is called the HMI approach, which is a unification of three research approaches investigating the Human (H), Machine (M), and Interaction/Interface (I) aspects of singing.

Part1: Singing understanding (singing skill)

Chapter 2 presents the results of two experiments on singing skill evaluation, where human subjects (raters) judge the subjective quality of previously unheard melodies (H domain). This will serve as a preliminary basis for developing an automatic singing skill evaluation method for unknown melodies. Such an evaluation system can be a useful tool for improving singing skills, and also can be applied to broadening the scope of music information retrieval and singing voice synthesis. Previous research on singing skill evaluation for unknown melodies has focused on analyzing the characteristics of the singing voice, but were not directly applied to automatic evaluation or studied in comparison with the evaluation by human subjects.

The two experiments used the rank ordering method, where the subjects ordered a group of given stimuli according to their preferred ratings. Experiment 1 was intended to explore the criteria that human subjects use in judging singing skill and the stability of their judgments, using unaccompanied singing sequences (solo singing) as the stimuli. Experiment 2 uses the F0 sequences (F0 singing) extracted from solo singing, and was resynthesized as a sinusoidal wave. The experiment was intended to identify the contribution of F0 in the judgment. In experiment 1, six key features were extracted from the introspective reports of the subjects as being significant for judging singing skill. The results of experiment 1 show that 88.9% of the correlation between the subjects' evaluations were significant at the 5 % level. This drops to 48.6% in experiment 2, meaning that F0 contribution is relatively low, although the median ratings of stimuli evaluated as good were higher than the median ratings of stimuli evaluated as poor in all cases.

Human subjects can be seen to consistently evaluate the singing skills for unknown melodies. This suggests that their evaluation utilizes easily discernible features which are independent of the particular singer or melody. The approach presented in Chapter 3 uses pitch interval accuracy and vibrato (intentional, periodic fluctuation of F0) which are independent from specific characteristics of the singer or melody (M domain). These features was tested by a 2-class (good/poor) classification test with 600 song sequences, and achieved a classification rate of 83.5%.

Following the results of the subjective evaluation (H domain), MiruSinger, a singing skill visualization interface, was implemented (I domain). MiruSinger provides realtime visual feedback of singing voice, and focuses on the visualization of two key features . F0 (for pitch accuracy improvement) and vibrato sections (for singing technique improvement). Unlike previous systems, real-world music CD recordings are used as referential data. The F0 of vocal-part is estimated automatically from music CD recordings, which can further be hand-corrected interactively using a graphical interface on the MiruSinger screen.

Part2: Singing expression (voice percussion)

Voice percussion in our context is the mimicking of drum sounds by voice, expressed in verbal form that can be transcribed into phonemic representation, or onomatopoeia (e.g. don-tan-do-do-tan). Chapter 5 describes a psychological experiment, voice percussion expression experiment, which gathers data on how subjects express drum patterns (H domain). This will serve as a preliminary basis for developing a voice percussion recognition method. Previous studies on query-by-humming focused on pitch detection and melodic feature extraction, but these features have less relevance in voice percussion recognition, which is primarily concerned with classification of timbre and identification of articulation methods. The methods for handling such features can be useful tools for music notation interface, and also can be applied to have promising applications in widening the scope of music information retrieval.

A "drum pattern" in our context means a sequence of drum beats that form minimum unit (one measure). In this thesis, drum patterns consist of only two percussion instruments . bass drum (BD) and snare drum (SD). In the expression experiment, there were 17 subjects of ages 19 to 31 (two with experience in percussion). The voice percussion sung by the subjects were recorded and analyzed. Significant discoveries from the expression experiment include: "the onomatopoeic expression had correspondence with the length and rhythmic patterns of the beats" and "some subjects may verbally expressed rest notes".

Chapter 6 describes a voice percussion recognition method. The voice percussion was compared with all the patterns in a drum pattern database, and the pattern that was estimated to be acoustically most close to the voice percussion is selected as the recognized result (M domain). The search first looks for drum patterns over onomatopoeic sequences. This selects instrument sequences with the highest likelihood ratings, which are then checked over their onset timings. The pattern with the highest ranking is output as the final result. The recognition method was tested by recognition experiments over a combination of different settings of the acoustic model and the pronunciation dictionary. The following 4 conditions were evaluated.

(A) General acoustic model of speech
(B) Acoustic model tuned by voice percussion utterances not in evaluation data
(C) Acoustic model tuned to individual subjects
(D) Same acoustic model, with the pronunciation dictionary restricted to the expressions used by the subject

The recognition rate in the evaluation experiments were (A)58.5%, (B)58.5%, (C)85.0%, and (D)92.0%.

Following the encouraging results of the proposed method as a practical tool for voice percussion recognition, a score input interface, Voice Drummer, was developed, as its application (I domain). Voice Drummer consists of a score input mode which is used for drum pattern input intended for use in composition, and an arrangement mode which edits drum patterns in a given music piece. There is also a practice/adaptation mode where the user can practice and adapt the system to his/her voice, thus increasing the recognition rate.


Part 1 presented the results of the subjective evaluation experiments, and presented two acoustical features, pitch interval accuracy and vibrato, as key features for evaluating singing skills. The results of the subjective evaluation suggested that the singing skill evaluation of human listeners are generally consistent and in mutual agreement. In the classification experiment, the acoustical features are shown to be effective for evaluating singing skills without score information.

Part 2 presented the results of the voice percussion expression experiment, and presented a voice percussion recognition method. The onomatopoeic expressions utilized in the regcognition experiment were extracted from the expression experiment. In the recognition experiment, the voice percussion recognition method achieved a recognition rate of 91.0% for the highest-tuned setting.

The results of these two studies were adapted to the development of two applications systems, MiruSinger for singing training assistance and Voice Drummer for percussion instrument notation. Trial usage of the systems suggest that they would be a useful tool and fun for average users.

The presented work can be seen as pioneering work in the fields of singing understanding and expression, contributing to the advance of singing voice research.

[BibTex, Return]