Signal Processing Methods for Drum Transcription and Music Structure Analysis

Jouni Paulus
Tampere University of Technology, Finnland (January, 2010)


This thesis proposes signal processing methods for the analysis of musical audio on two time scales: drum transcription on a finer time scale and music structure analysis on the time scale of entire pieces. The former refers to the process of locating drum sounds in the input and recognising the instruments that were used to produce the sounds. The latter refers to the temporal segmentation of a musical piece into parts, such as chorus and verse.

For drum transcription, both low-level acoustic recognition and high-level musicological modelling methods are presented. A baseline acoustic recognition method with a large number of features using Gaussian mixture models for the recognition of drum combinations is presented. Since drums occur in structured patterns, modelling of the sequential dependencies with N-grams is proposed. In addition to the conventional N-grams, periodic N-grams are proposed to model the dependencies between events that occur one pattern length apart. The evaluations show that incorporating musicological modelling improves the performance considerably. As some drums are more probable to occur at certain points in a pattern, this dependency is utilised for producing transcriptions of signals produced with arbitrary sounds, such as beatboxing.

A supervised source separation method using non-negative matrix factorisation is proposed for transcribing mixtures of drum sounds. Despite the simple signal model, a high performance is obtained for signals without other instruments. Most of the drum transcription methods operate only on single-channel inputs, but multichannel signals are available in recording studios. A multichannel extension of the source separation method is proposed, and an increase in performance is observed in evaluations.

Many of the drum transcription methods rely on detecting sound onsets for the segmentation of the signal. Detection errors will then decrease the overall performance of the system. To overcome this problem, a method utilising a network of connected hidden Markov models is proposed to perform the event segmentation and recognition jointly. The system is shown to be able to perform the transcription even from polyphonic music.

The second main topic of this thesis is music structure analysis. Two methods are proposed for this purpose. The first relies on defining a cost function for a description of the repeated parts. The second method defines a fitness function for descriptions covering the entire piece. The abstract cost (and fitness) functions are formulated in terms that can be determined from the input signal algorithmically, and optimisation problems are formulated. In both cases, an algorithm is proposed for solving the optimisation problems. The first method is evaluated on a small data set, and the relevance of the cost function terms is shown. The latter method is evaluated on three large data sets with a total of 831 (557+174+100) songs. This is to date the largest evaluation of a structure analysis method. The evaluations show that the proposed method outperforms a reference system on two of the data sets.

Music structure analysis methods rarely provide musically meaningful names for the parts in the result. A method is proposed to label the parts in descriptions based on a statistical model of the sequential dependencies between musical parts. The method is shown to label the main parts relatively reliably without any additional information. The labelling model is further integrated into the fitness function based structure analysis method.

[BibTex, Return]