Signal Processing Methods for Audio Classification and Music Content Analysis

Antti Eronen
Tampere University of Technology, Tampere, Finland (June, 2009)


Signal processing methods for audio classification and music content analysis are developed in this thesis. Audio classification is here understood as the process of assigning a discrete category label to an unknown recording. Two specific problems of audio classification are considered: musical instrument recognition and context recognition. In the former, the system classifies an audio recording according to the instrument, e.g. violin, flute, piano, that produced the sound. The latter task is about classifying an environment, such a car, restaurant, or library, based on its ambient audio background.

In the field of music content analysis, methods are presented for music meter analysis and chorus detection. Meter analysis methods consider the estimation of the regular pattern of strong and weak beats in a piece of music. The goal of chorus detection is to locate the chorus segment in music which is often the catchiest and most memorable part of a song. These are among the most important and readily commercially applicable content attributes that can be automatically analyzed from music signals.

For audio classification, several features and classification methods are proposed and evaluated. In musical instrument recognition, we consider methods to improve the performance of a baseline audio classification system that uses mel-frequency cepstral coefficients and their first derivatives as features, and continuous-density hidden Markov models (HMMs) for modeling the feature distributions. Two improvements are proposed to increase the performance of this baseline system. First, transforming the features to a base with maximal statistical independence using independent component analysis. Secondly, discriminative training is shown to further improve the recognition accuracy of the system.

For musical meter analysis, three methods are proposed. The first performs meter analysis jointly at three different time scales: at the temporally atomic tatum pulse level, at the tactus pulse level, which corresponds to the tempo of a piece, and at the musical measure level. The features obtained from an accent feature analyzer and a bank of combfilter resonators are processed by a novel probabilistic model which represents primitive musical knowledge and performs joint estimation of the tatum, tactus, and measure pulses.

The second method focuses on estimating the beat and the tatum. The design goal was to keep the method computationally very efficient while retaining sufficient analysis accuracy. Simplified probabilistic modeling is proposed for beat and tatum period and phase estimation, and ensuring the continuity of the estimates. A novel phase-estimator based on adaptive comb filtering is presented. The accuracy of the method is close to the first method but with a fraction of the computational cost.

The third method for music rhythm analysis focuses on improving the accuracy in music tempo estimation. The method is based on estimating the tempo of periodicity vectors using locally weighted k-Nearest Neighbors (k-NN) regression. Regression closely relates to classification, the difference being that the goal of regression is to estimate the value of a continuous variable (the tempo), whereas in classification the value to be assigned is a discrete category label. We propose a resampling step applied to an unknown periodicity vector before finding the nearest neighbors to increase the likelihood of finding a good match from the training set. This step improves the performance of the method significantly. The tempo estimate is computed as a distance-weighted median of the nearest neighbor tempi. Experimental results show that the proposed method provides significantly better tempo estimation accuracies than three reference methods.

Finally, we describe a computationally efficient method for detecting a chorus section in popular and rock music. The method utilizes a self-dissimilarity representation that is obtained by summing two separate distance matrices calculated using the mel-frequency cepstral coefficient and pitch chroma features. This is followed by the detection of off-diagonal segments of small distance in the distance matrix. From the detected segments, an initial chorus section is selected using a scoring mechanism utilizing several heuristics, and subjected to further processing.

[BibTex, External Link, Return]