MIR PhD Thesis: Kazuyoshi Yoshii (2008)

Studies on Hybrid Music Recommendation Using Timbral and Rhythmic Features

Kazuyoshi Yoshii
Kyoto University, Kyoto, Japan (March, 2008)

ABSTRACT

The importance of music recommender systems is increasing because most online services that manage large music collections cannot provide users with entirely satisfactory access to their collections. Many users of music streaming services want to discover “unknown” pieces that truly match their musical tastes even if these pieces are scarcely known by other users. Recommender systems should thus be able to select musical pieces that will likely be preferred by estimating user tastes. To develop a satisfactory system, it is essential to take into account the contents and ratings of musical pieces. Note that the content should be automatically analyzed from multiple viewpoints such as timbral and rhythmic aspects whereas the ratings are provided by users.

We deal with hybrid music recommendation in this thesis based on users’ ratings and timbral and rhythmic features extracted from polyphonic musical audio signals. Our goal was to simultaneously satisfy six requirements: (1) accuracy, (2) diversity, (3) coverage, (4) promptness, (5) adaptability, and (6) scalability. To achieve this, we focused on a model-based hybrid filtering method that has been proposed in the field of document recommendation. This method can be used to make accurate and prompt recommendations with wide coverage and diversity in musical pieces, using a probabilistic generative model that unifies both content-based and rating-based data in a statistical way.

To apply this method to enable satisfying music recommendation, we tackled four issues: (i) lack of adaptability, (ii) lack of scalability, (iii) no capabilities for using musical features, and (iv) no flexibility for integrating multiple aspects of music. To solve issue (i), we propose an incremental training method that partially updates the model to promptly reflect partial changes in the data (addition of rating scores and registration of new users and pieces) instead of training the entire model from scratch. To solve issue (ii), we propose a cluster-based training method that efficiently constructs the model at a fixed computational cost regardless of the numbers of users and pieces. To solve issue (iii), we propose a bag-of-features model that represents the time-series features of a musical piece as a set of existence probabilities of predefined features. To solve issue (iv), we propose a flexible method that integrates the musical features of timbral and rhythmic aspects into bag-of-features representations.

In Chapter 3, we first explain the model-based method of hybrid filtering that takes into account both rating-based and content-based data, i.e., rating scores awarded by users and the musical features of audio signals. The probabilistic model can be used to formulate a generative mechanism that is assumed to lie behind the observed data from the viewpoint of probability theory. We then present incremental training and its application to cluster-based training. The model formulation enables us to incrementally update the partial parameters of the model according to the increase in observed data. Cluster-based training initially builds a compact model called a core model for fixed numbers of representative users and pieces, which are the centroids of clusters of similar users and pieces. To obtain the complete model, the core model is then extended by registering all users and pieces with incremental training. Finally, we describe the bag-of-features model to enable hybrid filtering to deal with musical features extracted from polyphonic audio signals. To capture the timbral aspects of music, we created a model for the distribution of Mel frequency cepstral coefficients (MFCCs). To also take into account the rhythmic aspects of music, we effectively combined rhythmic features based on drum-sound onsets with timbral features (MFCCs) by using principal component analysis (PCA). The onsets of drum sounds were automatically obtained as described in the next chapter.

Chapter 4 describes a system that detects onsets of the bass drum, snare drum, and hi-hat cymbals from polyphonic audio signals. The system takes a template-matchingbased approach that uses the power spectrograms of drum sounds as templates. However, there are two problems. The first is that no appropriate templates are known for all songs. The second is that it is difficult to detect drum-sound onsets in sound mixtures including various sounds. To solve these, we propose two methods of template adaptation and harmonic-structure suppression. First, an initial template for each drum sound (seed template) is prepared. The former method adapted it to actual drum-sound spectrograms appearing in the song spectrogram. To make our system robust to the overlapping of harmonic sounds with drum sounds, the latter method suppressed harmonic components in the song spectrogram. Experimental results with 70 popular songs demonstrated that our methods improved the recognition accuracy and respectively achieved 83%, 58%, and 46% in detecting the onsets of the bass drum, snare drum, and hi-hat cymbals.

In Chapter 5, we discuss the evaluation of our system by using audio signals from commercial CDs and their corresponding rating scores obtained from an e-commerce site. The results revealed that our system accurately recommended pieces including non-rated ones from a wide diversity of artists and maintained a high degree of accuracy even when new rating score, users, and pieces were added. Cluster-based training, which can speed up model training a hundred fold, had the potential to improve the accuracy of recommendations. That is, we found a breakthrough that overcame the trade-off, i.e., accuracy v.s. efficiency, which has been considered to be unavoidable. In addition, we verified the importance of timbral and rhythmic features in making accurate recommendations.

Chapter 6 discusses the major contributions of this study to different research fields, particularly to music recommendation and music analysis. We also discuss issues that still remain to be resolved and future directions of research.

Chapter 7 concludes this thesis.

[BibTex, PDF, Return]