MIR PhD Thesis: Masataka Goto (1998)

A Study of Real-time Beat Tracking for Musical Audio Signals (in Japanese)

Masataka Goto
Waseda University, Japan (March, 1998)

ABSTRACT

Although a great deal of music-understanding research has been undertaken, it is still difficult to build a computer system that can understand musical audio signals in a human-like fashion. One of popular approaches to the computational modeling of music understanding is to build an automatic music transcription system or a sound source separation system, which typically transforms audio signals into a symbolic representation such as a musical score or MIDI data. Although such transcription technologies are important, they have difficulty in dealing with real-world audio signals such as ones sampled from compact discs. Because only a trained listener can identify musical notes, it can be inferred that musical transcription is an advanced skill difficult even for human beings to acquire.

On the other hand, this study gives attention to the fact that an untrained listener understands music to some extent without mentally representing audio signals as musical scores. The approach of this study is to first build a computational model that can understand music the way untrained listeners do, without relying on transcription, and then extend the model so that it can understand music the way musicians do. The first appropriate step of this approach is to build a computational model of beat tracking that is a process of understanding musical beats and measures, which are fundamental and important concepts in music, because beat tracking is a fundamental skill for both trained and untrained listeners and is indispensable to the perception of Western music. Moreover, beat tracking is useful in various applications in which music synchronization is necessary.

Most previous beat-tracking related systems have dealt with symbolic musical information like MIDI signals. They were, however, not able to process audio signals that were difficult to be transformed into a symbolic representation. Although some systems dealt with audio signals, they had difficulty in processing, in real time, audio signals containing sounds of various instruments and in tracking beats above the quarter-note level.

This thesis describes a real-time beat tracking system that recognizes a hierarchical beat structure in audio signals of popular music containing sounds of various instruments. The hierarchical beat structure consists of the quarter-note (beat) level, the half-note level, and the measure (bar-line) level. The system can process both music with drums and music without drums.

This thesis consists of the following nine chapters.

Chapter 1 presents the goal, background, and significance of this study. This study has relevant to both computational auditory scene analysis and musical information processing and contributes to various research fields such as music understanding, signal processing, artificial intelligence, parallel processing, and computer graphics.

Chapter 2 specifies that the beat-tracking problem is defined as a process that organizes musical audio signals into the hierarchical beat structure. This problem can be considered the inverse problem of the following three processes: indicating or implying the beat structure when performing music, playing musical instruments, and acoustic transmission of those sounds. The principal reason why beat tracking is intrinsically difficult is that this is the problem of inferring the original beat structure, which is not explicitly expressed in music. The main issues of solving this problem are: detecting beat-tracking cues in audio signals, interpreting the cues to infer the beat structure, and dealing with ambiguity of interpretation.

Chapter 3 proposes a beat-tracking model that consists of the inverse model of the process of indicating the beat structure and a model of extracting musical elements. The inverse model is represented by three kinds of musical knowledge corresponding to three kinds of musical elements: onset times, chord changes, and drum patterns. This chapter then addresses the three issues as follows:

(1) Detecting beat-tracking cues in audio signals
The three kinds of musical elements are detected as the beat-tracking cues. Since it is difficult to detect chord changes and drum patterns by bottom-up frequency analysis, this chapter proposes a method of detecting them by making use of provisional beat times as top-down information.

(2) Interpreting the cues to infer the beat structure
The quarter-note level is inferred on the basis of the musical knowledge of onset times. The half-note and measure levels are inferred on the basis of the musical knowledge of chord changes and drum patterns that is applied selectively according to the presence or absence of drum-sounds.

(3) Dealing with ambiguity of interpretation
To examine multiple hypotheses of beat positions in parallel, a multiple-agent model is introduced. Each agent makes a hypothesis according to different strategy while interacting with another agent and evaluates the reliability of its own hypothesis. The final beat-tracking result is determined on the basis of the most reliable hypothesis.

Chapter 4 describes the processing model of the system. In the frequency analysis stage, the system detects onset-time vectors representing onset times of all the frequency ranges. It also detects onset times of a bass drum and a snare drum and judges the presence of drum-sounds by using autocorrelation of the snare drum's onset times. In the beat prediction stage, each agent infers the quarter-note level by using autocorrelation and cross-correlation of the onset-time vectors. The quarter-note level is then utilized as the top-down information to detect chord changes and drum patterns. On the basis of those detected results, each agent infers the higher levels and evaluates the reliability. Finally, the beat-tracking result is transmitted to other application programs via a computer network.

Chapter 5 proposes a method of parallelizing the processing model to perform it in real time. This method applies four kinds of parallelizing techniques to execute heterogeneous processes simultaneously. The processes are first pipelined, and then each stage of the pipeline is implemented with data/control parallel processing, pipeline processing, and distributed cooperative processing. This chapter also proposes a time-synchronization mechanism for real-time processing on AP1000 on which the system has been implemented.

Chapter 6 proposes quantitative measures for analyzing the beat-tracking accuracies and shows experimental results of the system. By using the proposed measures, the system was evaluated on 85 songs sampled from compact discs of popular music. The results showed that the recognition accuracies were more then 86.7% at each level of the beat structure. It was also confirmed that musical decisions based on chord changes and drum patterns were effective.

Chapter 7 concludes that the processing model proposed by this thesis was robust enough to track beats in real-world audio signals sampled from compact discs. Moreover, the validity of the three kinds of musical knowledge introduced as the inverse model is verified by the results described in the previous chapter. The main contribution of this thesis is to propose a new computational model that can recognize the hierarchical beat structure in audio signals in real time. Although such a beat-tracking model has been desired, it was not built in previous work.

Chapter 8 introduces various applications in which beat tracking is useful. To confirm that the system is effective in a real-world application, an application that displays real-time computer graphics dancers whose motions change in time to musical beats has already been developed.

Chapter 9 concludes this thesis by summarizing the main results. This thesis shows that it is possible to build a computational model of real-time beat tracking for audio signals, which is one of important processes of music perception, and makes a step toward a complete computational model that can understand music in a human-like fashion.

[BibTex, Return]