MIR PhD Thesis: Matt Wright (2008)

Computer-Based Music Theory and Acoustics

Matt Wright
Stanford University, CA, USA (March, 2008)

ABSTRACT

A musical event’s Perceptual Attack Time (“PAT”) is its perceived moment of rhythmic placement; in general it is after physical or perceptual onset. If two or more events sound like they occur rhythmically together it is because their PATs occur at the same time, and the perceived rhythm of a sequence of events is the timing pattern of the PATs of those events. A quantitative model of PAT is useful for the synthesis of rhythmic sequences with a desired perceived timing as well as for computer-assisted rhythmic analysis of recorded music. Musicians do not learn to make their notes' physical onsets have a certain rhythm; rather, they learn to make their notes' perceptual attack times have a certain rhythm.

PAT is notoriously difficult to measure, because all known methods can measure a test sound’s PAT only in relationship to a physical action or to a second sound, both of which add their own uncertainty to the measurements. A novel aspect of this work is the use of the ideal impulse (the shortest possible digital audio signal) as a reference sound. Although the ideal impulse is the best possible reference in the sense of being perfectly isolated in time and having a very clear and percussive attack, it is quite difficult to use as a reference for most sounds because it has a perfectly broad frequency spectrum, and it is more difficult to perceive the relative timing of sounds when their spectra differ greatly. This motivates another novel contribution of this work, Spectrally Matched Click Synthesis, the creation of arbitrarily short duration clicks whose magnitude frequency spectra approximate those of arbitrary input sounds.

All existing models represent the PAT of each event as a single instant. However, there is often a range of values that sound equally correct when aligning sounds rhythmically, and this range depends on perceptual characteristics of the specific sounds such as the sharpness of their attacks. Therefore this work represents each event’s PAT as a continuous probability density function indicating how likely a typical listener would be to hear the sound’s PAT at each possible time. The methodological problem of deriving each sound’s own PAT from measurements comparing pairs of sounds therefore becomes the problem of estimating the distributions of the random variables for each sound’s intrinsic PAT given only observations of a random variable corresponding to difference between the intrinsic PAT distributions for the two sounds plus noise. Methods presented to address this draw from maximum likelihood estimation and the graphtheoretical shortest path problem.

This work describes an online listening test, in which subjects download software that presents a series of PAT measurement trials and allows them to adjust their relative timing until they sound synchronous. This establishes perceptual “ground truth” for the PAT of a collection of 20 sounds compared against each other in various combinations. As hoped, subjects were indeed able to align a sound more reliably to one of that sound’s spectrally matched clicks than to other sounds of the same duration.

The representation of PAT with probability density functions provides a new perspective on the long-standing problem of predicting PAT directly from acoustical signals. Rather than choosing a single moment for PAT given a segment of sound known a priori to contain a single musical event, these regression methods estimate continuous shapes of PAT distributions from continuous (not necessarily presegmented) audio signals, formulated as a supervised machine learning regression problem whose inputs are DSP functions computed from the sound, the detection functions used in the automatic onset detection literature. This work concludes with some preliminary musical applications of the resulting models.

[BibTex, External Link, Return]