Time-frequency Methods for Pitch Detection

Herbert Griebel
Vienna University of Technology, Austria (September, 2002)


The thesis proposes new methods for the pitch detection of monophonic and polyphonic signals. Investigated have been speech and music signals with non-ideal real properties, with a little number of harmonics or stretched harmonics. Pitch detection means detecting the fundamental frequency of a harmonic complex sound, i.e. the sound consists of a fundamental tone and harmonics at integral multiples of the fundamental frequency. Additionally the detection of a single sinusoid is treated in general, with strong overlap from arbitrary other small-band components and with strong overlap from other stable sinusoidal components. Fundamental problem of polyphonic pitch detection is the overlapping of signal components. Estimation of frequency, amplitude and phase is no simple task anymore. Resolving overlapping determined signal components was neglected in the past and is main part of this thesis. The detection of individual sinusoidal components is subproblem of the fundamental frequency detection. Motivation for the thorough treatment is the problem of detecting voiced and unvoiced segments of a speech signal, which is more difficult than detecting the fundamental frequency and further, the application in automatic speech recognition. If the amplitudes of speech harmonics and the tone have the same order of magnitude, individual kernels of the front-end filterbank are disturbed and the error rate deteriorates. The proposed method uses two additional time-frequency planes, which represent the smoothness of the sinusoidal signal. It is possible to detect stationary sinusoidal signal even with strong overlap of partial tones of a speech signal. Polyphonic pitch detection is main part of an automatic music recognition system. Musicians could use such a system, analysis of musical expression and tune recognition are important applications. The evaluated iterative method identifies the most easily detectable sound and subtracts it from the overall spectrum. Both steps are repeated until no sound is detectable anymore. The sound is detected locally in bands and does not utilize partial tracks. The subtraction simplifies the spectrum in the sense, that overlaps are resolved and other sounds become detectable. Detection of the fundamental frequency of speech is economically the most important problem. With an accurate signal model many problems can be solved easier or can be solved at all. Applications are denoising or equalizing of speech, estimating syllables rates, speech recognition and speech detection. Despite the vast amount of research already done on the field current available methods are not reliable enough. The proposed method overcomes some of the shortcomings and gives more reliable results than other methods, especially all correlation based methods.

[BibTex, External Link, Return]