Humans seem to perform sound-source separation for quasi-periodic sounds, such as speech, mostly on harmonicity cues. To model this function, most machine algorithms use a pitch-based approach to group the speech parts of the spectrum. In these methods the pitch is obtained either explicitly, in autocorrelation methods, or implicitly, as in harmonic sieves. If the estimation of pitch is wrong, the grouping will fail as well. In this paper we show a method that performs harmonic grouping without first calculating the pitch. Instead a pitch estimate is associated with each grouping hypothesis.
Making the grouping independent of the pitch estimate makes it more robust in noisy settings.
The algorithm obtains possible harmonics by tracking energy peaks in a cochleogram. Co-occuring harmonics are compared in terms of frequency difference. Grouping hypotheses are formed by combining harmonics with similar frequency differences. Consistency checks are performed on these hypotheses and hypotheses with compatible properties are combined into harmonic complexes. Every harmonic complex is evaluated on the number of the harmonics, the number of subsequent harmonics and the presence of a harmonic at the pitch position. By using the number of subsequent harmonics octave errors are prevented.
Multiple concurrent harmonic complexes can be found as long as the spectral overlap is small.