08-05-2013, 03:39 PM
MUSICAL SOUND SEPARATION USING PITCH-BASED LABELING AND BINARY TIME-FREQUENCY MASKING
MUSICAL SOUND SEPARATION.pdf (Size: 390.82 KB / Downloads: 19)
ABSTRACT
Monaural musical sound separation attempts to segregate different
instrument lines from single-channel polyphonic music.
We propose a system that decomposes an input into timefrequency
units using an auditory filterbank and utilizes pitch
to label which instrument line each time-frequency unit is assigned
to. The system is conceptually simple and computationally
efficient. Systematic evaluation shows that, despite
its simplicity, the proposed system achieves a competitive level
of performance.
INTRODUCTION
As the demand for automatically analyzing, organizing, and
retrieving a vast amount of online music data explodes, musical
sound separation has attracted significant attention in recent
years. Monaural separation that attempts to recover each
source/instrument line from single-channel polyphonic music
is a particularly challenging problem. On the other hand,
a system possessing such functionality allows more efficient
audio coding, accurate content-based analysis, and sophisticated
manipulation on musical signals [1].
In music multiple instruments often play simultaneously.
The polyphonic nature of music creates unique problems for
monaural musical sound separation. One such problem is
overlapping harmonics where a harmonic of one note has a
frequency that is the same as or close to the frequency of a
harmonic from another concurrent note. The phenomenon
of overlapping harmonics is common since Western music
favors notes that are harmonically related—pitches are in a
simple integer ratio [2]. It is in general difficult to recover
each individual harmonic without instrument-specific knowledge.
The interplay of different instrument lines also makes
the independence assumption of sound sources invalid.
SYSTEM DESCRIPTION
Our proposed system is illustrated in Fig. 1. The input to the
system is monaural polyphonic music. In the time-frequency
(T-F) decomposition stage, the system decomposes the input
into its frequency components using an auditory filterbank
and divides the output of each filter into overlapping frames.
We call an element indexed by frame and frequency a T-F
unit. In the next stage, an auditory representation, called the
correlogram, is computed. At the same time, the pitches of
different instrument lines are detected in the multiple pitch
detection module. Multiple pitch detection for music is a
very difficult problem. Since the main focus of this study
is to investigate the performance of pitch-based separation in
music using auditory representations, we do not perform multiple
pitch detection (indicated by the dashed box); instead we
supply the system with ideal pitches detected from premixing
instrument lines. In the pitch-based labeling stage, pitches are
used to determine which instrument line each T-F unit should
be assigned to. This creates a binary mask for each line. In
this paper we do not attempt to separate overlapping harmonics
and we leave it for future study. In the final stage of the
system, the masks are used to resynthesize individual instrument
lines. The details of each stage are explained in the
following subsections.
EVALUATION
To evaluate the proposed system, we constructd a database
consisting of 20 pieces of quartet composed by J. S. Bach.
Since it is difficult to obtain multi-track signals where different
instruments are recorded in different tracks, we generate
audio signals from MIDI files. For each MIDI file, we use the
tenor and the alto line for synthesis since we focus on separating
two concurrent instrument lines. Audio signals could
be generated from MIDI data using MIDI synthesizers. But
such signals tend to have stable spectral contents, which are
very different from real music recordings. In this study, we
use recorded note samples from the RWC music instrument
database [13] to synthesize audio signals based on MIDI data.
First, each line is randomly assigned to one of the four instruments:
a clarinet, a flute, a violin, and a trumpet. After that,
for each note in the line, a note sound sample with the closest
average pitch is selected from the samples of the assigned
instrument and used for that note. Details about the synthesis
procedure can be found in [14]. Admittedly, the audio signals
generated this way are a rough approximation of real recordings.
But they show realistic spectral and temporal variations.
Different instrument lines are mixed to 0 dB SNR for separation.
The first 5-second signal of each piece is used for testing.
The pitches of each instrument line are detected using Praat
[15].
CONCLUSION
In this paper, we have proposed a CASA system for monaural
musical sound separation. We label each T-F unit solely based
on the values of the autocorrelation function at time lags corresponding
to two pitch periods. The SNR evaluation shows
the proposed system is as effective as more complicated sinusoidal
model-based systems. Besides auditory filtering, the
main computation of our system is to obtain the values of autocorrelation
at two time lags at each T-F unit. Note that the
calculation of a full correlogram is unnecessary, i.e., the system
does not need to calculate autocorrelation for all possible
time lags. We believe there is considerable room to improve
our system. For example, segmentation and grouping, the two
stages widely adopted in CASA, can be applied to make unit
labeling more reliable. One can also first identify T-F units
that are reliably labeled and use those T-F units to further process
unreliable T-F units. We will pursue these directions in
our future study.