Abstract:
This paper proposes an approach for indexing a
collection of multimedia clips by a speaker in an audio
track. A Bayesian Information Criterion (BIC) procedure
is used for segmentation and Mel-Frequency Cepstral
Coefficients (MFCC) are extracted and sampled as
metadata for each segment. Silence detection is also
carried out during segmentation. Gaussian Mixture
Models (GMM) are trained for each speaker, and an
ensemble technique is proposed to reduce errors caused
by the probabilistic nature of GMM training. The
indexing system utilizes sampled MFCC features as
segment metadata and maintains the metadata of the
speakers separately, allowing modification or additions to
be done independently. The system achieves a True Miss
Rate (TMR) of around 20% and a False Alarm Rate
(FAR) of around 10% for segments between 15 and 25
seconds in length with performance decreasing with
reduction in segment size.