Abstract:
This paper proposes an approach for indexing a collection of
multimedia
clips by a speaker in an audio track. A Bayesian Information Criterion
(BIC) procedure is used for segmentation and Mel-Frequency Cepstral
Coefficients (MFCC) are extracted and sampled as metadata for each
segment. Silence detection is also carried out during segmentation.
Gaussian Mixture Models (GMM) are trained for each speaker, and an
ensemble technique is proposed to reduce errors caused by the
probabilistic
nature of GMM training. The indexing system utilizes sampled MFCC
features as segment metadata and maintains the metadata of the speakers
separately, allowing modification or additions to be done independently.
The system achieves a True Miss Rate (TMR) of around 20% and a False
Alarm Rate (FAR) of around 10% for segments between 15 and 25 seconds
in length with performance decreasing with reduction in segment size.