Within the MoCA project, we have the goal to extract video contents from picture AND audio tracks. One application of such content extraction is automatic indexing, which will support a user (both, professionals like librarians and home users) in a search task. The audio track bears important information about a video's content, not only within speech segments but also within the background music and within different noises. We have, e.g., tried to automatically detect violence by analyzing the audio stream.
In the following paragraphs, different activites of audio content analysis within the MoCA project are presented.
As already mentioned above, different types of information are found within speech, music, silence and noise segments of the audio track. Before being able to extract such information, a segmentation of the audio track is therefore necessary. We have implemented algorithms to perform such a segmentation based on the similarity of consecutive sound. On top of this segmentation, a classification is necessary in order to be able to distinguish music, speech, noises and silence.
Content of music will be described along two dimensions within our
project: rhythm characteristics and tone characteristics.
Rhythm characteristics give a temporal description of musical events.
In a first step, we have extracted the beat of percussion-based music
both by analyzing time domain (i.e. temporal distribution of amplitude
values) and frequency domain (i.e. the frequency patterns of typical
percussion instruments) parameters.
Tone characteristics give a direct description of the content of
musical events. Our first efforts have been the extraction of the
fundamental frequency of musical events and the identification of
single notes of single voiced music. We further intend to extract
melodies and harmonic characteristics.
Content of speech is twofold: it can be used to determine who's speaking (speaker recognition) and to determine what's being said (speech recognition). Both dimensions have been a main research area within Artificial Intelligence projects and software for both is available. Therefore, our research efforts within MoCA Audio Content Analysis have left out speech analysis. It is, however, interesting to consider the situation for speech analysis tools within video content analysis: the vocabulary used is basically unlimited, lots of different speakers are involved, training is not possible, speech is continuous and a lot of disturbing noises occur. Therefore, common speech recognition packages ae difficult to use for this task. In a first approach, complete recognition of speech is not necessary: a simpler word spotting algorithm is sufficient.
As an example for the analyis of noises, we implemented an application to detect violence in movie sequences. As violence itself contains many aspects and is strongly dependent on the cultural environment, a computer system cannot recognize violence in all its forms. We therefore concentrated sucessfully on the recognition of a few indicators of violence: shots, explosions and cries.
For more information, see our technical report on Audio Content Analysis.
This paper describes the theoretic framework and applications of automatic audio content analysis. After explaining the basic properties of audio analysis, we present a toolbox being the basis for the development of audio analysis algorithms. We also describe new applications which can be developed using the toolset, among them music indexing and retrieval as well as violence detection in the sound track of videos.
Silvia Pfeiffer,Stephan Fischer and Wolfgang Effelsberg, Automatic
Audio Content Analysis, Proc. ACM Multimedia 96, pp. 21-30, Bosten, MA,
1996.
[Abstract]
[PDF:
309KB];
also Technical Report TR-96-008,
1996.