Description
The objective of Audio Content Analysis (ACA) is the extraction of information from audio signals such as music recordings stored on digital media. The information to be extracted is usually referred to as meta data: it is data about (audio) data and can essentially cover any information allowing a meaningful description or explanation of the raw audio data. The meta data represents (among other things) the musical content of the recording. Nowadays, attempts have been made to automatically extract practically everything from the music recording including formal, perceptual, musical, and technical meta data. Examples range from tempo and key analysis – ultimately leading to the complete transcription of recordings into a score-like format- over the analysis of artists’ performances of specific pieces of music to approaches to modeling the human emotional affection when listening to music.
In addition to the meta data extractable from the signal itself there is also meta data which is neither implicitly nor explicitly included in the music signal itself but represents additional information on the signal, such as the year of the composition or recording, the record label, the song title, information on the artists, etc.
The examples given above already imply that ACA is a multi-disciplinary research field. Since it deals with audio signals, the main emphasis lies on (digital) signal processing. But depending on the task at hand, the researcher may be required to use knowledge from different research fields such as musicology and music theory, (music) psychology, psychoacoustics, audio engineering, library science, and last but not least computer science for pattern recognition and machine learning. If the research is driven by commercial interests, even legal and economical issues may be of importance.
The term audio content analysis is not the only one used for systems analyzing audio signals. Frequently, the research field is also called Music Information Retrieval (MIR). MIR should be understood as a more general, broader field of which ACA is a part. Downie and Orio have both published valuable introductory articles in the field of MIR [1, 2]. In contrast to ACA, MIR also includes the analysis of symbolic non-audio music formats such as musical scores and files or signals compliant to the so-called Musical Instrument Digital Interface (MIDI) protocol [3]. Furthermore, MIR may include the analysis and retrieval of information that is music-related but cannot be (easily) extracted from the audio signal such as the song lyrics, user ratings, performance instructions in the score, or bibliographical information such as publisher, publishing date, the work’s title, etc. Therefore the term audio content analysis seems to be the most accurate for the description of the approaches to be covered in the following. In the past, other terms have been used more or less synonymously to the term audio content analysis. Examples of such synonyms are machine listening and computer audition. Computational Auditory Scene Analysis ( CASA) is closely related to ACA but usually has a strong focus on modeling the human perception of audio.
Historically, the first systems analyzing the content of audio signals appear shortly after technology provided the means of storing and reproducing recordings on media in the 20th century. One early example is Seashore’s Tonoscope, which allowed one to analyze the pitch of an audio signal by visualizing the fundamental frequency of the incoming audio signal on a rotating drum [4]. However, the development of digital storage media and digital signal processing during the last decades, along with the growing amount of digital audio data available through broadband connections, has significantly increased both the need and the possibilities of automatic systems for analyzing audio content, resulting in a lively and growing research field. A short introduction to extracting information from audio on different levels has been published by Ellis [5].
Audio content analysis systems can be used on a relatively wide variety of tasks. Obviously, the automatic generation of meta data is of great use for the retrieval of music signals with specific characteristics from large databases or the Internet. Here, the manual annotation of meta data by humans is simply not feasible due to the sheer amount of (audio) data. Therefore, only computerized tags can be used to find files or excerpts of files with, e.g., a specific tempo, instrumentation, chord progression, etc. The same information can be used in end consumer applications such as for the automatic generation of play lists in music players or in automatic music recommendation systems based on the user’s music database or listening habits. Another typical area of application is music production software. Here, the aim of ACA is on the one hand to allow the user to interact with a more “musical” software interface – e.g., by displaying score-like information along with the audio data – and thus enabling a more intuitive approach to visualization and editing the audio data. On the other hand, the software can support the user by giving suggestions of how to combine and process different audio signals. For instance, software applications for DJs nowadays include technology allowing the (semi-) automatic alignment of audio loops and complete mixes based on previously extracted information such as the tempo and key of the signals. In summary, ACA can help with
• automatic organization of audio content in large databases as well as search and retrieve audio files with specific characteristics in such databases (including the tasks of song identification and recommendation),
• new approaches and interfaces to search and retrieval of audio data such as query-byhumming systems,
• new ways of sound visualization, user interaction, and musical processing in music software such as an audio editor displaying the current score position or an automatically generated accompaniment,
• intelligent, content-dependent control of audio processing (effect parameters, intelligent cross fades, time stretching, etc.) and audio coding algorithms, and
• automatic play list generation in media players.