What is Audio Mining .. ? January 6, 2011Posted by hasnain110 in Uncategorized.
Audio mining approaches
There are two main approaches to audio mining.
Text-based indexing. Text-based indexing, also known as large-vocabulary continuous speech recognition (LVCSR), converts speech to text and then identifies words in a dictionary that can contain up to several hundred thousand entries. If a word or name is not in the dictionary, the LVCSR system will choose the most similar word it can find.
The system uses language understanding to create a confidence level for its findings. For findings with less than a 100 percent confidence level, the system offers other possible word matches, said Professor Dan Ellis, who leads Columbia University’s Laboratory for Recognition and Organization of Speech and Audio (http://labrosa.ee.columbia.edu).
Thus, an LVCSR system can enhance its accuracy level by storing words that sound much like other words, although this approach also generates some wrong results.
Phoneme-based indexing. Phoneme-based indexing doesn’t convert speech to text but instead works only with sounds.
The system first analyzes and identifies sounds in a piece of audio content to create a phonetic-based index. It then uses a dictionary of several dozen phonemes to convert a user’s search term to the correct phoneme string. (Phonemes are the smallest unit of speech that distinguishes one utterance from another. For example, “ai”, “eigh”, and “ey” are the long “a” phoneme. Each language has a finite set of phonemes, and all words are sets of phonemes.) Finally, the system looks for the search terms in the index.
“A phonetic system requires a more proprietary search tool because it must phoneticize the query term, then try to match it with the existing phonetic-string output,” Weideman said. This is considerably more complex than using one of the many existing text-based search tools.
Phoneme-based searches can result in more false matches than the text-based approach, particularly for short search terms, because many words sound alike or sound like parts of other words. For example, Weideman explained, a search for the word “ray” might get a match from within the word “trading.”
According to Ellis, it’s difficult for a phonetic system to accurately classify a phoneme except by recognizing the entire word that it is part of or by understanding that a language permits only certain phoneme sequences.
However, he added, phonetic indexing can still be useful if the analyzed material contains important words that are likely to be missing from a text system’s dictionary, such as foreign terms and names of people and places.
How the technology works
Text- and phoneme-based systems operate in much the same way, except that the former uses a text-based dictionary and the latter uses a phonetic dictionary.
The most important and complex component technology for audio mining is speech recognition. In these systems, explained University of Texas Assistant Professor Latifur R. Khan, “A speech recognizer converts the observed acoustic signal into the corresponding [written] representation of the spoken [words].”
Speech recognition software contains acoustic models of the way in which all phonemes are represented. Also, TMA’s Meisel said, there is a statistical language model that indicates how likely words are to follow each other in a specific language. By using these capabilities, as well as complex probability analysis, the technology can take a speech signal of unknown content and convert it to a series of words from the program’s dictionary.
Khan noted that this process is more difficult with highly inflected languages, such as Chinese, in which tonality changes the meaning of a word.
Some audio mining dictionaries are domain specific, for use by professionals in different fields, such as law or medicine. In any event, users can update dictionaries, usually manually but sometimes automatically by scanning Web sites or other sources into an audio mining product.
Some products, such as ScanSoft’s AudioMining Development System, use XML’s ability to tag data so that it can be read by other XML-capable systems, ScanSoft’s Weideman noted. This lets the product export speech index information to other systems, he said.
By working with powerful host-system processors, large memories, and efficient algorithms, most audio mining technology provides high performance levels.
For example, Fast-Talk says its newest technology can index a one-hour audio file in five minutes, and can process 30 hours of content per second in response to a specific, 10-phoneme search query in a host system running a 2.53-GHz Pentium CPU.