Sphinx: An Open Source Speech-to-Text Engine

Posted by Julian Dunn on August 29, 2006
Telephony

I attended the Toronto Asterisk Users’ Group meeting tonight and one of the hot topics discussed over dinner was speech-to-text (i.e. speech recognition). Text-to-speech (TTS) in Asterisk is already well-handled by Festival and the corresponding Asterisk application, but I think you’ll agree that speech recognition is a far more interesting topic. (Except if you hate Emily, Bell Canada’s vocal equivalent of the stupid Microsoft paperclip)

Carnegie Mellon University has long had a group working on a recognition engine called Sphinx, funded by a DARPA grant. I’m told that Sphinx-II, the original C version, is available as an application for Asterisk, but later versions of Sphinx have much higher accuracy. Sphinx-3 is written in C++ and Sphinx-4 is written entirely in Java. Sphinx is different from many other speech recognition systems in that it does not require training, which makes it ideal for use in telephony applications. Instead, you supply it with a dictionary of known waveforms (the bigger the dictionary, the more RAM is used). Mike Ashton of QualityTrack claims over 96% accuracy using Sphinx, using it to strip sensitive information out of recorded phone calls from a call centre monitoring application.

This is really fascinating technology, and the best part about it is that despite having been developed under a DARPA grant, it’s open source! Apparently this was one of the stipulations of the CMU researchers when they first agreed to accept the grant, and the community is the better for it. According to the site, it’s rather difficult to install and set up, particularly for those of us with no knowledge in speech patterns and the like, but perhaps one day I’ll be able to have a system that I can dial and say “Please reboot programGuide” and Asterisk will be able to do the right thing.