OpenAI, the artificial intelligence research laboratory founded by Elon Musk and Y-Combinator’s Sam Altman in 2015, has had tremendous success in promoting the cause of large-vocabulary Natural Language Understanding with high-visibility models like GPT-2 and GPT-3. More recently, as described in this post from Scott Baker, it made Dall-E 2 generally available to the public. As Scott asserts, in his post, “DALL·E has instantly democratized concept art design and collaboration.”
Lost in the hoopla surrounding Dall-E 2’s uptake is the open sourcing of another OpenAI product, Whisper. It is a general-purpose speech recognition model trained on a large dataset of diverse audio. It is also a multi-task and multi-lingual model that purports to perform instantaneous language identification and simultaneous translation. It is a downloadable speech recognition resource trained on 680,000 hours of audio and corresponding transcripts in 98 different languages. In the classic “your mileage may vary” disclaimer, its originators observe that accuracy of transcription in any particular language directly correlates with the amount of training data employed in that language. Still, they close their introductory blog post with this wish: “We hope Whisper’s high accuracy and ease of use will allow developers to add voice interfaces to a much wider set of applications.”
The Whisper Campaign is Working
For developers, the “model card” (meaning the official codebase for running its ASR models) is available on Github, the fifteen-year-old developer platform that is now a subsidiary of Microsoft. Nine models of different sizes and capabilities are provided. All support speech recognition and transcription in the language that is spoken. They were developed by the researchers at OpenAI to evaluate the robustness of the features supported in relation to the size of the model deployed.
Now that all nine models are made available to the the reported 83 million developers registered to use Github, they are invited to put the robustness of the model to the test and to develop applications of their own. This is in stark contrast with how sharing of versions of DALL-E and GPT have been handled. Even though they were called “open source” their release was controlled or the APIs to reach them were restricted to paying customers.
Unlike those predecessor learning models, Whisper can be downloaded to run on the computing platform-of-choice among developers. This approach conforms to what Ben Dickson of VentureBeat refers to as a “return to openness”. He points out in this article that it may be in response to initiatives from Meta or Hugging Face, companies that released open source models to rival rival GPT-3 (OPT-175B and BLOOM, respectively). Likewise Stability.ai related Stable Diffusion, an open-source image generation model which is functionally similar to DALL-E.
Developers and users who have tried Whisper are impressed. While Opus Research does not have a lab to benchmark and compare results from multiple engine providers, we’ve been told that, in many use cases, Whisper performs better than the most popular ASR/Transcription services on the market today. This was confirmed in Dickson’s article through a quote from “MLops expert” Noah Gift who is finding it superior to SaaS providers offering “transcription only”.
Transcription as a Commodity: Gateway to Conversational Intelligence
For decades, the road to progress in conversational technologies have been paved with the products of commoditization. Telephony’s stack, supporting switching, call routing and call completion was commoditized by SIP. Interactive Voice Response (IVRs), especially the speech-enabled variety, were commoditized by VoiceXML. Now OpenAI’s Whisper and other open source initiatives accelerate the process of commoditizing automated speech recognition (ASR).
One way to look at it is from an investor’s standpoint. Commoditization is poised to destroy billions in market cap assigned to cloud-based ASR specialists. The more positive way to look at it and the historically true precedent is that low cost/no cost for key solution components migrates that investment toward innovative, yet affordable solutions. The creativity of hundreds of thousands of developers is about to be drawn toward creating and delivering new services that are speech enabled, reliable and affordable on a global scale.