Audio transcription

Table of Contents Transcription API Google Speech Recognition Open Source Clients Closed Source Clients Open Source Services Sphinx Julius Language independent phonetic transcription Related Requirements Audio Chunking based on Silence Additional References

Transcription API

Quicktate API
- Speech Transcription
- Call Auditing for customer service call centers
- Call Transcription for phone meetings
- SMS dictation for driver safety
The overall process of successfully transcribing an audio file is as follows:

Submit the job into our system
Our system assigns a Job ID, which is returned to your application for your tracking purposes
We transcribe the file into text (using thousands of typists on call 24/7 ready to work at a moment’s notice)
Our servers send the results via HTTP POST to a Callback URL you specify

NexiWave API
Provides audio search, subtitling, indexing and similar tasks for podcasters and video creators with 80+ accuracy probably using Spinx underneath
- Partnered with UbiCast which records seminars and creates webcasts/podcasts from them
Twilio allows users to call the website and receive some TTS generated speech and some help, and also call cellphones and land lines to get human service.
- Greg Tracy's example
- pay service
- two minutes long
- append ".txt" to the end of a Recording resource URI to retrieve the transcription text for that recording
SpeakerText Beta uses auto-gen captions which are corrected by a human to make transcribed videos for bloggers etc
- 2$ min
Process

User uploads videos on YouTube, Vimeo, Blip.tv, Ooyala, Brightcove
72 hours later transcriptions are emailed back

Ribbit API
- Consumer application Ribbit Mobile links mobile phones and the internet to create an integrated voice and data solution tailored for the lifestyle of the modern professional.
- Our enterprise solution, Ribbit for Salesforce, integrates mobile phones and advanced voice automation features directly into Salesforce.com to increase sales team productivity.
Process to transcribe voicemail

Create a folder, enable transcription services for that folder, upload media to that folder .mp3, .wav, or .ulaw ,
Transcription event is triggered if there is no corresponding txt file
Transcription costs debited from the user's account

Ditech the exclusive rights to resell SimulScribe’s speech-to-text transcription services on a wholesale basis to telephone companies and developers.
CastingWords Integration uses mechanical turk consultants to do transcription
Scribie claimed to humans in India to transcribe for .99$/minute using skype as the audio transmission service
TalkScribe is a startup which is trying to pretend to serve Jott's "refugees"
Jott nearly two years after Jott’s acquisition and a successful integration into Nuance, we officially ended the Jott service on May 3rd, 2011. This may seem counter-intuitive – success leading to a shutdown. But while it is an ending of sorts, the reality is that the technology, service, talent and imagination of Jott will continue on as part of a far broader set of services.

Google Speech Recognition

Google speech recognition stands to be the best quality of all "available" systems. Their Search Langauge Model is based on the billions of google searches. Their free-form Language models are based on transcriptions of Google Voice voicemail messages, YouTube videos (it generates closed captions, and then users can upload corrected versions so that users can have accurate closed captions) among other unconfirmed datasources.

Open Source Clients

A YouTube captioning API has similar functionality to the desired chunking on silence, annotating blocks of audio with text...as subtitles are simply arrays of timespan<->text pairs.
- Sample code in Python
- Batch captions uploader in Java on App Engine Running Demo
However, there are conditions that wouldnt work to use it as a general purpose transcription service.

Video : the audio must be in a video which has been uploaded and has a video id
Ownership : the video must be owned by the user. so it could be possible to create an single useraccount and push the audio to youtube or ask the user to let the app access their youtube user account, ask for the auto transcribed version and render that as blog text for the users, even providing an interface that helps them navigate their text in time, audio and text format. but that would like be a huge use violation for the developer API key as potentially millions of blank useless youtube videos would be created and (publicly) availible. The privacy factor can be reset if this was done not with YouTube videos, but rather videos uploaded in Google Docs, then the audio is private to that user's google account. There is also Google Video for Busness in the apps products, but neither of those two API are available yet, just the YouTube one.

Cromium hack by Mike Putz results in a general perl+post approach, others made it work for PHP and Java
- Cromium speech source code for your re-verse engineering pleasure
The Sample VoiceRecognition.java uses the Android Speech Package android.speech , more specifically the RecognizerIntent. The example works great and is very clear, you can test it out in the API demos Sample code in the SDK.
- But, it's only for short speech samples (until user pauses) and it is a Intent->GUI->Record->Result use case. No GUI-free/eyes-free access yet. There are feature requests for it on the Android Google Code issue tracker.
The implementation of the RecognizerIntent itself (or other files in android.speech) should provide some exposure to the Google Speech Recognition servers..
Android Source on GitHub, contains the core code, not the com.google code, and there doesn't appear to be any speech recognition clients in there

Closed Source Clients

Relevant packages that could be tweeked to provide a GUI free solution
- com.google.android.voicesearch
- com.google.android.voicesearch.speechservice

Open Source Services

Android SpeechRecognizer does allow more control than just the using the intent but still no possibility to pre-process and chunk audio to send a longer file/sample.
The Voice Recognizer Sample in the Android SDK (just create a new project in Eclipse, select create from existing source) shows a skeleton example of how to build a new Speech Recognizer that is automatically registered in the Device, and can be configured even in the Android Preferences > Voice Input and Output, this could be a direction to follow if one were to implement a new Speech Recognition service, using another server such as a machine running Sphinx or using the Google semi-exposed service discussed in Mike Pultz' Cromium investigation. Beware, it causes android.process.acore fail unexpectedly force closes, probably due to the Android core being heavily tied into the (exact) implementation of the RecognizerIntent/SpeechRecognizer in previous development, this will likely change in the future.

Sphinx

A classic and long-standing project now hosting Google Summer of Code students. CMUSphinx is a speaker-independent large vocabulary continuous speech recognizer released under BSD style license. It is also a collection of open source tools and resources that allows researchers and developers to build speech recognition systems.

Julius

"Julius" is a high-performance, two-pass large vocabulary continuous speech recognition (LVCSR) decoder software for speech-related researchers and developers. Based on word N-gram and context-dependent HMM, it can perform almost real-time decoding on most current PCs in 60k word dictation task. Major search techniques are fully incorporated such as tree lexicon, N-gram factoring, cross-word context dependency handling, enveloped beam search, Gaussian pruning, Gaussian selection, etc. Besides search efficiency, it is also modularized carefully to be independent from model structures, and various HMM types are supported such as shared-state triphones and tied-mixture models, with any number of mixtures, states, or phones. Standard formats are adopted to cope with other free modeling toolkit such as HTK, CMU-Cam SLM toolkit, etc.

http://julius.sourceforge.jp/en_index.php?q=index-en.html
Speech2Text project provides a ready-to-use interface for the julius CSR engine for a handicapped child which is not able to use the keyboard well. It integrates into X11 and Windows.

Language independent phonetic transcription

Our goal is to support a bit of bootstrapping, even for non-standard languages so that experiments on any language provide at least a bit of audio analysis.

Related Requirements

Audio Chunking based on Silence

The MARF project has some libraries for audio analysis. Not sure how complete and which goals have been realized yet.

 MARF is an open-source research platform and a collection of voice/sound/speech/text and natural language processing (NLP) algorithms written in Java and arranged into a modular and extensible framework facilitating addition of new algorithms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Audio transcription

Table of Contents

Transcription API

Google Speech Recognition

Open Source Clients

Closed Source Clients

Open Source Services

Sphinx

Julius

Language independent phonetic transcription

Related Requirements

Audio Chunking based on Silence

Additional References

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally